#Quiz

> tmdb_5000_movies에서 overview를 이용하여 추천 시스템을 구축하시오
1. 파일 읽기
2. 전처리
3. 데이터 가공
4. 유사도 분석

##1. 파일 읽기

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

df_movies = pd.read_csv('/content/drive/MyDrive/2. 추천 알고리즘/2. content Based/data/tmdb_5000_movies.csv')
df_movies.info()

> 위의 정보 중 사용할 정보를 추린다.
* genres, id, keywords, overview,popularity, title, vote_average, vote_count

> 현재 진행해야 할 정보는 overview 정보로 추천 시스템을 만들려고 한다. 하지만 overview에 3개의 결측치가 발생하여 이를 제거해야 한다.

##2. 데이터 전처리

In [None]:
df_movies_part = df_movies[['id', 'title', 'genres', 'keywords', 'overview', 'popularity', 'vote_average', 'vote_count']]
df_movies_part = df_movies_part.dropna()
df_movies_part

>필요한 정보만 추리고 결측치는 제거한다.

In [None]:
df_movies_part['overview'] = df_movies_part['overview'].apply(lambda x : x.lower())
df_movies_part['overview']

0       in the 22nd century, a paraplegic marine is di...
1       captain barbossa, long believed to be dead, ha...
2       a cryptic message from bond’s past sends him o...
3       following the death of district attorney harve...
4       john carter is a war-weary, former military ca...
                              ...                        
4798    el mariachi just wants to play his guitar and ...
4799    a newlywed couple's honeymoon is upended by th...
4800    "signed, sealed, delivered" introduces a dedic...
4801    when ambitious new york attorney sam is sent t...
4802    ever since the second grade when he first saw ...
Name: overview, Length: 4800, dtype: object

In [None]:
word_list = ["'s ", ' a ', ' is ', ' the ', ' on ', 
             ' as ', ' to ', ' the ', ' of ', 
             '’s ', ' him ', ', ', 
             ' in ', ' an ', ' will ', ' with ', 
             ' it ', ' but ', ' and ', ' be ', 
             ' for ', ' by ',' who ', ' what ', ' that ',
             ' which ', ' has ', ' have ', ' from ', ' while ',
             ' been ', ' he ', ' you ', ' its ', ' his ',
             ' when ', ' she ', ' are ', ' at ', ' than ', ' those ', ' can ', ' could ', ' on. ']
testword = df_movies_part['overview'][10]
for word in word_list:
  testword = testword.replace(word, ' ')

testword

'superman returns discover 5-year absence allowed lex luthor walk free was closest too felt abandoned moved luthor plots ultimate revenge see millions killed change face planet forever well ridding himself man steel.'

> 정상적으로는 형태소 분석을 통해 명사만 추출하여 사용하는 것이 맞지만 간단하게 replace를 통해 불필요한 단어들을 제거해 보자

In [None]:
def remove_word(x):
  for word in word_list:
    x = x.replace(word, ' ')
  
  return x

# remove_word(df_movies_part['overview'][0])
df_movies_part['overview_part'] = df_movies_part['overview'].apply(remove_word)
df_movies_part

##3. 데이터 count

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
overview_matrix = vect.fit_transform(df_movies_part['overview_part'])
print(sorted(vect.vocabulary_.items()))
overview_matrix.shape



(4800, 21261)

>overview는 특정 키워드만 들어 있는 것이 아니고 설명을 위한 문장들이 들어 있다.
* In the 22nd century, a paraplegic Marine is di...

> 첫 번째 영화는 위와 같은 overview가 작성되어 있으며 여기서 in, the, a, is 이러한 단어는 의미없는 단어 이므로 배제되어야 한다. 따라서 기존에 사용했던 빈도수를 구하는 countvectorizer가 아닌 가중치를 이용한 tfidf를 이용해야 한다.

##4. 유사도 분석

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

overview_similarity = cosine_similarity(overview_matrix, overview_matrix)
overview_similarity[:1]

array([[1., 0., 0., ..., 0., 0., 0.]])

> 위와 같이 cosine_similarity를 이용하여 유사도를 구할 수 있다. 이를 이용하여 추천 시스템을 만들어 보자

In [None]:
overview_similarity_sorted_idx = overview_similarity.argsort()[:,::-1]
overview_similarity_sorted_idx[:3]

array([[   0, 3603, 2130, ..., 3027, 3026, 2399],
       [   1, 2542, 3094, ..., 2872, 2871,    0],
       [   2, 1343, 4070, ..., 2985, 2984,    0]])

##5. 추천

In [None]:
C = df_movies['vote_average'].mean()
m = df_movies['vote_count'].quantile(0.6)
print('C : ', round(C, 3), '\nm : ', round(m, 3))

C :  6.092 
m :  370.2


In [None]:
def weighted_vote_average(dataFrame):
  v = dataFrame['vote_count']
  R = dataFrame['vote_average']

  return (v/(v+m)) * R + (m/(v+m)) * C

df_movies['weighted_vote'] = df_movies.apply(weighted_vote_average, axis=1)
df_movies[:3]

In [None]:
def find_similarity_movie(dataFrame, similarity_sorted_idx, movieName, top=10):
  title_movie = dataFrame[dataFrame['title'].str.lower().isin([movieName.lower()])]
  title_idx = title_movie.index.values
  #2배수로 늘리기
  similar_indexes = similarity_sorted_idx[title_idx, :(top*2)]
  similar_indexes = similar_indexes.reshape(-1)
  #기준 영화 제외
  similar_indexes = similar_indexes[similar_indexes!=title_idx]
  
  return dataFrame.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top]

movies_top10 = find_similarity_movie(df_movies, overview_similarity_sorted_idx, 'avatar', top=30)
movies_top10[['title', 'vote_average', 'vote_count', 'weighted_vote', 'overview']]

In [None]:
print(df_movies_part['overview'].iloc[0])
print(df_movies_part['overview'].iloc[634])
print(df_movies_part['overview_part'].iloc[2966])

In [None]:
word_split = df_movies_part['overview_part'].iloc[0].split(' ')
# word_split = df_movies_part['overview'].iloc[0].split(' ')

test_word = df_movies_part['overview_part'].iloc[634]
# print( df_movies_part['overview'].iloc[3605])
# test_word = df_movies_part['overview'].iloc[3605]
for word in word_split:
  word_space = ' {} '.format(word)
  # print(word_space)
  test_word = test_word.replace(word_space, '    ~~~~change~~~~     ')

test_word

'set    ~~~~change~~~~        ~~~~change~~~~     matrix tells story computer hacker joins group underground insurgents fighting vast powerful computers now rule earth.'