## 10월 27일 금요일 연습문제 : TF-IDF 기반 영화 추천시스템

- 데이터셋 : movies_metadata.csv에서 "영화 5000편"
- 동작 방식 : 사용자가 영화 제목을 입력하면, 해당 제목과 "overview" 기준 가장 유사한 10개의 영화를 추천

---------
즐겁게 봤던 영화 제목을 입력하세요 :
Toy Story
당신에게 추천하고 싶은 영화 제목은 아래와 같습니다.
1)
2)
...
10)
---------

<개발 방법>
1) overview 열 추출 > 단어 전처리(불용어 제거, 표제어(혹은 어간) 추출, 대소문자 통일, 특수문자 제거, 단어 통일(wordnet) 등) >  코퍼스 구성 (정규식 활용 가능)
2) TF-IDF 행렬 만들기
3) 코사인 유사도를 이용한 영화 추천

In [100]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence

In [101]:
data=pd.read_csv('archive/movies_metadata.csv')
data=data.head(5000)
data

  data=pd.read_csv('archive/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,False,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 27, 'n...",,43715,tt0050294,en,The Deadly Mantis,The calving of an Arctic iceberg releases a gi...,...,1957-05-01,0.0,79.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,This Was the Day That Engulfed the World in Te...,The Deadly Mantis,False,5.3,16.0
4996,False,,60000000,"[{'id': 18, 'name': 'Drama'}]",,10052,tt0259288,en,Dragonfly,A grieving doctor is being contacted by his la...,...,2002-02-22,52322400.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,When someone you love dies... are they gone fo...,Dragonfly,False,6.2,209.0
4997,False,"{'id': 217704, 'name': 'The Vampire Chronicles...",35000000,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",,11979,tt0238546,en,Queen of the Damned,Lestat de Lioncourt is awakened from his slumb...,...,2002-02-10,45479110.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,This time there are no interviews.,Queen of the Damned,False,5.5,247.0
4998,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",,75151,tt0260746,en,Big Bad Love,Vietnam veteran Leon Barlow is struggling as a...,...,2001-10-11,0.0,111.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Big Bad Love,False,6.5,4.0


In [102]:
data=data[['original_title', 'overview']]
data

Unnamed: 0,original_title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...
4995,The Deadly Mantis,The calving of an Arctic iceberg releases a gi...
4996,Dragonfly,A grieving doctor is being contacted by his la...
4997,Queen of the Damned,Lestat de Lioncourt is awakened from his slumb...
4998,Big Bad Love,Vietnam veteran Leon Barlow is struggling as a...


In [103]:
overviews=data.overview
overviews

0       Led by Woody, Andy's toys live happily in his ...
1       When siblings Judy and Peter discover an encha...
2       A family wedding reignites the ancient feud be...
3       Cheated on, mistreated and stepped on, the wom...
4       Just when George Banks has recovered from his ...
                              ...                        
4995    The calving of an Arctic iceberg releases a gi...
4996    A grieving doctor is being contacted by his la...
4997    Lestat de Lioncourt is awakened from his slumb...
4998    Vietnam veteran Leon Barlow is struggling as a...
4999    A tale about Vietnamese refugees sent to an or...
Name: overview, Length: 5000, dtype: object

In [104]:
overview=overviews[0]
overview

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [105]:
import re

shortword = re.compile(r'\W*\b\w{1,2}\b')
overview = shortword.sub('', overview)
print(overview)

Led Woody, Andy toys live happily his room until Andy birthday brings Buzz Lightyear onto the scene. Afraid losing his place Andy heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns put aside their differences.


In [106]:
words=text_to_word_sequence(overview)
words

['led',
 'woody',
 'andy',
 'toys',
 'live',
 'happily',
 'his',
 'room',
 'until',
 'andy',
 'birthday',
 'brings',
 'buzz',
 'lightyear',
 'onto',
 'the',
 'scene',
 'afraid',
 'losing',
 'his',
 'place',
 'andy',
 'heart',
 'woody',
 'plots',
 'against',
 'buzz',
 'but',
 'when',
 'circumstances',
 'separate',
 'buzz',
 'and',
 'woody',
 'from',
 'their',
 'owner',
 'the',
 'duo',
 'eventually',
 'learns',
 'put',
 'aside',
 'their',
 'differences']

In [107]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words=[lemmatizer.lemmatize(word) for word in words]
words

['led',
 'woody',
 'andy',
 'toy',
 'live',
 'happily',
 'his',
 'room',
 'until',
 'andy',
 'birthday',
 'brings',
 'buzz',
 'lightyear',
 'onto',
 'the',
 'scene',
 'afraid',
 'losing',
 'his',
 'place',
 'andy',
 'heart',
 'woody',
 'plot',
 'against',
 'buzz',
 'but',
 'when',
 'circumstance',
 'separate',
 'buzz',
 'and',
 'woody',
 'from',
 'their',
 'owner',
 'the',
 'duo',
 'eventually',
 'learns',
 'put',
 'aside',
 'their',
 'difference']

In [108]:
shortword = re.compile(r'\W*\b\w{1,2}\b')
result = []

for overview in overviews:
    overview_s = shortword.sub('', str(overview))
    words = text_to_word_sequence(overview_s)
    words_final = [lemmatizer.lemmatize(word) for word in words]
    result.append(words_final)
    
result

[['led',
  'woody',
  'andy',
  'toy',
  'live',
  'happily',
  'his',
  'room',
  'until',
  'andy',
  'birthday',
  'brings',
  'buzz',
  'lightyear',
  'onto',
  'the',
  'scene',
  'afraid',
  'losing',
  'his',
  'place',
  'andy',
  'heart',
  'woody',
  'plot',
  'against',
  'buzz',
  'but',
  'when',
  'circumstance',
  'separate',
  'buzz',
  'and',
  'woody',
  'from',
  'their',
  'owner',
  'the',
  'duo',
  'eventually',
  'learns',
  'put',
  'aside',
  'their',
  'difference'],
 ['when',
  'sibling',
  'judy',
  'and',
  'peter',
  'discover',
  'enchanted',
  'board',
  'game',
  'that',
  'open',
  'the',
  'door',
  'magical',
  'world',
  'they',
  'unwittingly',
  'invite',
  'alan',
  'adult',
  'who',
  'been',
  'trapped',
  'inside',
  'the',
  'game',
  'for',
  'year',
  'into',
  'their',
  'living',
  'room',
  'alan',
  'only',
  'hope',
  'for',
  'freedom',
  'finish',
  'the',
  'game',
  'which',
  'prof',
  'risky',
  'all',
  'three',
  'find',
  'th

In [109]:
shortword = re.compile(r'\W*\b\w{1,2}\b')
lemmatizer = WordNetLemmatizer()

for index, overview in enumerate(data['overview']):
    overview_s = shortword.sub('', str(overview))
    words = text_to_word_sequence(overview_s)
    words_final = [lemmatizer.lemmatize(word) for word in words]
    data.at[index, 'overview'] = ' '.join(words_final)

In [110]:
data

Unnamed: 0,original_title,overview
0,Toy Story,led woody andy toy live happily his room until...
1,Jumanji,when sibling judy and peter discover enchanted...
2,Grumpier Old Men,family wedding reignites the ancient feud betw...
3,Waiting to Exhale,cheated mistreated and stepped the woman are h...
4,Father of the Bride Part II,just when george bank ha recovered from his da...
...,...,...
4995,The Deadly Mantis,the calving arctic iceberg release giant prayi...
4996,Dragonfly,grieving doctor being contacted his late wife ...
4997,Queen of the Damned,lestat lioncourt awakened from his slumber bor...
4998,Big Bad Love,vietnam veteran leon barlow struggling writer ...


In [111]:
data.overview[0]

'led woody andy toy live happily his room until andy birthday brings buzz lightyear onto the scene afraid losing his place andy heart woody plot against buzz but when circumstance separate buzz and woody from their owner the duo eventually learns put aside their difference'

In [112]:
tfidf=TfidfVectorizer()

In [113]:
tfidf_mat=tfidf.fit_transform(data['overview'])

In [114]:
tfidf_mat

<5000x20240 sparse matrix of type '<class 'numpy.float64'>'
	with 183341 stored elements in Compressed Sparse Row format>

In [115]:
tfidf_mat.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [116]:
tfidf.vocabulary_

{'led': 10503,
 'woody': 19984,
 'andy': 902,
 'toy': 18475,
 'live': 10754,
 'happily': 8206,
 'his': 8570,
 'room': 15519,
 'until': 19080,
 'birthday': 2052,
 'brings': 2509,
 'buzz': 2730,
 'lightyear': 10661,
 'onto': 12874,
 'the': 18148,
 'scene': 15895,
 'afraid': 563,
 'losing': 10879,
 'place': 13731,
 'heart': 8361,
 'plot': 13796,
 'against': 578,
 'but': 2714,
 'when': 19766,
 'circumstance': 3416,
 'separate': 16155,
 'and': 882,
 'from': 7293,
 'their': 18154,
 'owner': 13130,
 'duo': 5698,
 'eventually': 6327,
 'learns': 10487,
 'put': 14429,
 'aside': 1236,
 'difference': 5103,
 'sibling': 16466,
 'judy': 9874,
 'peter': 13561,
 'discover': 5216,
 'enchanted': 6041,
 'board': 2190,
 'game': 7425,
 'that': 18140,
 'open': 12879,
 'door': 5471,
 'magical': 11080,
 'world': 20004,
 'they': 18181,
 'unwittingly': 19099,
 'invite': 9520,
 'alan': 656,
 'adult': 502,
 'who': 19810,
 'been': 1807,
 'trapped': 18559,
 'inside': 9330,
 'for': 7086,
 'year': 20115,
 'into': 9464

In [117]:
tfidf.get_feature_names_out()

array(['000', '007', '05pm', ..., 'zyto', 'émigré', 'état'], dtype=object)

In [118]:
tfidf_df=pd.DataFrame(tfidf_mat.toarray(), columns=tfidf.get_feature_names_out())
tfidf_df

Unnamed: 0,000,007,05pm,100,101,103,10th,114,1183,11th,...,zords,zorin,zorro,zubin,zuckermann,zula,zulu,zyto,émigré,état
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [135]:
cs=cosine_similarity(tfidf_df)
cs

array([[1.        , 0.03425493, 0.01012865, ..., 0.02340008, 0.03019161,
        0.00996605],
       [0.03425493, 1.        , 0.0542086 , ..., 0.03500056, 0.00875896,
        0.02331573],
       [0.01012865, 0.0542086 , 1.        , ..., 0.03529602, 0.00962463,
        0.02168051],
       ...,
       [0.02340008, 0.03500056, 0.03529602, ..., 1.        , 0.06262757,
        0.0346923 ],
       [0.03019161, 0.00875896, 0.00962463, ..., 0.06262757, 1.        ,
        0.03874545],
       [0.00996605, 0.02331573, 0.02168051, ..., 0.0346923 , 0.03874545,
        1.        ]])

In [136]:
cs.shape

(5000, 5000)

In [137]:
cs[0]

array([1.        , 0.03425493, 0.01012865, ..., 0.02340008, 0.03019161,
       0.00996605])

In [148]:
cs[0].argsort()[-11:][:10]

array([1884, 3252, 2157, 4078,  485,  448, 1932, 3057, 1071, 2997],
      dtype=int64)

In [126]:
data['original_title'].iloc[[1884, 3252, 2157, 4078,  485,  448, 1932, 3057, 1071, 2997]]

1884           Child's Play 3
3252          Bound for Glory
2157        Indecent Proposal
4078                Losin' It
485                    Malice
448         For Love or Money
1932                Condorman
3057          Man on the Moon
1071    Rebel Without a Cause
2997              Toy Story 2
Name: original_title, dtype: object

In [151]:
movie_title = input('영화 제목 :')

movie_index = data[data['original_title']==movie_title].index[0]

movie = cs[movie_index].argsort()[-11:][:10]

data['original_title'].iloc[movie]

영화 제목 :Jumanji


1951                    BASEketball
4823    The People That Time Forgot
363                        Maverick
3056               Any Given Sunday
696                    Celtic Pride
3872             Dungeons & Dragons
4652          Sidewalks of New York
2056       Your Friends & Neighbors
2486                       eXistenZ
1506             The Innocent Sleep
Name: original_title, dtype: object