# 1. Movie Metadata and TF-IDF Exercise
- 영화 줄거리 데이터에 TF-IDF를 적용해 영화별 유사도

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021)

### 1.1 Data
- 데이터에서 영화 제목을 나타내는 `title`과 줄거리 `overview` 컬럼을 이용

In [2]:
df = pd.read_csv('../02. Data/movies_metadata.csv')
df = df[["title", "overview"]]

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df = df.iloc[:1000]
df.shape

(1000, 2)

In [4]:
df

Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...
995,The Three Caballeros,For Donald's birthday he receives a box with t...
996,The Sword in the Stone,Wart is a young boy who aspires to be a knight...
997,So Dear to My Heart,The tale of Jeremiah Kincaid and his quest to ...
998,Robin Hood: Prince of Thieves,When the dastardly Sheriff of Nottingham murde...


### 1.2 Cleaning
- 'overview'가 결측값인 경우 빈 str으로 대체

In [5]:
df["overview"].isna().sum()

12

In [6]:
df["overview"] = df["overview"].fillna('')

# 2. TF-IDF Calculation
- `sklearn.feature_extraction.text`의 `TfidfVectorizer`을 이용해 TF-IDF 결과 값을 계산

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

### 2.1 Sample Data

In [8]:
df["overview"].values[:2]

array(["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
       "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures."],
      dtype=object)

In [9]:
transformer = TfidfVectorizer(stop_words='english')
tfidf_matrix = transformer.fit_transform(df['overview'].values[:2])

tfidf_matrix.toarray()

array([[0.        , 0.        , 0.14358239, 0.        , 0.43074717,
        0.14358239, 0.14358239, 0.        , 0.14358239, 0.43074717,
        0.14358239, 0.        , 0.14358239, 0.        , 0.        ,
        0.14358239, 0.        , 0.14358239, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.14358239, 0.14358239,
        0.        , 0.        , 0.        , 0.        , 0.14358239,
        0.14358239, 0.14358239, 0.14358239, 0.        , 0.14358239,
        0.        , 0.        , 0.        , 0.14358239, 0.        ,
        0.14358239, 0.14358239, 0.        , 0.        , 0.        ,
        0.10216005, 0.        , 0.14358239, 0.14358239, 0.        ,
        0.        , 0.14358239, 0.        , 0.        , 0.43074717,
        0.        , 0.        ],
       [0.15160873, 0.15160873, 0.        , 0.30321746, 0.        ,
        0.        , 0.        , 0.15160873, 0.        , 0.        ,
        0.        , 0.15160873, 0.        , 0.15160873, 0.15160873,
        0.     

In [10]:
transformer.get_feature_names()[:10]

['26',
 'adult',
 'afraid',
 'alan',
 'andy',
 'aside',
 'birthday',
 'board',
 'brings',
 'buzz']

In [11]:
pd.DataFrame(tfidf_matrix.toarray(), columns=transformer.get_feature_names()).T.head(10)

Unnamed: 0,0,1
26,0.0,0.151609
adult,0.0,0.151609
afraid,0.143582,0.0
alan,0.0,0.303217
andy,0.430747,0.0
aside,0.143582,0.0
birthday,0.143582,0.0
board,0.0,0.151609
brings,0.143582,0.0
buzz,0.430747,0.0


### 2.2 Learning

In [12]:
transformer = TfidfVectorizer(stop_words='english')

### 2.3 Transformation

In [13]:
tfidf_matrix = transformer.fit_transform(df['overview'])
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [14]:
# 키워드 확인
transformer.get_feature_names()[-5:]

['zombies', 'zones', 'zookeeper', 'zorro', 'zulu']

# 3. Calculate similarity by movie
- 코사인 유사도를 이용해 영화별 유사도를 계산

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(tfidf_matrix)
similarity

array([[1.        , 0.01570657, 0.        , ..., 0.        , 0.        ,
        0.01234882],
       [0.01570657, 1.        , 0.05047128, ..., 0.        , 0.01578968,
        0.02378018],
       [0.        , 0.05047128, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.01578968, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.01234882, 0.02378018, 0.        , ..., 0.        , 0.        ,
        1.        ]])

# 4. Recommend similar movies
- 데이터 인덱스 998과 유사한 영화를 추천

In [16]:
idx = 998
print(df.loc[idx, 'title'])

Robin Hood: Prince of Thieves


위에서 계산한 `similarity` 에서 998번째 영화와 다른 영화 사이의 유사도를 추출하고, 유사도 높은 인덱스를 반환

In [17]:
similarity_one_idx = similarity[idx]

1. `argsort`는 값을 오름차순으로 정렬할때 해당하는 인덱스를 반환
2. `argsort`에 역순을 취해 가장 유사한 인덱스가 앞으로 오도록 정렬

In [18]:
order_idx = similarity_one_idx.argsort()[::-1]
order_idx[:100]

array([998, 515, 913, 215, 779, 598,  43, 150, 675,   3, 392, 148, 181,
        25,  99, 231, 241, 548, 725, 363, 331, 447, 988, 207, 270, 804,
       256, 517, 940, 807, 401, 319, 245, 660, 977, 400,  18, 648,  11,
       695, 459, 621, 402, 503, 903, 135, 971, 594, 790, 975, 185, 978,
       513, 823, 883, 408, 853, 628, 944, 668, 663,  64, 446, 147, 953,
       718, 240, 264, 670, 642, 484, 742, 461, 857, 178, 797, 295, 671,
       763, 412, 470, 564, 243,  54, 268, 829, 841, 547, 890,   8, 597,
        13, 623, 778, 581, 768, 316, 867, 777, 133], dtype=int64)

In [19]:
# 자기 자신과의 유사도가 가장 높고 이후 유사한 영화의 인덱스 확인
top5 = order_idx[:6]
top5

array([998, 515, 913, 215, 779, 598], dtype=int64)

In [20]:
# 기존 데이터에서 각 인덱스에 해당하는 영화의 제목
df.loc[top5, 'title']

998      Robin Hood: Prince of Thieves
515          Robin Hood: Men in Tights
913       The Adventures of Robin Hood
215                   Boys on the Side
779                          Lone Star
598    Candyman: Farewell to the Flesh
Name: title, dtype: object

"Robin Hood: Prince of Thieves"와 유사한 "Robin Hood" 영화가 추천되는 것을 확인