## Contents-based Filtering
- Contents가 비슷한 item을 추천
- 사용자가 과거에 경험했던 item중 비슷한 item을 추천
- 사용자 A가 높은 평점을  주거나 큰 관심을 가진 item X 유사한 item Y를 추천
    - ex : 
        - 웹사이트, 블로그, 뉴스 : 비슷한 내용의 게시글을 찾아서 추천
        - 영화 : 같은 배우, 장르, 감독 등 비슷한 특징을 가진 영화 추천

#### 전체 과정
1. 유저가 과거에 접한 아이템이면서 만족한 아이템
2. 유저가 좋아했던 아이템 중 일부 또는 전체와 비슷한  아이템 선정
3. 선정된 아이템을 유저에게 추천

장점
- 적은 정보로 시작
- 다른 사용자의 영향을 받지 않음
- 추천할 수 있는 item의 범위가 넓음 ( unique, new, unpopular )
- 추천하는 이유를 제시할 수 있음 ( item의 특징 : feature 제시 )

단점
- 적절한 특징( feature )을 찾기 어려움
- 새로운 user를 위한 추천이 어려움 ( user profile이 없거나 데이터 부족 )
- 선호하는 특징을 가진 항목을 반복 추천
    - Overspecialization
    - User의 다양한 취향 반영 어려움
    - User profile 외 추천 불가

### item profile
- 모든 item에 대해서 feature extraction
- item representation
    - `one-hot vector`
    - DataFrame(pandas) : `dummy vector`, `get_dummies()`
- item similarity
    - cosine similarity
    - euclidean distance
    - pearson correlation
    - jaccard similarity

- ex 
    - 영화 : 제목, 작가, 배우, 장르, 영화 review 등
    - 뉴스 : 뉴스 내용, headline 등
    - 음악 : 장르, 가사, 음원, 제목, review, 함께 소개된 음악
    - text data: keyword 추출 후 TF-IDF


### User profile
- User 성향 파악 : rating, explicit, like/dislike
- 사용자가 가지는 item들의 특성, 조사된 사용자 특성
- 특성 가중치는 사용자가 가지고 있는 item 특성 가중치의 평균
- implicit feedback : 검색log, item 선택 후 구매, 비슷한 item tag한 이력
- User profile 학습은 기계학습의 classfication과 같음

### 거리와 유사도 REMIND
- 군집화에서의 거리 개념
    - 유사도와 반대 개념
    - 특징 값의 성격 ( 어떤 성격의 데이터인가? )
        ```
        양적 특징 --- 수량값 ------ 거리 개념 있음
                   |        |
                   |---순서값(오디널)
        질적 특징 ---|
                   |
                   |---명칭값(노미널) --- 거리 개념 없음
        ```

- `Minkowski 거리` : 유클라디안 거리와 맨하탄 거리의 일반화
    - 두 점 $x_i=(x_{i1}, ..., x_{id})^T$ 와 $x_j=(x_{j1}, ..., x_{jd})^T$ 간의 거리 척도
    - $d_{ij} = (\sum_{k=1}^{d} |x_{ik} - x_{jk}|^p)^{1/p}$
    - 유클리디언 거리 : $(p=2)$ $d_{ij} = \sqrt{\sum_{k=1}^{d} |x_{ik} - x_{jk}|^2}$
    - 맨하탄 거리 : $(p=1)$ $d_{ij} = \sum_{k=1}^{d} |x_{ik} - x_{jk}|$

- `Mahalanobis 거리` : 두 점간의 거리 계산에서 그들이 속한 분포를 고려
    - $((x-\mu_{i})^T \sum^{-1} (x-\mu_{i}))^{1/2}$

- 코사인 유사도 : 두 벡터간의 각도를 이용한 유사도
    - 문서 검색 응용에서 주로 사용(단어의 출현 빈도가 특징 값)
    - 벡터 A와 벡터 B사이의 각도
- 타니모토 계수(Tanimoto coefficient)
    - 데이터가 이진값을 가진 경우
    - $sim(x,y) = \frac{c}{a+b-c}$
        - $a$ : x의 1인 값의 개수
        - $b$ : y의 1인 값의 개수
        - $c$ : x와 y의 1인 값의 동시 발생 개수
- Pearson 상관계수
    - 상관분석은 확률론과 통계학에서 두 변수 간에 어떤 선형적 관계를 갖고 있는지를 분석하는 방법
    - 회귀 분석 : 두 변수 간에 원인과 결과의 인과관계 파악(방향, 정도와 수학적 모델)
    - $r = \frac{X와 Y가 함께 변하는 정도}{X와 Y가 따로 변하는 정도}$
    

### 유클리디언 거리와 Cosine 유사도 비교
- 벡터 간의 scale 차이
    - 클 경우 cosine 유사도
    - 크지 않다면 유클리디언 거리
- 유클리디언 거리 : 단순 두 벡터 사이의 거리
- Cosine 유사도 : 두 벡터 간의 각도

### 알고리즘 : KNN
- Contents의 내용을 분석하는 알고리즘
    - Clustering (K-NN), Machine Learning, TF-IDF
- k-최근접 이웃
    - 특정 공간에 분포하는 데이터에 대하여 k개의 가장 가까운 이웃(유클리디언 거리)을 살펴보고 다수결 방식으로 데이터의 레이블을 할당하는 분류 방식
    - `K-NN Classification` : 가장 가까운 k개의 이웃을 찾아 다수결로 분류
    - `K-NN Regression` : 가장 가까운 k개의 이웃을 찾아 평균을 구해 회귀

### 알고리즘 : TF-IDF
- 문서에서 단어의 중요도를 평가하는 통계적 방법
- 텍스트 데이터를 수치화할 때 자주 사용되는 방법, 자주 등장하지만 문서 전체에 걸쳐 널리 퍼져있는 단어는 가중치를 낮추고 특정 문서에 자주 등장하는 단어는 가중치를 높여 텍스트 내에서 중요한 단어를 강조한다.
    ```
    1. 단어 빈도(TF)와 역문서 빈도(IDF)를 결합하여 계산
    2. TF(d,t) = (문서d에서 단어 t의 등장 횟수) / (문서d의 총 단어 수)
    3. IDF : 단어가 corpus 전체에서 얼마나 희귀한지를 나타냄
        - IDF(t) = log(전체 문서  수 / 단어 t가 등장한 문서 수)
    4. TF-IDF = TF * IDF
    ```
- Document Term Matrix
    - BoW(Bag of Words), stopword 제거
    - DTM : 문서 단어 행렬
    - TDM : 단어 문서 행렬
    - TTM : 단어 단어 행렬 
- TF(Term Frequency) : 단어 w가 문서 d에 나타난 빈도
- DF(Document Frequency) : 단어 w가 등장한 문서 d의 수
- IDF(Inverse Document Frequency) : 단어 w가 등장한 문서의 수에 반비례하는 값, $log\frac{N}{DF}$


In [1]:
import numpy as np
import pandas as pd

### Data Load

In [2]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.user', sep='|', names=u_cols, encoding='latin-1')
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [3]:
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']
movies = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.item', sep='|', names=i_cols, encoding='latin-1')
movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDB URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [4]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
# timestamp 제거 
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### 영화별 장르 data frame 구성

In [6]:
# movie ID와 title 빼고 다른 데이터 제거
genres = ['unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
genre_representation = movies[genres]
genre_representation

Unnamed: 0,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


---
## IDF를 활용한 영화 추천 방식

### IDF를 구하기 위해 Genre별 count

In [7]:
movie_genre = []
genre_count = dict()
for each_movie_genre in genres:
    genre_count[each_movie_genre]= genre_representation[each_movie_genre].sum()

genre_count 

{'unknown': np.int64(2),
 'Action': np.int64(251),
 'Adventure': np.int64(135),
 'Animation': np.int64(42),
 "Children's": np.int64(122),
 'Comedy': np.int64(505),
 'Crime': np.int64(109),
 'Documentary': np.int64(50),
 'Drama': np.int64(725),
 'Fantasy': np.int64(22),
 'Film-Noir': np.int64(24),
 'Horror': np.int64(92),
 'Musical': np.int64(56),
 'Mystery': np.int64(61),
 'Romance': np.int64(247),
 'Sci-Fi': np.int64(101),
 'Thriller': np.int64(251),
 'War': np.int64(71),
 'Western': np.int64(27)}

### Genre 별 IDF
- 전체 데이터셋에서 각 장르가 얼마나 자주 나타나는지를 측정
- 즉, 모든 영화 중 특정 장르가 얼마나 드물게 나타나는지를 수치화하는 것
- 이 IDF값은 각 장르의 희소성을 반영하며, 자주 나타나지 않는 장르(드물게 나타나는 장르)에 더 큰 가중치를 부여하여 추천시스템에서 해당 장르의 중요성을 강조

In [8]:
total_movie = len(genre_representation) 

genre_idf = dict()
for each_movie_genre in genres:
    genre_idf[each_movie_genre]= np.log10(total_movie/genre_count[each_movie_genre]) # 전체 영화 수 / 각 영화의 장르 수

genre_idf                                     

{'unknown': np.float64(2.924795995797912),
 'Action': np.float64(0.8261522699808552),
 'Adventure': np.float64(1.0954922229668873),
 'Animation': np.float64(1.6025767010639929),
 "Children's": np.float64(1.1394661607871452),
 'Comedy': np.float64(0.5225346133432319),
 'Crime': np.float64(1.1883994935212696),
 'Documentary': np.float64(1.5268559871258747),
 'Drama': np.float64(0.3654879848908996),
 'Fantasy': np.float64(1.883403310639687),
 'Film-Noir': np.float64(1.8456147497502873),
 'Horror': np.float64(1.262038164116338),
 'Musical': np.float64(1.477637964455693),
 'Mystery': np.float64(1.4404961564511263),
 'Romance': np.float64(0.8331290382022276),
 'Sci-Fi': np.float64(1.2215046176792508),
 'Thriller': np.float64(0.8261522699808552),
 'War': np.float64(1.3745676427428182),
 'Western': np.float64(1.794462227302906)}

### 영화별 genre IDF dataframe
- 각 영화에 대해 그 영화가 가진 장르의 IDF값을 적용하여 영화의 장르 표현을 보다 구체적으로 한다.
- 이 과정은 각 영화를 개별 문서로 보고, 해당 영화의 장르 구성을 각 장르의 IDF 가중치를 사용하여 수치화한다.
- 이렇게 함으로써 각 영화의 유니크한 장르 조합이 그 영화의 프로필을 형성하게 되고, 이를 통해 콘텐츠 기반 필터링이 더 정확한 영화 추천이 가능하다.

In [9]:
from tqdm import tqdm

genre_idf_representation = pd.DataFrame(columns=genres, index=movies.index)

for index, each_row in tqdm(genre_representation.iterrows()): # genre_representation DataFrame의 각 행(영화)을 순차적으로 반복
    dict_temp = dict()

    for each_genre in genres:   # index : 영화의 인덱스, each_row : 해당 인덱스에 있는 영화의 장르 데이터
        if genre_representation.loc[index][each_genre] > 0 : # 해당 영화가 특정 장르에 속하는지 확인, 만약 해당 영화가 그 장르에 속하면, 해당 장르에 대한 IDF값을 dict_temp에 저장
            dict_temp[each_genre] = genre_idf[each_genre] # 장르에 대한 IDF값을 dict_temp에 저장

    row_to_add = pd.DataFrame(dict_temp, index=[index]) # 각 영화의 장르별 IDF값을 담고 있다
    genre_idf_representation.update(row_to_add) # 동일한 인덱스를 가진 영화에 대해 값이 업데이트 된다.

genre_idf_representation

1682it [00:01, 1479.00it/s]


Unnamed: 0,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,,,,1.602577,1.139466,0.522535,,,,,,,,,,,,,
1,,0.826152,1.095492,,,,,,,,,,,,,,0.826152,,
2,,,,,,,,,,,,,,,,,0.826152,,
3,,0.826152,,,,0.522535,,,0.365488,,,,,,,,,,
4,,,,,,,1.188399,,0.365488,,,,,,,,0.826152,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,,,,,,,,,0.365488,,,,,,,,,,
1678,,,,,,,,,,,,,,,0.833129,,0.826152,,
1679,,,,,,,,,0.365488,,,,,,0.833129,,,,
1680,,,,,,0.522535,,,,,,,,,,,,,


- `update()`
    - 기존 값을 덮어쓰기 : 일치하는 인덱스와 열의 값만 업데이터 된다. 만약 업데이트할 DataFrame에서 NaN값이 있다면, 그 값은 무시되고 원래 값이 그대로 유지 된다.
    - 일치하는 인덱스와 열만 업데이트 : 두 DataFrame의 인덱스와 열이 동일할 때만 업데이트가 이루어진다. 일치하지 않는 인덱스나 열은 무시

```python
# list comprehension style
from tqdm import tqdm

genre_idf_representation = pd.DataFrame(columns=genres, index=movies.index)

for index, each_row in tqdm(genre_representation.iterrows()):
    dict_temp = {each_genre: genre_idf[each_genre] for each_genre in genres if genre_representation.loc[index][each_genre] > 0 }
    #pdb.set_trace()
    row_to_add = pd.DataFrame(dict_temp, index=[index])
    genre_idf_representation.update(row_to_add)

genre_idf_representation
```

### genre IDF 기반 Contents 유사도 평가

In [13]:
genre_idf_representation = genre_idf_representation.fillna(0) # 결측값을 0으로 채운다.
genre_idf_representation.head()

  genre_idf_representation = genre_idf_representation.fillna(0) # 결측값을 0으로 채운다.


Unnamed: 0,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,0.0,1.602577,1.139466,0.522535,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.826152,1.095492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.826152,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.826152,0.0,0.0
3,0.0,0.826152,0.0,0.0,0.0,0.522535,0.0,0.0,0.365488,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.188399,0.0,0.365488,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.826152,0.0,0.0


In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def cos_sim_matrix(a, b):
    cos_sim = cosine_similarity(a, b)
    result_df = pd.DataFrame(data=cos_sim, index=[a.index])

    return result_df

- `data=cos_sim`: 코사인 유사도 행렬을 DataFrame의 데이터로 사용
- `index=[a.index]`: 입력 데이터 a의 인덱스를 결과 DataFrame의 인덱스로 설정

In [15]:
cs_df = cos_sim_matrix(genre_idf_representation, genre_idf_representation) # (a, b)
cs_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,1.0,0.0,0.0,0.128589,0.0,0.0,0.0,0.59149,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.256822,0.0
1,0.0,1.0,0.515826,0.408337,0.285474,0.0,0.0,0.0,0.0,0.0,...,0.729488,0.0,0.0,0.0,0.0,0.0,0.363207,0.0,0.0,0.0
2,0.0,0.515826,1.0,0.0,0.553431,0.0,0.0,0.0,0.0,0.0,...,0.707107,0.0,0.0,0.0,0.0,0.0,0.704127,0.0,0.0,0.0
3,0.128589,0.408337,0.0,1.0,0.085744,0.35021,0.100389,0.298391,0.35021,0.089992,...,0.559759,0.35021,0.35021,0.35021,0.35021,0.35021,0.0,0.140692,0.500692,0.35021
4,0.0,0.285474,0.553431,0.085744,1.0,0.244837,0.070184,0.068531,0.244837,0.062914,...,0.391335,0.244837,0.244837,0.244837,0.244837,0.244837,0.389686,0.09836,0.0,0.244837


- `cos_sim_matrix()`: 두 매트릭스를 입력받아 코사인 유사도를 계산하는 함수
    - 코사인 유사도 : 두 벡터 사이의 각도를 계산하는 방식, 값이 1에 가가울수록 두 벡터가 유사하고, 0에 가까울수록 유사하지 않음

- `장르 정보를 TF-IDF로 표현` : **각 영화의 장르가 얼마나 중요한지**를 벡터 형태로 나타내는 것을 의미
    - 영화는 문서 역할을 하고 장르는 단어 역할을 한다.
    - TF ( 장르 빈도 ) : 각 영화에서 특정 장르가 등장하는 지를 확인, 한 영화가 여러 장르를 가질 수 있는데, 각 영화에서 해당 장르가 얼마나 자주 등장하는지를 계산
    - IDF( 역장르 빈도 ) : 특정 장르가 전체 영화에서 얼마나 자주 등장하는 지를 평가
    - TF-IDF : TF와 IDF를 곱한 값으로, 특정 영화에서 특정 장르가 얼마나 중요한지를 나타내는 값

### 영화[i]와 가장 유사한 영화

In [24]:
# 내림차순
cs_df[1].sort_values(ascending=False)[:10]

826     1.0
117     1.0
929     1.0
1012    1.0
1313    1.0
1       1.0
981     1.0
1015    1.0
116     1.0
565     1.0
Name: 1, dtype: float64

In [17]:
movies.loc[1]

movie_id                                                              2
title                                                  GoldenEye (1995)
release date                                                01-Jan-1995
video release date                                                  NaN
IMDB URL              http://us.imdb.com/M/title-exact?GoldenEye%20(...
unknown                                                               0
Action                                                                1
Adventure                                                             1
Animation                                                             0
Children's                                                            0
Comedy                                                                0
Crime                                                                 0
Documentary                                                           0
Drama                                                           

In [18]:
movies.loc[826]

movie_id                                                            827
title                                                   Daylight (1996)
release date                                                06-Dec-1996
video release date                                                  NaN
IMDB URL              http://us.imdb.com/M/title-exact?Daylight%20(1...
unknown                                                               0
Action                                                                1
Adventure                                                             1
Animation                                                             0
Children's                                                            0
Comedy                                                                0
Crime                                                                 0
Documentary                                                           0
Drama                                                           

---
## 영화 줄거리 기반의 추천 (TF-IDF)

In [20]:
# Data 읽기
movies = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/movies_metadata.csv', encoding='latin-1', low_memory=False)
movies = movies[['id', 'title', 'overview']]
print(len(movies)) # 45442
movies.head()

45442


Unnamed: 0,id,title,overview
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...


### Data preprocessing

In [21]:
# 데이터 전처리
movies = movies.drop_duplicates() # 중복 행 중 하나만 제거
movies = movies.dropna() # 결측값 제거
movies['overview'] = movies['overview'].fillna('') # overview 열에 존재하는 결측값(NaN)을 빈 문자열('')로 대체
len(movies)

44300

In [22]:
movies

Unnamed: 0,id,title,overview
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...,...
45437,439050,Subdue,Rising and falling between a man and woman.
45438,111109,Century of Birthing,An artist struggles to finish his work while a...
45439,67758,Betrayal,"When one of her hits goes wrong, a professiona..."
45440,227506,Satan Triumphant,"In a small town live two brothers, one a minis..."


### TF-IDF
- 불용어(stopwords) : 텍스트 분석에서 큰 의미가 없어 제거되는 단어들
    - ex : 영어에서는 'i', 'me', 'my', 'myself' 등 인칭대명사, 관사, 접속사 등이 해당
- `TfidfVectorizer()` : 텍스트 데이터를 TF-IDF 방식으로 벡터화 하는 scikit-learn의 도구

In [25]:
# 불용어를 english로 지정하고 tf-idf 계산
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english') # 영어 불용어를 제외하고 벡터화를 수행하도록 지정하는 옵션
tfidf_matrix = tfidf.fit_transform(movies['overview']) # movies['overview'] 열에 있는 각 영화 설명에 대해 TF-IDF값을 계산

- 결과
    - tfidf_matrix : 각 영화 설명을 벡터화한 희소 행렬(sparse matrix)이다. 이 행렬의 크기는 (영화 수, 단어 수)로 영화 설명에서 등장한 단어들의 TF-IDF값을 포함
    - 즉, 각 영화 설명을 단어들의 가중치(TF-IDF)로 변환한 벡터들이 저장됨

In [26]:
# Cosine 유사도 계산
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) # 벡터화한 희소 행렬들을 코사인 유사도로 계산
cosine_sim = pd.DataFrame(cosine_sim, index=movies.index, columns=movies.index)

In [27]:
cosine_sim.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45432,45433,45434,45435,45436,45437,45438,45439,45440,45441
0,1.0,0.014981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005955,0.0
1,0.014981,1.0,0.046968,0.0,0.0,0.050222,0.0,0.102622,0.0,0.007219,...,0.0,0.0,0.0,0.011276,0.0,0.0,0.066866,0.0,0.022018,0.009356
2,0.0,0.046968,1.0,0.0,0.02507,0.0,0.006414,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014077,0.0
3,0.0,0.0,0.0,1.0,0.0,0.007214,0.008982,0.0,0.0,0.0,...,0.0,0.0,0.0,0.021457,0.0,0.026478,0.0,0.0,0.009531,0.016436
4,0.0,0.0,0.02507,0.0,1.0,0.0,0.0,0.03282,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007014,0.0


In [28]:
# title을 index로 설정
indices = pd.Series(movies.index, index=movies['title'])
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45437
Century of Birthing            45438
Betrayal                       45439
Satan Triumphant               45440
Queerama                       45441
Length: 44300, dtype: int64

In [29]:
# 영화제목을 받아서 추천 영화를 돌려주는 함수
def content_recommender(title, n_of_recomm):
    # title 영화 제목 에서 영화 index 찾기
    idx = indices[title]
    # 주어진 영화와 다른 영화의 similarity(유사도)를 가져온다
    sim_scores = cosine_sim[idx]
    # similarity 기준으로 정렬하고 n_of_recomm만큼 가져오기 (자기자신은 빼기)
    sim_scores = sim_scores.sort_values(ascending=False)[1:n_of_recomm+1] # 주어진 영화와 가장 유사한 영화부터 순서대로 나열, [1:]로 시작하는 이유 : 자기 자신 제외
    # 영화 title 반환
    return movies.loc[sim_scores.index]['title'] # sim_scores.index : 유사도가 높은 영화들의 인덱스를 나타냄

In [30]:
print(content_recommender('The Lion King', 5))

34664    How the Lion Cub and the Turtle Sang a Song
9339                               The Lion King 1Â½
9101                  The Lion King 2: Simba's Pride
42806                                           Prey
25637                                 Fearless Fagan
Name: title, dtype: object


In [31]:
print(content_recommender('The Dark Knight Rises', 10))

12468                                      The Dark Knight
149                                         Batman Forever
1321                                        Batman Returns
15497                           Batman: Under the Red Hood
584                                                 Batman
21179    Batman Unmasked: The Psychology of the Dark Kn...
9216                    Batman Beyond: Return of the Joker
18021                                     Batman: Year One
19778              Batman: The Dark Knight Returns, Part 1
3085                          Batman: Mask of the Phantasm
Name: title, dtype: object


### 정리
1. IDF방식으로 추천
- 중점 : IDF(역문서 빈도)만을 사용하여 추천 시스템은 각 특정의 희귀성만을 고려
    - 전체 데이터셋에서 드물게 등장하는 특성에 더 큰 가중치를 부여한다.
    - 특정 장르가 전체 영화 중 소수만이 속하는 경우, 그 장르에 속하는 영화들은 높은 IDF 값을 가지게 되어 추천 시 우선적으로 고려될 수 있다.
2. IF-IDF 방식으로 추천
- 중점 : 각 특성의 빈도(TF)와 희귀성(IDF)을 동시에 고려한다.
    - 이는 특정 문서 내에서의 특성의 중요도와 전체 데이터셋에서의 그 특성의 희귀성을 모두 반영한다.
    - 문서 내에서 자주 등장하지만 전체 문서에 걸쳐 희귀한 단어를 중요한 특성으로 강조
