# 추천 시스템 - 아이템 기반 협업 필터링(item based collaborative filtering)

### 협업 필터링(collaborative filtering)  
사용자와 item간의 rating을 이용해서 사용자끼리 '유사도'를 찾는 방식.

특정 사용자와 유사한 사용자들이 남긴 평점, 상품구매 이력 등 행동양식 기반으로 '예측'해서 '추천'해준다. 

그래서 item을 얼마나 좋아할 것인지 수치적으로 예측한다. 

데이터는 kaggle의 Movielens (Small) (https://www.kaggle.com/sengzhaotoo/movielens-small) 을 사용했습니다.

- rating.csv : user가 movies에 평가를 매긴 데이터
- movies.csv : 영화 정보 데이터

목적: 아이템 기반 협업 필터링으로 사용자에게 유사한 영화를 추천해준다.

## 1. 데이터 준비

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
rating_data = pd.read_csv('./ratings.csv')
movie_data = pd.read_csv('./movies.csv')

In [3]:
rating_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
movie_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## 2. 데이터 전처리

**필요 없는 컬럼 삭제**

In [5]:
rating_data.drop('timestamp', axis=1, inplace = True)

In [6]:
rating_data.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


**movie 데이터와 평점데이터를 merge해서 하나로 합침**

In [7]:
usr_movie_rating = pd.merge(rating_data, movie_data, on="movieId")
usr_movie_rating.head()

Unnamed: 0,userId,movieId,rating,title,genres
0,1,31,2.5,Dangerous Minds (1995),Drama
1,7,31,3.0,Dangerous Minds (1995),Drama
2,31,31,4.0,Dangerous Minds (1995),Drama
3,32,31,4.0,Dangerous Minds (1995),Drama
4,36,31,3.0,Dangerous Minds (1995),Drama


**피봇테이블 생성**

아이템 기반 협업 필터링을 적용하기 위해서 사용자 - 영화에 따른 평점 점수가 데이터로 들어가야한다.

사용자 - 영화는 아래와 같이 두가지 경우가 있을 수 있다.

- movie_user_rating : 영화-사용자 피봇테이블(index: 영화, column: 사용자)
- usr_movie_rating : 사용자-영화 피봇테이블 (index: 사용자, column: 영화)

In [8]:
movie_usr_rating = usr_movie_rating.pivot_table('rating',index='title',columns='userId')
usr_movie_rating = usr_movie_rating.pivot_table('rating',index='userId',columns='title')

In [9]:
movie_usr_rating.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",,,,,,,,,,,...,,,,,,,,,,
$9.99 (2008),,,,,,,,,,,...,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies (1934),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,


In [10]:
usr_movie_rating.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


아이템 기반 협업 필터링이니까,


인덱스가 영화인 movie_usr_rating: 영화-사용자 피봇테이블(index: 영화, column: 사용자) 를 사용한다.

**피봇테이블의 NaN처리**

NaN은 아직 평점을 매기지 않은 것으로 볼 수 있는데, fillna를 사용해서 NaN 0으로 처리한다.

In [11]:
movie_usr_rating.fillna(0, inplace=True)
movie_usr_rating.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Neath the Arizona Skies (1934),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. 아이템기반 협업 필터링

유사도값 추출(코사인유사도)
유사한 아이템끼리 추천을 해주는 방식, 즉 평점이 비슷한 아이템(영화)를 추천해주는 것이다.

현재 평점이 data로 들어가있으니까, 이 상태에서 코사인 유사도값을 이용해 유사도를 계산한다.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_rate = cosine_similarity(movie_usr_rating, movie_usr_rating)
print(similarity_rate)

[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.05821787 0.         0.        ]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.05821787 0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


유사도값을 가진 데이터프레임 생성

그럼 각 아이템(영화)끼리 서로 유사한 정보의 값을 가지게 된다.

In [13]:
similarity_rate_df = pd.DataFrame(
    data=similarity_rate,
    index=movie_usr_rating.index,
    columns=movie_usr_rating.index)

In [14]:
similarity_rate_df.head()

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",1.0,0.0,0.0,0.164399,0.020391,0.0,0.014046,0.0,0.0,0.003166,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.079474,0.0,0.15633,...,0.0,0.0,0.0,0.0,0.0,0.013899,0.0,0.058218,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.217357,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Neath the Arizona Skies (1934),0.164399,0.0,0.0,1.0,0.124035,0.0,0.085436,0.0,0.0,0.019259,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.020391,0.0,0.0,0.124035,1.0,0.0,0.010597,0.143786,0.0,0.136163,...,0.0,0.0,0.0,0.121567,0.0,0.0,0.0,0.0,0.0,0.0


그럼 서로 유사도가 가까운 영화일수록 1에 가깝고, 자기자신과 같은 영화이면 유사도값은 1이 된다.(대각선은 자기자신)

## 4. 영화추천 사용자 함수 생성


이제 이 데이터프레임을 가지고, 영화추천 기능을 구현하는 사용자 함수를 생성한다.

만약 사용자가 어떤 영화를 보았을 때(매개변수로 영화이름 입력), 그 영화와 비슷한 영화를 추천해주는 것이다.

In [15]:
# 가장 유사도가 높은 TOP 5
def recommand_movie(title):
  return similarity_rate_df[title].sort_values(ascending=False)[:6]

In [16]:
recommand_movie("Toy Story (1995)")

title
Toy Story (1995)                             1.000000
Toy Story 2 (1999)                           0.594710
Star Wars: Episode IV - A New Hope (1977)    0.576188
Forrest Gump (1994)                          0.564534
Independence Day (a.k.a. ID4) (1996)         0.562946
Groundhog Day (1993)                         0.548023
Name: Toy Story (1995), dtype: float64

실제로 toy story(1995)를 보았다고 했을 때, 이와 유사한 영화에 대한 결과이다.