# 1. Matrix Factorization Exercise
- KNN과 동일한 `ratings` 데이터에 모델 기반 협업필터링 방법 중 하나인 Matrix Factorization을 적용

In [1]:
import pandas as pd
import numpy as np

np.random.seed(2021)

### 1.1 Data Load
- 유저-영화 평점 데이터를 이용해 유저가 아직 평가하지 않은 영화를 추천
- 유저 고유 아이디를 나타내는 `userId`, 영화 고유 아이디를 나타내는 `movieId`, 유저가 영화를 평가한 점수 `rating` 컬럼을 이용

In [3]:
ratings = pd.read_csv("../02. Data/ratings_small.csv")
ratings = ratings[["userId", "movieId", "rating"]]

ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [6]:
# 다른 두 데이터를 이용해 `ratings` 데이터의 `movieId`에 맞는 영화 제목 확인
movies = pd.read_csv("../02. Data/movies_metadata.csv")
links = pd.read_csv("../02. Data/links_small.csv")

movies.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [7]:
links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


### 1.2 Preprocessing
`movies` 데이터에서 "tt숫자"로 이루어진 `imdb_id`에서 숫자 부분과  
`links` 데이터의 "숫자"로 이루어진 `imdbId`와 연결

In [8]:
movies = movies.fillna('')
movies = movies[movies["imdb_id"].str.startswith('tt')] ## tt로 시작하는 데이터 추출

movies['imdb_id'].head(3)

0    tt0114709
1    tt0113497
2    tt0113228
Name: imdb_id, dtype: object

In [9]:
movies["imdbId"] = movies["imdb_id"].apply(lambda x: int(x[2:])) ## tt제외한 숫자 추출
movies["imdbId"].head(3)

0    114709
1    113497
2    113228
Name: imdbId, dtype: int64

In [10]:
movies = movies.merge(links, on="imdbId") # imdbId 컬럼 기준 links와 데이터 결합

movies = movies[["title", "movieId"]]
movies = movies.set_index("movieId")

movies.head()

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story
2,Jumanji
3,Grumpier Old Men
4,Waiting to Exhale
5,Father of the Bride Part II


`pivot`함수를 이용해 유저 아이디가 인덱스이고, 영화 아이디가 컬럼, 값이 평가 점수인 `user_movie_matrix`를 생성  
결측값은 0으로 대체

In [11]:
user_movie_matrix = ratings.pivot(
    index="userId",
    columns="movieId",
    values="rating",
)
user_movie_matrix = user_movie_matrix.fillna(0)
user_movie_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 2. Matrix Factorization

### 2.1 초기 세팅

#### 2.1.1 정답 R

In [14]:
R = user_movie_matrix.values

n_user = R.shape[0]  # 전체 유저 수
n_item = R.shape[1]  # 전체 영화 수

print('전체 유저 수', n_user)
print('전체 영화 수', n_item)

전체 유저 수 671
전체 영화 수 9066


#### 2.1.2 잠재 요인 행렬
- 유저와 영화별로 잠재 요인 크기가 10인 행렬을 선언

In [15]:
K = 10

#### 2.1.3 P와 Q 랜덤 값으로 초기화
- 유저 행렬 P와 영화 행렬 Q를 랜덤 값으로 초기화

In [16]:
P = np.random.normal(size=(n_user, K))
Q = np.random.normal(size=(n_item, K))

P

array([[ 1.48860905,  0.67601087, -0.41845137, ...,  0.64500184,
         0.10641374,  0.42215483],
       [ 0.12420684, -0.83795346,  0.4090157 , ..., -0.22508127,
        -1.33620597,  0.30372151],
       [-0.72015884,  2.5449146 ,  1.31729112, ...,  1.37626076,
        -0.47218397,  0.5240849 ],
       ...,
       [-0.34036392,  1.10504404,  0.25446956, ..., -0.20915116,
         0.65492966, -0.3958868 ],
       [-0.31165161,  1.78026007,  1.08668056, ...,  0.03222073,
        -0.52333827, -0.11044398],
       [-1.2146398 , -0.10685361,  0.845032  , ..., -1.02719008,
         0.00569836,  0.22101445]])

In [17]:
Q

array([[ 0.30194165,  0.36629183, -0.52061911, ..., -0.43741366,
         1.19149681,  0.03748171],
       [-0.02156433, -1.76596912, -0.05909484, ...,  0.45219164,
        -0.99925363,  1.92936678],
       [-0.26655993, -0.48104382, -0.16922735, ...,  0.48428921,
        -0.04504006, -0.35068684],
       ...,
       [-0.33373493, -0.76955212, -1.0908092 , ...,  0.88754135,
        -2.14405834,  1.25667084],
       [-0.32719638, -0.73017883,  0.04958502, ...,  0.20299266,
         0.02776886,  0.30185611],
       [ 0.0813312 ,  0.29697644,  1.11559121, ..., -1.66948007,
        -0.15183078,  0.60258872]])

### 2.2 Gradient Descent를 이용한 잠재 요인 행렬 학습
- 유저 "670"이 영화 "0"에 평가한 점수를 학습하는 과정