### 아이템 기반 최근접 이웃 협업 필터링 
- 최근접 이웃 필터링 -> USER 기반 vs ITEM 기반 
- 일반적으로 ITEM 기반이 더 정확도가 뛰어남.

#### STEP1. 데이터 가공 및 변환

In [7]:
import pandas as pd
import numpy as np

movies= pd.read_csv("movies.csv")   # Meta table : title, genre
ratings= pd.read_csv("ratings.csv") # User별로 영화에 평점 매긴 데이터셋

print(movies.shape, ratings.shape)

(9742, 3) (100836, 4)


In [8]:
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [10]:
ratings.head(3)
    # row :user, col:movie로 변경해야 함!
    # 즉, R에서 gather & spread 또는 melt & cast 함수처럼 movieID 값이 col명으로 할당되도록 행렬 구성되어야 추천시스템 구축 가능

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [11]:
ratings = ratings[['userId','movieId','rating']] # timestamp 값은 필요없음. 
ratings_matrix = ratings.pivot_table('rating', index='userId', columns='movieId')
    # R에서 ratings %>% spread(key='movieId', value='rating') 이렇게 했다면 python에서는 pivot_table로!
ratings_matrix.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


- 데이터 형태 보면 유저들이 평점을 매기지 않은 영화가 거의 대부분이므로 NaN으로 할당되어 sparse한 형태의 행렬이 생성된 것임. 
- 최소평점이 0.5였기 때문에 NaN으로 평점 매기지 않은 경우를 모두 0으로 변환
    - 0으로 변환해도 괜찮은 것인가? -> 찾아보기
- 우선 위에서 movieId로 했으나 title명으로 표시해주자 아래처럼

In [17]:
rating_movies = pd.merge(ratings, movies, on = 'movieId') # title명 얻으려고 movies와 조인 

ratings_matrix = rating_movies.pivot_table('rating', index='userId', columns='title')
ratings_matrix = ratings_matrix.fillna(0) # fillna 함수는 na 값을 임의의 값으로 할당할 수 있도록 함.
ratings_matrix.head(3)
    # col에 title명으로 잘 할당되었고, NaN값도 0으로 잘 변환되었음 확인.

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### STEP2. 영화 간 유사도 산출

- 사이킷런 cosine_similarity로 측정할 것임. 단, 이 함수는 row 끼리 비교해 코사인 유사도 산출하기 때문에 현재는 user 기반으로 유사도가 계산되게 됨. 따라서 col <-> row 변환해야 함. 

In [19]:
ratings_matrix_T=ratings_matrix.T
ratings_matrix_T.head(3) # 전치행렬 

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T) # 코사인 유사도 행렬 계산 
item_sim_df = pd.DataFrame(data = item_sim,                      # DataFrame으로 변환
                           index= ratings_matrix.columns,
                            columns = ratings_matrix.columns)

print(item_sim_df.shape)
item_sim_df.head(3)

(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
item_sim_df.loc["Godfather, The (1972)"].sort_values(ascending=False)[:6]

title
Godfather, The (1972)                        1.000000
Godfather: Part II, The (1974)               0.821773
Goodfellas (1990)                            0.664841
One Flew Over the Cuckoo's Nest (1975)       0.620536
Star Wars: Episode IV - A New Hope (1977)    0.595317
Fargo (1996)                                 0.588614
Name: Godfather, The (1972), dtype: float64

- 결과적으로 장르가 완전 다른 영화도 유사도가 높게 나타났음. 
- 앞서 콘텐츠 기반 필터링은 장르 차원만 고려해서 유사도 계산했기 때문에 같은 장르만을 추천했다면, 아이템기반 방법은 색다른 장르도 추천가능함.

In [38]:
item_sim_df.loc["Inception (2010)"].sort_values(ascending=False)[:6]

title
Inception (2010)                 1.000000
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Name: Inception (2010), dtype: float64

- 다크나이트가 가장 유사도가 높음. 
- 이렇게 이상으로 아이템 기반 유사도 데이터를 구축해 놓았음.
- 이 결과를 바탕으로 Personalized 영화 추천 알고리즘 만들기

#### STEP3. 아이템 기반 최근접 이웃 협업 필터링으로 개인화된 영화 추천

- 한마디로, 개인이 아직 안 본 영화의 평점 예측하는 것. 
- 즉 기존에 본 영화의 평점 데이터를 기반으로 해서 모든 영화의 예측 평점 계산한 후 높은 예측치 가진 영화를 추천하는 방식.

![alt text](Item_based.jpg "Title")

In [58]:
# 사용할 재료 확인

print(ratings_matrix.values)
print(item_sim_df.values)

print(ratings_matrix.values.shape, item_sim_df.values.shape)

[[0.  0.  0.  ... 0.  4.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 ...
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [4.  0.  0.  ... 1.5 0.  0. ]]
[[1.         0.         0.         ... 0.32732684 0.         0.        ]
 [0.         1.         0.70710678 ... 0.         0.         0.        ]
 [0.         0.70710678 1.         ... 0.         0.         0.        ]
 ...
 [0.32732684 0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
(610, 9719) (9719, 9719)


In [59]:
def predict_rating(ratings_arr, item_sim_arr):
    ratings_pred = ratings_arr.dot(item_sim_arr)/np.array([np.abs(item_sim_arr).sum(axis=1)])
    return ratings_pred

In [63]:
ratings_pred = predict_rating(ratings_matrix.values, item_sim_df.values)

In [64]:
# 위 함수 shape 확인

ratings_matrix.values.dot(item_sim_df.values).shape, np.array([np.abs(item_sim_df.values).sum(axis=1)]).shape, (ratings_matrix.values.dot(item_sim_df.values)/np.array([np.abs(item_sim_df.values).sum(axis=1)])).shape
# 분자 shape, 분모 shape, 최종 shape

((610, 9719), (1, 9719), (610, 9719))

In [66]:
ratings_pred_matrix = pd.DataFrame(data = ratings_pred, 
                                   index= ratings_matrix.index,
                                  columns=ratings_matrix.columns)
ratings_pred_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501


- 유저별 실제평점 * 코사인 유사도 행렬 계산했으므로 실제평점 0일때 값 채워짐. 
- 추천시스템 모델 평가 지표는 아래와 같이 MSE 사용. 

In [68]:
# 사용자가 평점을 부여했던(즉, 0이 아닌 값에 대해서만) 영화에 대해서만 MSE 구하기.

from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
    pred = pred[actual.nonzero()].flatten() # 실제 값이 0이 아닌 값에 대해서만 pred값 추출해서 flatten 
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)

get_mse(ratings_pred, ratings_matrix.values)

9.895354759094706

In [69]:
# 계산 과정 상세. 값 추출해서 펼쳐서 계산.

pred = ratings_pred[ratings_matrix.values.nonzero()].flatten() # 실제 값이 0이 아닌 값에 대해서만 pred값 추출해서 flatten 
actual = ratings_matrix.values[ratings_matrix.values.nonzero()].flatten()
print(pred)
print(actual)

[0.2855597  1.08359021 0.35404974 ... 2.57350896 1.08329872 1.81609065]
[4.  4.  4.  ... 3.  2.  1.5]


### MSE 감소시켜보자
- 앞선 경우는 영화 수가 너무 많아서 예측력 감소. 
- 특정 영화와 가장 유사도 높은 영화에 대해서만 유사도 벡터를 적용하자. 
- N인자에 있어서 TOP-N유사도 벡터만 예측값 계산. 
- 단 for문 돌아야 하기 때문에 계산량 많음 

In [119]:
def predict_rating_topsim(ratings_arr, item_sim_arr, n = 20):
    pred = np.zeros(ratings_arr.shape) # user-item 행렬크기만큼 0으로 값 채워서 초기화하기.
    
    for col in range(ratings_arr.shape[1]):    # item만큼 
        top_n_items = [np.argsort(item_sim_arr[:, col])[:-n-1:-1]] # 유사도 큰 n개 데이터 행렬의 index반환 (argsort)
            # 즉, col(item,여기선 영화)별로 평점 유사도가 높은 유저의 index값
            
        for row in range(ratings_arr.shape[0]): # user수 만큼 
            pred[row, col] = item_sim_arr[col, :][top_n_items].dot(ratings_arr[row, :][top_n_items].T) 
            pred[row, col] /= np.sum(np.abs(item_sim_arr[col, :][top_n_items]))        
    return pred
        

In [120]:
ratings_pred = predict_rating_topsim(ratings_matrix.values , item_sim_df.values, n=20)
ratings_pred

array([[0.        , 0.        , 0.        , ..., 0.        , 1.67729077,
        0.28437161],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.17548935, 0.70242952,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [3.7       , 0.11494779, 0.        , ..., 0.74587428, 0.17047454,
        0.        ]])

In [122]:
get_mse(ratings_pred, ratings_matrix.values)
    # 기존 9.89 -> 3.69로 성능 좋아짐! ! 

3.69501623729494

In [130]:
# 계산된 최종 예측평점 행렬 DF형태로 한번 더 저장 
ratings_pred_matrix = pd.DataFrame(data = ratings_pred,
                                  index= ratings_matrix.index,
                                  columns = ratings_matrix.columns)
ratings_pred_matrix.head(10)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,1.677291,0.284372
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,0.194828,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022562,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.295227,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.270718,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.107105,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 위 함수 뜯어보기 ~~
- 계산되는 행렬 차원 헷갈려서 원노트 정리.
- 아래와 같은 방식으로 선택적으로 수행되는 것. 
- STEP1~STEP3를 각각 user-item 행렬에서 user수만큼, item 수만큼 반복됨.

![alt text](Item_based_top_n.jpg "Title")

In [111]:
# 예를들어 [0,1] 인 경우만
n = 20
col = 0
row = 1

print(ratings_matrix.shape)
top_n_items = [np.argsort(item_sim_df.values[:, col])[:-n-1:-1]] # 유사도 큰 n개 데이터 행렬의 index반환 (argsort) # 마지막-1은 내립차순
top_n_items

(610, 9719)


[array([   0,  179, 7085, 6471, 2253, 5591, 7674, 7095, 2247, 3584, 4925,
        3565, 7537, 8267, 7676, 5111,  183, 8251, 3990,  199], dtype=int64)]

In [112]:
a=item_sim_df.values[col, :][top_n_items] # 유사도 행렬에서 앞서 가장 유사도 컸던 index의 실제 값 추출.
a

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])

In [117]:
b=ratings_matrix.values[row, :][top_n_items]
b

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

In [114]:
a.dot(b.T)

0.0

In [115]:
print(np.abs(item_sim_df.values[col, :][top_n_items]))
np.sum(np.abs(item_sim_df.values[col, :][top_n_items]))  

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


20.0

In [116]:
# 최종 계산되는 값 
a.dot(b.T) / np.sum(np.abs(item_sim_df.values[col, :][top_n_items]))  

0.0

### 드디어 유저한테 추천! 
- 앞에서 사실상 `ratings_pred_matrix` 계산한거로 추천시스템 끝난 것임
- 하지만 예측행렬은 실제로 평점 부여한 작품에 대해서도 예측하니깐 이미 봤던 행렬 제외하기 위해서 한번 더 처리가 필요함.
- userId = 9인 경우 영화 추천하기.

In [126]:
user_rating_id = ratings_matrix.loc[9,:] # 9번 유저 평점 추출
user_rating_id[user_rating_id >0].sort_values(ascending=False)[:10] # 평점 0 이상인 경우의 10개 추출

title
Adaptation (2002)                                                                 5.0
Austin Powers in Goldmember (2002)                                                5.0
Lord of the Rings: The Fellowship of the Ring, The (2001)                         5.0
Lord of the Rings: The Two Towers, The (2002)                                     5.0
Producers, The (1968)                                                             5.0
Citizen Kane (1941)                                                               5.0
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    5.0
Back to the Future (1985)                                                         5.0
Glengarry Glen Ross (1992)                                                        4.0
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)                                     4.0
Name: 9, dtype: float64

- 대체적으로 흥행성 높은 어드벤쳐, 코미디 영화에 높은 평점.
- 이제 이 유저에게 추천할 영화를 골라보자. 

In [127]:
### 유저들이 안본 영화만 골라내는 함수

def get_unseen_movies(ratings_matrix, userId):
    user_rating= ratings_matrix.loc[userId, :] # userId의 모든 평점 정보를 Series 형태로 저장 
    already_seen = user_rating[user_rating >0].index.tolist() 
        # user_rating >0 : 평점 0 이상인 작품의
        # .index : index 값만 가져와서
        # .tolist() : list 객체로 변환한다
    movies_list = ratings_matrix.columns.tolist() # 모든 영화명만을 담은 list
    unseen_list = [movie for movie in movies_list if movie not in already_seen] # already_seen 제거하는 list comprehension 
    
    return unseen_list
    

In [131]:
### 최종적으로 유저에게 영화 추천.

def recomm_movie_by_userid(pred_df, userId, unseen_list, top_n = 10):
    recomm_movies = pred_df.loc[userId, unseen_list].sort_values(ascending=False)[:top_n] 
        # pred_df.loc[userId, unseen_list] : 예측행렬에서 유저 Id 에 해당하는 아직 안본 작품 추출 
        # .sort_values(ascending=False)[:top_n]  : 예측값 높은(평점 높게 부여하리라 예상되는) 작품 top_n개 추출
        
    return recomm_movies

In [133]:
unseen_list = get_unseen_movies(ratings_matrix, 9) # 아직 안본 영화 
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)

recomm_movies = pd.DataFrame(data = recomm_movies.values,
                            index=recomm_movies.index,
                            columns=['pred_score'])
recomm_movies

Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Shrek (2001),0.866202
Spider-Man (2002),0.857854
"Last Samurai, The (2003)",0.817473
Indiana Jones and the Temple of Doom (1984),0.816626
"Matrix Reloaded, The (2003)",0.80099
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),0.765159
Gladiator (2000),0.740956
"Matrix, The (1999)",0.732693
Pirates of the Caribbean: The Curse of the Black Pearl (2003),0.689591
"Lord of the Rings: The Return of the King, The (2003)",0.676711


- 이렇게 드디어 추천할 작품 완성
- 대체적으로 흥행성 높은 작품 (슈렉, 스파이더맨, 등등) 