## 협업 필터링

### 협업 필터링의 유형
1. 최근접 이웃 기반(Nearest Neighbor)<br>
    1) 사용자 기반 (User-user CF): 특정 사용자와 가장 유사한 다른 사용자를 추출<br>
    2) 아이템 기반 (Item-item CF): 특정 아이템과 가장 유사한 다른 아이템을 추출<br>
<br>
2. 잠재 요인 기반(Latent Factor)<br>
    1) 행렬 분해 기반 (Matrix Factorization): 평점 행렬 -> SVD 행렬 분해 -> 재결합 -> 평점 예측 및 추천

### 협업 필터링의 특징
최근접 이웃 기반 - 사용자 추천<br>
**행이 User들이고, 열이 Item에 대한 평가의 데이터프레임을 만드는게 주요 관건이다.**<br>
-> 사용자가 평가한 Item기반으로, 평가하지 않은 Item의 평점 예측 -> 평점이 좋으면 추천, 안좋으면 안추천<br>
<img src=https://blog.kakaocdn.net/dn/o66ik/btrGNCDxcmI/yWFWqfkWZ6CB9P6OjDJlK1/img.png width=1000>

### 협업 필터링을 위한 데이터셋 만들기
행이 사용자, 열이 Item 평점인 데이터로 변환<br>
<br><img src=https://blog.kakaocdn.net/dn/bLlwAm/btrGLy2Xz86/jROqSTzfUW0OEIproTEHH1/img.png width=1000>

### 사용자 기반과 아이템 기반 협업 필터링 이해
1. 사용자 기반: A사용자와 B사용자의 아이템 평점이 매우 비슷할 때, B사용자가 L아이템에 좋은 평점을 줬고, A는 그걸 안써봤을 때, A에게 L아이템을 추천<br>
2. 아이템 기반: I아이템과 J아이템에 대해 사용자들의 평점이 매우 유사할 때, I아이템을 사는 사람에게 -> 이 상품을 선택한 다른 고객들은 J아이템도 구매했습니다.<br>
<br><img src=https://blog.kakaocdn.net/dn/dY98VK/btrGLghaIMQ/jII7ihwhG9jwblpkhVcbm1/img.png width=1000>

#### 사용자 기반 협업 필터링 예시
A, B, C 사용자가 있을 때, A 사용자와 B사용자의 평점이 비슷하다 -> 취향이 비슷하다고 판단.<br>
이 때, B 사용자가 프로메테우스를 매우 재미있게 봤다면, A 사용자에게 추천할만 한다.<br>
<br><img src=https://blog.kakaocdn.net/dn/bdeKMg/btrGM4tBQCO/WFCF5IHW2oq1sb4UPN4gs1/img.png width=1000>

### 사용자 기반 vs 아이템 기반
**-> 일반적으론 사용자 기반 보다는 아이템 기반 방식이 더 선호된다.**<br>
단순히 동일한 상품을 구입했다고, 유사한 사람이라고 판단하기 어려운 경우가 많기 때문.<br>
즉, 같은 상품을 구매했다고, 같은 취향의 사람은 아닌 경우가 많다는 뜻.

### 협업 필터링에는 코사인 유사도가 쓰인다.
<br><img src=https://blog.kakaocdn.net/dn/dqf572/btrGLfii5dw/mQk7mcmeHkMXj5YjMGdkPK/img.png width=1000>

 ### 사용자 기반 <-> 아이템 기반 변환
<br><img src=https://blog.kakaocdn.net/dn/dqIsQg/btrGKwERqiY/LKprwkmQAZH3y8C91dPnP1/img.png width=1000>

### 아이템 기반 협업 필터링
-> 개인화된 영화 추천<br>
<br><img src=https://blog.kakaocdn.net/dn/chDSSX/btrGKS1XcnK/LjY1JJBuNsP9WnA1LbqWCk/img.png width=1000>
<br>
- 개인화된 예측 평점 수식<br>
<br><img src=https://blog.kakaocdn.net/dn/bxt9th/btrGOcdyo50/tKIjocEsTIiY68iVJjWwsk/img.png width=1000>

### 아이템 기반 협업 필터링 구현 순서
1. 사용자-아이템 행렬 데이터를 아이템-사용자 행렬 데이터로 변환<br>
2. 아이템간의 코사인 유사도로 아이템 유사도 산출<br>
3. 사용자가 관람(구매)하지 않은 아이템들 중에서 아이템간 유사도를 반영한 예측 점수 계산<br>
4. 예측 점수가 가장 높은 순으로 아이템 추천

## 아이템 기반 협업 필터링 실습

### 데이터 로드
https://grouplens.org/datasets/movielens/<br>

10만개 ratings에 대해서, 9,000개 영화를 600명의 사용자가 평점을 매긴 데이터<br>

ratings.csv에서 사용자ID별 영화ID에 대한 평점 확인<br>
movies.csv에서 영화ID에 따른 영화 Title 존재<br>

ratings와 movies를 Join하여 사용자별, 영화이름별 평점 데이터 만들기

In [3]:
import pandas as pd
import numpy as np

movies = pd.read_csv('./ml-latest-small/movies.csv')
ratings = pd.read_csv('./ml-latest-small/ratings.csv')
print(f"{movies.shape=}")
print(f"{ratings.shape=}")

movies.shape=(9742, 3)
ratings.shape=(100836, 4)


In [6]:
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [7]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [71]:
print(ratings.rating.describe())
# 평점은 최소 0.5점부터 5.0까지 존재. 평균은 3.5 (0.5 단위로 보임)
print(f"평점 종류: {sorted(ratings.rating.unique())}")

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64
평점 종류: [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]


### pivot_table: user - item 평점 행렬
**로우레벨 사용자 평점 데이터를 사용자-아이템 평점 행렬로 변환**

In [8]:
ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = ratings.pivot_table('rating', index='userId', columns='movieId')
ratings_matrix.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


In [86]:
ratings_matrix.shape

(610, 9719)

In [17]:
# title 컬럼을 얻기 위해 movies 와 조인 수행
rating_movies = pd.merge(ratings, movies, on='movieId')
print(f"{ratings.shape=}, {movies.shape=}, {rating_movies.shape=}")
rating_movies.head(3)

ratings.shape=(100836, 3), movies.shape=(9742, 3), rating_movies.shape=(100836, 5)


Unnamed: 0,userId,movieId,rating,title,genres
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [13]:
# columns='title' 로 title 컬럼으로 pivot 수행. 
ratings_matrix = rating_movies.pivot_table('rating', index='userId', columns='title')

# NaN 값을 모두 0 으로 변환
ratings_matrix = ratings_matrix.fillna(0)

print(f"{ratings_matrix.shape=} -> user 610명, 영화 9719개")
ratings_matrix.head(3)
# 개인이 아무리 영화를 많이봐도 9700개를 다 볼수없기 때문에 이는 희소행렬이다.
## 이 데이터 뿐만이 아니고, 추천관련해서 상품이나 콘텐츠도 모두 희소행렬이다. (개인이 모든걸 소비하지 못하기 때문)

ratings_matrix.shape=(610, 9719) -> user 610명, 영화 9719개


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### user-item -> item-user로 transpose

In [15]:
ratings_matrix_T = ratings_matrix.transpose()
print(f"{ratings_matrix_T.shape=}")
ratings_matrix_T.head(3)

ratings_matrix_T.shape=(9719, 610)


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 평점 기반 영화-영화 cosine 유사도 산출

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

# item_sim은 ndarray
item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T) # 데이터프레임과 데이터프레임을 넣어서, 코사인 유사도 산출

# item_sim을 보기좋게 DataFrame으로 바꾸기
# cosine_similarity() 로 반환된 넘파이 행렬을 영화명을 매핑하여 DataFrame으로 변환
item_sim_df = pd.DataFrame(data=item_sim, index=ratings_matrix.columns,
                          columns=ratings_matrix.columns)
print(f"{item_sim_df.shape=}")
item_sim_df.head(3)

item_sim_df.shape=(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
item_sim_df["Godfather, The (1972)"].sort_values(ascending=False)[:6]

title
Godfather, The (1972)                        1.000000
Godfather: Part II, The (1974)               0.821773
Goodfellas (1990)                            0.664841
One Flew Over the Cuckoo's Nest (1975)       0.620536
Star Wars: Episode IV - A New Hope (1977)    0.595317
Fargo (1996)                                 0.588614
Name: Godfather, The (1972), dtype: float64

In [11]:
item_sim_df["Inception (2010)"].sort_values(ascending=False)[1:6]

title
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Name: Inception (2010), dtype: float64

### Weighted Rating Sum

아이템 기반 인접 이웃 협업  필터링으로 개인화된 영화 추천
<br><img src=https://blog.kakaocdn.net/dn/chDSSX/btrGKS1XcnK/LjY1JJBuNsP9WnA1LbqWCk/img.png width=1000>

- ratings_arr: 유저별 영화 평점
- item_sim_arr: 영화별 cosine 유사도

In [20]:
def predict_rating(ratings_arr, item_sim_arr ):
    ratings_pred = ratings_arr.dot(item_sim_arr)/ np.array([np.abs(item_sim_arr).sum(axis=1)])
    return ratings_pred

In [73]:
ratings_matrix.head(2) # 실제 평점

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
item_sim_df.head(2) # 평점 기반 영화간 유사도

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
ratings_pred = predict_rating(ratings_matrix.values , item_sim_df.values) # 예측 평점 구하기 (ndarray)
# 보기좋게 Dataframe으로
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index= ratings_matrix.index, 
                                   columns = ratings_matrix.columns)
print(f"{ratings_pred.shape=}, {ratings_pred_matrix.shape=}")
ratings_pred_matrix.head(3) # 개인화된 유저별 평점 예측

ratings_pred.shape=(610, 9719), ratings_pred_matrix.shape=(610, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501


### 예측 성능 평가(MSE)
**가중치 평점 부여뒤에 예측 성능 평가 MSE를 구함**<br>
예측 평점이 잘 매겨진건지 모르니까, 실제로 평점 매겨진애들과 예측평점을 비교해서 MSE 산출

In [31]:
from sklearn.metrics import mean_squared_error

# 사용자가 평점을 부여한 영화에 대해서만 예측 성능 평가 MSE 를 구함. 
def get_mse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)

print(f'아이템 기반 모든 인접 이웃 MSE: {get_mse(ratings_pred, ratings_matrix.values ):.4f}')

아이템 기반 모든 인접 이웃 MSE: 9.8954


### top-n개에 대해서, Weight Rating Sum
predict_rating함수 업그레이드

#### 함수 정의
**top-n 유사도를 가진 데이터들에 대해서만 예측 평점 계산**<br>
n(20)개에 대해서만 아래 과정 수행
<br><img src=https://blog.kakaocdn.net/dn/chDSSX/btrGKS1XcnK/LjY1JJBuNsP9WnA1LbqWCk/img.png width=1000>

In [76]:
def predict_rating_topsim(ratings_arr, item_sim_arr, n=20):
    # 사용자-아이템 평점 행렬 크기만큼 0으로 채운 예측 행렬 초기화
    pred = np.zeros(ratings_arr.shape)

    # 사용자-아이템 평점 행렬의 열 크기만큼 Loop 수행. - 영화 개수만큼
    print(f"{ratings_arr.shape=}")
    for col in range(ratings_arr.shape[1]):
        # 유사도 행렬에서 유사도가 큰 순으로 n개 데이터 행렬의 index 반환
        top_n_items = [np.argsort(item_sim_arr[:, col])[:-n-1:-1]]
        # 개인화된 예측 평점을 계산
        for row in range(ratings_arr.shape[0]):
            pred[row, col] = item_sim_arr[col, :][top_n_items].dot(ratings_arr[row, :][top_n_items].T) 
            pred[row, col] /= np.sum(np.abs(item_sim_arr[col, :][top_n_items]))        
    return pred

#### MSE 계산
**top-n 유사도 기반의 예측 평점 및 MSE 계산**

In [77]:
import warnings
warnings.filterwarnings('ignore')
ratings_pred = predict_rating_topsim(ratings_matrix.values , item_sim_df.values, n=20)
print(f'아이템 기반 인접 TOP-20 이웃 MSE: {get_mse(ratings_pred, ratings_matrix.values):.4f}')

ratings_arr.shape[1]=9719
아이템 기반 인접 TOP-20 이웃 MSE: 3.6950


In [38]:
# 계산된 예측 평점 데이터는 DataFrame으로 재생성
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index= ratings_matrix.index,
                                   columns = ratings_matrix.columns)

In [55]:
ratings_pred_matrix.head(5)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,1.677291,0.284372
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,0.194828,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# 9번 사용자에 대한 영화 평점 높은순
user_rating_id = ratings_matrix.loc[9, :]
user_rating_id[ user_rating_id > 0].sort_values(ascending=False)[:10]

title
Adaptation (2002)                                                                 5.0
Citizen Kane (1941)                                                               5.0
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    5.0
Producers, The (1968)                                                             5.0
Lord of the Rings: The Two Towers, The (2002)                                     5.0
Lord of the Rings: The Fellowship of the Ring, The (2001)                         5.0
Back to the Future (1985)                                                         5.0
Austin Powers in Goldmember (2002)                                                5.0
Minority Report (2002)                                                            4.0
Witness (1985)                                                                    4.0
Name: 9, dtype: float64

### 협업 필터링 영화 추천
**사용자가 관람하지 않은 영화 중에서 아이템 기반의 인접 이웃 협업 필터링으로 영화 추천**

#### 평점 없는 것 중, 예측 평점 높은거 추천
**아이템 기반 유사도로 평점이 부여된 데이터 세트에서 해당 사용자가 관람하지 않은 영화들의 예측 평점이 가장 높은 영화를 추천**

In [78]:
def get_unseen_movies(ratings_matrix, userId): 
    # userId로 입력받은 사용자의 모든 영화정보 추출하여 Series로 반환함. 
    user_rating = ratings_matrix.loc[userId,:] # userID로 사용자 평점정보 가져오기
    # user_rating이 0보다 크면 기존에 관람한 영화 -> user_rating이 0인 경우 관람하지 않은 영화
    unseen_list = user_rating[ user_rating == 0].index.tolist()
    return unseen_list

def recomm_movie_by_userid(pred_df, userId, unseen_list, top_n=10):
    # 예측 평점 DataFrame에서 사용자id index와 unseen_list로 들어온 영화명 컬럼을 추출하여
    # 가장 예측 평점이 높은 순으로 정렬함. 
    recomm_movies = pred_df.loc[userId, unseen_list].sort_values(ascending=False)[:top_n]
    return recomm_movies    

In [85]:
user9_rating = rating_movies[rating_movies.userId==9].sort_values('rating', ascending=0)
print(user9_rating.shape[0])
user9_rating.head(13)

46


Unnamed: 0,userId,movieId,rating,title,genres
7132,9,1198,5.0,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
8727,9,1270,5.0,Back to the Future (1985),Adventure|Comedy|Sci-Fi
40308,9,5481,5.0,Austin Powers in Goldmember (2002),Comedy
35930,9,4993,5.0,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
40439,9,5902,5.0,Adaptation (2002),Comedy|Drama|Romance
37032,9,5952,5.0,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
5641,9,923,5.0,Citizen Kane (1941),Drama|Mystery
40176,9,2300,5.0,"Producers, The (1968)",Comedy
1262,9,223,4.0,Clerks (1994),Comedy
39987,9,1095,4.0,Glengarry Glen Ross (1992),Drama


In [49]:
# 9번 사용자가 관람하지 않는 영화명 추출   
unseen_list = get_unseen_movies(ratings_matrix, 9)

In [57]:
unseen_list[:10]

["'71 (2014)",
 "'Hellboy': The Seeds of Creation (2004)",
 "'Round Midnight (1986)",
 "'Salem's Lot (2004)",
 "'Til There Was You (1997)",
 "'Tis the Season for Love (2015)",
 "'burbs, The (1989)",
 "'night Mother (1986)",
 '(500) Days of Summer (2009)',
 '*batteries not included (1987)']

In [52]:
# 아이템 기반의 인접 이웃 협업 필터링으로 영화 추천 
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)
print(f"{recomm_movies=}")

recomm_movies=title
Shrek (2001)                                                                                      0.866202
Spider-Man (2002)                                                                                 0.857854
Last Samurai, The (2003)                                                                          0.817473
Indiana Jones and the Temple of Doom (1984)                                                       0.816626
Matrix Reloaded, The (2003)                                                                       0.800990
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)    0.765159
Gladiator (2000)                                                                                  0.740956
Matrix, The (1999)                                                                                0.732693
Pirates of the Caribbean: The Curse of the Black Pearl (2003)                                     0.689591
Lord of the Rings

In [53]:
# 평점 데이타를 DataFrame으로 생성. 
recomm_movies = pd.DataFrame(data=recomm_movies.values,index=recomm_movies.index,columns=['pred_score'])
recomm_movies

Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Shrek (2001),0.866202
Spider-Man (2002),0.857854
"Last Samurai, The (2003)",0.817473
Indiana Jones and the Temple of Doom (1984),0.816626
"Matrix Reloaded, The (2003)",0.80099
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),0.765159
Gladiator (2000),0.740956
"Matrix, The (1999)",0.732693
Pirates of the Caribbean: The Curse of the Black Pearl (2003),0.689591
"Lord of the Rings: The Return of the King, The (2003)",0.676711
