### 아이템 기반 협업 필터링 (Item-based Collaborative Filtering)

- 특정 아이템과 유사한 다른 아이템을 찾아 추천하는 방식
- 사용자의 과거 행동 데이터를 바탕으로 각 아이템 간의 유사도를 계산하고, 이를 기반으로 추천 생성

**과정**
1. 아이템 간 유사도 계산
2. 사용자의 선호도 파악
3. 가중 평점 예측
4. 추천 제공

**장점**
- 사용자 수가 많아지더라도 유사도 계산에 소요되는 시간 비교적 적음
- 아이템의 특성을 고려하지 않으므로 특성 데이터가 부족하더라도 활용 가능

**단점**
- 아이템 간 유사도만 고려하므로 사용자의 선호 변화나 개인 취향 반영이 어려울 수 있음
- 충분한 기반 데이터가 없는 경우 정확한 유사도 계산이 어려움 (Cold Start)

In [2]:
import numpy as np
import pandas as pd

In [3]:
movies_df = pd.read_csv('./data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('./data/ml-latest-small/ratings.csv')
movies_df.shape, ratings_df.shape

((9742, 3), (100836, 4))

In [4]:
# movies_df.head()
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
movies_ratings_df = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
print(movies_ratings_df.shape)
movies_ratings_df.head()

(100836, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


##### 사용자 평점 기반 아이템(영화) 유사도 계산

In [6]:
users_movies_df = movies_ratings_df.pivot_table('rating', index='userId', columns='title', fill_value=0)
print(users_movies_df.shape)
users_movies_df.head()

(610, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# 특정 사용자의 영화 평점 조회
users_movies_df.iloc[555].sort_values(ascending=False)[:30]

title
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)    5.0
How to Train Your Dragon (2010)                                                                   5.0
Guardians of the Galaxy (2014)                                                                    5.0
Aladdin (1992)                                                                                    5.0
Harry Potter and the Chamber of Secrets (2002)                                                    4.5
Harry Potter and the Deathly Hallows: Part 2 (2011)                                               4.5
Eragon (2006)                                                                                     4.5
Lord of the Rings: The Fellowship of the Ring, The (2001)                                         4.5
Harry Potter and the Prisoner of Azkaban (2004)                                                   4.0
Into the Woods (2014)                                                       

In [8]:
# 사용자별 평점 개수
(users_movies_df != 0).sum(axis=1).describe()

count     610.000000
mean      165.298361
std       269.466692
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
dtype: float64

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

movies_sim = cosine_similarity(users_movies_df.T, users_movies_df.T)
movies_sim_df = pd.DataFrame(movies_sim, index=users_movies_df.columns, columns=users_movies_df.columns)

In [10]:
movies_sim_df["'Hellboy': The Seeds of Creation (2004)"].sort_values(ascending=False)[:10]

title
'Hellboy': The Seeds of Creation (2004)                       1.000000
Space Battleship Yamato (2010)                                1.000000
Monsters (2010)                                               1.000000
All the Right Moves (1983)                                    0.780869
Hidden Fortress, The (Kakushi-toride no san-akunin) (1958)    0.747409
...And Justice for All (1979)                                 0.715542
'Round Midnight (1986)                                        0.707107
Kagemusha (1980)                                              0.542720
Sanjuro (Tsubaki Sanjûrô) (1962)                              0.526685
Ghost Rider: Spirit of Vengeance (2012)                       0.525226
Name: 'Hellboy': The Seeds of Creation (2004), dtype: float64

##### 가중 평점 예측
1. 사용자별 아이템(영화) 평점이 있다. 
2. 평점 기반의 아이템(영화) 유사도가 있다.
3. 모든 사용자의 모든 아이템(영화)에 대한 가중평점 예측한다. 
4. 사용자별 영화 예측평점이 높은 순으로 영화를 추천한다. 
    - 사용자가 안 본 영화(평점이 없는, 0인 영화)를 추천한다. 

**Weighted Rating Sum**

사용자 $ u $의 아이템 $ i $에 대한 평점 예측은 사용자 $ u $가 아이템 $ i $와 유사한 다른 아이템들($ N $개의 다른 아이템)의 합으로 계산하며, 아이템들 간의 유사도를 반영한 합으로 계산

$
\hat{R}_{u,i} = \frac{\sum_{N} (S_{i,N} \times R_{u,N})}{\sum_{N} (|S_{i,N}|)}
$


- $\hat{R}_{u,i}$: 사용자 $ u $가 아이템 $ i $에 대해 가질 것으로 예측되는 평점
- $S_{i,N}$: 아이템 $ i $와 유사한 다른 아이템들의 유사도
- $R_{u,N}$: 사용자 $u$의 유사한 아이템들의 평점

**사용자 $ u $의 유사한 아이템 들의 평점**

| Item | j | k | **i** | m | n |
|------|---|---|---|---|---|
| Rating | 5 | 4 | **1** | 3 | 2 |

**유사도**

|   | (i,j) | (i,k) | (i,i) | (i,m) | (i,n) |
|---|-------|-------|-------|-------|-------|
| **R_{u,i}** | 0.2 | 0.1 | **0.4** | 0.1 | 0.2 |

**계산:** 위 두개의 행렬에 대한 내적을 구한다.

$
5 \times 0.2 + 4 \times 0.1 + 1 \times 0.4 + 3 \times 0.1 + 2 \times 0.2 = 2.5
$
<br>
<br>
$
\hat{R}_{u,i} = \frac{(5 \times 0.2) + (4 \times 0.1) + (1 \times 0.4) + (3 \times 0.1) + (2 \times 0.2)}{0.2 + 0.1 + 0.4 + 0.1 + 0.2}
= \frac{1 + 0.4 + 0.4 + 0.3 + 0.4}{1} = \frac{2.5}{1} = 2.5
$

결과적으로, 사용자 $ u $가 아이템 $ i $에 대해 가질 것으로 예측되는 평점은 **2.5**이이다.


- 전체 가중평점 예측

In [13]:
def predict_ratings(user_movies_df, movies_sim_df):
    return user_movies_df.dot(movies_sim_df) / np.abs(movies_sim_df).sum(axis=1)

ratings_pred_df = predict_ratings(users_movies_df, movies_sim_df)
print(ratings_pred_df.shape)
ratings_pred_df.head(1).T

(610, 9719)


userId,1
title,Unnamed: 1_level_1
'71 (2014),0.070345
'Hellboy': The Seeds of Creation (2004),0.577855
'Round Midnight (1986),0.321696
'Salem's Lot (2004),0.227055
'Til There Was You (1997),0.206958
...,...
eXistenZ (1999),0.212070
xXx (2002),0.192921
xXx: State of the Union (2005),0.136024
¡Three Amigos! (1986),0.292955


In [15]:
from sklearn.metrics import mean_squared_error

# 실제 평점과 예측 평점 오차 비교
def get_mse(actual, pred):
    non_zero_idx = actual.nonzero()
    # print(non_zero_idx) # ([row_idx, row_idx, ...], [col_idx, col_idx, ...])
    actual = actual[non_zero_idx]
    pred = pred[non_zero_idx]
    return mean_squared_error(actual, pred)

get_mse(users_movies_df.values, ratings_pred_df.values)

9.895354759094706

- 특정 사용자의 영화 하나 평점 예측

In [16]:
users_movies_df.iloc[176, 35]    # 176번째 사용자의 35번째 영화에 대한 평점

np.float64(5.0)

In [None]:
topn_sim_idx = movies_sim_df.iloc[35].argsort()[::-1]