# 130. Item-Based Collaborative Filtering (아이템 기반 협업 필터링)

### User-Based CF vs Item_Basef CF

| 사용자 | 기생충 | 겨울왕국 | 부산행 | 백두산 | 
| --- | --- |--- |--- | ---|
| 철수|4 | 3 | 5 |   |  
|영희| |2|1|2|
|길동|1|5| | 3| 
|정숙| | |4|5|

<img src="https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FpxM35%2FbtrbAIgrefV%2FKPUOwAAuXRUw5DOaQshNi1%2Fimg.png" width="500" />

### UBCF

- 장점 : 각 사용자별로 맞춤형 추천을 하므로 정확한 추천이 가능
- 단점 : 
    - 데이터가 풍부한 경우(구매나 평가 정보) 추천이 가능   
    - 데이터가 조금만 바뀌어도 업데이트 필요
    

### IBCF

- 장점 :
    - 사용자별로 따로 계산 않으므로 계산 속도가 빠르다.
    - 데이터가 조금 바뀌어도 추천 결과에 영향이 크지 않으므로 업데이트를 자주하지 않아도 된다.  
    - 대규모 사이트에 적합
- 단점 : 정확도는 UBCF 보다 떨어짐

In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity

i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']

movies = pd.read_csv('data/u.item', sep='|', names=i_cols, encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('data/u.rating', sep='\t', names=r_cols, encoding='latin-1')

- content-based filtering 과 달리 CF 는 similarity 를 이용하므로 item 정보가 필요하지 않기 때문에 movies dataframe에서 title 을 제외한 모든 다른 열을 drop 한다.  

- ratings 만을 입력으로 사용

In [12]:
# timestamp 제거 
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [13]:
# item 정보가 필요하지 않으므로 
#movie ID와 title 빼고 다른 데이터 제거
movies = movies[['movie_id', 'title']]
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [14]:
# train, test set 분리
X = ratings.copy()
y = ratings['user_id']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((75000, 3), (25000, 3), (75000,), (25000,))

### train 데이터로 Full matrix 구하기 

In [25]:
rating_matrix = X_train.pivot(index="movie_id", columns="user_id", values="rating")
print(rating_matrix.shape)
rating_matrix

(1638, 943)


user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,,,4.0,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
2,3.0,,,,3.0,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,3.0,,,,,,,,,4.0,...,5.0,,,,,,2.0,,,
5,3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,
1681,,,,,,,,,,,...,,,,,,,,,,


### Item-Based CF 알고리즘

1) 모든 Item 간의 평가의 유사도를 계산  
2) 현재 추천 대상이 되는 아이템과 다른 아이템들의 유사도를 추출  
3) 현재 사용자가 평가하지 않은 모든 아이템에 대해서 현재 사용자의 예상 평가 값 계산. 예상 평가 값은 추천하려는 영화와 사용자가 평가한 영화의 유사도로 가중 평균  
4) 아이템 중에서 예상 평가 값이 가장 높은 N 개의 아이템을 추천  

이때, 추천 대상 영화와 가장 유사도가 높은 K 개의 다른 영화만 반영하여 예측 평가의 품질을 높인다.

이를 위해 추천 사용자, 영화 id 를 parameter 로 받아 예상 평점을 반환하는 함수를 작성

In [26]:
# train set의 모든 가능한 아이템 pair의 Cosine similarities 계산
matrix_dummy = rating_matrix.copy().fillna(0)
item_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
item_similarity = pd.DataFrame(item_similarity, index=rating_matrix.index, columns=rating_matrix.index)
print(item_similarity.shape)
item_similarity

(1638, 1638)


movie_id,1,2,3,4,5,6,7,8,9,10,...,1672,1673,1674,1675,1676,1677,1679,1680,1681,1682
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.291547,0.270898,0.328703,0.196434,0.096052,0.464825,0.306123,0.367212,0.195947,...,0.053819,0.040364,0.0,0.0,0.0,0.040364,0.0,0.0,0.053819,0.000000
2,0.291547,1.000000,0.257153,0.398155,0.192962,0.083040,0.327136,0.284269,0.150688,0.126659,...,0.090412,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.090412,0.090412
3,0.270898,0.257153,1.000000,0.251134,0.205386,0.074995,0.284027,0.124798,0.174688,0.110316,...,0.000000,0.000000,0.0,0.0,0.0,0.037424,0.0,0.0,0.000000,0.112272
4,0.328703,0.398155,0.251134,1.000000,0.253780,0.043336,0.353637,0.396401,0.288082,0.161575,...,0.062911,0.000000,0.0,0.0,0.0,0.041941,0.0,0.0,0.062911,0.083881
5,0.196434,0.192962,0.205386,0.253780,1.000000,0.050670,0.284305,0.186362,0.177654,0.040043,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.105540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,0.040364,0.000000,0.037424,0.041941,0.000000,0.000000,0.000000,0.000000,0.083057,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,1.000000,0.0,0.0,0.000000,0.000000
1679,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,1.0,1.0,0.000000,0.000000
1680,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,1.0,1.0,0.000000,0.000000
1681,0.053819,0.090412,0.000000,0.062911,0.000000,0.000000,0.059642,0.095277,0.066446,0.000000,...,1.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,1.000000,0.000000


현재 추천 대상이 되는 영화와 다른 영화의 유사도를 추출

In [27]:
user_id = 1     # 현재 추천 대상이 되는 사용자
movie_id = 8   # 현재 추천 대상이 되는 영화

# 현재 영화와 다른 영화의 similarity 값 가져오기
sim_scores = item_similarity[movie_id]
print(sim_scores.shape)
sim_scores.head()

(1638,)


movie_id
1    0.306123
2    0.284269
3    0.124798
4    0.396401
5    0.186362
Name: 8, dtype: float64

### 추천 대상이 되는 사용자에게 보지 않은 영화를 추천할 때 사용자가 부여할 rating 예측

1. 현 사용자의 모든 rating값 가져오기  
2. 현 사용자가 평가하지 않은 영화의 index 가져오기
3. 현 사용자가 평가하지 않은 영화의 rating (null) 제거 --> 사용자가 평가한 영화만 남김

In [31]:
 # 현 사용자의 모든 rating 값 가져오기
user_ratings = rating_matrix[user_id]
print(user_ratings.shape)
user_ratings.head()

(1638,)


movie_id
1    5.0
2    3.0
3    4.0
4    3.0
5    3.0
Name: 1, dtype: float64

In [32]:
# 사용자가 평가하지 않은 영화 index 가져오기
non_rating_idx = user_ratings[user_ratings.isnull()].index
print(non_rating_idx.shape)
non_rating_idx

(1434,)


Int64Index([   8,   12,   15,   16,   25,   29,   34,   37,   41,   46,
            ...
            1672, 1673, 1674, 1675, 1676, 1677, 1679, 1680, 1681, 1682],
           dtype='int64', name='movie_id', length=1434)

In [33]:
 # 사용자가 평가하지 않은 영화 제거
user_ratings = user_ratings.drop(non_rating_idx)
print(user_ratings.shape)
user_ratings.head()

(204,)


movie_id
1    5.0
2    3.0
3    4.0
4    3.0
5    3.0
Name: 1, dtype: float64

In [34]:
 # 사용자가 평가하지 않은 영화의 similarity 값 제거
sim_scores = sim_scores.drop(non_rating_idx)
print(sim_scores.shape)
sim_scores.head()

(204,)


movie_id
1    0.306123
2    0.284269
3    0.124798
4    0.396401
5    0.186362
Name: 8, dtype: float64

추천하려는 대상 영화에 대한 예상 rating 계산

In [35]:
# 추천하려는 영화에 대한 예상 rating 계산, 가중치는 현 영화와 사용자가 평가한 영화의 유사도
predicted_rating = np.dot(sim_scores, user_ratings) / sim_scores.sum()
predicted_rating

3.7923458513321173

In [36]:
# 가장 유사도가 높은 K 영화의  유사도로 가중평균한 rating 구하기
K = 10
sorted_item_idx = np.argsort(sim_scores)
sorted_item_idx[-K:]

movie_id
261    125
262     61
263     97
264    178
266    142
267    126
269     53
270    160
271     22
272    149
Name: 8, dtype: int64

In [38]:
sim_scores = sim_scores.values[sorted_item_idx][-K:]
sim_scores

array([0.40706372, 0.40858816, 0.41572646, 0.41726409, 0.42294405,
       0.4230396 , 0.42776092, 0.43394906, 0.43842876, 0.45214382])

In [40]:
movie_ratings = user_ratings.values[sorted_item_idx][-K:]
movie_ratings

array([5., 5., 4., 4., 4., 5., 3., 5., 4., 5.])

In [42]:
predicted_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
predicted_rating

4.3995902872832495

### 모델 성능 평가 - 모델별 RMSE를 계산하는 함수 
- Test set을 이용하여 측정  

In [43]:
def score(model, n_neighbors=0):
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    y_pred = np.array([model(user, movie, n_neighbors) for (user, movie) in id_pairs])
    y_true = np.array(X_test['rating'])
    return mean_squared_error(y_true, y_pred, squared=True)

###  특정 사용자에게 추천하는 영화(movie_id)의  가중평균 rating을 계산하는 함수
- 위의 과정을 model 함수로 작성
- 위에서 개별 사용자와 아이템에 대해 구하던 predicted_rating을 함수로 작성
- 가중치는 주어진 영화와 주어진 사용자가 평가한 영화의 유사도(user_similarity)

In [44]:
def IBCF_knn_model(user_id, movie_id, n_neighbors=0):
    if movie_id in item_similarity:      # 현재 영화가 train set에 있는지 확인
        # 현재 영화와 다른 영화의 similarity 값 가져오기
        sim_scores = item_similarity[movie_id]
        # 현 사용자의 모든 rating 값 가져오기
        user_ratings = rating_matrix[user_id]
        # 사용자가 평가하지 않은 영화 index 가져오기
        non_rating_idx = user_ratings[user_ratings.isnull()].index
        # 사용자가 평가하지 않은 영화 제거
        user_ratings = user_ratings.drop(non_rating_idx)
        # 사용자가 평가하지 않은 영화의 similarity 값 제거
        sim_scores = sim_scores.drop(non_rating_idx)

        if n_neighbors == 0:   # 모든 item 의 유사도 사용, 가중치는 현 영화와 사용자가 평가한 영화의 유사도
            predicted_rating = np.dot(sim_scores, user_ratings) / sim_scores.sum()
        else:
            if len(sim_scores) > 1: #사용자가 평가한 영화 수가 2 개 이상인 경우만 계산
                neighbor_size = min(n_neighbors, len(sim_scores))
                sorted_item_idx = np.argsort(sim_scores)
                sim_scores = sim_scores.values[sorted_item_idx][-neighbor_size:]
                user_ratings = user_ratings.values[sorted_item_idx][-neighbor_size:]
                predicted_rating = np.dot(sim_scores, user_ratings) / sim_scores.sum()
    else:
        predicted_rating = 3.0
    return predicted_rating

# 정확도 계산
print("knn 감안하지 않은 simple CF - 모든 아이템 포함")
print(score(IBCF_knn_model, n_neighbors=0))
print("knn 감안한 CF - K 개의 가장 유사한 아이템만 포함")
print(score(IBCF_knn_model, n_neighbors=30))

knn 감안하지 않은 simple CF - 모든 아이템 포함
1.031842067616898
knn 감안한 CF - K 개의 가장 유사한 아이템만 포함
0.9667645371046725


### 특정 사용자에 대하여 영화 추천

In [47]:
# 사용자가 평가한 영화 ratings
user_id = 1
user_movie_ratings = rating_matrix.loc[user_id]
user_movie_ratings.head()

user_id
1    5.0
2    4.0
3    NaN
4    NaN
5    4.0
Name: 1, dtype: float64

In [48]:
# 현 사용자가 평가하지 않은 영화의 예상 평점 계산
movie_id = 8
user_movie_ratings.loc[movie_id] = IBCF_knn_model(user_id, movie_id, n_neighbors=30)
user_movie_ratings.head()

user_id
1    5.0
2    4.0
3    NaN
4    NaN
5    4.0
Name: 1, dtype: float64

영화를 예상 평점에 따라 정렬해서 제목을 뽑아서 돌려 줌

In [51]:
movie_ratings_sorted = user_movie_ratings.sort_values(ascending=False)[:5]
movie_ratings_sorted

user_id
1      5.0
416    5.0
343    5.0
381    5.0
388    5.0
Name: 1, dtype: float64

In [52]:
recom_movies = movies.loc[movie_ratings_sorted.index]
recommendations = recom_movies['title']
recommendations

user_id
1                                       GoldenEye (1995)
416                              Parent Trap, The (1961)
343                                  Apostle, The (1997)
381    Adventures of Priscilla, Queen of the Desert, ...
388                                  Black Beauty (1994)
Name: title, dtype: object

### 주어진 사용자에 대하여 추천 받는 함수

In [53]:
def recom_movie(user_id, n_items, neighbor_size=30):
    # 현 사용자가 평가한 영화 ratings 가져오기
    user_movie_ratings = rating_matrix.loc[user_id]
    
    for movie in rating_matrix:       
        if pd.notnull(user_movie_ratings.loc[movie]):
             # 현 사용자가 이미 평가한 영화는 제외 (평점을 0으로) 
            user_movie_ratings.loc[movie] = 0
        else:
            # 현 사용자가 평가하지 않은 영화의 예상 평점 계산
            user_movie_ratings.loc[movie] = IBCF_knn_model(user_id, movie, neighbor_size)
    # 영화를 예상 평점에 따라 정렬해서 제목을 뽑아서 돌려 줌
    movie_sort = user_movie_ratings.sort_values(ascending=False)[:n_items]
    recom_movies = movies.loc[movie_sort.index]
    recommendations = recom_movies['title']
    return recommendations

recom_movie(user_id=2, n_items=5, neighbor_size=30)

user_id
784          Only You (1994)
711           Tin Men (1987)
701         Barcelona (1994)
45            Exotica (1994)
647    Quiet Man, The (1952)
Name: title, dtype: object

### 최적의 neighbor size 구하기

In [54]:
for neighbor_size in [10, 20, 30, 40, 50, 60]:
    print(f"Neighbor size = {neighbor_size} : RMSE = {score(IBCF_knn_model, neighbor_size):.4f}")

Neighbor size = 10 : RMSE = 0.9480
Neighbor size = 20 : RMSE = 0.9516
Neighbor size = 30 : RMSE = 0.9635
Neighbor size = 40 : RMSE = 0.9745
Neighbor size = 50 : RMSE = 0.9833
Neighbor size = 60 : RMSE = 0.9906
