## 유사도가 높은 K 사용자(KNN)의 평점을 이용한 협업 필터링
- 💬 REMIND
    - Memory-based : Matrix를 이용하는 추천시스템 중 사용자의 평점 혹은 사용 여부를 바탕으로 구매 패턴을 파악해 그 기억을 바탕으로 추천을 진행하는 방법

- KNN(K- Nearest Neighbors) : K명의 최근접 이웃에 기반해서 찾는 방법, 사용자가 준 평점으로 유사한 사람의 아이템을 찾거나, 유사한 아이템을 찾아 추천을 한다.
    - 편향을 제거(전반적으로 평점을 후하게 주거나 적게 주는 경우를 방지) 해주기 위해 비교군의 평점을 더해주거나 빼주어 동일하게 해준다.
    - 방법이 간단하고 직관적이어서 접근이 용이
    - 유저 기반의 방법 및 속도, 메모리가 많이 든다.
    - 희소성으로 인한 제약이 발생한다. (유사한 이웃이 사용한 경험이 없으면 추천 불가능하다)


In [1]:
import numpy as np
import pandas as pd

### Data Load

In [2]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.user', sep='|', names=u_cols, encoding='latin-1')

i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']
movies = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.item', sep='|', names=i_cols, encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.data', sep='\t', names=r_cols, encoding='latin-1')

In [3]:
# rating df, timestamp 제거 
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### movie data 재구성

In [4]:
# movie ID와 title 빼고 다른 데이터 제거
movies = movies[['movie_id', 'title']]
movies = movies.set_index('movie_id')
movies

Unnamed: 0_level_0,title
movie_id,Unnamed: 1_level_1
1,Toy Story (1995)
2,GoldenEye (1995)
3,Four Rooms (1995)
4,Get Shorty (1995)
5,Copycat (1995)
...,...
1678,Mat' i syn (1997)
1679,B. Monkey (1998)
1680,Sliding Doors (1998)
1681,You So Crazy (1994)


### RMSE
- $RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2}$

In [5]:
# 정확도(RMSE)를 계산하는 함수
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred))**2))

### 모델별로 테스트 데이터의 예측 및 실데이터 간의 정확도 계산

In [6]:
# 모델별 RMSE를 계산하는 함수
def score(model, neighbor_size=0):
    id_pairs = zip(x_test['user_id'], x_test['movie_id'])
    y_pred = np.array([model(user, movie, neighbor_size) for (user, movie) in id_pairs])
    y_true = np.array(x_test['rating'])
    return RMSE(y_true, y_pred)

### train, test set 분리
- user_id를 기준으로 일정 비율(stratify=true)로 학습, 테스트 데이터 분리

In [7]:
from sklearn.model_selection import train_test_split
x = ratings.copy()
y = ratings['user_id']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state= 42, stratify=y)

### 학습데이터(사용자 X 영화 X 평점) matrix

In [8]:
# train 데이터로 Full matrix 구하기 
rating_matrix = x_train.pivot(index='user_id', columns='movie_id', values='rating')
rating_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### 학습데이터의 전체 사용자간의 유사도 (cosie similarity) 계산

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)
user_similarity

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.000000,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.000000,0.139805,0.000000,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.000000,0.175060,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.000000
4,0.029577,0.130237,0.139805,1.000000,0.000000,0.045190,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.000000,0.130343,0.077357,0.157890,0.063911
5,0.245753,0.054918,0.000000,0.000000,1.000000,0.176443,0.281860,0.132205,0.038790,0.134200,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.104799,0.149088,0.019052,0.000000,0.029743,0.086464,0.075012,0.095736,0.000000,0.080883,...,0.061061,0.299811,0.158064,0.221251,0.323989,1.000000,0.047368,0.162173,0.058828,0.124548
940,0.232472,0.102807,0.127099,0.130343,0.188392,0.230566,0.270071,0.164157,0.131458,0.255758,...,0.195863,0.113346,0.144570,0.173568,0.139877,0.047368,1.000000,0.092911,0.199881,0.135868
941,0.051528,0.062386,0.023917,0.077357,0.068342,0.095478,0.020036,0.076269,0.106763,0.063461,...,0.021901,0.055348,0.226017,0.170493,0.249612,0.162173,0.092911,1.000000,0.072402,0.099200
942,0.129555,0.109143,0.060392,0.157890,0.055557,0.197307,0.236086,0.089871,0.089297,0.169309,...,0.111291,0.078263,0.051882,0.137759,0.069516,0.058828,0.199881,0.072402,1.000000,0.142812


## 이웃 갯수 K를 정해서 예측치 계산
- 특정 사용자가 특정 영화에 대해 가질 평점을 예측하는 함수
- **knn (k-최근접 이웃) CF 알고리즘 구현**
    - 유사도 및 평점 데이터 복사 및 정제
        - `sim_scores`: 지정된 사용자와 다른 모든 사용자 간의 유사도 점수를 복사
        - `movie_ratings`: 지정된 영화에 대한 모든 사용자의 평점을 복사
        - `none_rating_idx`: 평점이 없는 사용자의 인덱스를 찾는다.
        - 평점이 없는 사용자를 제거
    - 이웃 크기에 따른 평점 예측
        - Neighbor size가 지정된 경우
            - 지정된 neighbor size에 따라 해당 영화를 평가한 사용자 중 **유사도가 가장 높은 사용자들의 평점을 사용**하여 가중 평균을 계산
            - `np.argsort()`: 사용하여 유사도를 기준으로 `사용자 인덱스`를 정렬하고, 가장 유사도가 높은 neighbor_size 수의 사용자만 고려
            - 이후 선택도니 이웃의 유사도와 평점을 바탕으로 가중 평균을 계산하여 반환
    - `유사도가 높은 사용자의 평점을 사용해 예측을 수행하며, 이웃 크기를 지정하여 특정 수의 이웃만 고려할 수 있다.`

In [10]:
def CF_knn(user_id, movie_id, neighbor_size=0):
    if movie_id in rating_matrix:
        # 현재 사용자(나)와 다른 사용자 간의 similarity 가져오기
        sim_scores = user_similarity[user_id].copy()
        # 현재 영화에 대한 모든 사용자의 rating값 가져오기
        movie_ratings = rating_matrix[movie_id].copy()
        # 현재 영화를 평가하지 않은 사용자의 index 가져오기 = 평가한 것만 남는다.
        none_rating_idx = movie_ratings[movie_ratings.isnull()].index
        # 현재 영화를 평가하지 않은 사용자의 rating (null) 제거 = 평가한 사람만 남는다.
        movie_ratings = movie_ratings.drop(none_rating_idx)
        # 현재 영화를 평가하지 않은 사용자의 similarity값 제거 = 테스트 데이터에 존재하지 않는 사람 제거
        sim_scores = sim_scores.drop(none_rating_idx)

        # (2) Neighbor size가 지정되지 않은 경우        
        if neighbor_size == 0:          
            # 현재 영화를 평가한 모든 사용자의 가중평균값 구하기
            if sim_scores.sum() == 0:    # user_id와 유사도가 0인 경우 있음.
                mean_rating = 3.0
            else:
                mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
                
        # (3) Neighbor size가 지정된 경우
        else:                       
            # 해당 영화를 평가한 사용자가 최소 2명이 되는 경우에만 계산
            if len(sim_scores) > 1: 
                # 지정된 neighbor size 값과 해당 영화를 평가한 총사용자 수 중 작은 것으로 결정
                neighbor_size = min(neighbor_size, len(sim_scores))
                # array로 바꾸기 (argsort를 사용하기 위함)
                sim_scores = np.array(sim_scores)
                movie_ratings = np.array(movie_ratings)
                # 유사도를 순서대로 정렬 (argsort() = 인덱스값과 value값을 같이 sorting한다.)
                user_idx = np.argsort(sim_scores)
                # 유사도 및 movie_ratings를 neighbor size만큼 받기
                sim_scores = sim_scores[user_idx][-neighbor_size:]
                movie_ratings = movie_ratings[user_idx][-neighbor_size:]
                if sim_scores.sum() == 0:    # user_id와 유사도가 0인 경우 있음.
                    mean_rating = 3.0
                else:
                    mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
            else:
                mean_rating = 3.0
    else:
        mean_rating = 3.0
    return mean_rating

- **가중평균** : `mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()`
    - 사용자 간의 유사도를 가중치로 사용하여 특정 영화에 대한 예측 평점을 계산하는 방법
    - 이 방식은 유사도 점수를 해당 영화에 대한 평점에 대한 가중치로 사용하여, 주어진 사용자에 대한 영화의 평점을 예측한다.
    -
    - 정의 : 각 데이터 포인트에 다른 중요도 또는 가중치를 부여하여 계산하는 평균
        - 단순평균과 다르게, 각 데이터 값이 전체 평균에 미치는 영향력을 그 값의 중요도나 빈도에 따라 조정한다.
        - $Weighted Average = \frac{w_1x_1 + w_2x_2 + ... + w_nx_n}{w_1 + w_2 + ... + w_n}$ = $\frac{\sum_{i=1}^{n} (w_i * x_i)}{\sum_{i=1}^{n} w_i}$

In [11]:
score(CF_knn)

np.float64(1.0237210431087944)

In [12]:
score(CF_knn, neighbor_size=30)

np.float64(1.0169668415833513)

- RMSE의 점수가 낮아졌으므로 neighbor_size를 주면 더 좋은 결과를 얻을 수 있다.

---
### 전체 데이터에서 추천

In [13]:
rating_matrix = ratings.pivot_table(values='rating', index='user_id', columns='movie_id')

from sklearn.metrics.pairwise import cosine_similarity
matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)

In [14]:
def recommender(user, n_items=10, neighbor_size=20):
    # 현재 사용자의 모든 아이템에 대한 예상 평점 계산
    predictions = []
    rated_index = rating_matrix.loc[user][rating_matrix.loc[user] > 0].index    # 이미 평가한 영화 확인
    items = rating_matrix.loc[user].drop(rated_index)
    for item in items.index:
        predictions.append(CF_knn(user, item, neighbor_size))                   # 예상평점 계산
    recommendations = pd.Series(data=predictions, index=items.index, dtype=float)
    recommendations = recommendations.sort_values(ascending=False)[:n_items]    # 예상평점이 가장 높은 영화 선택
    recommended_items = movies.loc[recommendations.index]['title']
    return recommended_items

In [15]:
recommender(user=2, n_items=5, neighbor_size=30) 

movie_id
1500               Santa with Muscles (1996)
1189                      Prefontaine (1997)
1293                         Star Kid (1997)
1467    Saint of Fort Washington, The (1993)
318                  Schindler's List (1993)
Name: title, dtype: object

### 최적의 neighbor size 찾기

In [16]:
# train set으로 full matrix와 cosine similarity 구하기 
rating_matrix = x_train.pivot_table(values='rating', index='user_id', columns='movie_id')

matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)

for neighbor_size in range(10, 100, 5):
    print("Neighbor size = %d : RMSE = %.4f" % (neighbor_size, score(CF_knn, neighbor_size)))


Neighbor size = 10 : RMSE = 1.0323
Neighbor size = 15 : RMSE = 1.0231
Neighbor size = 20 : RMSE = 1.0192
Neighbor size = 25 : RMSE = 1.0184
Neighbor size = 30 : RMSE = 1.0170
Neighbor size = 35 : RMSE = 1.0165
Neighbor size = 40 : RMSE = 1.0165
Neighbor size = 45 : RMSE = 1.0164
Neighbor size = 50 : RMSE = 1.0167
Neighbor size = 55 : RMSE = 1.0169
Neighbor size = 60 : RMSE = 1.0173
Neighbor size = 65 : RMSE = 1.0178
Neighbor size = 70 : RMSE = 1.0182
Neighbor size = 75 : RMSE = 1.0185
Neighbor size = 80 : RMSE = 1.0188
Neighbor size = 85 : RMSE = 1.0192
Neighbor size = 90 : RMSE = 1.0196
Neighbor size = 95 : RMSE = 1.0199


- neighbor size가 40일때 RMSE가 가장 낮은 것을 확인할 수 있다.