사용자/아이템 기반 협업 필터링 문제점  
- **확장성** 큰 행렬 계산은 쉽지 않음
    - 아이템 기반은 계산량이 적을 수 있음
    - Spark를 사용하면 큰 행렬 계산 가능
- **데이터 부족(sparse data)**
    - 많은 사용자들이 충분한 수의 리뷰를 남기지 않음

## 해결 -> 모델 기반 협업 필터링
머신 러닝 기술을 사용해 평점을 예측, 입력은 사용자-아이템 평점 행렬  
- 행렬 분해 방식(SVD)
- 딥러닝

## SVD
협엽 필터링 문제를 사용자-아이템 평점 행렬을 채우는 문제로 재정의  
사용자 혹은 아이템을 적은 수의 차원으로 축소함으로써 문제를 간단화

- PCA
- SVD or SVDpp


### PCA
의미를 최대한 유지하며 차원을 축소  
모든 행렬의 값이 존재해야함 null값을 처리할 필요가 필요  
차원 축소를 어떤 기준으로 계산되었는지 알기 어려움

### SVD
2개 혹은 3개의 작은 행렬 곱으로 단순화
모든 행렬의 값이 존재해야함

### SVD++
sparse 행렬이 주어졌을 때 비어있는 셀들을 채우는 방법을 배우는 알고리즘  
채워진 셀들의 값을 최대한 비슷하게 채움 SGD  
보통 RMSE 값을 최소화하는 방식으로 활용  

**모델 성능 평가**
train test set으로 나눠 정확도 확인  
cross validation 방식 활용

In [1]:
# 모듈 설치
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=2811598 sha256=ec83fa8376f9c8bcbebebc60b7cd7080b5289cc1378b5a13c7e8e4753f2a84ec
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [2]:
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv"
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv"

--2023-08-18 02:42:51--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.144.66, 3.5.145.124, 52.219.204.34, ...
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.144.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458390 (448K) [text/csv]
Saving to: ‘movies.csv’


2023-08-18 02:42:53 (639 KB/s) - ‘movies.csv’ saved [458390/458390]

--2023-08-18 02:42:53--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.144.66, 3.5.145.124, 52.219.204.34, ...
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.144.66|:443... connected.
HTTP request s

In [3]:
import numpy as np
import pandas as pd

In [4]:
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [None]:
# 사용자별 영화 리뷰 정보를 보면 sparse함을 확인할 수 있음
itemRatings = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
itemRatings.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


In [None]:
movies.shape, ratings.shape

((9125, 3), (100004, 4))

In [7]:
def get_movie_name(movie_ratings, movie_id):
    return movie_ratings[movie_ratings["movieId"] == movie_id][["title", "genres"]].values[0]

def get_movie_id(movie_ratings, movie_name):
    return movie_ratings[movie_ratings["title"] == movie_name][["movieId", "genres"]].values[0]

In [None]:
from collections import defaultdict

from surprise import Dataset, Reader

reader = Reader(line_format="user item rating timestamp", sep=",", skip_lines=1)
data = Dataset.load_from_file('ratings.csv', reader=reader)

In [None]:
from surprise import SVD, NormalPredictor
from surprise.model_selection import GridSearchCV


param_grid = {
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.010],
    'n_factors': [50, 100]      # SVD의 대각행렬의 차원
}

# 3fold 교차검증, 두개의 비용함수 RMSE MAE
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

In [None]:
# RMSE
print(f"Best RMSE score attained: {gs.best_score['rmse']}")
print(f"Best RMSE params: {gs.best_params['rmse']}")

Best RMSE score attained: 0.8994460670194662
Best RMSE params: {'n_epochs': 20, 'lr_all': 0.005, 'n_factors': 50}


In [None]:
# MAE
print(f"Best MAE score attained: {gs.best_score['mae']}")
print(f"Best MAE params: {gs.best_params['mae']}")

Best MAE score attained: 0.6932052642643666
Best MAE params: {'n_epochs': 30, 'lr_all': 0.005, 'n_factors': 50}


In [None]:
# 최고 성능 보인 파라미터로 모델 훈련하고 예측하기
svd = gs.best_estimator['rmse']
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f58846efc40>

In [None]:
uid = str(196)
iid = str(302)

pred = svd.predict(uid, iid, verbose=True)  # actual label: 4

user: 196        item: 302        r_ui = None   est = 3.89   {'was_impossible': False}


In [None]:
# Train/Test Split으로 훈련하고 성능 평가하기
from surprise import accuracy
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(data, test_size=.25)

svd = SVD()
svd.fit(trainset)
predictions = svd.test(testset)

accuracy.rmse(predictions)

RMSE: 0.9010


0.901024346062311

In [None]:
testset[0:10]

[('99', '1947', 4.0),
 ('306', '1955', 5.0),
 ('77', '44191', 4.5),
 ('201', '2420', 3.0),
 ('253', '2115', 2.5),
 ('162', '650', 3.0),
 ('55', '104', 5.0),
 ('353', '21', 3.0),
 ('110', '485', 4.0),
 ('481', '8645', 3.5)]

In [None]:
pred = svd.predict("358", "367", verbose=True)

user: 358        item: 367        r_ui = None   est = 2.87   {'was_impossible': False}


In [None]:
pred = svd.predict("41", "7817", verbose=True)

user: 41         item: 7817       r_ui = None   est = 3.93   {'was_impossible': False}


## TOP N 추천 정확도
- 모델 방식의 추천, N=10이라고 가정할 때 사용자 별 아래 사이클을 반복
    - 평점 데이터에서 사용자의 모든 데이터를 찾음
    - 한 평점 레코드를 빼서 LeaveOneOut 테스트 셋에 추가
    - 나머지 레코드 -> 훈련셋
- 만들어진 훈련셋으로 모델 학습
- 훈련에 사용되지 않은 레코드(build_anti_testset)을 통해 평점 예측
    - 평점 정보가 없는 모든 사용자 ID, 아이템 ID 레코드는 테스트셋에 들어감
- 사용자별 테스트 셋의 아이템 중 평점이 높은 것들 중 추천된 Top 10개의 포함된 것의 비율 계산 후 평균

## NDCG
N개의 추천 아이템 중 추천된 순서에 따라 가중치를 부여

In [None]:
# 모듈 설치
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=2811598 sha256=ec83fa8376f9c8bcbebebc60b7cd7080b5289cc1378b5a13c7e8e4753f2a84ec
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [None]:
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv"
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv"

--2023-08-18 02:42:51--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.144.66, 3.5.145.124, 52.219.204.34, ...
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.144.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458390 (448K) [text/csv]
Saving to: ‘movies.csv’


2023-08-18 02:42:53 (639 KB/s) - ‘movies.csv’ saved [458390/458390]

--2023-08-18 02:42:53--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.144.66, 3.5.145.124, 52.219.204.34, ...
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.144.66|:443... connected.
HTTP request s

In [8]:
import numpy as np
import pandas as pd

In [9]:
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

In [10]:
def get_movie_name(movie_ratings, movie_id):
    return movie_ratings[movie_ratings["movieId"] == movie_id][["title", "genres"]].values[0]

def get_movie_id(movie_ratings, movie_name):
    return movie_ratings[movie_ratings["title"] == movie_name][["movieId", "genres"]].values[0]

In [11]:
from collections import defaultdict

from surprise import Dataset, Reader

reader = Reader(line_format="user item rating timestamp", sep=",", skip_lines=1)
data = Dataset.load_from_file('ratings.csv', reader=reader)

In [14]:
from surprise import SVD, NormalPredictor
from surprise.model_selection import GridSearchCV

In [22]:
# 사용자별 10개의 추천 아이템 생성
# 인자인 predictions으로는 SVD의 리턴값을 활용
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)

    # predictions (userId, itemId, ratings, predicted_ratings)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # 각 사용자별 가장 높게 평점을 예측한 n개 반환
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [16]:
# Top-N 정확도 계산
from surprise.model_selection import LeaveOneOut

LOOCV = LeaveOneOut(n_splits=1, random_state=1)
ratingCutoff = 4.0

for trainSet, testSet in LOOCV.split(data):
    ...

In [17]:
len(testSet)

671

In [18]:
svd = SVD()
svd.fit(trainSet)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7faf44c2e3b0>

In [19]:
bigTestSet = trainSet.build_anti_testset()
print(len(bigTestSet))

5975230


In [20]:
allPredictions = svd.test(bigTestSet)

In [23]:
topNPredicted = get_top_n(allPredictions, n=20)

for tnp in topNPredicted['1']:
    print(tnp)

('318', 3.7414861563241324)
('858', 3.7081088254637504)
('50', 3.685493357405708)
('969', 3.670341687981738)
('6016', 3.6422463496651267)
('1945', 3.6420167976286932)
('926', 3.631352008734992)
('356', 3.6252746143137333)
('1704', 3.597113813370548)
('1198', 3.576683952870792)
('7502', 3.572909492705307)
('48780', 3.5610898736125924)
('904', 3.5577765025758725)
('7153', 3.5519739764023286)
('1221', 3.5516687935810625)
('593', 3.548962545501422)
('908', 3.5476480043890897)
('1203', 3.543564846630735)
('1196', 3.5398301004444446)
('913', 3.5394226832224405)


In [24]:
hits = 0
total = 0

for userId, leftOutMovieId, trueRating in testSet:
    print(userId, leftOutMovieId, trueRating)
    if trueRating > ratingCutoff:
        for movieId, predictedRating in topNPredicted[userId]:
            if leftOutMovieId == movieId:
                hits += 1
                break
        total += 1

print(hits / float(total) * 100.)

1 1263 2.0
2 165 3.0
3 377 2.5
4 2114 5.0
5 5679 4.5
6 903 4.0
7 1371 3.0
8 2918 5.0
9 1358 4.0
10 152 4.0
11 2042 3.5
12 1028 1.0
13 54286 3.5
14 2628 3.0
15 60295 3.5
16 4718 4.5
17 6503 0.5
18 748 3.0
19 1374 3.0
20 8493 3.5
21 1302 4.0
22 4239 2.0
23 3310 5.0
24 786 4.0
25 260 4.0
26 8784 2.0
27 2858 5.0
28 924 4.0
29 2571 5.0
30 2114 4.0
31 372 3.5
32 317 3.0
33 2793 2.0
34 21 4.0
35 24 2.5
36 381 3.0
37 3357 4.0
38 364 4.0
39 188 2.0
40 33794 4.0
41 2877 4.0
42 48394 4.0
43 3869 1.0
44 95 3.0
45 1748 4.5
46 1270 5.0
47 95 3.0
48 63808 3.5
49 173 2.0
50 95 3.0
51 2881 5.0
52 7160 4.5
53 1772 4.0
54 44199 5.0
55 494 3.0
56 38038 4.0
57 357 5.0
58 3510 5.0
59 5349 3.0
60 2115 3.5
61 8464 3.0
62 106920 4.0
63 4846 5.0
64 266 5.0
65 5060 5.0
66 1270 4.0
67 198 4.0
68 380 4.0
69 4896 4.5
70 88 3.0
71 1259 4.0
72 81591 3.5
73 7845 3.0
74 1302 4.0
75 8957 4.5
76 45 5.0
77 7147 4.0
78 3949 5.0
79 5952 3.0
80 610 3.0
81 1213 3.0
82 150 4.0
83 3160 0.5
84 34162 3.5
85 370 2.0
86 178 5.0
87 