# Surprise

- 추천시스템을 위한 라이브러리
- scikitlearn 에서 추천기능을 제공하지 않아 Surprise 를 사용함
- 데이터셋 로딩, 분리, 예측 등 많은 기능을 제공하고 있음
- http://surpriselib.com/


In [3]:
from surprise import SVD, Dataset, accuracy
from surprise.model_selection import train_test_split

In [5]:
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] 

 Y


Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/hakchangs/.surprise_data/ml-100k


In [6]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7feae8938c10>

In [7]:
predictions = algo.test(testset)
print('prediction type :', type(predictions), ' size:', len(predictions))
print('prediction 결과의 최초 5개 추출')
predictions[:5]

prediction type : <class 'list'>  size: 25000
prediction 결과의 최초 5개 추출


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.669898280967229, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.8437213891480853, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.169159315299788, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.7262438152300152, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.2496170497209254, details={'was_impossible': False})]

In [8]:
[ (pred.uid, pred.iid, pred.est) for pred in predictions[: 3] ]

[('120', '282', 3.669898280967229),
 ('882', '291', 3.8437213891480853),
 ('535', '507', 4.169159315299788)]

In [9]:
uid = str(196)
iid = str(302)
pred = algo.predict(uid, iid)
print(pred)

user: 196        item: 302        r_ui = None   est = 3.80   {'was_impossible': False}


In [10]:
accuracy.rmse(predictions)

RMSE: 0.9479


0.9479113211056617

In [11]:
import pandas as pd

In [12]:
ratings = pd.read_csv('../datasets/movielens/ratings.csv')
ratings.to_csv('../datasets/movielens/ratings_noh.csv', index=False, header=False)

In [13]:
from surprise import Reader

In [14]:
reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))
data = Dataset.load_from_file('../datasets/movielens/ratings_noh.csv', reader=reader)

In [15]:
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

In [16]:
import pandas as pd
from surprise import Reader, Dataset

In [17]:
ratings = pd.read_csv('../datasets/movielens/ratings.csv')
reader = Reader(rating_scale=(0.5, 5.0))

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

### 사용자 성향을 반영한 Baseline Rating

- $r_{ui} = b_{ui} = \mu + b_u + b_i$
- 전체사용자 평균평점 + 사용자 편향점수 + 아이템 편향점수


In [18]:
from surprise.model_selection import cross_validate

In [19]:
ratings = pd.read_csv('../datasets/movielens/ratings.csv')
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

algo = SVD(random_state=0)
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8840  0.8643  0.8711  0.8717  0.8690  0.8720  0.0065  
MAE (testset)     0.6790  0.6649  0.6682  0.6693  0.6680  0.6699  0.0048  
Fit time          3.87    4.02    3.69    3.83    3.80    3.84    0.11    
Test time         0.08    0.09    0.07    0.08    0.07    0.08    0.01    


{'test_rmse': array([0.88399859, 0.86426585, 0.87110413, 0.87166934, 0.86900528]),
 'test_mae': array([0.67900358, 0.66490739, 0.66823221, 0.66930204, 0.6680453 ]),
 'fit_time': (3.8739969730377197,
  4.02209210395813,
  3.691829204559326,
  3.830047130584717,
  3.799591302871704),
 'test_time': (0.07763314247131348,
  0.09044766426086426,
  0.07229113578796387,
  0.08048796653747559,
  0.07348990440368652)}

In [20]:
from surprise.model_selection import GridSearchCV

In [21]:
param_grid = {
    'n_epochs': [20, 40, 60],
    'n_factors': [50, 100, 200]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.8776674347696437
{'n_epochs': 20, 'n_factors': 50}
