## Surprise 패키지
추천 시스템 구축을 돕는 패키지
사이킷런과 유사한 API 명으로 작성

In [1]:
from surprise import SVD, Dataset, accuracy
from surprise.model_selection import train_test_split

In [2]:
data = Dataset.load_builtin("ml-100k")
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/jinjae/.surprise_data/ml-100k


In [3]:
algo = SVD(random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x15f949dc0>

생성한 알고리즘 기반으로 추천 수행
test(): 데이터 세트 전체에 대해 추천 예측
predict(): 개별 사용자와 영화에 대한 추천 평점 반환

In [4]:
predictions = algo.test(testset)
print("prediction type:", type(predictions), "size:", len(predictions))
print("extract first 5")
predictions[:5]

prediction type: <class 'list'> size: 25000
extract first 5


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.5114147666251547, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.5738724195814906, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.033583485472446, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.846363949593691, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.1807542478219157, details={'was_impossible': False})]

prediction 객체 => Surprise 패키지 제공 데이터 타입

In [5]:
[ (prd.uid, prd.iid, prd.est) for prd in predictions[:3] ]

[('120', '282', 3.5114147666251547),
 ('882', '291', 3.5738724195814906),
 ('535', '507', 4.033583485472446)]

추천 평점 예측

In [6]:
uid = str(196)
iid = str(302)
prd = algo.predict(uid, iid)
print(prd)

user: 196        item: 302        r_ui = None   est = 4.49   {'was_impossible': False}


In [7]:
accuracy.rmse(predictions)

RMSE: 0.9467


0.9466860806937948

일반 데이터 파일이나 판다스 DataFrame 등 로딩 가능
사용자 아이디, 아이템 아이디, 평점  순서를 지켜야 함

Dataset.load_builtin(name="ml-100k"): 데이터 내려받기
Dataset.load_from_file(file_path, reader): OS 파일에서 데이터 로드
Dataset.load_from_df(df, reader): DataFrame에서 데이터 로드

In [8]:
import pandas as pd

ratings = pd.read_csv("../ml-latest-small/ratings.csv")

ratings.to_csv("../ml-latest-small/ratings_noh.csv", index=False, header=False)

각 필드의 칼럼명과 칼럼 분리문자, 최소~최대 평점 입력, Reader 객체 참조

In [12]:
from surprise import Reader

reader = Reader(line_format="user item rating timestamp", sep=',', rating_scale=(0.5, 5))
data = Dataset.load_from_file("../ml-latest-small/ratings_noh.csv", reader=reader)

Reader 클래스 생성 주요 파라미터
line_format(string): 칼럼을 순서대로 나열
sep(char): 칼럼 분리 분리자
rating_scale: 최소, 최대 평점 설정

In [13]:
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

#### Pandas DataFrame

In [14]:
import pandas as pd
from surprise import Reader, Dataset

ratings = pd.read_csv("../ml-latest-small/ratings.csv")
reader = Reader(rating_scale=(0.5, 5))

data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

### Surprise 추천 알고리즘 클래스
SVD, KNNBasic, BaselineOnly 등 사용

### 베이스라인 평점
평가에 편향성 요소를 반영하여 평점 부과
전체 평균 평점 + 사용자 편향 점수 + 아이템 편향 점수
- 전체 평균 평점 = 모든 사용자 아이템에 대한 평점 평균
- 사용자 편향 점수 = 사용자별 아이템 평점 평균 값 - 전체 평균 평점
- 아이템 편향 점수 = 아이템별 평점 평균 값 - 전체 평균 평점

### 교차 검증과 하이퍼파라미터 튜닝

In [15]:
from surprise.model_selection import cross_validate

ratings = pd.read_csv("../ml-latest-small/ratings.csv")
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)

algo = SVD(random_state=0)
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8753  0.8675  0.8636  0.8797  0.8769  0.8726  0.0060  
MAE (testset)     0.6696  0.6671  0.6676  0.6738  0.6738  0.6704  0.0029  
Fit time          0.53    0.46    0.46    0.46    0.47    0.48    0.03    
Test time         0.05    0.05    0.06    0.05    0.05    0.05    0.00    


{'test_rmse': array([0.87527672, 0.86749823, 0.86364509, 0.87971374, 0.87692562]),
 'test_mae': array([0.66962188, 0.6670856 , 0.66762888, 0.67381821, 0.6738331 ]),
 'fit_time': (0.5292291641235352,
  0.461118221282959,
  0.46279096603393555,
  0.46213293075561523,
  0.4671659469604492),
 'test_time': (0.05265378952026367,
  0.05006217956542969,
  0.056197166442871094,
  0.05397462844848633,
  0.05055999755859375)}

하이퍼 파라미터 최적화

In [16]:
from surprise.model_selection import GridSearchCV

param_grid = {"n_epochs": [20, 40, 60],
              "n_factors": [50, 100, 200]}

gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.8764014130380192
{'n_epochs': 20, 'n_factors': 50}


개인화 영화 추천 시스템 구축

In [17]:
# error
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)
algo = SVD(n_factors=50, random_state=0)
algo.fit(data)

AttributeError: 'DatasetAutoFolds' object has no attribute 'n_users'

In [18]:
from surprise.dataset import DatasetAutoFolds

reader = Reader(line_format="user item rating timestamp", sep=',', rating_scale=(0.5, 5))
data_folds = DatasetAutoFolds(ratings_file="../ml-latest-small/ratings_noh.csv", reader=reader)

# make all data to train set
trainset = data_folds.build_full_trainset()

In [19]:
algo = SVD(n_epochs=20, n_factors=50, random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x16b66c3a0>

In [20]:
movies = pd.read_csv("../ml-latest-small/movies.csv")

movieIds = ratings[ratings["userId"] == 9]["movieId"]

if movieIds[movieIds == 42].count() == 0:
    print("movie 42: no review from user 9")

print(movies[movies["movieId"] == 42])

movie 42: no review from user 9
    movieId                   title              genres
38       42  Dead Presidents (1995)  Action|Crime|Drama


In [21]:
uid = str(9)
iid = str(42)

prd = algo.predict(uid, iid, verbose=True)

user: 9          item: 42         r_ui = None   est = 3.13   {'was_impossible': False}


In [22]:
def get_unseen_surprise(ratings, movies, userId):
    seen_movies = ratings[ratings["userId"] == userId]["movieId"].tolist()
    total_movies = movies["movieId"].tolist()
    unseen_movies = [movie for movie in total_movies if movie not in seen_movies]
    print("graded movies:", len(seen_movies), "to recommends:", len(unseen_movies),
          "all movies:", len(total_movies))

    return unseen_movies

unseen_movies = get_unseen_surprise(ratings, movies, 9)

graded movies: 46 to recommends: 9696 all movies: 9742


In [23]:
def recomm_movie_by_surprise(algo, userId, unseen_movies, top_n=10):
    predictions = [algo.predict(str(userId), str(movieId)) for movieId in unseen_movies]

    def sortkey_est(pred):
        return pred.est

    predictions.sort(key=sortkey_est, reverse=True)
    top_predictions= predictions[:top_n]

    top_movie_ids = [ int(pred.iid) for pred in top_predictions]
    top_movie_rating = [ pred.est for pred in top_predictions]
    top_movie_titles = movies[movies.movieId.isin(top_movie_ids)]['title']
    top_movie_preds = [ (id, title, rating) for id, title, rating in zip(top_movie_ids, top_movie_titles, top_movie_rating)]

    return top_movie_preds

unseen_movies = get_unseen_surprise(ratings, movies, 9)
top_movie_preds = recomm_movie_by_surprise(algo, 9, unseen_movies, top_n=10)
print('##### Top-10 추천 영화 리스트 #####')

for top_movie in top_movie_preds:
    print(top_movie[1], ":", top_movie[2])

graded movies: 46 to recommends: 9696 all movies: 9742
##### Top-10 추천 영화 리스트 #####
Usual Suspects, The (1995) : 4.306302135700814
Star Wars: Episode IV - A New Hope (1977) : 4.281663842987387
Pulp Fiction (1994) : 4.278152632122758
Silence of the Lambs, The (1991) : 4.226073566460876
Godfather, The (1972) : 4.1918097904381995
Streetcar Named Desire, A (1951) : 4.154746591122658
Star Wars: Episode V - The Empire Strikes Back (1980) : 4.122016128534504
Star Wars: Episode VI - Return of the Jedi (1983) : 4.108009609093436
Goodfellas (1990) : 4.083464936588478
Glory (1989) : 4.07887165526957
