<a href="https://colab.research.google.com/github/jiminmini/mini/blob/main/10_17_%ED%95%84%EC%82%AC_%EA%B3%BC%EC%A0%9C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**[개념 정리]**

##**[잠재 요인 협업 필터링]**
- 행렬 분해 이용
- SVD나 NMF 등 적용 가능

##**[surprise 패키지]**
- 파이썬 기반의 추천 시스템 패키지
- 사이킷런과 유사한 API와 프레임워크 제공

##**[surprise 주요 모듈]**
- dataset.load_builtin(name='ml-100k)
- Dataset.load_from_file(file_path, reader)
- Dataset.load_from_df(df, reader)

##**[surprise 추천 알고리즘 클래스]**
- SVD
- KNNBasic
- BaselineOnly

##**[교차 검증과 하이퍼 파라미터 튜닝]**
- cross_validate()와 GridSearchCV 클래스 제공

#**[코드 필사]**

In [None]:
def matrix_factorization(R, K, steps=200, learning_rate=0.01, r_lambda = 0.01 ):
    num_users, num_items = R.shape
 # 으와 Q 매트릭스의 크기를 지정하고 정규 분포를 가진 랜덤한 값으로 입력합니다.
    np.random.seed(1)
    P = np.random.normal(scale=1./K, size=(num_users, K))
    Q = np.random.normal(scale=1./K, size=(num_items, K))
# R〉0 인 행 위치, 열 위치, 값을 non.zeros 리스트 객체에 저장.
    non_zeros = [ (i, j, R[i, j]) for i in range(num_users) for j in range(num_items) if R[i, j] > 0]
 # SGD기법으로 우와 Q 매트릭스를 계속 업데이트.
    for step in range(steps):
        for i, j, r in non_zeros:
# 실제 값과 예측 값의 차이인 오류 값 구함
            eij = r - np.dot(P[i, :], Q[j, :].T)
 # Regularization을 반영한 SGD 업데이트 공식 적용
            P[i, :] = P[i, :] + learning_rate*(eij * Q[j, :] - r_lambda*P[i, :])
            Q[j, :] = Q[j, :] + learning_rate*(eij * P[i, :] - r_lambda*Q[j, :])
        rmse = get_rmse(R, P, Q, non_zeros)
        if (step % 10) == 0 :
            print("### iteration step : ", step, " rmse : ", rmse)
    return P, Q

In [None]:
import pandas as pd
import numpy as np
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings = ratings[['userId', 'movieId', 'rating']]
ratings_matrix = ratings.pivot_table('rating', index='userId',columns='movieId')
 # title 칼럼을 얻기 위해 movies와 조인 수행
rating_movies = pd.merge(ratings, movies, on='movieId')
# columns=,title' 로 title 칼럼으로 pivot 수행.
ratings_matrix = rating_movies.pivot_table('rating', index='userId', columns='title')

In [None]:
import numpy as np

def get_rmse(R, P, Q, non_zeros):
    error = 0
    for i, j, r in non_zeros:
        pred = np.dot(P[i, :], Q[j, :].T)
        error += pow(r - pred, 2)
    rmse = np.sqrt(error / len(non_zeros))
    return rmse


In [None]:
P, Q = matrix_factorization(ratings_matrix.values, K=50, steps=200, learning_rate=0.01, r_lambda = 0.01)
pred_matrix = np.dot(P, Q.T)

### iteration step :  0  rmse :  2.9023619751337115
### iteration step :  10  rmse :  0.7335768591017939
### iteration step :  20  rmse :  0.5115539026853438
### iteration step :  30  rmse :  0.37261628282537734
### iteration step :  40  rmse :  0.29608182991810145
### iteration step :  50  rmse :  0.2520353192341621
### iteration step :  60  rmse :  0.22487503275269882
### iteration step :  70  rmse :  0.20685455302331512
### iteration step :  80  rmse :  0.19413418783028674
### iteration step :  90  rmse :  0.1847008200272031
### iteration step :  100  rmse :  0.17742927527209082
### iteration step :  110  rmse :  0.17165226964707506
### iteration step :  120  rmse :  0.16695181946871496
### iteration step :  130  rmse :  0.16305292191997453
### iteration step :  140  rmse :  0.159766919296796
### iteration step :  150  rmse :  0.15695986999457337
### iteration step :  160  rmse :  0.15453398186715442
### iteration step :  170  rmse :  0.1524161855107769
### iteration step :  180  rm

In [None]:
ratings_pred_matrix = pd.DataFrame(data=pred_matrix, index= ratings_matrix.index, columns = ratings_matrix.columns)
ratings_pred_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3.055084,4.092018,3.56413,4.502167,3.981215,1.271694,3.603274,2.333266,5.091749,3.972454,...,1.402608,4.208382,3.705957,2.720514,2.787331,3.475076,3.253458,2.161087,4.010495,0.859474
2,3.170119,3.657992,3.308707,4.166521,4.31189,1.275469,4.237972,1.900366,3.392859,3.647421,...,0.973811,3.528264,3.361532,2.672535,2.404456,4.232789,2.911602,1.634576,4.135735,0.725684
3,2.307073,1.658853,1.443538,2.208859,2.229486,0.78076,1.997043,0.924908,2.9707,2.551446,...,0.520354,1.709494,2.281596,1.782833,1.635173,1.323276,2.88758,1.042618,2.29389,0.396941


In [None]:
 # 사용자가 관람하지 않은 영화명 추출
unseen_list = get_unseen_movies(ratings_matrix, 9)
 # 잠재 요인 협업 필터링으로 영화 추천
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)
 # 평점 데이터를 DataFrame으로 생성.
recomm_movies = pd.DataFrame(data=recomm_movies.values, index=recomm_movies.index, columns=['pred_score'])
recomm_movies

Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Rear Window (1954),5.704612
"South Park: Bigger, Longer and Uncut (1999)",5.4511
Rounders (1998),5.298393
Blade Runner (1982),5.244951
Roger & Me (1989),5.191962
Gattaca (1997),5.183179
Ben-Hur (1959),5.130463
Rosencrantz and Guildenstern Are Dead (1990),5.087375
"Big Lebowski, The (1998)",5.03869
Star Wars: Episode V - The Empire Strikes Back (1980),4.989601


In [None]:
!pip install numpy==1.26.4 --force-reinstall --no-cache-dir

Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m145.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-contrib-python 4.12.0.88 requires nu

In [None]:
! pip install scikit-surprise



In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

In [None]:
data = Dataset.load_builtin('ml-100k')
 # 수행 시마다 동일하게 데이터를 분할하기 위해 random.state 값 부여
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [None]:
algo = SVD(random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7c7c3d95c560>

In [None]:
predictions = algo.test( testset )
print('prediction type:', type(predictions),' size：',len(predictions))
print('prediction 결과의 최초 5개 추출')
predictions[:5]

prediction type: <class 'list'>  size： 25000
prediction 결과의 최초 5개 추출


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.5114147666251547, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.573872419581491, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.033583485472447, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.8463639495936905, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.1807542478219157, details={'was_impossible': False})]

In [None]:
[(pred.uid, pred.iid, pred.est) for pred in predictions[:3] ]

[('120', '282', 3.5114147666251547),
 ('882', '291', 3.573872419581491),
 ('535', '507', 4.033583485472447)]

In [None]:
 # 사용자 아이디, 아이템 아이디는 문자열로 입력해야 함.
uid = str(196)
iid = str(302)
pred = algo.predict(uid, iid)
print(pred)

user: 196        item: 302        r_ui = None   est = 4.49   {'was_impossible': False}


In [None]:
accuracy.rmse(predictions)

RMSE: 0.9467


0.9466860806937948

In [None]:
import pandas as pd
ratings = pd.read_csv('ratings.csv')
 # ratings_noh.csv 파일로 언로드 시 인덱스와 헤더를 모두 제거한 새로운 파일 생성.
ratings.to_csv('ratings_noh.csv', index=False, header=False)

In [None]:
from surprise import Reader
reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))
data=Dataset.load_from_file('ratings_noh.csv', reader=reader)

In [None]:
trainset, testset = train_test_split(data, test_size=.25, random_state=0)
 # 수행 시마다 동일한 결과를 도출하기 위해 random_state 설정
algo = SVD(n_factors=50, random_state=0)
# 학습 데이터 세트로 학습하고 나서 테스트 데이터 세트로 평점 예측 후 RMSE 평가
algo.fit(trainset)
predictions = algo.test( testset )
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

In [None]:
import pandas as pd
from surprise import Reader, Dataset
ratings = pd.read_csv('ratings.csv')
reader = Reader(rating_scale=(0.5, 5.0))
 # ratings DataFrame에서 칼럼은 사용자 아이디, 아이템 아이디, 평점 순서를 지켜야 합니다.
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=0)
algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions = algo.test( testset )
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

In [None]:
from surprise.model_selection import cross_validate
 # 판다스 DataFrame에서 Surprise 데이터 세트로 데이터 로딩
ratings = pd.read_csv('ratings.csv') # reading data in pandas df
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
algo = SVD(random_state=0)
cross_validate(algo, data, measures=[ 'RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8692  0.8721  0.8723  0.8737  0.8781  0.8731  0.0029  
MAE (testset)     0.6699  0.6717  0.6700  0.6698  0.6760  0.6715  0.0024  
Fit time          1.56    1.43    1.55    1.56    1.93    1.61    0.17    
Test time         0.12    0.26    0.13    0.37    0.12    0.20    0.10    


{'test_rmse': array([0.8692297 , 0.87207469, 0.87234181, 0.87367126, 0.87807656]),
 'test_mae': array([0.66993509, 0.67174206, 0.66997254, 0.66976128, 0.676     ]),
 'fit_time': (1.560788869857788,
  1.4319169521331787,
  1.5507066249847412,
  1.560366153717041,
  1.9313790798187256),
 'test_time': (0.11897802352905273,
  0.25997495651245117,
  0.13253045082092285,
  0.36750173568725586,
  0.12000370025634766)}

In [None]:
from surprise.model_selection import GridSearchCV
 # 최적화할 파라미터를 딕셔너리 형태로 지정.
param_grid = {'n_epochs': [20, 40, 60], 'n_factors': [50, 100, 200] }
 # CV를 3개 폴드 세트로 지정, 성능 평가는 rmse, mse로 수행하도록 GridSearchCV 구성
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
 # 최고 RMSE Evaluation 점수와 그때의 하이퍼 파라미터
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.8767471946623759
{'n_epochs': 20, 'n_factors': 50}


In [None]:
 # 다음 코드는 train_test_split( )으로 분리되지 않는 데이터 세트에 fit( )을 호출해 오류가 발생합니다.
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
algo = SVD(n_factors=50, random_state=0)
algo.fit(data)

AttributeError: 'DatasetAutoFolds' object has no attribute 'n_users'

In [None]:
from surprise.dataset import DatasetAutoFolds
reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))
 # DatasetAutoFolds 클래스를 ratings_noh.csv 파일 기반으로 생성.
data_folds = DatasetAutoFolds(ratings_file='ratings_noh.csv', reader=reader)
 # 전체 데이터를 학습 데이터로 생성함.
trainset = data_folds.build_full_trainset()

In [None]:
algo = SVD(n_epochs=20, n_factors=50, random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7c7c2af27140>

In [None]:
 # 영화에 대한 상세 속성 정보 DataFrame 로딩
movies = pd.read_csv('movies.csv')
 # userld=9의 movield 데이터를 추출해 movield=42 데이터가 있는지 확인.
movields = ratings[ratings[ 'userId' ]==9] [ 'movieId' ]
if movields[movields==42].count() == 0:
    print('사용자 아이디 9는 영화 아이디 42의 평점 없음')
print(movies[movies['movieId' ]==42])

사용자 아이디 9는 영화 아이디 42의 평점 없음
    movieId                   title              genres
38       42  Dead Presidents (1995)  Action|Crime|Drama


In [None]:
uid = str(9)
iid = str(42)
pred = algo.predict(uid, iid, verbose=True)

user: 9          item: 42         r_ui = None   est = 3.13   {'was_impossible': False}


In [None]:
def get_unseen_surprise(ratings, movies, userId):
    # 입력값으로 들어온 userId가 평점을 매긴 영화 목록
    seen_movies = ratings[ratings['userId'] == userId]['movieId'].tolist()

    # 모든 영화의 movieId 목록
    total_movies = movies['movieId'].tolist()

    # 이미 평점 매긴 영화 제외
    unseen_movies = [movie for movie in total_movies if movie not in seen_movies]

    print('평점 매긴 영화 수:', len(seen_movies),
          '추천 대상 영화 수:', len(unseen_movies),
          '전체 영화 수:', len(total_movies))

    return unseen_movies

# 예시: userId가 9인 경우
unseen_movies = get_unseen_surprise(ratings, movies, 9)


평점 매긴 영화 수: 46 추천 대상 영화 수: 9696 전체 영화 수: 9742


In [None]:
def recomm_movie_by_surprise(algo, userId, unseen_movies, movies, top_n=10):
    # 알고리즘 객체의 predict 메서드를 평점이 없는 영화에 반복 수행
    predictions = [algo.predict(str(userId), str(movieId)) for movieId in unseen_movies]

    # est 값을 기준으로 정렬
    def sortkey_est(pred):
        return pred.est

    predictions.sort(key=sortkey_est, reverse=True)
    top_predictions = predictions[:top_n]

    # top_n으로 추출된 영화 정보
    top_movie_ids = [int(pred.iid) for pred in top_predictions]
    top_movie_rating = [pred.est for pred in top_predictions]
    top_movie_titles = movies[movies.movieId.isin(top_movie_ids)]['title'].tolist()

    top_movie_preds = [(id, title, rating) for id, title, rating in zip(top_movie_ids, top_movie_titles, top_movie_rating)]
    return top_movie_preds

# unseen_movies 생성
unseen_movies = get_unseen_surprise(ratings, movies, 9)

# 추천 수행
top_movie_preds = recomm_movie_by_surprise(algo, 9, unseen_movies, movies, top_n=10)

# 결과 출력
print('##### Top-10 추천 영화 리스트 #####')
for top_movie in top_movie_preds:
    print(top_movie[1], ":", round(top_movie[2], 2))


평점 매긴 영화 수: 46 추천 대상 영화 수: 9696 전체 영화 수: 9742
##### Top-10 추천 영화 리스트 #####
Usual Suspects, The (1995) : 4.31
Star Wars: Episode IV - A New Hope (1977) : 4.28
Pulp Fiction (1994) : 4.28
Silence of the Lambs, The (1991) : 4.23
Godfather, The (1972) : 4.19
Streetcar Named Desire, A (1951) : 4.15
Star Wars: Episode V - The Empire Strikes Back (1980) : 4.12
Star Wars: Episode VI - Return of the Jedi (1983) : 4.11
Goodfellas (1990) : 4.08
Glory (1989) : 4.08
