# MF를 이용한 추천시스템 (정확률 평가)
### Matrix Factorization의 정확률
- 추천 시스템 모델의 예측 성능을 평가하는 지표를 의미한다.
- MF는 사용자-아이템 평점 행렬을 분해하여 잠재 요인을 학습한 뒤, 이를 기반으로 새로운 평점을 예측하거나 추천을 제공
- 정확률은 이 예측 결과가 실제 데이터와 얼마나 잘 맞는지 평가하는 데 사용됨

## Env

In [1]:
import numpy as np
import pandas as pd

## Data load

In [2]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.user', sep='|', names=u_cols, encoding='latin-1')

In [3]:
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown',
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary',
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
          'Thriller', 'War', 'Western']
movies = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.item', sep='|', names=i_cols, encoding='latin-1')
movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDB URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [4]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
ratings = ratings[['user_id', 'movie_id', 'rating']].astype(int)
ratings

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1
...,...,...,...
99995,880,476,3
99996,716,204,5
99997,276,1090,1
99998,13,225,2


## train, test dataset 분리

In [6]:
# shuffle을 사용하여 train, test dataset에 고르게 배정되지 않을 수 있음 
from sklearn.utils import shuffle 

TRAIN_SIZE = 0.75
ratings = shuffle(ratings, random_state=1)
cutoff = int(TRAIN_SIZE * len(ratings))
ratings_train = ratings.iloc[:cutoff]
ratings_test = ratings.iloc[cutoff:]

- stratify를 사용한것과 다르다. 셔플 사용 시 한쪽에는 비어있는 사용자 id가 있을 수 있다.

#### train_test_split에서 stratify의 역할
- 데이터 분리 시 특정 열의 분포를 유지하도록 도와줌, 주로 `불균형 데이터`에서 학습/ 테스트 데이터셋의 분포를 원래 데이터와 동일하게 유지하려고 사용
- `데이터의 특정 기준에 따라 분포를 유지하면서 데이터셋을 나누는 데 유용`

#### 셔플을 사용하지 않는 경우
- 셔플을 사용하지 않으면 데이터가 원래 순서대로 분리, 즉 원래 데이터셋이 정렬되어 있다면, 특정 클래스나 그룹이 한쪽 데이터셋에 몰릴 가능성이 높아짐
- 데이터가 무작위로 섞이지 않으므로 학습데이터와 테스트 데이터의 분포가 불균형해질 수 있음


In [7]:
ratings_train

Unnamed: 0,user_id,movie_id,rating
43660,508,185,5
87278,518,742,5
14317,178,28,5
81932,899,291,4
95321,115,117,4
...,...,...,...
73044,751,399,3
94380,455,286,5
16494,429,1119,3
37067,412,172,5


In [8]:
ratings_test

Unnamed: 0,user_id,movie_id,rating
53670,345,715,4
77110,92,998,2
69323,934,195,4
85968,586,423,2
30243,336,383,1
...,...,...,...
50057,26,840,2
98047,625,198,4
5192,56,568,4
77708,882,172,5


In [9]:
ratings_train

Unnamed: 0,user_id,movie_id,rating
43660,508,185,5
87278,518,742,5
14317,178,28,5
81932,899,291,4
95321,115,117,4
...,...,...,...
73044,751,399,3
94380,455,286,5
16494,429,1119,3
37067,412,172,5


## MF Class

In [10]:
class NEW_MF():
    def __init__(self, ratings, K, alpha, beta, iterations, verbose=True):
        self.R = np.array(ratings)
        
##### >>>>> (2) user_id, item_id를 R의 index와 매핑하기 위한 dictionary 생성
        item_id_index = []
        index_item_id = []
        for i, one_id in enumerate(ratings):
            item_id_index.append([one_id, i])
            index_item_id.append([i, one_id])
        self.item_id_index = dict(item_id_index)
        self.index_item_id = dict(index_item_id)

        user_id_index = []
        index_user_id = []
        for i, one_id in enumerate(ratings.T): # col = user_id, row = movie_id
            user_id_index.append([one_id, i])
            index_user_id.append([i, one_id])
        self.user_id_index = dict(user_id_index)
        self.index_user_id = dict(index_user_id)

#### <<<<< (2)
        self.num_users, self.num_items = np.shape(self.R)
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations
        self.verbose = verbose

    # train set의 RMSE 계산
    def rmse(self):
        xs, ys = self.R.nonzero()
        self.predictions = []
        self.errors = []
        for x, y in zip(xs, ys):
            prediction = self.get_prediction(x, y)
            self.predictions.append(prediction)
            self.errors.append(self.R[x, y] - prediction)
        self.predictions = np.array(self.predictions)
        self.errors = np.array(self.errors)
        return np.sqrt(np.mean(self.errors**2))

    # Ratings for user i and item j
    def get_prediction(self, i, j):
        prediction = self.b + self.b_u[i] + self.b_d[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    # Stochastic gradient descent to get optimized P and Q matrix
    def sgd(self):
        for i, j, r in self.samples:
            prediction = self.get_prediction(i, j)
            e = (r - prediction)

            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_d[j] += self.alpha * (e - self.beta * self.b_d[j])

            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i,:])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j,:])

##### >>>>> (3)
    # Test set을 선정 , 셔플 후 리스트에서 뒷부분
    def set_test(self, ratings_test):
        test_set = []
        for i in range(len(ratings_test)):      # test 데이터에 있는 각 데이터에 대해서
            try:
              x = self.user_id_index[ratings_test.iloc[i, 0]] 
              y = self.item_id_index[ratings_test.iloc[i, 1]]
              z = ratings_test.iloc[i, 2]
              test_set.append([x, y, z])        # 실제 사용자가 평가한 값
              self.R[x, y] = 0                    # Setting test set ratings to 0 , 예측 평가 값
            except:
              print(i)

        self.test_set = test_set
        return test_set                         # Return test set

    # Test set의 RMSE 계산
    def test_rmse(self):
        error = 0
        for one_set in self.test_set: # test_set = [x, y, z]
            predicted = self.get_prediction(one_set[0], one_set[1]) 
            error += pow(one_set[2] - predicted, 2)
        return np.sqrt(error/len(self.test_set))

    # Training 하면서 test set의 정확도를 계산
    def test(self):
        # Initializing user-feature and item-feature matrix
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initializing the bias terms
        self.b_u = np.zeros(self.num_users)
        self.b_d = np.zeros(self.num_items)
        self.b = np.mean(self.R[self.R.nonzero()])

        # List of training samples
        rows, columns = self.R.nonzero()
        self.samples = [(i, j, self.R[i,j]) for i, j in zip(rows, columns)]

        # Stochastic gradient descent for given number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
            rmse1 = self.rmse()
            rmse2 = self.test_rmse()
            training_process.append((i+1, rmse1, rmse2))
            if self.verbose:
                if (i+1) % 10 == 0:
                    print("Iteration: %d ; Train RMSE = %.4f ; Test RMSE = %.4f" % (i+1, rmse1, rmse2))
        return training_process

    # Ratings for given user_id and item_id
    def get_one_prediction(self, user_id, item_id):
        return self.get_prediction(self.user_id_index[user_id], self.item_id_index[item_id])

    # Full user-movie rating matrix
    def full_prediction(self):
        return self.b + self.b_u[:,np.newaxis] + self.b_d[np.newaxis,:] + self.P.dot(self.Q.T)


- 실제 데이터셋에서 사용 가능한 ID 매핑, 테스트 데이터 분리, ID 기반 예측 등의 기능을 추가로 제공, 더 실용적인 구현

| 항목 | MF Class | NEW_MF Class |
|:---:|:---:|:---:|
| ID 매핑 | 없음, 행렬의 인덱스를 그대로 사용 | 사용자 ID 및 아이템 ID를 별도로 매핑하여 유연성 제공|
|테스트 데이터 처리| 테스트 데이터 관련 메서드가 없음 | set_test 및 test_rmse로 처리|
|출력 정보| 학습 데이터 RMSE만 출력|학습 데이터와 테스트 데이터 RMSE를 함께 출력|
|추가 기능|없음|ID기반 예측 및 전체 평점 행렬 예측 기능 포함|
|확장성 및 유연성| 기본적인 MF 구현 | 실제 데이터 환경에서 사용 가능한 기능 포함|

# 전체 데이터에서 추천

In [11]:
R_temp = ratings.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)
R_temp

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Testing MF RMSE
mf = NEW_MF(R_temp, K=30, alpha=0.001, beta=0.02, iterations=100, verbose=True) # verbose=True 10개씩 출력
test_set = mf.set_test(ratings_test)
result = mf.test()


Iteration: 10 ; Train RMSE = 0.9659 ; Test RMSE = 0.9833
Iteration: 20 ; Train RMSE = 0.9410 ; Test RMSE = 0.9644
Iteration: 30 ; Train RMSE = 0.9298 ; Test RMSE = 0.9565
Iteration: 40 ; Train RMSE = 0.9231 ; Test RMSE = 0.9522
Iteration: 50 ; Train RMSE = 0.9184 ; Test RMSE = 0.9496
Iteration: 60 ; Train RMSE = 0.9146 ; Test RMSE = 0.9477
Iteration: 70 ; Train RMSE = 0.9110 ; Test RMSE = 0.9463
Iteration: 80 ; Train RMSE = 0.9071 ; Test RMSE = 0.9451
Iteration: 90 ; Train RMSE = 0.9025 ; Test RMSE = 0.9438
Iteration: 100 ; Train RMSE = 0.8964 ; Test RMSE = 0.9422


In [13]:
# Printing predictions
print(mf.full_prediction())
print('-' * 25)
print(mf.get_one_prediction(1, 2))

[[3.78430234 3.36718811 3.05836321 ... 3.3435388  3.46105491 3.44343456]
 [3.93066453 3.4707566  3.14697219 ... 3.41991483 3.55759529 3.55094461]
 [3.32624291 2.89546635 2.5365547  ... 2.81707493 2.93455307 2.92348546]
 ...
 [4.21800506 3.77571321 3.42783345 ... 3.70598755 3.83445771 3.82303292]
 [4.35649575 3.89975769 3.54555097 ... 3.8407829  3.95513537 3.95496368]
 [3.82108172 3.36898384 2.98273203 ... 3.29286379 3.42311374 3.42659454]]
-------------------------
3.3671881066834555


- `mf.full_prediction()`: 사용자와 아이템 간의 전체 평점 행렬을 예측하여 반환
    - 즉, 모든 사용자와 아이템의 조합에 대해 모델이 학습한 결과를 기반으로 예상 평점을 계산한 값
    - 각 행은 특정 사용자에 대한 모든 아이템의 예측 평점을 나타내며, 각 열은 특정 아이템에 대한 모든 사용자의 예측 평점을 나타냄
- `mf.get_one_prediction(1,2)`: 특정 사용자와 특정 아이템의 평점을 예측