# Collaborative Filtering

- 아이템간의 유사도 혹은 사용자간의 유사도를 사용자의 아이템 평가 이력을 이용하여 계산하는 방법

#### 사용 방법
1. 평점 데이터를 이용하여 사용자를 vector로 표현
2. 사용자간의 유사도를 cosine (pearson) similarity로 계산   


#### 왜 사용할까?
- CB보다 CF가 user,item 사용 패턴을 기반으로하기 때문에 추천 성능이 더 좋음
- Coldstart 문제가 있을 경우 CB를 많이 사용
- User Based, Item Based 방법 두가지 존재
- User Based: User를 벡터로 두고 필터링하는 방법
- Item Based: Item을 벡터로 두고 필터링하는 방법

#### 유사도 측정방법
<img width='500' src='img/CF/유사도종류.png'>

- 일반적으로 Cosine 유사도만 사용해도 성능 차이가 크지 않다.(도메인에 따라 다름)
- ref[The Prediction accuracy of a RS is really not affected by the choice of the similarity measure(seth 2015)]

#### 예측 방법
<img width='500' src='img/CF/CF평점계산법.png'>

- Item base일 경우 반대로 적용

#### Userbase & Itembase 장단점
장점
- Userbase: 일반적으로 더 높은 정확도
- Itembase: 데이터가 충분할 경우 아이템간 유사도가 크게 변하지 않아 안정적인 추천 가능 
단점
- Userbase: 일반적으로 Item < user으로 수가 더 많이 때문에 sparse matrics 문제와 더 많은 저장공간이 필요함
- Itembase: 일반적으로 더 낮은 정확도
결론
- 보통은 Itembase가 좋다.

#### 고려해봐야할 사항
- 유사도를 측정했을 때 유사한 벡터가 1개만 있는 경우
- 유사도를 측정했을 때 유사한 벡터 수가 적은 경우(2~3)

방법
- Filtering 적용(평가 이력이 적은 경우 유사도 계산 제외)
- Smooting 적용(이력수가 k보다 작은 경우 similarity/k 값으로 보정)

# Item-based Collaborative Filtering

Import Packages

In [4]:
from IPython.display import display, HTML
import math
import numpy as np
from numpy import linalg as LA
import pandas as pd

np.set_printoptions(precision=2)
pd.set_option('display.precision', 2)

def displayMovies(movies, movieIds, ratings=[]):

    html = ""

    for i, movieId in enumerate(movieIds):
        movie = movies[movies['movieId'] == movieId].iloc[0]

        html += f"""
            <div style="display:inline-block;min-width:150px;max-width:150px; vertical-align:top">
                <img src='{movie.imgurl}' width=120> <br/>
                <span>{movie.title}</span> <br/>
                {f"<span>{ratings[i]}</span> <br/>" if len(ratings) > 0 else ""}
                <ul>{"".join([f"<li>{genre}</li>" for genre in movie.genres.split('|')])}</ul>
            </div>
        """

    display(HTML(html))


def getMAE(real, pred):
    errors = real - pred
    return errors.abs().mean()

def getRMSE(real, pred):
    errors = real - pred
    return math.sqrt(errors.pow(2).mean())

## Read Data: movies and ratings

Read Movies

In [2]:
movies = pd.read_csv('movielens/movies_w_imgurl.csv')
movies

Unnamed: 0,movieId,imdbId,title,genres,imgurl
0,1,114709,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,https://images-na.ssl-images-amazon.com/images...
1,2,113497,Jumanji (1995),Adventure|Children|Fantasy,https://images-na.ssl-images-amazon.com/images...
2,3,113228,Grumpier Old Men (1995),Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,4,114885,Waiting to Exhale (1995),Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,5,113041,Father of the Bride Part II (1995),Comedy,https://images-na.ssl-images-amazon.com/images...
...,...,...,...,...,...
9120,162672,3859980,Mohenjo Daro (2016),Adventure|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
9121,163056,4262980,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
9122,163949,2531318,The Beatles: Eight Days a Week - The Touring Y...,Documentary,https://images-na.ssl-images-amazon.com/images...
9123,164977,27660,The Gay Desperado (1936),Comedy,https://images-na.ssl-images-amazon.com/images...


Read Rating Data

In [3]:
ratings = pd.read_csv('ratings-9_1.csv')

train = ratings[ratings['type'] == 'train'][['userId', 'movieId', 'rating']]
test = ratings[ratings['type'] == 'test'][['userId', 'movieId', 'rating']]

## Convert Ratings to Item-User Sparse Matrix
### Create Index to Id Maps

In [4]:
movieIds = train.movieId.unique()

movieIdToIndex = {}
indexToMovieId = {}

rowIdx = 0

for movieId in movieIds:
    movieIdToIndex[movieId] = rowIdx
    indexToMovieId[rowIdx] = movieId
    rowIdx += 1

In [5]:
userIds = train.userId.unique()

userIdToIndex = {}
indexToUserId = {}

colIdx = 0

for userId in userIds:
    userIdToIndex[userId] = colIdx
    indexToUserId[colIdx] = userId
    colIdx += 1

movieId와 UserId의 해당하는 딕셔너리를 만들어 준다.   
후에 coo_matrix 라이브러리를 사용하는데 사용

### Create Item-User Sparse Matrix

In [6]:
import scipy.sparse as sp

rows = []
cols = []
vals = []

for row in train.itertuples():
    rows.append(movieIdToIndex[row.movieId])
    cols.append(userIdToIndex[row.userId])
    vals.append(row.rating)

coomat = sp.coo_matrix((vals, (rows, cols)), shape=(rowIdx, colIdx))

matrix = coomat.todense()
matrix

# coo_matrix 예시
'''
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
coo_matrix((data, (row, col)), shape=(4, 4)).toarray()
array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])
       '''

matrix([[2.5, 0. , 0. , ..., 0. , 0. , 0. ],
        [3. , 0. , 0. , ..., 0. , 0. , 0. ],
        [2. , 0. , 0. , ..., 0. , 0. , 0. ],
        ...,
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

- scipy의 라이브러리 coo_matrix 사용하면 적은 메모리로 metrics 구현 가능

## Compute Item-Item Similarities

Compute $l_2$-norm

In [7]:
from numpy import linalg as LA

norms = LA.norm(matrix, ord = 2, axis=1)
norms

array([20.71, 20.35, 22.94, ...,  3.  ,  1.  ,  1.  ])

Normalize Row Vectors

In [8]:
normmat = np.divide(matrix.T, norms).T
normmat

matrix([[0.12, 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
        [0.15, 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
        [0.09, 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
        ...,
        [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ]])

Compute Similarities ( = inner product)

In [9]:
sims = pd.DataFrame(data = np.matmul(normmat, normmat.T), index = movieIds, columns=movieIds)
sims

Unnamed: 0,31,1061,1129,1172,1287,1293,1339,1343,1371,1405,...,134528,134783,137595,138204,60832,64997,72380,129,4736,6425
31,1.00,0.20,0.13,0.07,0.09,0.15,0.07,0.18,0.12,0.14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0
1061,0.20,1.00,0.25,0.18,0.14,0.13,0.25,0.20,0.16,0.23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.15,0.0
1129,0.13,0.25,1.00,0.17,0.25,0.16,0.26,0.30,0.34,0.31,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0
1172,0.07,0.18,0.17,1.00,0.27,0.33,0.12,0.12,0.11,0.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0
1287,0.09,0.14,0.25,0.27,1.00,0.32,0.17,0.07,0.18,0.18,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64997,0.00,0.00,0.00,0.00,0.00,0.00,0.15,0.00,0.00,0.14,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.00,0.00,0.0
72380,0.00,0.00,0.00,0.00,0.00,0.00,0.15,0.00,0.00,0.14,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.00,0.00,0.0
129,0.00,0.15,0.00,0.00,0.00,0.00,0.13,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.00,1.00,0.0
4736,0.00,0.15,0.00,0.00,0.00,0.00,0.13,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.00,1.00,0.0


### Similarity Example

In [10]:
movieIdx = 6

rels = sims.iloc[movieIdx,:].sort_values(ascending=False).head(6)[1:]

displayMovies(movies, [indexToMovieId[movieIdx]])
displayMovies(movies, rels.index, rels.values)

## User Rating Prediction

In [11]:
userId = 33

userRatings = train[train['userId'] == userId][['movieId', 'rating']] 

userRatings

Unnamed: 0,movieId,rating
6176,19,3.0
6177,88,3.0
6178,157,1.0
6179,231,3.0
6180,344,4.0
...,...,...
6309,5282,4.0
6310,5339,4.0
6311,5483,4.0
6312,5669,4.0


### Predict Ratings
- userid 33 대한 평점예측 및 오차를 구해보자

In [12]:
recSimSums = sims.loc[userRatings['movieId'].values, :].sum().values

recWeightedRatingSums = np.matmul(sims.loc[userRatings['movieId'].values, :].T.values, userRatings['rating'].values)

recItemRatings = pd.DataFrame(data = np.divide(recWeightedRatingSums, recSimSums), index=sims.index)

recItemRatings.columns = ['pred']

recItemRatings

  recItemRatings = pd.DataFrame(data = np.divide(recWeightedRatingSums, recSimSums), index=sims.index)


Unnamed: 0,pred
31,3.43
1061,3.38
1129,3.36
1172,3.49
1287,3.42
...,...
64997,3.23
72380,3.23
129,3.00
4736,3.00


In [13]:
top30Movies = recItemRatings.sort_values(by='pred', ascending=False).head(30)

displayMovies(movies, top30Movies.index, top30Movies['pred'].values)

### Compute Errors (MAE, RMSE)

In [14]:
userTestRatings = pd.DataFrame(data=test[test['userId'] == userId])

temp = userTestRatings.join(recItemRatings.loc[userTestRatings['movieId']], on='movieId')

mae = getMAE(temp['rating'], temp['pred'])
rmse = getRMSE(temp['rating'], temp['pred'])

print(f"MAE : {mae:.4f}")
print(f"RMSE: {rmse:.4f}")

MAE : 0.7424
RMSE: 0.8455


### Compare Logs and Recommendations

In [15]:
logs = userRatings.sort_values(by='rating', ascending=False).head(20)
recs = recItemRatings.sort_values(by='pred', ascending=False).head(20)

In [16]:
displayMovies(movies, logs['movieId'].values, logs['rating'].values)
displayMovies(movies, recs.index, recs['pred'].values)

# User-based Collaborative Filtering

## Read Data: movies and ratings

Read Movies

In [10]:
movies = pd.read_csv('data/movielens/movies_w_imgurl.csv')
movies

Unnamed: 0,movieId,imdbId,title,genres,imgurl
0,1,114709,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,https://images-na.ssl-images-amazon.com/images...
1,2,113497,Jumanji (1995),Adventure|Children|Fantasy,https://images-na.ssl-images-amazon.com/images...
2,3,113228,Grumpier Old Men (1995),Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,4,114885,Waiting to Exhale (1995),Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,5,113041,Father of the Bride Part II (1995),Comedy,https://images-na.ssl-images-amazon.com/images...
...,...,...,...,...,...
9120,162672,3859980,Mohenjo Daro (2016),Adventure|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
9121,163056,4262980,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi,https://images-na.ssl-images-amazon.com/images...
9122,163949,2531318,The Beatles: Eight Days a Week - The Touring Y...,Documentary,https://images-na.ssl-images-amazon.com/images...
9123,164977,27660,The Gay Desperado (1936),Comedy,https://images-na.ssl-images-amazon.com/images...


Read Rating Data

In [11]:
ratings = pd.read_csv('data/ratings-9_1.csv')

train = ratings[ratings['type'] == 'train'][['userId', 'movieId', 'rating']]
test = ratings[ratings['type'] == 'test'][['userId', 'movieId', 'rating']]

## Convert Ratings to User-Item Sparse Matrix

### Create Index to Id Maps

In [12]:
movieIds = train.movieId.unique()

movieIdToIndex = {}
indexToMovieId = {}

colIdx = 0

for movieId in movieIds:
    movieIdToIndex[movieId] = colIdx
    indexToMovieId[colIdx] = movieId
    colIdx += 1

In [13]:
userIds = train.userId.unique()

userIdToIndex = {}
indexToUserId = {}

rowIdx = 0

for userId in userIds:
    userIdToIndex[userId] = rowIdx
    indexToUserId[rowIdx] = userId
    rowIdx += 1

### Create User-Item Sparse Matrix

In [14]:
rows = []
cols = []
vals = []

for row in train.itertuples():
    rows.append(userIdToIndex[row.userId])
    cols.append(movieIdToIndex[row.movieId])
    vals.append(row.rating)

coomat = sp.coo_matrix((vals, (rows, cols)), shape=(rowIdx, colIdx))

matrix = coomat.todense()
matrix.shape

(671, 8740)

## Compute User-User Similarities

Compute $l_2$-norm

In [15]:
norms = LA.norm(matrix, ord = 2, axis=1)
norms.shape

(671,)

Normalize Row Vectors

In [16]:
normmat = np.divide(matrix.T, norms).T
normmat.shape

(671, 8740)

Compute Similarities ( = inner product)

In [17]:
sims = pd.DataFrame(data = np.matmul(normmat, normmat.T), index = userIds, columns=userIds)
sims

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
1,1.00,0.00e+00,0.00,0.07,0.02,0.00,0.08,0.00,1.38e-02,0.00,...,0.00,0.00,0.02,0.03,0.00,0.00,0.00,6.72e-02,0.00,0.00
2,0.00,1.00e+00,0.10,0.11,0.10,0.00,0.18,0.11,1.18e-01,0.05,...,0.46,0.04,0.07,0.15,0.43,0.35,0.09,6.22e-03,0.15,0.07
3,0.00,1.04e-01,1.00,0.08,0.15,0.04,0.17,0.27,1.43e-01,0.08,...,0.16,0.07,0.14,0.16,0.19,0.10,0.14,8.54e-02,0.11,0.16
4,0.07,1.07e-01,0.08,1.00,0.11,0.07,0.29,0.15,9.89e-03,0.10,...,0.11,0.03,0.11,0.21,0.13,0.07,0.08,9.22e-02,0.04,0.14
5,0.02,1.04e-01,0.15,0.11,1.00,0.07,0.08,0.13,9.35e-02,0.04,...,0.19,0.02,0.11,0.20,0.15,0.03,0.05,2.06e-02,0.06,0.22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.00,3.51e-01,0.10,0.07,0.03,0.00,0.21,0.06,6.84e-02,0.04,...,0.28,0.03,0.07,0.12,0.29,1.00,0.10,1.88e-02,0.16,0.09
668,0.00,8.98e-02,0.14,0.08,0.05,0.02,0.07,0.12,2.02e-01,0.11,...,0.08,0.06,0.10,0.08,0.12,0.10,1.00,0.00e+00,0.15,0.12
669,0.07,6.22e-03,0.09,0.09,0.02,0.02,0.08,0.06,2.95e-02,0.04,...,0.02,0.03,0.08,0.09,0.03,0.02,0.00,1.00e+00,0.05,0.09
670,0.00,1.46e-01,0.11,0.04,0.06,0.00,0.06,0.22,3.38e-01,0.15,...,0.14,0.10,0.10,0.11,0.19,0.16,0.15,4.84e-02,1.00,0.22


## Similarity Example

In [18]:
userId = 33
topK = 5

In [19]:
simUsers = sims.loc[userId, :].sort_values(ascending=False).head(6).tail(5)
simUsers

598    0.20
457    0.20
350    0.19
461    0.19
15     0.18
Name: 33, dtype: float64

show liked user movies

In [20]:
def displayLikedUserMovies(movies, userId, topK):
    topKRatings = train[train['userId'] == userId].sort_values(by='rating', ascending=False).head(topK)
    display(HTML(f"<h3>{userId}</h3><hr>"))
    displayMovies(movies, topKRatings.movieId.values, topKRatings.rating.values)

In [21]:
for index, simUser in simUsers.iteritems():
    displayLikedUserMovies(movies, index, topK)

## User Rating Prediction

set test user

In [22]:
userId = 33

### Predict Ratings
- userid 33 대한 평점예측 및 오차를 구해보자

In [23]:
ratingDF = pd.DataFrame(data=matrix, index=userIds, columns=movieIds)
binDF = ratingDF.applymap(lambda x: math.ceil(x/10))

In [24]:
userAvgRatings = pd.DataFrame(data = ratingDF.sum(axis=1).divide(binDF.sum(axis=1)), columns=['avg'])
userAvgRatings

Unnamed: 0,avg
1,2.69
2,3.47
3,3.60
4,4.36
5,3.89
...,...
667,3.67
668,3.83
669,3.35
670,3.72


In [25]:
simUsers = sims.loc[userId, :]
simUsers[userId] = 0

In [26]:
simRatingSums = (ratingDF - binDF.T.multiply(userAvgRatings['avg']).T).T.multiply(simUsers).T.sum(axis=0)
simSums = binDF.T.multiply(simUsers).T.sum(axis=0)
recItemRatings = userAvgRatings.loc[userId].avg + pd.Series(data = simRatingSums.divide(simSums), name='pred')
recItemRatings.fillna(0, inplace=True)

recItemRatings

31       2.84
1061     3.44
1129     3.15
1172     3.80
1287     3.68
         ... 
64997    1.96
72380    2.96
129      2.97
4736     0.97
6425     0.41
Name: pred, Length: 8740, dtype: float64

### 평점 예측 수식 변형
<img width='500' src='CF평점예측변형.png'>

- 평균을 더하고 가중 평점곱 부분에서 평균을 뺀 형태
- 평균에서 크게 벗어나지 않는 평점이 매겨지게 된다.

### Compute Errors (MAE, RMSE)

In [27]:
userTestRatings = pd.DataFrame(data=test[test['userId'] == userId])

temp = userTestRatings.join(recItemRatings.loc[userTestRatings['movieId']], on='movieId')

mae = getMAE(temp['rating'], temp['pred'])
rmse = getRMSE(temp['rating'], temp['pred'])

print(f"MAE : {mae:.4f}")
print(f"RMSE: {rmse:.4f}")

MAE : 0.5524
RMSE: 0.6825


In [28]:
temp

Unnamed: 0,userId,movieId,rating,pred
6187,33,1060,4.0,4.0
6198,33,1291,4.0,3.67
6199,33,1347,2.0,3.34
6208,33,1982,4.0,3.45
6212,33,2005,4.0,3.6
6215,33,2064,5.0,4.0
6257,33,3794,4.0,3.27
6292,33,4678,3.0,3.54
6303,33,4974,3.0,2.91
