# Content-based(CB)

- Movie lens Data의 장르를 이용하여 CB를 만들어 보자
- 장르를 tf,idf,tfidf로 변환하고 각각 코사인 유사도 기반 평점 예측
- tf-idf값에 평점을 곱하여 평점 예측하자

In [1]:
import math
import numpy as np
from numpy import linalg as LA
import pandas as pd

### Movies Weight Matrix on Genres

Read movie metadata from a csv file.

In [2]:
movies = pd.read_csv('data/movielens/movies_w_imgurl.csv')
movies.head()

Unnamed: 0,movieId,imdbId,title,genres,imgurl
0,1,114709,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,https://images-na.ssl-images-amazon.com/images...
1,2,113497,Jumanji (1995),Adventure|Children|Fantasy,https://images-na.ssl-images-amazon.com/images...
2,3,113228,Grumpier Old Men (1995),Comedy|Romance,https://images-na.ssl-images-amazon.com/images...
3,4,114885,Waiting to Exhale (1995),Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...
4,5,113041,Father of the Bride Part II (1995),Comedy,https://images-na.ssl-images-amazon.com/images...


Split genres and stack genres into one column.

## TF(Term Frequency, 단어 빈도)
- 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값
- TF를 구하는 법 3가지

1. 불린 빈도: tf(t,d) = t가 d에 한 번이라도 나타나면 1, 아니면 0
2. 로그 스케일 빈도: tf(t,d) = log (f(t,d) + 1)
3. 증가 빈도: 최빈 단어를 분모로 target 단어의 TF를 나눈 값으로, 일반적으로는 문서의 길이가 상대적으로 길 경우, 단어 빈도값을 조절하기 위해 사용한다.

In [3]:
movieGenres = pd.DataFrame(data=movies['genres'].str.split('|').apply(pd.Series, 1).stack(), columns=['genre'])
movieGenres.index = movieGenres.index.droplevel(1)

movieGenres dataframe에 대해 시리즈 데이터로 쌓는다.

In [4]:
movieGenres

Unnamed: 0,genre
0,Adventure
0,Animation
0,Children
0,Comedy
0,Fantasy
...,...
9121,Fantasy
9121,Sci-Fi
9122,Documentary
9123,Comedy


Count movies that have each genre and then compute IDF of genres.

In [34]:
genres = pd.DataFrame(data=movieGenres.groupby('genre')['genre'].count())
genres.columns = ['movieCount']

totalItems = movies.shape[0]

genres['idf'] = genres['movieCount'].apply(lambda x: math.log10(totalItems/x))

genres.head()

Unnamed: 0_level_0,movieCount,idf
genre,Unnamed: 1_level_1,Unnamed: 2_level_1
(no genres listed),18,2.70496
Action,1545,0.771304
Adventure,1117,0.91218
Animation,447,1.309925
Children,583,1.194564


Join genre's IDF to movie genre DataFrame.

In [13]:
movieGenreWeights = movieGenres.join(genres['idf'], on='genre')
movieGenreWeights

Unnamed: 0,genre,idf
0,Adventure,0.912180
0,Animation,1.309925
0,Children,1.194564
0,Comedy,0.439749
0,Fantasy,1.144655
...,...,...
9121,Fantasy,1.144655
9121,Sci-Fi,1.061508
9122,Documentary,1.265628
9123,Comedy,0.439749


In [14]:
movieWeights = movies[['movieId']]

for genre in genres.index:
    movieGenreIdf = movieGenreWeights[movieGenreWeights['genre'] == genre][['idf']]
    movieGenreIdf = movieGenreIdf.rename(columns={'idf':genre})
    movieWeights = movieWeights.join(movieGenreIdf)

movieWeights.fillna(0, inplace=True)

In [22]:
display(movieGenreIdf)
display(movieWeights)

Unnamed: 0,Western
142,1.734924
238,1.734924
271,1.734924
332,1.734924
347,1.734924
...,...
8595,1.734924
8696,1.734924
8824,1.734924
9024,1.734924


Unnamed: 0,movieId,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,0.0,0.000000,0.91218,1.309925,1.194564,0.439749,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
1,2,0.0,0.000000,0.91218,0.000000,1.194564,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2,3,0.0,0.000000,0.00000,0.000000,0.000000,0.439749,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.771304,0.000000,0.0,0.0,0.0
3,4,0.0,0.000000,0.00000,0.000000,0.000000,0.439749,0.0,0.000000,0.320249,...,0.0,0.0,0.0,0.0,0.0,0.771304,0.000000,0.0,0.0,0.0
4,5,0.0,0.000000,0.00000,0.000000,0.000000,0.439749,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9120,162672,0.0,0.000000,0.91218,0.000000,0.000000,0.000000,0.0,0.000000,0.320249,...,0.0,0.0,0.0,0.0,0.0,0.771304,0.000000,0.0,0.0,0.0
9121,163056,0.0,0.771304,0.91218,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,1.061508,0.0,0.0,0.0
9122,163949,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,1.265628,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
9123,164977,0.0,0.000000,0.00000,0.000000,0.000000,0.439749,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0


## IDF(Inverse Document Frequency, 역문서 빈도)
- 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미
- DF: 단어 자체가 문서군 내에서 등장하는 문서의 수(빈도)
- IDF: 전체 문서의 수(n)를 해당 단어를 포함한 문서의 수(DF)로 나눈 뒤 로그를 취하여 얻을 수 있다.   
즉, 한 단어가 문서 집합 전체에서 얼마나 공통적으로 나타나는지를 나타내는 값   

- log를 씌우는 이유: log를 사용하지 않았을 때, IDF를 DF의 역수(n/DF)로 사용한다면 총 문서의 수 n이 커질 수록, IDF의 값은 기하급수적으로 커지게 됩니다. 그렇기 때문에 log를 사용합니다.

### Movie-Movie Cosine Similarity Matrix

Compute $l_2$-norm of movies.

In [24]:
movieNorms = pd.DataFrame(data = LA.norm(movieWeights.iloc[:,1:].values, ord=2, axis=1), index=movieWeights.index, columns=['norm2'])
movieNorms

Unnamed: 0,norm2
0,2.34
1,1.89
2,0.89
3,0.94
4,0.44
...,...
9120,1.24
9121,1.97
9122,1.27
9123,0.44


Normalize movie vector so that similarity can be computed simply by inner product between vectors.

$$ cosine(u, v)=\frac{\sum_{\forall i}{u_i v_i}}{||u||_2||v||_2}=\sum_{\forall i}{\frac{u_i v_i}{||u||_2||v||_2}}=\sum_{\forall i}{\frac{u_i}{||u||_2}\frac{v_i}{||v||_2}}=u'\cdot v'$$

In [25]:
normalizedMovieWeights = movieWeights.iloc[:, 1:].divide(movieNorms['norm2'], axis=0)

normalizedMovieWeights

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.00,0.39,0.56,0.51,0.19,0.0,0.0,0.00,0.49,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.0
1,0.0,0.00,0.48,0.00,0.63,0.00,0.0,0.0,0.00,0.61,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.0
2,0.0,0.00,0.00,0.00,0.00,0.50,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.87,0.00,0.0,0.0,0.0
3,0.0,0.00,0.00,0.00,0.00,0.47,0.0,0.0,0.34,0.00,0.0,0.0,0.0,0.0,0.0,0.82,0.00,0.0,0.0,0.0
4,0.0,0.00,0.00,0.00,0.00,1.00,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9120,0.0,0.00,0.74,0.00,0.00,0.00,0.0,0.0,0.26,0.00,0.0,0.0,0.0,0.0,0.0,0.62,0.00,0.0,0.0,0.0
9121,0.0,0.39,0.46,0.00,0.00,0.00,0.0,0.0,0.00,0.58,0.0,0.0,0.0,0.0,0.0,0.00,0.54,0.0,0.0,0.0
9122,0.0,0.00,0.00,0.00,0.00,0.00,0.0,1.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.0
9123,0.0,0.00,0.00,0.00,0.00,1.00,0.0,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0,0.0


Create item-item similarity matrix

In [26]:
sims = pd.DataFrame(data=np.matmul(normalizedMovieWeights, normalizedMovieWeights.T))

sims.index = movieWeights['movieId']
sims.columns = movieWeights['movieId']

sims

  sims = pd.DataFrame(data=np.matmul(normalizedMovieWeights, normalizedMovieWeights.T))


movieId,1,2,3,4,5,6,7,8,9,10,...,161830,161918,161944,162376,162542,162672,163056,163949,164977,164979
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.00,0.81,0.09,0.09,0.19,0.00,0.09,0.64,0.00,0.25,...,0.00,0.19,0.00,0.00,0.00,0.29,0.47,0.0,0.19,0.0
2,0.81,1.00,0.00,0.00,0.00,0.00,0.00,0.80,0.00,0.32,...,0.00,0.23,0.00,0.00,0.00,0.36,0.58,0.0,0.00,0.0
3,0.09,0.00,1.00,0.94,0.50,0.00,1.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.63,0.54,0.00,0.0,0.50,0.0
4,0.09,0.00,0.94,1.00,0.47,0.00,0.94,0.00,0.00,0.00,...,0.08,0.00,0.34,0.34,0.60,0.60,0.00,0.0,0.47,0.0
5,0.19,0.00,0.50,0.47,1.00,0.00,0.50,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,1.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,0.29,0.36,0.54,0.60,0.00,0.00,0.54,0.45,0.00,0.48,...,0.06,0.36,0.26,0.26,0.46,1.00,0.34,0.0,0.00,0.0
163056,0.47,0.58,0.00,0.00,0.00,0.22,0.00,0.28,0.39,0.52,...,0.00,0.69,0.00,0.00,0.00,0.34,1.00,0.0,0.00,0.0
163949,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.0,0.00,1.0
164977,0.19,0.00,0.50,0.47,1.00,0.00,0.50,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,1.00,0.0


## Recommend Movies based on Predicted Ratings

Read ratings as train and test datasets.

In [27]:
ratings = pd.read_csv('ratings-9_1.csv')

train = ratings[ratings['type'] == 'train'][['userId', 'movieId', 'rating']]
test = ratings[ratings['type'] == 'test'][['userId', 'movieId', 'rating']]

Set test user ID

In [28]:
userId = 33

Check top rated movies of the test user

In [29]:
userRatings = train[train['userId'] == userId][['movieId', 'rating']] 

topRatings = userRatings.sort_values(by='rating', ascending=False).head(20)

topRatings

displayMovies(movies, topRatings['movieId'].values, topRatings['rating'].values)

Predict item ratings for the test users.

In [30]:
recSimSums = sims.loc[userRatings['movieId'].values, :].sum().values

recSimSums = recSimSums + 1

recWeightedRatingSums = np.matmul(sims.loc[userRatings['movieId'].values, :].T.values, userRatings['rating'].values)

recItemRatings = pd.DataFrame(data = np.divide(recWeightedRatingSums, recSimSums), index=sims.index)

recItemRatings.columns = ['pred']

recItemRatings

# np.matmul()

Unnamed: 0_level_0,pred
movieId,Unnamed: 1_level_1
1,2.99
2,2.72
3,3.21
4,3.22
5,3.22
...,...
162672,3.07
163056,2.75
163949,2.67
164977,3.22


Check recommended items

In [31]:
top30Movies = recItemRatings.sort_values(by='pred', ascending=False).head(30)

displayMovies(movies, top30Movies.index, top30Movies['pred'].values)

Compute MAE and RMSE for the test user.

In [32]:
userTestRatings = pd.DataFrame(data=test[test['userId'] == userId])

temp = userTestRatings.join(recItemRatings.loc[userTestRatings['movieId']], on='movieId')

mae = getMAE(temp['rating'], temp['pred'])
rmse = getRMSE(temp['rating'], temp['pred'])

print(f"MAE : {mae:.4f}")
print(f"RMSE: {rmse:.4f}")

MAE : 0.9682
RMSE: 1.1347
