<a href="https://colab.research.google.com/github/jieunjeon/Data-Science-Fundamental/blob/master/Exploration/%5BE_09%5D_Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender System (MovieLens)

## Goal of this project:
- Understand the concept and purpose of a recommendation system.
- Create a Matrix Factorization (hereinafter MF)-based recommendation model - using the implicit library.
- Learn CSR Matrix, a data structure frequently used in recommender systems.
- Learn the difference between explicit data and implicit data among user behavior data.
- Create my own recommendation model with a new dataset.

## MovieLens Dataset
- The data shows how users rate movies by data size. We recommend using the MovieLens 1M Dataset.
- Star rating data is representative explicit data. But you can consider it implicit data and test it out.
- Notice that the star rating as the number of views.
- Assume that data with less than 3 points is not preferred.
- [MovieLens 1M Dataset](https://grouplens.org/datasets/movielens/1m/)

# Table of Contents
1. Load the Data
2. Analyze the dataset
3. Add my 5 fav movies to the 'rating'
4. Create CSR Matrix
5. Create ALS model and train (AlternatingLeastSquares)
6. Get preference level of my fav movies from the trained model
7. Get recommendations for movies that are similar to your favorite movies.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [43]:
!pip install implicit

Collecting implicit
  Downloading implicit-0.4.4.tar.gz (1.1 MB)
[?25l[K     |▎                               | 10 kB 20.8 MB/s eta 0:00:01[K     |▋                               | 20 kB 27.6 MB/s eta 0:00:01[K     |▉                               | 30 kB 30.0 MB/s eta 0:00:01[K     |█▏                              | 40 kB 30.1 MB/s eta 0:00:01[K     |█▌                              | 51 kB 32.1 MB/s eta 0:00:01[K     |█▊                              | 61 kB 33.5 MB/s eta 0:00:01[K     |██                              | 71 kB 23.9 MB/s eta 0:00:01[K     |██▍                             | 81 kB 25.5 MB/s eta 0:00:01[K     |██▋                             | 92 kB 27.3 MB/s eta 0:00:01[K     |███                             | 102 kB 27.5 MB/s eta 0:00:01[K     |███▎                            | 112 kB 27.5 MB/s eta 0:00:01[K     |███▌                            | 122 kB 27.5 MB/s eta 0:00:01[K     |███▉                            | 133 kB 27.5 MB/s eta 0:00:01

In [146]:
import pandas as pd
import os
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# 1. Load the Data

In [147]:
data_path = '/content/drive/MyDrive/aiffel/EXP_9_data/'

In [164]:
import pandas as pd
import os
rating_file_path = data_path + 'ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python')
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [165]:
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [166]:
ratings.rename(columns={'rating':'count'}, inplace=True)

In [167]:
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

In [168]:
movie_file_path = data_path + 'movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


# 2. Analyze the dataset
- Number of unique movies in ratings
- number of unique users in rating
- The 30 most popular movies (in order of popularity)

In [169]:
ratings['movie_id'].nunique()

3628

In [154]:
ratings['user_id'].nunique()

6039

In [170]:
ratings_df = pd.merge(ratings, movies, on='movie_id')
count_movies = ratings_df.groupby('title')['user_id'].count()
count_movies = count_movies.sort_values(ascending=False)
count_movies.head(30)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

In [111]:
count_movies.head(60)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

# 3.Add my 5 fav movies to the 'rating'

In [173]:
using_cols = ['user_id', 'title', 'count']
ratings = ratings_df[using_cols]
ratings.head()

Unnamed: 0,user_id,title,count
0,1,One Flew Over the Cuckoo's Nest (1975),5
1,2,One Flew Over the Cuckoo's Nest (1975),5
2,12,One Flew Over the Cuckoo's Nest (1975),4
3,15,One Flew Over the Cuckoo's Nest (1975),4
4,17,One Flew Over the Cuckoo's Nest (1975),5


In [174]:
my_favorite = ['Toy Story (1995)' , 'Forrest Gump (1994)' ,'Bug\'s Life, A (1998)','Stand by Me (1986)' ,'Star Wars: Episode V - The Empire Strikes Back (1980)']

my_playlist = pd.DataFrame({'user_id': [9999]*5, 'title': my_favorite, 'count':[5]*5})

my_playlist

Unnamed: 0,user_id,title,count
0,9999,Toy Story (1995),5
1,9999,Forrest Gump (1994),5
2,9999,"Bug's Life, A (1998)",5
3,9999,Stand by Me (1986),5
4,9999,Star Wars: Episode V - The Empire Strikes Back...,5


In [175]:
ratings = ratings.append(my_playlist)

In [176]:
ratings.head()

Unnamed: 0,user_id,title,count
0,1,One Flew Over the Cuckoo's Nest (1975),5
1,2,One Flew Over the Cuckoo's Nest (1975),5
2,12,One Flew Over the Cuckoo's Nest (1975),4
3,15,One Flew Over the Cuckoo's Nest (1975),4
4,17,One Flew Over the Cuckoo's Nest (1975),5


In [177]:
ratings[ratings['user_id'] == 9999]

Unnamed: 0,user_id,title,count
0,9999,Toy Story (1995),5
1,9999,Forrest Gump (1994),5
2,9999,"Bug's Life, A (1998)",5
3,9999,Stand by Me (1986),5
4,9999,Star Wars: Episode V - The Empire Strikes Back...,5


# 4. Create CSR Matrix


## Data Preprocessing (Indexing)

In [178]:
user_unique = ratings['user_id'].unique()
title_unique = ratings['title'].unique()

user_to_idx = {v:k for k,v in enumerate(user_unique)}
title_to_idx = {v:k for k,v in enumerate(title_unique)}

In [179]:
print(user_to_idx[9999])
print(title_to_idx['Toy Story (1995)'])

6039
40


In [180]:
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):   
    print('user_id column indexing OK!!')
    ratings['user_id'] = temp_user_data   
else:
    print('user_id column indexing Fail!!')

temp_title_data = ratings['title'].map(title_to_idx.get).dropna()
if len(temp_title_data) == len(ratings):
    print('title column indexing OK!!')
    ratings['title'] = temp_title_data
else:
    print('title column indexing Fail!!')

ratings

user_id column indexing OK!!
title column indexing OK!!


Unnamed: 0,user_id,title,count
0,0,0,5
1,1,0,5
2,2,0,4
3,3,0,4
4,4,0,5
...,...,...,...
0,6039,40,5
1,6039,160,5
2,6039,4,5
3,6039,80,5


In [183]:
num_user = ratings['user_id'].nunique()
num_movie = ratings['title'].nunique()

print(num_user,num_movie)
csr_data = csr_matrix((ratings['count'],(ratings['user_id'], ratings['title'])))

csr_data

6040 3628


<6040x3628 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Row format>

# 5. Create ALS model and train (AlternatingLeastSquares)


Recommended from `Implicit` Library as follow:

In [184]:
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

Declare `als_model`

In [185]:
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=30, dtype=np.float32)


In [186]:
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [187]:
als_model.fit(csr_data_transpose)

  0%|          | 0/30 [00:00<?, ?it/s]

# 6. Get preference level of my fav movies from the trained model


Predicting the preference of my favorite movies

In [188]:
jieun, toystory = user_to_idx[9999], 2000
jieun_vector, toystory_vector = als_model.user_factors[jieun], als_model.item_factors[toystory]

In [189]:
jieun_vector

array([-0.21731155, -0.16243088,  0.3459847 ,  0.7857577 , -0.67380977,
        0.69026375, -0.12849012, -0.6673374 ,  0.4345594 ,  0.06172493,
       -0.13035168,  0.40272513,  0.07888252,  0.84380615, -0.37019327,
       -0.09008068,  0.09297616,  0.9707458 , -0.5767809 , -0.16720784,
        0.0229263 , -0.39049682, -0.10922033, -0.3883964 , -0.12710854,
        0.2884064 , -0.0467146 ,  0.22819225,  0.9771122 , -0.78253895,
        0.7854523 , -0.26202673, -0.25814393,  0.14835005, -0.37672985,
        0.42490304,  0.17876416,  0.2514161 ,  0.24307497, -0.39245582,
        0.5226281 ,  0.6528023 ,  0.8273683 , -0.6731733 , -0.50528395,
       -0.00225086, -0.6086046 , -0.12206251, -0.34630227,  0.0410092 ,
        0.09792275, -0.05464485,  0.11516984, -0.05163636, -0.22902432,
        0.13922457,  0.15628834, -1.4676596 , -0.34168586, -0.10798316,
        0.22197226,  0.08388081,  0.18967581, -0.09891239, -0.16538824,
        1.0148449 ,  0.01640047,  0.53528124, -0.22062808, -0.38

In [190]:
toystory_vector

array([ 8.7005016e-04,  1.8048884e-03,  3.4659640e-03,  2.3006853e-03,
        5.7497732e-03,  4.8243310e-04,  6.4813625e-03,  8.0564516e-03,
        2.1479810e-03, -2.5455949e-03, -2.5671280e-03, -5.5790627e-03,
        3.8866813e-03, -1.3015044e-03,  2.8111271e-03,  4.4910447e-04,
        9.7162994e-03,  4.9158428e-03, -4.4545438e-03,  2.9683104e-03,
       -4.3224832e-03, -2.5790522e-03,  3.3652433e-03, -8.5877785e-03,
        3.4659596e-03, -4.2457497e-03, -4.2089480e-03,  5.5026677e-03,
        4.7180224e-03,  5.8445334e-03,  8.4327385e-03,  2.8736070e-03,
       -4.6423841e-03,  5.1864266e-04, -8.0362447e-03, -1.6890940e-03,
       -9.8971173e-04,  9.7194538e-03, -1.0512193e-02,  1.8122756e-03,
        6.8663298e-03,  5.3020120e-03, -1.4738153e-03,  1.0527914e-03,
        1.0584030e-02, -7.6987519e-04,  3.6738275e-03, -7.4431614e-04,
       -2.6720494e-03, -2.5080428e-03,  9.4146840e-03,  4.0215207e-03,
        5.4372535e-03,  2.5894407e-03, -5.8555813e-04,  5.1524043e-03,
      

In [191]:
np.dot(jieun_vector, toystory_vector)

0.008474321

Predicting my preference for other movies : Speed (1994) - index 1442

In [192]:
jieun, speed = user_to_idx[9999], 1442
jieun_vector, speed_vector = als_model.user_factors[jieun], als_model.item_factors[speed]

In [193]:
np.dot(jieun_vector, speed_vector)

-0.0078366855

# 7. Get recommendations for movies that are similar to your favorite movies.

In [194]:
idx_to_title = {v:k for k, v in title_to_idx.items()}

def get_similar_movie(movie_title: str):
    title_id = title_to_idx[movie_title]
    similar_movie = als_model.similar_items(title_id)
    similar_movie = [idx_to_title[i[0]] for i in similar_movie]
    return similar_movie

def get_similar_movie_by_id(title_id: int):
    similar_movie = als_model.similar_items(movie_id)
    similar_movie = [idx_to_title[i[0]] for i in similar_movie]
    return similar_movie

Get recommendations for my fav movies

In [195]:
get_similar_movie('Toy Story (1995)')

['Toy Story (1995)',
 'Toy Story 2 (1999)',
 "Bug's Life, A (1998)",
 'Babe (1995)',
 'Aladdin (1992)',
 'Groundhog Day (1993)',
 'Pleasantville (1998)',
 'Lion King, The (1994)',
 'Beauty and the Beast (1991)',
 "There's Something About Mary (1998)"]

Get recommendations for another movie

In [196]:
get_similar_movie('Men in Black (1997)')

['Men in Black (1997)',
 'Jurassic Park (1993)',
 'Terminator 2: Judgment Day (1991)',
 'Total Recall (1990)',
 'Independence Day (ID4) (1996)',
 'Matrix, The (1999)',
 'Fifth Element, The (1997)',
 'Lost World: Jurassic Park, The (1997)',
 'Galaxy Quest (1999)',
 'Schlafes Bruder (Brother of Sleep) (1995)']

In [198]:
user = user_to_idx[9999]
artist_recommended = als_model.recommend(user, csr_data, N=20, filter_already_liked_items=True)
artist_recommended

[(50, 0.75809884),
 (64, 0.59311485),
 (44, 0.38345823),
 (110, 0.37946138),
 (322, 0.32532197),
 (5, 0.3136822),
 (120, 0.30574942),
 (22, 0.2963306),
 (26, 0.28146118),
 (99, 0.28129023),
 (87, 0.2774638),
 (33, 0.26344126),
 (126, 0.24480475),
 (1030, 0.2203416),
 (48, 0.21387866),
 (20, 0.21099932),
 (189, 0.20866719),
 (172, 0.2013833),
 (1084, 0.17556304),
 (475, 0.17154062)]

In [200]:
[idx_to_title[i[0]] for i in artist_recommended]


['Toy Story 2 (1999)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Groundhog Day (1993)',
 'Babe (1995)',
 'Princess Bride, The (1987)',
 'Raiders of the Lost Ark (1981)',
 'Back to the Future (1985)',
 'E.T. the Extra-Terrestrial (1982)',
 'American Beauty (1999)',
 'Braveheart (1995)',
 'Aladdin (1992)',
 'Shakespeare in Love (1998)',
 'Platoon (1986)',
 'Saving Private Ryan (1998)',
 'Pleasantville (1998)',
 'Breakfast Club, The (1985)',
 'Indiana Jones and the Last Crusade (1989)',
 'Full Metal Jacket (1987)',
 'My Cousin Vinny (1992)']

Let's check movies that are contributed to this recommendations.

In [205]:
slam = title_to_idx['Star Wars: Episode VI - Return of the Jedi (1983)']
explain = als_model.explain(user, csr_data, itemid=slam)

In [206]:
[(idx_to_title[i[0]], i[1]) for i in explain[1]]


[('Star Wars: Episode V - The Empire Strikes Back (1980)', 0.4652252443664239),
 ('Forrest Gump (1994)', 0.14910034953971713),
 ("Bug's Life, A (1998)", 0.02734312202215841),
 ('Toy Story (1995)', 0.004513152299155231),
 ('Stand by Me (1986)', -0.05906457799085581)]

As expected(?), Star Wars series 5 contributed to recommend Star Wars series 6..! including other animes I chose.

# Conclusion
- Succesfully created a Matrix Factorization (hereinafter MF)-based recommendation model - using the implicit library.
- Utilized CSR Matrix, a data structure frequently used in recommender systems.
- Created my own recommendation model with a new dataset.
- Got movie recommendations based on my fav movies.

# References
- Hu, Yifan, et al. “Collaborative Filtering for Implicit Feedback Datasets.” 2008 Eighth IEEE International Conference on Data Mining, 2008, doi:10.1109/icdm.2008.22. 