<a href="https://colab.research.google.com/github/jackiekuen2/recommender-systems/blob/main/Kaggle_The_Movie_Dataset_Collaborative_Filtering_and_Hybird_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credits
- Kaggle - Movie Recommender System https://www.kaggle.com/rounakbanik/movie-recommender-systems/notebook
- https://stackoverflow.com/questions/61305393/svd-has-no-split-attribute
-https://stackoverflow.com/questions/62046795/attributeerror-datasetautofolds-object-has-no-attribute-split
- Surprise https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=train#train-test-split-and-the-fit-method

In [1]:
!pip install -q kaggle

In [None]:
from google.colab import files

files.upload()

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

kaggle.json


In [4]:
!kaggle datasets download -d rounakbanik/the-movies-dataset

Downloading the-movies-dataset.zip to /content
 88% 201M/228M [00:02<00:00, 60.3MB/s]
100% 228M/228M [00:02<00:00, 92.9MB/s]


In [5]:
!unzip \*.zip  && rm *.zip


Archive:  the-movies-dataset.zip
  inflating: credits.csv             
  inflating: keywords.csv            
  inflating: links.csv               
  inflating: links_small.csv         
  inflating: movies_metadata.csv     
  inflating: ratings.csv             
  inflating: ratings_small.csv       


In [6]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 3.1MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1670927 sha256=468f2b57ec908e76d814af27b9fdf1eff9ad18b34ca3e18fd41a55409bfc3da4
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


# I. Collaborative Filtering
- Using the Surprise library with Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) 


In [7]:
import pandas as pd
import numpy as np
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [8]:
ratings = pd.read_csv('ratings_small.csv')

In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [11]:
%%time
reader = Reader()

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

CPU times: user 86 ms, sys: 9.21 ms, total: 95.3 ms
Wall time: 94.3 ms


## Cross Validation

In [12]:
%%time
svd = SVD()

cross_validate(svd, data, measures=['rmse', 'mae'], cv=5, n_jobs=-1, verbose=1)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8923  0.9005  0.8931  0.9014  0.8987  0.8972  0.0038  
MAE (testset)     0.6878  0.6909  0.6852  0.6947  0.6917  0.6900  0.0033  
Fit time          4.58    4.56    4.61    4.58    4.57    4.58    0.02    
Test time         0.15    0.16    0.15    0.17    0.16    0.16    0.01    
CPU times: user 7.85 s, sys: 410 ms, total: 8.26 s
Wall time: 13.2 s


## Model Training

In [13]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7faae2c11fd0>

In [14]:
ratings[ratings['userId']==1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


## Predictions

In [16]:
# Predict User ID=1 and Movie ID=1953
svd.predict(uid=1, iid=1953, r_ui=4)

Prediction(uid=1, iid=1953, r_ui=4, est=3.230811973660988, details={'was_impossible': False})

The true rating for movie 1953 is 4.0, the predicted rating is 3.23

# II. Hybird
- Content-based part: tagline + descriptions
- Collaborative Filtering part: SVD on user ratings

## I. Content-based part

In [96]:
from ast import literal_eval

md = pd.read_csv('movies_metadata.csv')

# Drop the bad movie ids
md = md.drop([19730, 29503, 35587])

md['id'] = md['id'].astype('int')

  interactivity=interactivity, compiler=compiler, result=result)


In [97]:
md.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [98]:
# Get release year
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [99]:
# Small dataset (Due to memory limitation), only need tmdbid for mapping
links_small = pd.read_csv('links_small.csv')

links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [100]:
smd = md[md['id'].isin(links_small)]

smd.shape

(9099, 25)

Downside by 5 times

In [101]:
smd['tagline'] = smd['tagline'].fillna('')
smd['soup'] = smd['overview'] + smd['tagline']
smd['soup'] = smd['soup'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [102]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['soup'])

In [103]:
%%time
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

CPU times: user 896 ms, sys: 22.9 ms, total: 919 ms
Wall time: 918 ms


In [104]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

## II. Hybird part: incoporated with SVD

In [105]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [106]:
id_map = pd.read_csv('links_small.csv')[['movieId', 'tmdbId']]

id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int) # For merging
id_map.columns = ['movieId', 'id']

id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [107]:
id_map.head()

Unnamed: 0_level_0,movieId,id
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1,862.0
Jumanji,2,8844.0
Grumpier Old Men,3,15602.0
Waiting to Exhale,4,31357.0
Father of the Bride Part II,5,11862.0


In [108]:
indices_map = id_map.set_index('id')

In [109]:
indices_map.head()

Unnamed: 0_level_0,movieId
id,Unnamed: 1_level_1
862.0,1
8844.0,2
15602.0,3
31357.0,4
11862.0,5


In [110]:
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [126]:
def hybird(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']

    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)
    sim_scores = sim_scores[1:31]

    movie_indices = [i[0] for i in sim_scores] # Sort out the top 30 similar movies

    movies = smd.iloc[movie_indices][['title','year','id', 'vote_count', 'vote_average']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [128]:
hybird(1, 'Jurassic World')

Unnamed: 0,title,year,id,vote_count,vote_average,est
6189,Grizzly Man,2005,501,213.0,7.3,3.083463
7550,South Park: Imaginationland,2008,16023,75.0,7.9,3.018926
2236,National Lampoon's Vacation,1983,11153,412.0,7.1,2.902166
8385,Austenland,2013,156711,192.0,6.6,2.865549
7363,Grey Gardens,2009,19851,35.0,6.7,2.858866
5907,Sympathy for Mr. Vengeance,2002,4689,302.0,7.3,2.796226
8097,Piranha 3DD,2012,71668,319.0,4.1,2.768469
8310,The Way Way Back,2013,147773,695.0,7.1,2.764324
857,Johns,1996,56830,6.0,6.0,2.749366
7181,Adventureland,2009,16614,748.0,6.4,2.689013


In [130]:
hybird(2, 'Jurassic World')

Unnamed: 0,title,year,id,vote_count,vote_average,est
2236,National Lampoon's Vacation,1983,11153,412.0,7.1,3.91374
6189,Grizzly Man,2005,501,213.0,7.3,3.802314
857,Johns,1996,56830,6.0,6.0,3.6631
8310,The Way Way Back,2013,147773,695.0,7.1,3.611717
6097,Ringu 0,2000,9674,49.0,5.8,3.581086
5907,Sympathy for Mr. Vengeance,2002,4689,302.0,7.3,3.546119
7686,Yogi Bear,2010,41515,228.0,5.2,3.531717
1412,Meet the Deedles,1998,40688,19.0,4.1,3.522619
8071,Geri's Game,1997,13929,309.0,7.8,3.513051
7363,Grey Gardens,2009,19851,35.0,6.7,3.454112


In [129]:
hybird(500, 'The Godfather')

Unnamed: 0,title,year,id,vote_count,vote_average,est
2192,The Color Purple,1985,873,345.0,7.7,3.541606
973,The Godfather: Part II,1974,240,3418.0,8.3,3.453374
5593,Eulogy,2004,16358,34.0,6.4,3.339311
7741,Elite Squad: The Enemy Within,2010,47931,477.0,7.5,3.330485
29,Shanghai Triad,1995,37557,17.0,6.5,3.279322
5271,The Valachi Papers,1972,33357,18.0,7.4,3.164755
4168,Songs from the Second Floor,2000,34070,61.0,7.0,3.160048
8931,Afro Samurai,2007,62931,63.0,7.3,3.152096
2412,American Movie,1999,14242,57.0,7.7,3.151237
8816,Run All Night,2015,241554,1169.0,6.3,3.074118


In [131]:
hybird(501, 'The Godfather')

Unnamed: 0,title,year,id,vote_count,vote_average,est
973,The Godfather: Part II,1974,240,3418.0,8.3,4.426557
2192,The Color Purple,1985,873,345.0,7.7,4.216693
2412,American Movie,1999,14242,57.0,7.7,4.167032
7741,Elite Squad: The Enemy Within,2010,47931,477.0,7.5,4.12493
7591,Machete,2010,23631,1171.0,6.3,3.899884
8387,The Family,2013,112205,1052.0,6.1,3.871404
29,Shanghai Triad,1995,37557,17.0,6.5,3.855555
4144,Road to Perdition,2002,4147,1102.0,7.3,3.829779
6398,Renaissance,2006,9389,78.0,6.7,3.818227
3560,Moon Over Parador,1988,34014,17.0,6.0,3.814549
