In [1]:
import pandas as pd
import numpy as np
import os
import plotly.graph_objects as go
import plotly.figure_factory as ff

# Dataset Loading and Preprocessing

Download datasets [here](https://grouplens.org/datasets/movielens/1m/)
Dataset consists of three files *movies.dat*, *users.dat*, *ratings.dat*

I am loading the dat files and converting them to csv files.

In [2]:
users=pd.read_csv('ml-1m/users.dat',sep='::',
                   engine='python',names=['user_id','gender','age','occupation','zipcode'])
print(users['user_id'].count(),'users are loaded.')

6040 users are loaded.


From the [README](http://files.grouplens.org/datasets/movielens/ml-1m-README.txt) file<br>
1) it can be read that age does not actually represent age but age groups so I am replacin age data to represent the age groups.<br>
2) occupations is an int type but it actually represents a category so I am replacing occupation data to represent the occupation category.

In [3]:
ages = { 1: "Under 18", 18: "18-24", 25: "25-34", 35: "35-44", 45: "45-49", 50: "50-55", 56: "56+" }
occupations = { 0: "other or not specified", 1: "academic/educator", 2: "artist", 3: "clerical/admin",
                4: "college/grad student", 5: "customer service", 6: "doctor/health care",
                7: "executive/managerial", 8: "farmer", 9: "homemaker", 10: "K-12 student", 11: "lawyer",
                12: "programmer", 13: "retired", 14: "sales/marketing", 15: "scientist", 16: "self-employed",
                17: "technician/engineer", 18: "tradesman/craftsman", 19: "unemployed", 20: "writer" }

In [4]:
users['age'] = users['age'].apply(lambda x: ages[x])
users.rename(columns={'age':'age_group'},inplace=True)
users['occupation'] = users['occupation'].apply(lambda x: occupations[x])

Lets take a look at the content of the file.

In [5]:
users.head()

Unnamed: 0,user_id,gender,age_group,occupation,zipcode
0,1,F,Under 18,K-12 student,48067
1,2,M,56+,self-employed,70072
2,3,M,25-34,scientist,55117
3,4,M,45-49,executive/managerial,2460
4,5,M,25-34,writer,55455


In [6]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
user_id       6040 non-null int64
gender        6040 non-null object
age_group     6040 non-null object
occupation    6040 non-null object
zipcode       6040 non-null object
dtypes: int64(1), object(4)
memory usage: 236.0+ KB


This confirms that there are 6040 users with attributes user_id,gender,age_group,occupation,zipcode.

In [7]:
users.to_csv('users.csv',sep=',',header=True,
             columns=['user_id', 'gender', 'age_group', 'occupation', 'zipcode'])
print('Saved to users.csv')

Saved to users.csv


In [8]:
movies=pd.read_csv('ml-1m/movies.dat',sep='::',
                   engine='python',names=['movie_id','title','genres'])
print(movies['movie_id'].count(),'movies are loaded.')

3883 movies are loaded.


lets take a look at the content of movies file

In [9]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
movie_id    3883 non-null int64
title       3883 non-null object
genres      3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


This dataset contains attributes of 3883 movies. There are 3 columns movie_id,title and genre.Genres are pipe seperated and are selected from <br>
(Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western)

In [11]:
movies.to_csv('movies.csv',sep=',',header=True,columns=['movie_id','title','genres'])
print('Saved to movies.csv')

Saved to movies.csv


In [12]:
ratings=pd.read_csv('ml-1m/ratings.dat',sep='::',engine='python',
                    names=['user_id','movie_id','rating','timestamp'])
print(ratings['rating'].count(),'ratings are loaded')

1000209 ratings are loaded


Lets take a look at the content of ratings file

In [13]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
user_id      1000209 non-null int64
movie_id     1000209 non-null int64
rating       1000209 non-null int64
timestamp    1000209 non-null int64
dtypes: int64(4)
memory usage: 30.5 MB


This confirms that there are 1 million ratings with attributes user_id,movie_id,rating,timestamp.

In [15]:
ratings.to_csv('ratings.csv',sep=',',header=True, 
               columns=['user_id', 'movie_id', 'rating', 'timestamp'])
print('Saved to ratings.csv')

Saved to ratings.csv


In [16]:
ratings=pd.read_csv('ratings.csv',sep=',',usecols=['user_id','movie_id','rating','timestamp'])
movies=pd.read_csv('movies.csv',sep=',',usecols=['movie_id','title','genres'])
users=pd.read_csv('users.csv',sep=',',
                  usecols=['user_id','gender','age_group','occupation','zipcode'])

# Model Based Collaborative Filtering

*Model-based Collaborative Filtering* is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:
* The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
* When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
* You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

In [17]:
ratings_matrix_df = ratings.pivot(index='user_id',columns='movie_id',values='rating')

In [18]:
ratings_matrix_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,


In [19]:
from scipy.sparse import csc_matrix

In [20]:
ratings_matrix_sparse=csc_matrix(ratings_matrix_df.fillna(0).values)

In [21]:
from scipy.sparse.linalg import svds

In [22]:
u, sigma, vt = svds(ratings_matrix_sparse,k=50)

In [23]:
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(u, sigma), vt)

In [24]:
predicted_ratings_matrix_df = pd.DataFrame(all_user_predicted_ratings,columns=ratings_matrix_df.columns,index=ratings_matrix_df.index)

In [25]:
def recommend(user_id,original_ratings,predicted_ratings,movies_df,top=20):
    orig=original_ratings.loc[user_id]
    pred=predicted_ratings.loc[user_id]
    recom=pred[orig.isna()].sort_values(ascending=False).iloc[:top]
    recom_ind=recom.index
    recom_movies=movies.loc[movies['movie_id'].isin(recom_ind)].sort_values(by='movie_id')
    recom_movies['pred']=recom.sort_index().values
    return(recom_movies.sort_values(by='pred',ascending=False)[['movie_id','title','genres']])

In [26]:
recommended_movies=recommend(1310,ratings_matrix_df,predicted_ratings_matrix_df,movies,20)

In [27]:
recommended_movies

Unnamed: 0,movie_id,title,genres
1628,1674,Witness (1985),Drama|Romance|Thriller
1892,1961,Rain Man (1988),Drama
1222,1242,Glory (1989),Action|Drama|War
1207,1225,Amadeus (1984),Drama
1192,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1282,1302,Field of Dreams (1989),Drama
1893,1962,Driving Miss Daisy (1989),Drama
1226,1246,Dead Poets Society (1989),Drama
1888,1957,Chariots of Fire (1981),Drama
1951,2020,Dangerous Liaisons (1988),Drama|Romance


## Model Evaluation

In [28]:
from surprise import Reader, Dataset, SVD, evaluate
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], Reader())
data.split(n_folds=5)
svd = SVD()
evaluate(svd, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8732
------------
Fold 2
RMSE: 0.8749
------------
Fold 3
RMSE: 0.8749
------------
Fold 4
RMSE: 0.8714
------------
Fold 5
RMSE: 0.8705
------------
------------
Mean RMSE: 0.8730
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8732042775103183,
                             0.874886595248249,
                             0.8749020220719009,
                             0.8713986051931782,
                             0.870481777791052]})