# Movies Recommender System


The dataset used is the MovieLens Latest Datasets, which can be found at https://grouplens.org/datasets/movielens/latest/

```
For this project, you will build a simple demo recommender system of your own, given the tools and skills you have learned in this course. It is up to you whether you want to implement a similarity- or machine learning-based recommender system.
```

We'll be using a machine learning-based recommender system, to select a __Collaborative recommender model__.

Using the small dataset, with 100k rating. 

__We don't need much more than a list of users, of movies, and of ratings to use this model.__

In [49]:
from surprise import Reader, Dataset, SVD
from surprise import SVD
from surprise import Dataset
import pandas as pd
from surprise.model_selection import cross_validate

# The Movie Database (md)

* The author use a dataset other than the one provided (Amazon Home Improvement Reviews) : Yes
* Description: as stated above, we provided information on what data is in the dataset, how much data, and where to find the dataset (if applicable).

In [50]:
md = pd. read_csv('data/movies_metadata.csv')
md = md[["id","imdb_id","title"]]
md.columns = ["movieId","imdb_id","title"]
#md.movieId = md.movieId.astype('int64',errors="ignore")
md.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,movieId,imdb_id,title
0,862,tt0114709,Toy Story
1,8844,tt0113497,Jumanji
2,15602,tt0113228,Grumpier Old Men
3,31357,tt0114885,Waiting to Exhale
4,11862,tt0113041,Father of the Bride Part II


# Opening ratings for all users

In [51]:
reader = Reader()
ratings = pd.read_csv('data/ratings_small.csv')
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader) 

__Verifications__. We will check below for the integrity of data 
* The data was cleaned and free of faulty, unnecessary, and missing values. 
* The data was appropriate for the analyses.

In [68]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [69]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100004.0,100004.0,100004.0,100004.0
mean,347.01131,12548.664363,3.543608,1129639000.0
std,195.163838,26369.198969,1.058064,191685800.0
min,1.0,1.0,0.5,789652000.0
25%,182.0,1028.0,3.0,965847800.0
50%,367.0,2406.5,4.0,1110422000.0
75%,520.0,5418.0,4.0,1296192000.0
max,671.0,163949.0,5.0,1476641000.0


* The data seem OK !

In [52]:
# Use the famous SVD algorithm
algo = SVD()

# Run 10-fold cross-validation and then print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=10, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    0.8903  0.8872  0.9017  0.8976  0.8851  0.8948  0.8942  0.8981  0.8999  0.8817  0.8931  0.0064  
MAE (testset)     0.6850  0.6825  0.6950  0.6905  0.6812  0.6872  0.6867  0.6912  0.6938  0.6789  0.6872  0.0051  
Fit time          5.33    5.40    5.86    5.48    5.62    5.50    5.38    5.41    5.50    5.44    5.49    0.15    
Test time         0.07    0.07    0.08    0.07    0.09    0.19    0.07    0.13    0.08    0.12    0.10    0.04    


{'test_rmse': array([0.89033037, 0.88721131, 0.90166726, 0.89757537, 0.88507642,
        0.89484201, 0.89419808, 0.89814205, 0.89989092, 0.8817161 ]),
 'test_mae': array([0.68498511, 0.68249329, 0.69497896, 0.6905423 , 0.68124366,
        0.68715458, 0.68670905, 0.69115397, 0.69380814, 0.67887323]),
 'fit_time': (5.332747220993042,
  5.397272825241089,
  5.864047527313232,
  5.475974798202515,
  5.622844934463501,
  5.502814292907715,
  5.38006067276001,
  5.41113805770874,
  5.503418922424316,
  5.436424970626831),
 'test_time': (0.07189559936523438,
  0.07032608985900879,
  0.07537055015563965,
  0.06665873527526855,
  0.08669495582580566,
  0.1870861053466797,
  0.06791830062866211,
  0.1251087188720703,
  0.0785529613494873,
  0.12360572814941406)}

We get a **MSE** of 0.8817161 - not so bad. 

Let us now train on our dataset and arrive at predictions.

In [53]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fbe4d46ec18>

# Does it work?
Checking user 333 and check the ratings s/he has given.

In [54]:
ratings[ratings['userId'] == 333]

Unnamed: 0,userId,movieId,rating,timestamp
46068,333,1,4.0,1441197471
46069,333,318,5.0,1441197184
46070,333,356,4.5,1441197368
46071,333,527,5.0,1441197187
46072,333,588,3.5,1441198986
...,...,...,...,...
46144,333,105844,3.5,1441198673
46145,333,109487,4.5,1441197391
46146,333,116797,5.0,1441197436
46147,333,117176,4.0,1441197950


In [84]:
algo.predict(333, 318, 3).est # rating for user 333 on movie 318

4.787334371802595

# Let's compare on this user

We should find something that matches the ratings the user has provided.

In [83]:
User333 =  ratings[ratings['userId'] == 333]
User333["estimated"] = User333.movieId.apply(lambda x: algo.predict(333, x, 3).est) 
User333

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,userId,movieId,rating,timestamp,estimated
46068,333,1,4.0,1441197471,4.151951
46069,333,318,5.0,1441197184,4.787334
46070,333,356,4.5,1441197368,4.338452
46071,333,527,5.0,1441197187,4.564067
46072,333,588,3.5,1441198986,4.088364
...,...,...,...,...,...
46144,333,105844,3.5,1441198673,4.229592
46145,333,109487,4.5,1441197391,4.545094
46146,333,116797,5.0,1441197436,4.868722
46147,333,117176,4.0,1441197950,4.379123


* Looks not too bad!

# Generalisation

In [66]:
md.movieId = md.movieId.astype('int64',errors="ignore")
def BestForUser(ID):
    BestGuessesUserOne = ratings[['userId', 'movieId', 'rating']].drop_duplicates(subset=['movieId'])
    BestGuessesUserOne["NewScore"] = BestGuessesUserOne.movieId.apply(lambda x: algo.predict(ID, x, 3).est)
    #BestGuessesUserOne = BestGuessesUserOne[~BestGuessesUserOne["userId"] == ID]
    md.movieId = BestGuessesUserOne.movieId.astype('int64',errors="ignore")
    RES = BestGuessesUserOne.sort_values(by=['NewScore'],ascending = False)[:10].merge(md,on="movieId")
    return RES

# Let's get favorite movies for User no 10

* Let's check that the system does what it set out to do. 

In [82]:
BestForUser(10)

Unnamed: 0,userId,movieId,rating,NewScore,imdb_id,title
0,4,858,5.0,4.851819,tt0110604,Mute Witness
1,15,1252,5.0,4.774213,tt0032455,Fantasia
2,5,1221,2.5,4.736326,tt0111693,When a Man Loves a Woman
3,19,969,5.0,4.717092,tt0099871,Jacob's Ladder
4,2,527,4.0,4.686256,tt0112445,The White Balloon
5,2,50,4.0,4.656696,tt0113627,Leaving Las Vegas
6,4,913,5.0,4.638657,tt0113986,Nine Months
7,3,7361,3.0,4.62854,tt0118055,Up Close & Personal
8,15,912,5.0,4.611156,tt0101026,Tie Me Up! Tie Me Down!
9,2,296,4.0,4.600558,tt0114814,The Usual Suspects


* Recommendations are quite close to expected results.