## Movie Rommendation using Collaborative Filtering

### Installing libraries and packages

In [0]:
import pandas as pd 
import numpy as np 

### Ingest

In [7]:
reader = Reader()
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### EDA

Total number of movies for which the users have given rating

In [34]:
len(ratings['movieId'].unique())

6540

Total number of Users participated in giving rating

In [19]:
len(ratings['userId'].unique())

300

In [20]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45529 entries, 0 to 45528
Data columns (total 4 columns):
userId       45529 non-null int64
movieId      45529 non-null int64
rating       45529 non-null float64
timestamp    45529 non-null int64
dtypes: float64(1), int64(3)
memory usage: 1.4 MB


**Average movie rating and the count of rating given for each movie. **

This will help us know whether our prediction results would be meaningful based on the number of ratings that movie has received.

For example, if we are going to predict what a new user would give rating to a movie which has got less count of ratings, the rating prediction won't make any meaning as it would be biased towards the opinion of the small set of users.

In [32]:
movie_rating_avg = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])

movie_rating_avg

Unnamed: 0_level_0,mean,count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.875000,112
2,3.291667,48
3,3.410714,28
4,2.600000,5
5,3.480000,25
6,3.925532,47
7,3.274194,31
8,3.000000,4
9,3.000000,5
10,3.508197,61


Similar to the above section, if the user has not given more ratings, when predicting the rate for any movie by this user, the prediction won't be correct.

In [33]:
user_rating_avg = ratings.groupby('userId')['rating'].agg(['mean', 'count'])
user_rating_avg

Unnamed: 0_level_0,mean,count
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.366379,232
2,3.948276,29
3,2.435897,39
4,3.555556,216
5,3.636364,44
6,3.493631,314
7,3.230263,152
8,3.574468,47
9,3.260870,46
10,3.278571,140


In [45]:
df_p = pd.pivot_table(ratings,values='rating',index='userId',columns='movieId')

print(df_p.shape)

(300, 6540)


### Model

In [84]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8872
MAE:  0.6837
------------
Fold 2
RMSE: 0.8985
MAE:  0.6924
------------
Fold 3
RMSE: 0.9007
MAE:  0.6925
------------
Fold 4
RMSE: 0.9005
MAE:  0.6948
------------
Fold 5
RMSE: 0.8942
MAE:  0.6890
------------
------------
Mean RMSE: 0.8962
Mean MAE : 0.6905
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'mae': [0.6837485872623522,
                             0.6924226637469555,
                             0.692501711674358,
                             0.6948099038620642,
                             0.6889979731440053],
                            'rmse': [0.8871549220967876,
                             0.898454737347983,
                             0.9006932492266636,
                             0.9005452619270903,
                             0.8942471054149498]})

In [10]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9713f59320>

In [46]:
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8858  0.8936  0.8995  0.9002  0.8935  0.8945  0.0052  
MAE (testset)     0.6828  0.6833  0.6971  0.6920  0.6915  0.6893  0.0055  
Fit time          2.79    2.79    2.74    2.69    2.70    2.74    0.04    
Test time         0.07    0.07    0.06    0.07    0.06    0.07    0.00    


{'fit_time': (2.7944629192352295,
  2.791969060897827,
  2.7406959533691406,
  2.6890599727630615,
  2.696943998336792),
 'test_mae': array([0.68277884, 0.68325066, 0.69705772, 0.69203251, 0.69150438]),
 'test_rmse': array([0.88578559, 0.89357735, 0.89947106, 0.90021259, 0.89347544]),
 'test_time': (0.06848740577697754,
  0.06882214546203613,
  0.06481456756591797,
  0.06637024879455566,
  0.06383824348449707)}

In [0]:
def recommend(movie_id, min_count):
    print("For movie ({})".format(movie_id))
    print("- Top 10 movies recommended based on Pearsons'R correlation - ")
    i = int(ratings.index[ratings['movieId'] == movie_id][0])
    target = df_p[i+1]
    similar_to_target = df_p.corrwith(target)
    corr_target = pd.DataFrame(similar_to_target, columns = ['PearsonR'])
    corr_target.dropna(inplace = True)
    corr_target = corr_target.sort_values('PearsonR', ascending = False)
    corr_target.index = corr_target.index.map(int)
    corr_target = corr_target.join(ratings).join(movie_rating_avg)[['PearsonR', 'movieId', 'count', 'mean']]
    print(corr_target[corr_target['count']>min_count][:10].to_string(index=False))

### Prediction

**Predicting user rating for a move by a user already present in the data**

In [39]:
svd.predict(1, 16, 3)

Prediction(uid=1, iid=16, r_ui=3, est=4.670604351336178, details={'was_impossible': False})

**Predicting movie rating for a new user**

A new user's rating for movie 16 would be 4.163

In [40]:
svd.predict(3098, 16, 3)

Prediction(uid=3098, iid=16, r_ui=3, est=4.163077673388976, details={'was_impossible': False})

**Predicting movie rating for a new movie by existing user**

User 1's rating for a new movie would be 4.223

**Recommending movies for User**

In [82]:
recommend(1, 30)
#ratings.index[ratings['movieId'] == 1]
#df_p

For movie (1)
- Top 10 movies recommended based on Pearsons'R correlation - 


  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


PearsonR   movieId  count      mean
                                   
1.000000       3.0    112  3.875000
0.769518   54503.0     60  3.733333
0.763212     367.0     69  3.913043
0.750847     247.0     41  3.402439
0.700637    2528.0     33  3.545455
0.698296    8984.0     33  3.560606
0.697911      47.0     91  3.785714
0.646771    2109.0     34  3.676471
0.641514  107141.0     35  4.042857
0.632267    2266.0     32  2.843750


In [41]:
svd.predict(1, 5050, 3)

Prediction(uid=1, iid=5050, r_ui=3, est=4.222670366912637, details={'was_impossible': False})

### Conclusion

Many users have rated only for few movies. Even though our recommendation engine suggests rating for them, this rating cannot be relied as it is not trained well for that user. Hence, the movie suggested for them may not be appropriate. Another thing is that, we do not know anything about the movie say, genre or cast. So, we do not know the liking of the user.