# Project 3 - Recommender Systems

## Association Rule mining, Collaborative Filtering and Content Based Filtering


### Brett Hallum, Mridul Jain, and Solomon Ndungu


# Introduction

The goal of this project is to analyze Movilens Dataset to understand. We will use this data to generate some of the movie recommendations for specific users, by looking at the movies they already watched and ratings they gave. By using the concepts of collaborative filtering we can find " Movie "X" "LIKED" BY “SIMILAR” USERS as "User-A" " and hence can be recommended to User-A as well.

# Understanding the Data
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.
There are multiple files in this dataset. There are 2 files that we are interested in u.data - this has the userId, the movieId, the rating and the date that rating was given. 

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   


u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.


# Data Exploration and Visualization


In [1]:
import os
os.chdir('C:\Users\Halltrino\Desktop\MDS Downloads\Data Mining\Project 3\ml-100k')
os.getcwd()

'C:\\Users\\Halltrino\\Desktop\\MDS Downloads\\Data Mining\\Project 3\\ml-100k'

In [2]:
import numpy as np 
import pandas as pd

In [34]:
#Files to be used for analysis

dataFile='u.data'
movieInfoFile='u.item'

In [35]:
#We are passing the header explicitly as there is no header info in the files
#We are not interested in all the columns of 'u.item'. We are going to use only 0,1 columns from this file.

data=pd.read_csv(dataFile,sep="\t",header=None,names=['userId','itemId','rating','timestamp'])
movieInfo=pd.read_csv(movieInfoFile,sep="|", header=None, index_col=False,
                     names=['itemId','title'], usecols=[0,1])

In [36]:
print data.head()
print '\n'
print movieInfo.head()

   userId  itemId  rating  timestamp
0     196     242       3  881250949
1     186     302       3  891717742
2      22     377       1  878887116
3     244      51       2  880606923
4     166     346       1  886397596


   itemId              title
0       1   Toy Story (1995)
1       2   GoldenEye (1995)
2       3  Four Rooms (1995)
3       4  Get Shorty (1995)
4       5     Copycat (1995)


In [37]:
# Merging the two files together into one single dataFrame. We will use this dataFrame in the further analysis.

data=pd.merge(data,movieInfo,left_on='itemId',right_on="itemId")

print (data.head())

   userId  itemId  rating  timestamp         title
0     196     242       3  881250949  Kolya (1996)
1      63     242       3  875747190  Kolya (1996)
2     226     242       5  883888671  Kolya (1996)
3     154     242       3  879138235  Kolya (1996)
4     306     242       5  876503793  Kolya (1996)


In [40]:
import graphlab as gl

gl_data = gl.SFrame(data)
print (gl_data.head())

model = gl.recommender.create(gl_data, user_id="userId", item_id="title", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output

+--------+--------+--------+-----------+--------------+
| userId | itemId | rating | timestamp |    title     |
+--------+--------+--------+-----------+--------------+
|  196   |  242   |   3    | 881250949 | Kolya (1996) |
|   63   |  242   |   3    | 875747190 | Kolya (1996) |
|  226   |  242   |   5    | 883888671 | Kolya (1996) |
|  154   |  242   |   3    | 879138235 | Kolya (1996) |
|  306   |  242   |   5    | 876503793 | Kolya (1996) |
|  296   |  242   |   4    | 884196057 | Kolya (1996) |
|   34   |  242   |   5    | 888601628 | Kolya (1996) |
|  271   |  242   |   4    | 885844495 | Kolya (1996) |
|  201   |  242   |   4    | 884110598 | Kolya (1996) |
|  209   |  242   |   4    | 883589606 | Kolya (1996) |
+--------+--------+--------+-----------+--------------+
[10 rows x 5 columns]



userId,title,score,rank
196,Titanic (1997),4.30369069694,1
196,Star Wars (1977),4.2474428591,2
196,Casablanca (1942),4.20361530675,3
196,Schindler's List (1993),4.17192911623,4
196,Good Will Hunting (1997),4.13874630152,5
63,Titanic (1997),4.48721780895,1
63,Good Will Hunting (1997),4.32851194857,2
63,Casablanca (1942),4.29275974153,3
63,Rear Window (1954),4.21884612261,4
63,"Boot, Das (1981)",4.0773907852,5


In [50]:
item_item = gl.recommender.item_similarity_recommender.create(gl_data, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=5)
results.head()

title,similar,score,rank
Kolya (1996),"English Patient, The (1996) ...",0.392465889454,1
Kolya (1996),"Full Monty, The (1997)",0.382244706154,2
Kolya (1996),Everyone Says I Love You (1996) ...,0.355488300323,3
Kolya (1996),Secrets & Lies (1996),0.352207303047,4
Kolya (1996),Ulee's Gold (1997),0.345902621746,5
L.A. Confidential (1997),"Full Monty, The (1997)",0.554080307484,1
L.A. Confidential (1997),"English Patient, The (1996) ...",0.52740240097,2
L.A. Confidential (1997),Contact (1997),0.504989147186,3
L.A. Confidential (1997),Titanic (1997),0.502544820309,4
L.A. Confidential (1997),Apt Pupil (1998),0.479608356953,5


In [44]:
train, test = gl.recommender.util.random_split_by_user(gl_data,
                                                    user_id="userId", item_id="title",
                                                    max_num_users=100, item_test_proportion=0.2)

In [61]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.54      | 0.0337716050497 |
|   2    |      0.5       | 0.0652904091777 |
|   3    |      0.46      | 0.0862699435694 |
|   4    |     0.4275     |  0.101901829927 |
|   5    |     0.406      |  0.120263715656 |
|   6    | 0.396666666667 |  0.140199119811 |
|   7    | 0.385714285714 |  0.156405050301 |
|   8    |    0.37125     |  0.171508486366 |
|   9    | 0.363333333333 |  0.188212550986 |
|   10   |      0.35      |  0.200560665057 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.685461074115566)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  502   |   3   | 1.83556845078 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per User RMSE (worst)


In [47]:
print rmse_results.viewkeys()
print rmse_results['rmse_by_item']

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])
+-------------------------------+-------+---------------+
|             title             | count |      rmse     |
+-------------------------------+-------+---------------+
|        Sneakers (1992)        |   4   | 4.17301815399 |
| Much Ado About Nothing (1993) |   2   | 4.51155886047 |
|     Drop Dead Fred (1991)     |   1   |      5.0      |
| Terminator 2: Judgment Day... |   5   |  3.8793210072 |
|      Jurassic Park (1993)     |   10  | 4.16177615306 |
|  Fried Green Tomatoes (1991)  |   3   | 4.07356373546 |
|      Reality Bites (1994)     |   1   |      1.0      |
|      Mary Poppins (1964)      |   5   | 3.39099247779 |
|         Casper (1995)         |   1   | 4.97396377741 |
| Free Willy 2: The Adventur... |   1   | 1.99172160051 |
+-------------------------------+-------+---------------+
[816 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You

In [51]:
rmse_results['rmse_by_user']

userId,count,rmse
71,10,3.76520625646
112,9,3.89153999894
750,7,2.95196665913
134,4,3.39326343458
285,7,4.15225832136
653,61,2.98682996762
364,7,3.51027759862
932,53,3.97447551233
80,6,3.37942691148
66,10,3.32874255519


In [52]:
rmse_results['precision_recall_by_user']

userId,cutoff,precision,recall,count
12,1,0.0,0.0,12
12,2,0.5,0.0833333333333,12
12,3,0.333333333333,0.0833333333333,12
12,4,0.25,0.0833333333333,12
12,5,0.2,0.0833333333333,12
12,6,0.166666666667,0.0833333333333,12
12,7,0.142857142857,0.0833333333333,12
12,8,0.125,0.0833333333333,12
12,9,0.222222222222,0.166666666667,12
12,10,0.2,0.166666666667,12


In [53]:
import graphlab.aggregate as agg

# we will be using these aggregations
agg_list = [agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')]

# apply these functions to each group (we will group the results by 'k' which is the cutoff)
# the cutoff is the number of top items to look for see the following URL for the actual equation
# https://dato.com/products/create/docs/generated/graphlab.recommender.util.precision_recall_by_user.html#graphlab.recommender.util.precision_recall_by_user
rmse_results['precision_recall_by_user'].groupby('cutoff',agg_list)

# the groups are not sorted

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
16,0.295625,0.21863479109,0.264979854992,0.1527319002
10,0.355,0.256661255354,0.203903261099,0.141956316442
36,0.210277777778,0.164347522113,0.39190022659,0.174300815001
26,0.244230769231,0.186478975448,0.341312702463,0.170462592758
41,0.19487804878,0.154969053348,0.406922634427,0.16791318387
3,0.46,0.335923271663,0.0862699435694,0.0884616982703
1,0.55,0.497493718553,0.0347716050497,0.0565883645696
6,0.393333333333,0.285306852354,0.137260131716,0.111453593295
11,0.342727272727,0.253522294323,0.213772285929,0.14479146177
2,0.51,0.380657326213,0.0687904091777,0.0809333656706


## Cross Validated Collab Filtering

In [54]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating")

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.34      | 0.0168347480649 |
|   2    |      0.29      | 0.0301152742516 |
|   3    | 0.316666666667 | 0.0525759989921 |
|   4    |     0.2975     | 0.0653803011618 |
|   5    |     0.278      | 0.0822655256673 |
|   6    | 0.258333333333 | 0.0899961891204 |
|   7    | 0.255714285714 |  0.101416976214 |
|   8    |    0.24625     |  0.11139613242  |
|   9    | 0.234444444444 |  0.118448697233 |
|   10   |     0.231      |  0.128633052462 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.3414340599301695)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|  549   |   6   | 0.641544379553 |
+--------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (w

In [56]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
16,0.206875,0.179777423152,0.176885504309,0.133767637785
10,0.231,0.201839044786,0.128633052462,0.122170430548
36,0.161666666667,0.135593118657,0.293910550565,0.158672806653
26,0.175384615385,0.151259016065,0.231069021139,0.150865260114
41,0.154146341463,0.128454872902,0.321246512059,0.15991395476
3,0.316666666667,0.306865877325,0.0525759989921,0.0638382433847
1,0.34,0.473708771293,0.0168347480649,0.0306457872504
6,0.258333333333,0.228977194401,0.0899961891204,0.099109754155
11,0.220909090909,0.192527769812,0.132511818526,0.120857481108
2,0.29,0.340440890611,0.0301152742516,0.0457600279036


In [57]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  num_factors=16,                 # override the default value
                                  regularization=1e-02,           # override the default value
                                  linear_regularization = 1e-3)   # override the default value

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.26      | 0.0108226642059 |
|   2    |     0.225      | 0.0179836007001 |
|   3    |      0.2       |  0.024245414697 |
|   4    |     0.1875     | 0.0308640844179 |
|   5    |     0.178      | 0.0370503787486 |
|   6    | 0.183333333333 | 0.0496711891019 |
|   7    | 0.181428571429 | 0.0568688817641 |
|   8    |      0.17      | 0.0591621851826 |
|   9    | 0.158888888889 | 0.0620789052571 |
|   10   |     0.157      | 0.0694198425776 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.0834667246990073)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|   46   |   4   | 0.406251880845 |
+--------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (w

## Comparison to Item-Item matrix

In [62]:
comparison = gl.recommender.util.compare_models(test, [item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.54      | 0.0337716050497 |
|   2    |      0.5       | 0.0652904091777 |
|   3    |      0.46      | 0.0862699435694 |
|   4    |     0.4275     |  0.101901829927 |
|   5    |     0.406      |  0.120263715656 |
|   6    | 0.396666666667 |  0.140199119811 |
|   7    | 0.385714285714 |  0.156405050301 |
|   8    |    0.37125     |  0.171508486366 |
|   9    | 0.363333333333 |  0.188212550986 |
|   10   |      0.35      |  0.200560665057 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.685461074115566)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  502   |   3   | 1.83556845078 |
+--------+-------+---------------+
[1 rows x 3 colum

In [63]:
 comparisonstruct = gl.compare(test,[item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.54      | 0.0337716050497 |
|   2    |      0.5       | 0.0652904091777 |
|   3    |      0.46      | 0.0862699435694 |
|   4    |     0.4275     |  0.101901829927 |
|   5    |     0.406      |  0.120263715656 |
|   6    | 0.396666666667 |  0.140199119811 |
|   7    | 0.385714285714 |  0.156405050301 |
|   8    |    0.37125     |  0.171508486366 |
|   9    | 0.363333333333 |  0.188212550986 |
|   10   |      0.35      |  0.200560665057 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.26      | 0.0108226

In [64]:
gl.show_comparison(comparisonstruct,[item_item, rec1])

In [65]:
params = {'user_id': 'userId', 
          'item_id': 'title', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)

# also note thatthis evaluator also supports sklearn
# https://dato.com/products/create/docs/generated/graphlab.toolkits.model_parameter_search.create.html?highlight=model_parameter_search

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.job: Creating a LocalAsync environment called 'async'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-10-2016-21-18-1000000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-10-2016-21-18-1000000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-10-2016-21-18-1000000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-10-2016-21-18-1000000-743f0'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-10-2016-21-18-1000000-743f0' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-10-2016-21-18-1000000-743f0' scheduled.


In [68]:
bst_prms = job.get_best_params()
bst_prms
models = job.get_models()

In [69]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.17      | 0.00521618024262 |
|   2    |     0.205      | 0.0160499609046  |
|   3    | 0.173333333333 | 0.0227758209531  |
|   4    |     0.1675     | 0.0277766242282  |
|   5    |     0.156      | 0.0334380004475  |
|   6    |     0.155      | 0.0392380007957  |
|   7    |      0.15      | 0.0457521935858  |
|   8    |    0.14625     |  0.049645427178  |
|   9    | 0.147777777778 | 0.0569375351715  |
|   10   |     0.139      |  0.058921409606  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    |      0.18

In [71]:
comparisonstruct = gl.compare(test,[models[4], item_item])
gl.show_comparison(comparisonstruct,[models[4], item_item])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.25      | 0.0113564230827 |
|   2    |      0.24      | 0.0185154042942 |
|   3    | 0.223333333333 | 0.0268064623146 |
|   4    |     0.2075     | 0.0341635535247 |
|   5    |     0.198      | 0.0411876239919 |
|   6    | 0.196666666667 | 0.0503010076153 |
|   7    | 0.182857142857 | 0.0566678937011 |
|   8    |      0.18      | 0.0619655090517 |
|   9    |      0.17      | 0.0648553030167 |
|   10   |     0.167      | 0.0717265676749 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.54      | 0.0337716

In [39]:
print data.shape
print data.head()

(100000, 5)
   userId  itemId  rating  timestamp         title
0     196     242       3  881250949  Kolya (1996)
1      63     242       3  875747190  Kolya (1996)
2     226     242       5  883888671  Kolya (1996)
3     154     242       3  879138235  Kolya (1996)
4     306     242       5  876503793  Kolya (1996)


In [8]:
data=pd.DataFrame.sort_values(data,['userId','itemId'],ascending=[0,1])

# Let's see how many users and how  many movies there are 
numUsers=max(data.userId)
numMovies=max(data.itemId)

moviesPerUser=data.userId.value_counts()
usersPerMovie=data.title.value_counts()

print 'Number of Users: ', numUsers
print 'Number of Movies: ', numMovies
print '\n'
print 'Number of users that rate a particular Movie: \n\n', usersPerMovie.head()
print '\n'
print 'Number of movies rated by particular User: \n\n', moviesPerUser.head()

Number of Users:  943
Number of Movies:  1682


Number of users that rate a particular Movie: 

Star Wars (1977)             583
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: title, dtype: int64


Number of movies rated by particular User: 

405    737
655    685
13     636
450    540
276    518
Name: userId, dtype: int64


In [9]:
data.head()

Unnamed: 0,userId,itemId,rating,timestamp,title
23781,943,2,5,888639953,GoldenEye (1995)
65410,943,9,3,875501960,Dead Man Walking (1995)
35098,943,11,4,888639000,Seven (Se7en) (1995)
43773,943,12,5,888639093,"Usual Suspects, The (1995)"
57040,943,22,4,888639042,Braveheart (1995)


In [10]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def topN(activeUser,N):
    user_topN = data.loc[data.userId == activeUser]
    return user_topN.loc[user_topN.rating > 4].head(N)

In [11]:
moviesPerUser.index[:10]

Int64Index([405, 655, 13, 450, 276, 416, 537, 303, 234, 393], dtype='int64')

In [42]:
TopMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    TopMoviesList = TopMoviesList.append(topN(i,Num_Movies_by_Each_Critic))

del TopMoviesList['userId']
del TopMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the top rating for the movies

TopMoviesList = TopMoviesList.title.value_counts()
TopMoviesList = TopMoviesList[TopMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated highly by most active movie raters in the dataset\n\n', TopMoviesList.head(10)


Movies that are rated highly by most active movie raters in the dataset

Star Wars (1977)                                                               15
Godfather, The (1972)                                                          13
Usual Suspects, The (1995)                                                     11
Pulp Fiction (1994)                                                            10
Monty Python and the Holy Grail (1974)                                         10
Apocalypse Now (1979)                                                           9
Schindler's List (1993)                                                         9
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)     9
Empire Strikes Back, The (1980)                                                 9
Shawshank Redemption, The (1994)                                                9
Name: title, dtype: int64


In [13]:
# # Since userID 405 is the most active user and seems like a movie buff. Its a good idea to check which movies he liked
# # Lets see user ID 405's highest and lowest rated movies.

user_405 = data.loc[data.userId == 405]
user_405_HighestRatings = user_405.loc[user_405.rating > 4]
user_405_LowestRatings = user_405.loc[user_405.rating < 2]

In [14]:
print '5 Highest Rated Movies by UserID 405', user_405_HighestRatings.head(5)
print '\n5 Lowest Rated Movies by UserID 405', user_405_LowestRatings.head(5)

5 Highest Rated Movies by UserID 405        userId  itemId  rating  timestamp                       title
43709     405      12       5  885545306  Usual Suspects, The (1995)
56861     405      22       5  885545167           Braveheart (1995)
14992     405      23       5  885545372          Taxi Driver (1976)
68788     405      38       5  885548093             Net, The (1995)
48303     405      47       5  885545429              Ed Wood (1994)

5 Lowest Rated Movies by UserID 405        userId  itemId  rating  timestamp                 title
23701     405       2       1  885547953      GoldenEye (1995)
72281     405      27       1  885546487       Bad Boys (1995)
89654     405      30       1  885549544  Belle de jour (1967)
87587     405      31       1  885548579   Crimson Tide (1995)
6166      405      32       1  885546025          Crumb (1994)


As in the personalized recommendation scenario, the introduction of new users or new items can 
cause the cold start problem, as there will be insufficient data on these new entries for the 
collaborative filtering to work accurately
Next we can quickly find the active raters, we call them Movie Critics, and see which movies they rated highest
and which movies they rated lowest. These movies in general can be recommended to the people who have not rated
or seen any movies yet, and are new to the system.

In [15]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def bottomN(activeUser,N):
    user_bottomN = data.loc[data.userId == activeUser]
    return user_bottomN.loc[user_bottomN.rating < 3].head(N)

In [41]:
bottomMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    bottomMoviesList = bottomMoviesList.append(bottomN(i,Num_Movies_by_Each_Critic))

del bottomMoviesList['userId']
del bottomMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the bottom rating for the movies

bottomMoviesList = bottomMoviesList.title.value_counts()
bottomMoviesList = bottomMoviesList[bottomMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated low by most active movie raters in the dataset\n\n', bottomMoviesList.head(10)


Movies that are rated low by most active movie raters in the dataset

Batman Forever (1995)                8
Die Hard: With a Vengeance (1995)    7
Natural Born Killers (1994)          7
Pretty Woman (1990)                  7
Very Brady Sequel, A (1996)          7
Volcano (1997)                       7
Waterworld (1995)                    7
Broken Arrow (1996)                  6
Mission: Impossible (1996)           6
Brady Bunch Movie, The (1995)        6
Name: title, dtype: int64


In [17]:
from scipy.spatial.distance import correlation 
def similarity(user1,user2):
    user1=np.array(user1)-np.nanmean(user1) # we are first normalizing user1 by 
    # the mean rating of user 1 for any movie. Note the use of np.nanmean() - this 
    # returns the mean of an array after ignoring and NaN values 
    user2=np.array(user2)-np.nanmean(user2)
    # Now to find the similarity between 2 users
    # We'll first subset each user to be represented only by the ratings for the 
    # movies the 2 users have in common 
    commonItemIds=[i for i in range(len(user1)) if user1[i]>0 and user2[i]>0]
    # Gives us movies for which both users have non NaN ratings 
    if len(commonItemIds)==0:
        # If there are no movies in common 
        return 0
    else:
        user1=np.array([user1[i] for i in commonItemIds])
        user2=np.array([user2[i] for i in commonItemIds])
        return correlation(user1,user2)

In [18]:
#Creating a very sparse Matrix "user_to_Movie_Rating_Matrix" of UserID and MovieRatig which we will use later 
# on to find the user-user correlation and hence will be able to find which users are similar to each other.

user_to_Movie_Rating_Matrix=pd.pivot_table(data, values='rating',
                                    index=['userId'], columns=['itemId'])

In [19]:
user_to_Movie_Rating_Matrix.head()

itemId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### Collaborative Filtering

#### Memory-based: Find similar users (user-based CF) or items (item-based CF) to predict missing ratings
1. Produce recommendations based on the preferences of similar users 
	(Goldberg et al., 1992; Resnick et al., 1994; Mild and Reutterer, 2001)
2. Produce recommendations based on the relationship between items in the user-item matrix 
	(Kitts et al., 2000; Sarwar et al., 2001)

#### Model-based: Build a model from the rating data (clustering, latent semantic structure, etc.) and then use this model to predict missing ratings
There are many techniques:
1. Cluster users and then recommend items the users in the cluster closest to the active user like
2. Mine association rules and then use the rules to recommend items (for binary/binarized data)
3. Define a null-model (a stochastic process which models usage of independent items) and then find significant deviation from the null-model
4. Learn a latent factor model from the data and then use the discovered factors to find items with high expected ratings

First we are going to use the K Nearest Neighbors technique (Memory Based Collaborative Filtering technique)
To achieve this we are going to create a K-Nearest Neighbors (Similar Users) of the user in question, and looking at "Neighbors / Similar Users" ratings for a specific item/movie, predict the rating for the user in question.

The idea here is to predict users ratings for the Movies/Products they have not yet rated based on the ratings or feedback received by other users who are in one way or other very similar to the user we are trying to recommend/predict for

Next we are going to use model based approach by using Latent Factor and Association Rules mining to predict the ratings and recommend the movies to users.
    

In [20]:
similarityMatrix=pd.DataFrame(index=user_to_Movie_Rating_Matrix.index,
                                  columns=['Similarity'])

In [21]:
similarityMatrix.head()

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
1,
2,
3,
4,
5,


In [22]:
for i in user_to_Movie_Rating_Matrix.index:
    similarityMatrix.loc[i]=similarity(user_to_Movie_Rating_Matrix.loc[2],
                                          user_to_Movie_Rating_Matrix.loc[i])
    
    # Find the similarity between user_i and user_1 and add it to the similarityMatrix
        
    similarityMatrix=pd.DataFrame.sort_values(similarityMatrix,
                                              ['Similarity'],ascending=[0])

In [23]:
similarityMatrix.head()

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
55,2
107,2
77,2
443,2
847,2


In [24]:
nearestNeighbours=similarityMatrix[:10]
nearestNeighbours

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
55,2.0
107,2.0
77,2.0
443,2.0
847,2.0
370,1.66667
314,1.6455
230,1.63246
913,1.61237
675,1.61237
