# Project 3 - Recommender Systems

## Association Rule mining, Collaborative Filtering and Content Based Filtering


### Brett Hallum, Mridul Jain, and Solomon Ndungu


# Introduction

The goal of this project is to analyze Movilens Dataset to understand. We will use this data to generate some of the movie recommendations for specific users, by looking at the movies they already watched and ratings they gave. By using the concepts of collaborative filtering we can find " Movie "X" "LIKED" BY “SIMILAR” USERS as "User-A" " and hence can be recommended to User-A as well.

# Understanding the Data
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.
There are multiple files in this dataset. There are 2 files that we are interested in u.data - this has the userId, the movieId, the rating and the date that rating was given. 

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   


u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.


# Data Exploration and Visualization


In [2]:
import os
os.chdir('C:\Users\Halltrino\Desktop\MDS Downloads\Data Mining\Project 3\ml-100k')
os.getcwd()

'C:\\Users\\Halltrino\\Desktop\\MDS Downloads\\Data Mining\\Project 3\\ml-100k'

In [3]:
import numpy as np 
import pandas as pd

In [4]:
#Files to be used for analysis

dataFile='u.data'
movieInfoFile='u.item'

In [5]:
#We are passing the header explicitly as there is no header info in the files
#We are not interested in all the columns of 'u.item'. We are going to use only 0,1 columns from this file.

data=pd.read_csv(dataFile,sep="\t",header=None,names=['userId','itemId','rating','timestamp'])
movieInfo=pd.read_csv(movieInfoFile,sep="|", header=None, index_col=False,
                     names=['itemId','title'], usecols=[0,1])

In [6]:
print data.head()
print '\n'
print movieInfo.head()

   userId  itemId  rating  timestamp
0     196     242       3  881250949
1     186     302       3  891717742
2      22     377       1  878887116
3     244      51       2  880606923
4     166     346       1  886397596


   itemId              title
0       1   Toy Story (1995)
1       2   GoldenEye (1995)
2       3  Four Rooms (1995)
3       4  Get Shorty (1995)
4       5     Copycat (1995)


In [7]:
# Merging the two files together into one single dataFrame. We will use this dataFrame in the further analysis.

data=pd.merge(data,movieInfo,left_on='itemId',right_on="itemId")

# Create a combined csv file that we will use to load in pyspark for Latent Factor Collaborative filtering \
# using ALTERNATING LEAST SQUARES method
data.to_csv('combined_user_movie_file.csv')
print (data.head())

   userId  itemId  rating  timestamp         title
0     196     242       3  881250949  Kolya (1996)
1      63     242       3  875747190  Kolya (1996)
2     226     242       5  883888671  Kolya (1996)
3     154     242       3  879138235  Kolya (1996)
4     306     242       5  876503793  Kolya (1996)


In [8]:
print data.shape
print data.head()

(100000, 5)
   userId  itemId  rating  timestamp         title
0     196     242       3  881250949  Kolya (1996)
1      63     242       3  875747190  Kolya (1996)
2     226     242       5  883888671  Kolya (1996)
3     154     242       3  879138235  Kolya (1996)
4     306     242       5  876503793  Kolya (1996)


In [9]:
data=pd.DataFrame.sort_values(data,['userId','itemId'],ascending=[0,1])

# Let's see how many users and how  many movies there are 
numUsers=max(data.userId)
numMovies=max(data.itemId)

moviesPerUser=data.userId.value_counts()
usersPerMovie=data.title.value_counts()

print 'Number of Users: ', numUsers
print 'Number of Movies: ', numMovies
print '\n'
print 'Number of users that rate a particular Movie: \n\n', usersPerMovie.head()
print '\n'
print 'Number of movies rated by particular User: \n\n', moviesPerUser.head()

Number of Users:  943
Number of Movies:  1682


Number of users that rate a particular Movie: 

Star Wars (1977)             583
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: title, dtype: int64


Number of movies rated by particular User: 

405    737
655    685
13     636
450    540
276    518
Name: userId, dtype: int64


In [10]:
data.head()

Unnamed: 0,userId,itemId,rating,timestamp,title
23781,943,2,5,888639953,GoldenEye (1995)
65410,943,9,3,875501960,Dead Man Walking (1995)
35098,943,11,4,888639000,Seven (Se7en) (1995)
43773,943,12,5,888639093,"Usual Suspects, The (1995)"
57040,943,22,4,888639042,Braveheart (1995)


In [11]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def topN(activeUser,N):
    user_topN = data.loc[data.userId == activeUser]
    return user_topN.loc[user_topN.rating > 4].head(N)

In [12]:
moviesPerUser.index[:10]

Int64Index([405, 655, 13, 450, 276, 416, 537, 303, 234, 393], dtype='int64')

In [13]:
TopMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    TopMoviesList = TopMoviesList.append(topN(i,Num_Movies_by_Each_Critic))

del TopMoviesList['userId']
del TopMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the top rating for the movies

TopMoviesList = TopMoviesList.title.value_counts()
TopMoviesList = TopMoviesList[TopMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated highly by most active movie raters in the dataset\n\n', TopMoviesList.head(10)


Movies that are rated highly by most active movie raters in the dataset

Star Wars (1977)                                                               15
Godfather, The (1972)                                                          13
Usual Suspects, The (1995)                                                     11
Monty Python and the Holy Grail (1974)                                         10
Pulp Fiction (1994)                                                            10
Apocalypse Now (1979)                                                           9
Jaws (1975)                                                                     9
Schindler's List (1993)                                                         9
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)     9
Empire Strikes Back, The (1980)                                                 9
Name: title, dtype: int64


In [14]:
# Since userID 405 is the most active user and seems like a movie buff. Its a good idea to check which movies he liked
# Lets see user ID 405's highest and lowest rated movies.

user_405 = data.loc[data.userId == 405]
user_405_HighestRatings = user_405.loc[user_405.rating > 4]
user_405_LowestRatings = user_405.loc[user_405.rating < 2]

In [15]:
print '5 Highest Rated Movies by UserID 405', user_405_HighestRatings.head(5)
print '\n5 Lowest Rated Movies by UserID 405', user_405_LowestRatings.head(5)

5 Highest Rated Movies by UserID 405        userId  itemId  rating  timestamp                       title
43709     405      12       5  885545306  Usual Suspects, The (1995)
56861     405      22       5  885545167           Braveheart (1995)
14992     405      23       5  885545372          Taxi Driver (1976)
68788     405      38       5  885548093             Net, The (1995)
48303     405      47       5  885545429              Ed Wood (1994)

5 Lowest Rated Movies by UserID 405        userId  itemId  rating  timestamp                 title
23701     405       2       1  885547953      GoldenEye (1995)
72281     405      27       1  885546487       Bad Boys (1995)
89654     405      30       1  885549544  Belle de jour (1967)
87587     405      31       1  885548579   Crimson Tide (1995)
6166      405      32       1  885546025          Crumb (1994)


As in the personalized recommendation scenario, the introduction of new users or new items can 
cause the cold start problem, as there will be insufficient data on these new entries for the 
collaborative filtering to work accurately
Next we can quickly find the active raters, we call them Movie Critics, and see which movies they rated highest
and which movies they rated lowest. These movies in general can be recommended to the people who have not rated
or seen any movies yet, and are new to the system.

In [16]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def bottomN(activeUser,N):
    user_bottomN = data.loc[data.userId == activeUser]
    return user_bottomN.loc[user_bottomN.rating < 3].head(N)

In [17]:
bottomMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    bottomMoviesList = bottomMoviesList.append(bottomN(i,Num_Movies_by_Each_Critic))

del bottomMoviesList['userId']
del bottomMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the bottom rating for the movies

bottomMoviesList = bottomMoviesList.title.value_counts()
bottomMoviesList = bottomMoviesList[bottomMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated low by most active movie raters in the dataset\n\n', bottomMoviesList.head(10)


Movies that are rated low by most active movie raters in the dataset

Batman Forever (1995)                8
Very Brady Sequel, A (1996)          7
Volcano (1997)                       7
Waterworld (1995)                    7
Die Hard: With a Vengeance (1995)    7
Natural Born Killers (1994)          7
Pretty Woman (1990)                  7
Lord of Illusions (1995)             6
Free Willy (1993)                    6
Long Kiss Goodnight, The (1996)      6
Name: title, dtype: int64


In [18]:
from scipy.spatial.distance import correlation 
def similarity(user1,user2):
    user1=np.array(user1)-np.nanmean(user1) # we are first normalizing user1 by 
    # the mean rating of user 1 for any movie. Note the use of np.nanmean() - this 
    # returns the mean of an array after ignoring and NaN values 
    user2=np.array(user2)-np.nanmean(user2)
    # Now to find the similarity between 2 users
    # We'll first subset each user to be represented only by the ratings for the 
    # movies the 2 users have in common 
    commonItemIds=[i for i in range(len(user1)) if user1[i]>0 and user2[i]>0]
    # Gives us movies for which both users have non NaN ratings 
    if len(commonItemIds)==0:
        # If there are no movies in common 
        return 0
    else:
        user1=np.array([user1[i] for i in commonItemIds])
        user2=np.array([user2[i] for i in commonItemIds])
        return correlation(user1,user2)

In [19]:
# Let's write a function to find the top N favorite movies of a user 
def favoriteMovies(activeUser,N):
    #1. subset the dataframe to have the rows corresponding to the active user
    # 2. sort by the rating in descending order
    # 3. pick the top N rows
    topMovies=pd.DataFrame.sort_values(
        data[data.userId==activeUser],['rating'],ascending=[0])[:N]
    # return the title corresponding to the movies in topMovies 
    return list(topMovies.title)

print favoriteMovies(5,3) # Print the top 3 favorite movies of user 5

['Men in Black (1997)', 'Blade Runner (1982)', 'Empire Strikes Back, The (1980)']


In [20]:
#Creating a very sparse Matrix "user_to_Movie_Rating_Matrix" of UserID and MovieRatig which we will use later 
# on to find the user-user correlation and hence will be able to find which users are similar to each other.

userItemRatingMatrix=pd.pivot_table(data, values='rating',
                                    index=['userId'], columns=['itemId'])

In [21]:
userItemRatingMatrix.head()

itemId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


## GraphLab Recommender System

In [22]:
import graphlab as gl

gl_data = gl.SFrame(data)
print (gl_data.head())

model = gl.recommender.create(gl_data, user_id="userId", item_id="title", target="rating")
results = model.recommend(users=None, k=5)
model.save("my_model")

results.head() # the recommendation output

This non-commercial license of GraphLab Create for academic use is assigned to bhallum@smu.edu and will expire on August 11, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\HALLTR~1\AppData\Local\Temp\graphlab_server_1471403771.log.0


+--------+--------+--------+-----------+----------------------------+
| userId | itemId | rating | timestamp |           title            |
+--------+--------+--------+-----------+----------------------------+
|  943   |   2    |   5    | 888639953 |      GoldenEye (1995)      |
|  943   |   9    |   3    | 875501960 |  Dead Man Walking (1995)   |
|  943   |   11   |   4    | 888639000 |    Seven (Se7en) (1995)    |
|  943   |   12   |   5    | 888639093 | Usual Suspects, The (1995) |
|  943   |   22   |   4    | 888639042 |     Braveheart (1995)      |
|  943   |   23   |   4    | 888638897 |     Taxi Driver (1976)     |
|  943   |   24   |   4    | 875502074 | Rumble in the Bronx (1995) |
|  943   |   27   |   4    | 888639954 |      Bad Boys (1995)       |
|  943   |   28   |   4    | 875409978 |      Apollo 13 (1995)      |
|  943   |   31   |   4    | 888639066 |    Crimson Tide (1995)     |
+--------+--------+--------+-----------+----------------------------+
[10 rows x 5 columns

userId,title,score,rank
943,Casablanca (1942),4.18583912669,1
943,Blade Runner (1982),4.08277056633,2
943,One Flew Over the Cuckoo's Nest (1975) ...,4.07698758958,3
943,Amadeus (1984),4.0680909407,4
943,Alien (1979),4.06442926346,5
942,Casablanca (1942),5.04674100398,1
942,"Silence of the Lambs, The (1991) ...",4.92353759228,2
942,Fargo (1996),4.87269386171,3
942,"Godfather, The (1972)",4.8452357751,4
942,Dr. Strangelove or: How I Learned to Stop Worrying ...,4.82898130535,5


In [23]:
item_item = gl.recommender.item_similarity_recommender.create(gl_data, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

results = item_item.get_similar_items(k=5)
results.head()

title,similar,score,rank
GoldenEye (1995),Under Siege (1992),0.659618616104,1
GoldenEye (1995),Top Gun (1986),0.623543560505,2
GoldenEye (1995),True Lies (1994),0.617273688316,3
GoldenEye (1995),Batman (1989),0.616143107414,4
GoldenEye (1995),Stargate (1994),0.604969024658,5
Dead Man Walking (1995),Fargo (1996),0.618000686169,1
Dead Man Walking (1995),Leaving Las Vegas (1995),0.590753376484,2
Dead Man Walking (1995),"Godfather, The (1972)",0.529238581657,3
Dead Man Walking (1995),Twelve Monkeys (1995),0.527462303638,4
Dead Man Walking (1995),Jerry Maguire (1996),0.527136862278,5


In [24]:
train, test = gl.recommender.util.random_split_by_user(gl_data,
                                                    user_id="userId", item_id="title",
                                                    max_num_users=100, item_test_proportion=0.2)

In [25]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.61      | 0.0429091904431 |
|   2    |      0.53      | 0.0744869886285 |
|   3    | 0.493333333333 |  0.103385197641 |
|   4    |     0.4575     |  0.124967035172 |
|   5    |     0.452      |  0.148665159835 |
|   6    | 0.438333333333 |  0.174192999748 |
|   7    | 0.417142857143 |  0.187193377713 |
|   8    |    0.39125     |  0.196085729773 |
|   9    | 0.384444444444 |  0.21419539291  |
|   10   |     0.366      |  0.221083676131 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.676979752890059)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  925   |   1   | 1.96951662533 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per User RMSE (worst)


In [26]:
print rmse_results.viewkeys()
print rmse_results['rmse_by_item']

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])
+-------------------------------+-------+---------------+
|             title             | count |      rmse     |
+-------------------------------+-------+---------------+
|        Sneakers (1992)        |   3   | 2.70334387852 |
|     Drop Dead Fred (1991)     |   1   |      2.0      |
| Terminator 2: Judgment Day... |   8   | 4.11037103677 |
|    Ruby in Paradise (1993)    |   1   |      4.0      |
|      Jurassic Park (1993)     |   9   | 3.47782205326 |
|  Fried Green Tomatoes (1991)  |   5   | 3.68182413987 |
|       Cliffhanger (1993)      |   4   | 2.89740010347 |
|      Reality Bites (1994)     |   2   | 3.53051894272 |
|      Mary Poppins (1964)      |   3   | 3.32622394357 |
|         Casper (1995)         |   1   | 4.95907067259 |
+-------------------------------+-------+---------------+
[802 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You

In [27]:
rmse_results['rmse_by_user']

userId,count,rmse
71,5,3.75280877258
112,8,3.12627604164
750,5,2.11865602824
134,5,3.93618426405
285,7,4.34232539231
653,56,2.90238515196
364,6,3.37381441511
932,50,4.14149309464
80,3,4.35889894354
66,7,3.35283520357


In [28]:
rmse_results['precision_recall_by_user']

userId,cutoff,precision,recall,count
12,1,1.0,0.0769230769231,13
12,2,0.5,0.0769230769231,13
12,3,0.333333333333,0.0769230769231,13
12,4,0.25,0.0769230769231,13
12,5,0.4,0.153846153846,13
12,6,0.5,0.230769230769,13
12,7,0.428571428571,0.230769230769,13
12,8,0.375,0.230769230769,13
12,9,0.333333333333,0.230769230769,13
12,10,0.3,0.230769230769,13


In [29]:
import graphlab.aggregate as agg

# we will be using these aggregations
agg_list = [agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')]

# apply these functions to each group (we will group the results by 'k' which is the cutoff)
# the cutoff is the number of top items to look for see the following URL for the actual equation
# https://dato.com/products/create/docs/generated/graphlab.recommender.util.precision_recall_by_user.html#graphlab.recommender.util.precision_recall_by_user
rmse_results['precision_recall_by_user'].groupby('cutoff',agg_list)

# the groups are not sorted

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
16,0.30375,0.215787076768,0.283004894821,0.195005645219
10,0.366,0.253069160508,0.221083676131,0.164408205116
36,0.216111111111,0.162188923523,0.413494916225,0.20814553451
26,0.245769230769,0.183884370022,0.348262666888,0.200310191469
41,0.202195121951,0.154538517788,0.43362818633,0.199159354562
3,0.493333333333,0.328227563336,0.103385197641,0.107841686417
1,0.61,0.48774993593,0.0429091904431,0.0616446080683
6,0.438333333333,0.264212666868,0.174192999748,0.143957247507
11,0.354545454545,0.249793302982,0.232547457545,0.167189337107
2,0.53,0.352278299076,0.0744869886285,0.0785919826771


## Cross Validated Collab Filtering

In [30]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating")

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.31      |  0.016795979674 |
|   2    |     0.265      | 0.0278245607391 |
|   3    | 0.263333333333 | 0.0402932191147 |
|   4    |     0.2475     | 0.0476731194612 |
|   5    |     0.232      | 0.0532962898121 |
|   6    | 0.218333333333 | 0.0579236521334 |
|   7    | 0.214285714286 | 0.0653083776976 |
|   8    |     0.205      | 0.0697411450408 |
|   9    | 0.201111111111 | 0.0758857803894 |
|   10   |     0.194      | 0.0813952486696 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.4673011556459044)

Per User RMSE (best)
+--------+-------+-----------------+
| userId | count |       rmse      |
+--------+-------+-----------------+
|  925   |   1   | 0.0893884652263 |
+--------+-------+-----------------+
[1 rows x 3 columns]


Per User RM

In [31]:
rmse_results['precision_recall_by_user'].groupby('cutoff',[agg.AVG('precision'),agg.STD('precision'),agg.AVG('recall'),agg.STD('recall')])

cutoff,Avg of precision,Stdv of precision,Avg of recall,Stdv of recall
16,0.17875,0.196655631753,0.11801910251,0.111732339911
10,0.194,0.229268401661,0.0813952486696,0.0976898026314
36,0.136666666667,0.14118545723,0.20028620441,0.144553759044
26,0.155,0.161213500984,0.169635294508,0.135130216143
41,0.131463414634,0.135969648781,0.223147735623,0.149080659044
3,0.263333333333,0.306576073575,0.0402932191147,0.0675441820757
1,0.31,0.462493243194,0.016795979674,0.0377036964808
6,0.218333333333,0.260016025147,0.0579236521334,0.0776211234523
11,0.192727272727,0.225366551743,0.0871650897882,0.0988517876953
2,0.265,0.363696301878,0.0278245607391,0.0554627204968


In [32]:
rec1 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  num_factors=16,                 # override the default value
                                  regularization=1e-02,           # override the default value
                                  linear_regularization = 1e-3)   # override the default value

rmse_results = rec1.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.3       | 0.0176498022543 |
|   2    |     0.255      | 0.0295393977648 |
|   3    | 0.223333333333 | 0.0346217811607 |
|   4    |      0.2       | 0.0389973714138 |
|   5    |     0.192      | 0.0447030801685 |
|   6    |     0.195      | 0.0559737168413 |
|   7    | 0.198571428571 | 0.0630862682948 |
|   8    |     0.1875     | 0.0663491808156 |
|   9    | 0.184444444444 |  0.074002446739 |
|   10   |     0.179      | 0.0792802102477 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.0645586631233674)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|  275   |   18  | 0.434469368511 |
+--------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (w

## Comparison to Item-Item matrix

In [33]:
comparison = gl.recommender.util.compare_models(test, [item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.61      | 0.0429091904431 |
|   2    |      0.53      | 0.0744869886285 |
|   3    | 0.493333333333 |  0.103385197641 |
|   4    |     0.4575     |  0.124967035172 |
|   5    |     0.452      |  0.148665159835 |
|   6    | 0.438333333333 |  0.174192999748 |
|   7    | 0.417142857143 |  0.187193377713 |
|   8    |    0.39125     |  0.196085729773 |
|   9    | 0.384444444444 |  0.21419539291  |
|   10   |     0.366      |  0.221083676131 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.676979752890059)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  925   |   1   | 1.96951662533 |
+--------+-------+---------------+
[1 rows x 3 colum

In [34]:
 comparisonstruct = gl.compare(test,[item_item, rec1])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.61      | 0.0429091904431 |
|   2    |      0.53      | 0.0744869886285 |
|   3    | 0.493333333333 |  0.103385197641 |
|   4    |     0.4575     |  0.124967035172 |
|   5    |     0.452      |  0.148665159835 |
|   6    | 0.438333333333 |  0.174192999748 |
|   7    | 0.417142857143 |  0.187193377713 |
|   8    |    0.39125     |  0.196085729773 |
|   9    | 0.384444444444 |  0.21419539291  |
|   10   |     0.366      |  0.221083676131 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.3       | 0.0176498

In [35]:
gl.show_comparison(comparisonstruct,[item_item, rec1])

In [36]:
params = {'user_id': 'userId', 
          'item_id': 'title', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)

# also note thatthis evaluator also supports sklearn
# https://dato.com/products/create/docs/generated/graphlab.toolkits.model_parameter_search.create.html?highlight=model_parameter_search

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.job: Creating a LocalAsync environment called 'async'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-16-2016-22-16-3400000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-16-2016-22-16-3400000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-16-2016-22-16-3400000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-16-2016-22-16-3400000-1fb7a'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-16-2016-22-16-3400000-1fb7a' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-16-2016-22-16-3400000-1fb7a' scheduled.


In [37]:
bst_prms = job.get_best_params()
bst_prms
models = job.get_models()

In [38]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.32      | 0.0191853444246 |
|   2    |     0.255      | 0.0271885574287 |
|   3    |      0.23      | 0.0351476140235 |
|   4    |     0.2075     | 0.0407824927862 |
|   5    |     0.208      | 0.0505811628346 |
|   6    |      0.2       |  0.058707455081 |
|   7    | 0.198571428571 | 0.0664698247844 |
|   8    |     0.1975     |  0.073838196408 |
|   9    |      0.19      | 0.0795490218959 |
|   10   |     0.191      | 0.0853603079678 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.34      | 0.0195045

In [39]:
comparisonstruct = gl.compare(test,[models[4], item_item])
gl.show_comparison(comparisonstruct,[models[4], item_item])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.3       | 0.0177050356152 |
|   2    |     0.265      | 0.0299401914156 |
|   3    | 0.236666666667 | 0.0354866453626 |
|   4    |     0.2125     |  0.040860355377 |
|   5    |     0.208      | 0.0467590291452 |
|   6    | 0.201666666667 | 0.0564364119131 |
|   7    | 0.202857142857 | 0.0636093961348 |
|   8    |    0.18625     | 0.0660174832168 |
|   9    | 0.182222222222 | 0.0723398634677 |
|   10   |     0.186      | 0.0823019263178 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.61      | 0.0429091

### Collaborative Filtering

#### Memory-based: Find similar users (user-based CF) or items (item-based CF) to predict missing ratings
1. Produce recommendations based on the preferences of similar users 
	(Goldberg et al., 1992; Resnick et al., 1994; Mild and Reutterer, 2001)
2. Produce recommendations based on the relationship between items in the user-item matrix 
	(Kitts et al., 2000; Sarwar et al., 2001)

#### Model-based: Build a model from the rating data (clustering, latent semantic structure, etc.) and then use this model to predict missing ratings

There are many techniques:

1. Cluster users and then recommend items the users in the cluster closest to the active user like
2. Mine association rules and then use the rules to recommend items (for binary/binarized data)
3. Define a null-model (a stochastic process which models usage of independent items) and then find significant deviation from the null-model
4. Learn a latent factor model from the data and then use the discovered factors to find items with high expected ratings

First we are going to use the K Nearest Neighbors technique (Memory Based Collaborative Filtering technique)
To achieve this we are going to create a K-Nearest Neighbors (Similar Users) of the user in question, and looking at "Neighbors / Similar Users" ratings for a specific item/movie, predict the rating for the user in question.

The idea here is to predict users ratings for the Movies/Products they have not yet rated based on the ratings or feedback received by other users who are in one way or other very similar to the user we are trying to recommend/predict for


In [27]:
def nearestNeighbourRatings(activeUser,K):
    # This function will find the K Nearest neighbours of the active user, then 
    # use their ratings to predict the activeUsers ratings for other movies 
    similarityMatrix=pd.DataFrame(index=userItemRatingMatrix.index,
                                  columns=['Similarity'])
    # Creates an empty matrix whose row index is userIds, and the value will be 
    # similarity of that user to the active User
    for i in userItemRatingMatrix.index:
        similarityMatrix.loc[i]=similarity(userItemRatingMatrix.loc[activeUser],
                                          userItemRatingMatrix.loc[i])
        # Find the similarity between user i and the active user and add it to the 
        # similarityMatrix 
    similarityMatrix=pd.DataFrame.sort_values(similarityMatrix,
                                              ['Similarity'],ascending=[0])
    # Sort the similarity matrix in the descending order of similarity 
    nearestNeighbours=similarityMatrix[:K]
    # The above line will give us the K Nearest neighbours 
    
    # We'll now take the nearest neighbours and use their ratings 
    # to predict the active user's rating for every movie
    neighbourItemRatings=userItemRatingMatrix.loc[nearestNeighbours.index]
    # The similarity matrix had an index which was the userId, By sorting 
    # and picking the top K rows, the nearestNeighbours dataframe now has 
    # a dataframe whose row index is the userIds of the K Nearest neighbours 
    # Using this index we can directly find the corresponding rows in the 
    # user Item rating matrix 
    predictItemRating=pd.DataFrame(index=userItemRatingMatrix.columns, columns=['Rating'])
    # A placeholder for the predicted item ratings. It's row index is the 
    # list of itemIds which is the same as the column index of userItemRatingMatrix
    #Let's fill this up now
    for i in userItemRatingMatrix.columns:
        # for each item 
        predictedRating=np.nanmean(userItemRatingMatrix.loc[activeUser])
        # start with the average rating of the user
        for j in neighbourItemRatings.index:
            # for each neighbour in the neighbour list 
            if userItemRatingMatrix.loc[j,i]>0:
                # If the neighbour has rated that item
                # Add the rating of the neighbour for that item
                #    adjusted by 
                #    the average rating of the neighbour 
                #    weighted by 
                #    the similarity of the neighbour to the active user
                predictedRating += (userItemRatingMatrix.loc[j,i]
                                    -np.nanmean(userItemRatingMatrix.loc[j]))*nearestNeighbours.loc[j,'Similarity']
        # We are out of the loop which uses the nearest neighbours, add the 
        # rating to the predicted Rating matrix
        predictItemRating.loc[i,'Rating']=predictedRating
    return predictItemRating

In [28]:
# Let's now use these predicted Ratings to find the top N Recommendations for the active user 

def topNRecommendations(activeUser,N):
    predictItemRating=nearestNeighbourRatings(activeUser,10)
    # Use the 10 nearest neighbours to find the predicted ratings
    moviesAlreadyWatched=list(userItemRatingMatrix.loc[activeUser]
                              .loc[userItemRatingMatrix.loc[activeUser]>0].index)
    # find the list of items whose ratings which are not NaN
    predictItemRating=predictItemRating.drop(moviesAlreadyWatched)
    topRecommendations=pd.DataFrame.sort_values(predictItemRating,
                                                ['Rating'],ascending=[0])[:N]
    # This will give us the list of itemIds which are the top recommendations 
    # Let's find the corresponding movie titles 
    topRecommendationTitles=(movieInfo.loc[movieInfo.itemId.isin(topRecommendations.index)])
    return list(topRecommendationTitles.title)

In [29]:
# Let's use this for one specific user and predict the top N recommendations for that user
activeUser=5
print favoriteMovies(activeUser,5),"\n",topNRecommendations(activeUser,3)

['Men in Black (1997)', 'Blade Runner (1982)', 'Empire Strikes Back, The (1980)', 'Wrong Trousers, The (1993)', 'Blues Brothers, The (1980)'] 
['Truth About Cats & Dogs, The (1996)', 'Scream (1996)', 'First Wives Club, The (1996)']


### LATENT FACTOR COLLABORATIVE FILTERING

The objective of Matrix Factoriation is to decompose each user rating into a user-factor vector and a product-factor vector. This is analogous to what happens in singular value decomposition or principal component analysis. However, these techniques would only make sense if you knew all the ratings for all the users for all products, which is not the case in the case of user-movie rating.
In order to overcome this issue, we only solve for the ratings which are available.

Next we are going to use model based approach by using Latent Factor and Association Rules mining to predict the ratings and recommend the movies to users.

#### Two popular methods to solve matrix factorization for recommendations

1. STOCHASTIC GRADIENT DESCENT
2. ALTERNATING LEAST SQUARES

### We will implement SGD algorithm manually and ALS using SPARK's MLLib Library

In [30]:
# Let's now use matrix factorization to do the same exercise ie
# finding the recommendations for a user
# The idea here is to identify some factors (these are factors which influence
# a user'r rating). The factors are identified by decomposing the 
# user item rating matrix into a user-factor matrix and a item-factor matrix
# Each row in the user-factor matrix maps the user onto the hidden factors
# Each row in the product factor matrix maps the item onto the hidden factors
# This operation will be pretty expensive because it will effectively give us 
# the factor vectors needed to find the rating of any product by any user 
# (in the  previous case (KNN) we only did the computations for 1 user)

def matrixFactorization(R, K, steps=10, gamma=0.001,lamda=0.02):
    # R is the user item rating matrix 
    # K is the number of factors we will find 
    # We'll be using Stochastic Gradient descent to find the factor vectors 
    # steps, gamma and lamda are parameters the SGD will use - we'll get to them
    # in a bit 
    N=len(R.index)# Number of users
    M=len(R.columns) # Number of items 
    P=pd.DataFrame(np.random.rand(N,K),index=R.index)
    # This is the user factor matrix we want to find. It will have N rows 
    # on for each user and K columns, one for each factor. We are initializing 
    # this matrix with some random numbers, then we will iteratively move towards 
    # the actual value we want to find 
    Q=pd.DataFrame(np.random.rand(M,K),index=R.columns)
    # This is the product factor matrix we want to find. It will have M rows, 
    # one for each product/item/movie. 
    for step in xrange(steps):
        # SGD will loop through the ratings in the user item rating matrix 
        # It will do this as many times as we specify (number of steps) or 
        # until the error we are minimizing reaches a certain threshold 
        for i in R.index:
            for j in R.columns:
                if R.loc[i,j]>0:
                    # For each rating that exists in the training set 
                    eij=R.loc[i,j]-np.dot(P.loc[i],Q.loc[j])
                    # This is the error for one rating 
                    # ie difference between the actual value of the rating 
                    # and the predicted value (dot product of the corresponding 
                    # user factor vector and item-factor vector)
                    # We have an error function to minimize. 
                    # The Ps and Qs should be moved in the downward direction 
                    # of the slope of the error at the current point 
                    P.loc[i]=P.loc[i]+gamma*(eij*Q.loc[j]-lamda*P.loc[i])
                    # Gamma is the size of the step we are taking / moving the value
                    # of P by 
                    # The value in the brackets is the partial derivative of the 
                    # error function ie the slope. Lamda is the value of the 
                    # regularization parameter which penalizes the model for the 
                    # number of factors we are finding. 
                    Q.loc[j]=Q.loc[j]+gamma*(eij*P.loc[i]-lamda*Q.loc[j])
        # At the end of this we have looped through all the ratings once. 
        # Let's check the value of the error function to see if we have reached 
        # the threshold at which we want to stop, else we will repeat the process
        e=0
        for i in R.index:
            for j in R.columns:
                if R.loc[i,j]>0:
                    #Sum of squares of the errors in the rating
                    e= e + pow(R.loc[i,j]-np.dot(P.loc[i],Q.loc[j]),2)+lamda*(pow(np.linalg.norm(P.loc[i]),2)+pow(np.linalg.norm(Q.loc[j]),2))
        if e<0.001:
            break
        print step
    return P,Q

In [47]:
# Let's call this function now 
(P,Q)=matrixFactorization(userItemRatingMatrix.iloc[:100,:100],K=2,gamma=0.001,lamda=0.02, steps=25)
# Ideally we should run this over the entire matrix for a few 1000's steps, 
# This will be pretty expensive computationally. For now lets just do it over a 
# part of the rating matrix to see how it works. We've kept the steps to 25. 
  

0
1
2
3
4
5
6
7
8
9


In [48]:
# Let's quickly use these ratings to find top recommendations for a user 
activeUser=5
predictItemRating=pd.DataFrame(np.dot(P.loc[activeUser],Q.T),index=Q.index,columns=['Rating'])
topRecommendations=pd.DataFrame.sort_values(predictItemRating,['Rating'],ascending=[0])[:5]
# We found the ratings of all movies by the active user and then sorted them to find the top 3 movies 
topRecommendationTitles=movieInfo.loc[movieInfo.itemId.isin(topRecommendations.index)]
print list(topRecommendationTitles.title)

['Star Wars (1977)', 'Good Will Hunting (1997)', 'L.A. Confidential (1997)', 'Titanic (1997)', "Schindler's List (1993)"]


### SPARK MLLib for Latent Factor Collaborative Filtering - Matrix Factorization
#### Alternative Least Squares Method for Calculating 

To run spark on local laptop/machine we install Spark from http://spark.apache.org/downloads.html, and follow the steps. 

1.Choose a Spark release: 2.0.0 (Jul 26 2016)1.6.2 
2.Choose a package type: Pre-built for Hadoop 2.6  
3.Choose a download type: Direct DownloadSelect Apache Mirror
4.Download Spark: spark-2.0.0-bin-hadoop2.6.tgz

Once downloaded, unzip the binaries in a specific folder like "C:\Apache-Spark\spark-2.0.0-bin-hadoop2.6" and change directory to this folder.
From shell/command prompt, run ./bin/spark-shell
This will start Spark Shell on the local machine.

Set the path variable or .bash_profile as needed: 
In case of Windows PATH: C:\Apache-Spark\spark-2.0.0-bin-hadoop2.6;C:\Apache-Spark\spark-2.0.0-bin-hadoop2.6\bin;

Next we will configure pyspark context in IPYTHON Notebook, which will enable us to use all the features of SPARK right from IPYTHON Notebook. pyspark is a python shell with all the fucntionalities and libraries from SPARK like MLLib etc.


In [1]:
# Below code will enable Spark Shell from IPYTHON
import sys
import os


spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError ('SPARK_HOME environment variable not set')

sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\pyspark.zip')) 
sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.9-src.zip')) 

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.11 (default, Feb 16 2016 09:58:36)
SparkSession available as 'spark'.


In [10]:
from pyspark import SparkContext

In [11]:
uadatapath="C:/Users/emrijai/Documents/IPython Notebooks/MS7331/Project3/ml-100k/ml-100k/combined_user_movie_file.csv"
rawUserArtistData = sc.textFile(uadatapath)
rawUserArtistData.take(10)

[u',userId,itemId,rating,timestamp,title',
 u'0,196,242,3,881250949,Kolya (1996)',
 u'1,63,242,3,875747190,Kolya (1996)',
 u'2,226,242,5,883888671,Kolya (1996)',
 u'3,154,242,3,879138235,Kolya (1996)',
 u'4,306,242,5,876503793,Kolya (1996)',
 u'5,296,242,4,884196057,Kolya (1996)',
 u'6,34,242,5,888601628,Kolya (1996)',
 u'7,271,242,4,885844495,Kolya (1996)',
 u'8,201,242,4,884110598,Kolya (1996)']

In [None]:
# Filter the header out 
rawUserMovieData_wo_header = rawUserMovieData.filter(lambda x:"userId" not in x)

In [None]:
rawUserMovieData_wo_header.take(10)

In [None]:
rawUserMovieData_wo_header.count()

In [None]:
# Extract the ratings column nwhere ratings is 4 or 5
# The code below gives the mean rating given to the movies (average of all the ratings) in the dataset
rawUserMovieData_wo_header.map(lambda x:float(x.split(",")[3])).stats()

In [None]:
from pyspark.mllib.recommendation import Rating,ALS

In [None]:
# Extract the ratings column nwhere ratings is 4 or 5
# Since we are running this algorithm on a local machine, filtering low ratings will help
# 1. Reduce the amount of processing
# 2. Reduce the amount of data held in-memory

# Since "rawUserMovieData_wo_header" is an RDD of Strings, we need to convert this into RDD of Rating objects
# Additionally, we have filtered out any ratings that are below 4
# Convert the list into a Rating object (Line #4 below)
# Using persist function, ALS will pass over this RDD many times. Persisting will make the computation much faster

uaData=rawUserMovieData_wo_header\
    .map(lambda x:x.split(","))\
    .filter(lambda x: float(x[3])>=4)\
    .map(lambda x:Rating(x[1],x[2],x[3]))
uaData.persist()

In [None]:
uaData.take(10)

In [None]:
# ALS has 2 methods : train and trainImplicit. Since our ratings are explicit we use the train method.
# Explicit vs. implicit feedback
# The standard approach to matrix factorization based collaborative filtering treats 
# the entries in the user-item matrix as explicit preferences given by the user to the item, 
# for example, users giving ratings to movies.
# model = ALS.train(ratings, rank (Factors), numIterations, lambda)

model=ALS.train(uaData,10,5,0.01)

In [None]:
user = 8

In [None]:
# Give below method a user id, and the number of recommendations we want
recommendations=model.recommendProducts(user,5)

In [None]:
recommendations

In [None]:
# Split the row into a tuple of (Movie ID, Movie Name)

moviesPath="C:/Users/emrijai/Documents/IPython Notebooks/MS7331/Project3/ml-100k/ml-100k/u.item"
moviesLookup=sc.textFile(moviesPath).map(lambda x:x.split("|"))
moviesLookup.persist()

In [None]:
# Let's see which movies the user (specific user with userId) likes and rated 5

userMovies=rawUserMovieData_wo_header\
    .map(lambda x:x.split(","))\
    .filter(lambda x:int(x[1])==user and int(x[3])>4)\
    .map(lambda x:x[2]).collect()

In [None]:
# Use the lookup action to print the names of the movies this user already likes
for movies in userMovies: 
    print moviesLookup.lookup(movies)

Looks like the user likes Action, War/Drama movies

In [None]:
# Let’s print the recommended Artist names

for rating in recommendations: 
    print moviesLookup.lookup(str(rating.product))

In [None]:
Latent Factor analysis and ALS are pretty magical. We just need to have a good dataset with User-Product Ratings
The algorithm takes care of finding out the hidden factors that influence user’s preferences
Running this in a spark cluster with millions or records will help us get better results quickly and with very less effort 

## Association rules from the Movielens dataset

Association rules normally make sense with purchases / transactions datasets for example market basket analysis and hence stacking up specific products together logically. In this case, the rules we create may not make much sense, but they can help to determine and understand for example if a person who watches movie a will also be likely to have watched movie b, and hence we can see which movies are normally associate with each other with some minimum support and confidence. That way we can bucketize these movies together and display them on the screen of a specific user next to each other.

The itertools module below will help us generate all permutations of movies
We'll use that to find the possible rules and then filter for those with the required confidence

Since this is a very expensive operation to iterate over all the permutations of such a huge dataset, we are going to increase the required support to 40% for this to work on a single laptop machine. We can use the logic on a large dataset using Spark Cluster Computing environment, where multiple worker nodes can work on the data set in parallel and generate the results much faster.


In [56]:
import itertools

allitems=[]

def support(itemset):
    userList=userItemRatingMatrix.index
    nUsers=len(userList)
    ratingMatrix=userItemRatingMatrix
    for item in itemset:
        ratingMatrix=ratingMatrix.loc[ratingMatrix.loc[:,item]>0]
        #Subset the ratingMatrix to the set of users who have rated this item 
        userList=ratingMatrix.index
    # After looping through all the items in the set, we are left only with the
    # users who have rated all the items in the itemset
    return float(len(userList))/float(nUsers)
# Support is the proportion of all users who have watched this set of movies 

minsupport=0.4
for item in list(userItemRatingMatrix.columns):
    itemset=[item]
    if support(itemset)>minsupport:
        allitems.append(item)

# We are now left only with the items which have been rated by atleast 40% of the users

In [57]:
print 'Number of movies were watched by atleast 40% of the users = ' , (len(allitems))

print '\nFrom these movies we will generate rules and test again for support and confidence'

print '\nList of movie Ids', allitems

Number of movies were watched by atleast 40% of the users =  17

From these movies we will generate rules and test again for support and confidence

List of movie Ids [1, 7, 50, 56, 98, 100, 117, 121, 127, 174, 181, 237, 258, 286, 288, 294, 300]


The below snippet will generate all possible 2 item rules which satisfy the support and confidence constraints. 
By Iterating over i, we can continue on  for finding 3 item rules or even n item rules. At each step make sure that every rule satisfies minconfidence and minsupport

In [64]:
minconfidence=0.2
assocRules=[]
i=2
for rule in itertools.permutations(allitems,i):
    #Generates all possible permutations of i items from the remaining list of movies 
    from_item=[rule[0]]
    to_item=rule
    # each rule is a tuple of i items 
    confidence=support(to_item)/support(from_item)
    if confidence>minconfidence and support(to_item)>minsupport:
        assocRules.append(rule)

In [65]:
assocRules

[(1, 50),
 (50, 1),
 (50, 100),
 (50, 174),
 (50, 181),
 (100, 50),
 (174, 50),
 (181, 50)]