# Project 3 - Recommender Systems

## Association Rule mining, Collaborative Filtering and Content Based Filtering


### Brett Hallum, Mridul Jain, and Solomon Ndungu


# Introduction

The goal of this project is to analyze Movilens Dataset to understand. We will use this data to generate some of the movie recommendations for specific users, by looking at the movies they already watched and ratings they gave. By using the concepts of collaborative filtering we can find " Movie "X" "LIKED" BY “SIMILAR” USERS as "User-A" " and hence can be recommended to User-A as well.

# Understanding the Data
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.
There are multiple files in this dataset. There are 2 files that we are interested in u.data - this has the userId, the movieId, the rating and the date that rating was given. 

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   


u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.


# Data Exploration and Visualization


In [1]:
import os
os.chdir('C:/Users/emrijai/Documents/IPython Notebooks/MS7331/Project3/ml-100k/ml-100k/')
os.getcwd()

'C:\\Users\\emrijai\\Documents\\IPython Notebooks\\MS7331\\Project3\\ml-100k\\ml-100k'

In [2]:
import numpy as np 
import pandas as pd

In [3]:
#Files to be used for analysis

dataFile='u.data'
movieInfoFile='u.item'

In [4]:
#We are passing the header explicitly as there is no header info in the files
#We are not interested in all the columns of 'u.item'. We are going to use only 0,1 columns from this file.

data=pd.read_csv(dataFile,sep="\t",header=None,names=['userId','itemId','rating','timestamp'])
movieInfo=pd.read_csv(movieInfoFile,sep="|", header=None, index_col=False,
                     names=['itemId','title'], usecols=[0,1])

In [5]:
print data.head()
print '\n'
print movieInfo.head()

   userId  itemId  rating  timestamp
0     196     242       3  881250949
1     186     302       3  891717742
2      22     377       1  878887116
3     244      51       2  880606923
4     166     346       1  886397596


   itemId              title
0       1   Toy Story (1995)
1       2   GoldenEye (1995)
2       3  Four Rooms (1995)
3       4  Get Shorty (1995)
4       5     Copycat (1995)


In [6]:
# Merging the two files together into one single dataFrame. We will use this dataFrame in the further analysis.

data=pd.merge(data,movieInfo,left_on='itemId',right_on="itemId")

In [7]:
print data.shape
print data.head()

(100000, 5)
   userId  itemId  rating  timestamp         title
0     196     242       3  881250949  Kolya (1996)
1      63     242       3  875747190  Kolya (1996)
2     226     242       5  883888671  Kolya (1996)
3     154     242       3  879138235  Kolya (1996)
4     306     242       5  876503793  Kolya (1996)


In [8]:
data=pd.DataFrame.sort_values(data,['userId','itemId'],ascending=[0,1])

# Let's see how many users and how  many movies there are 
numUsers=max(data.userId)
numMovies=max(data.itemId)

moviesPerUser=data.userId.value_counts()
usersPerMovie=data.title.value_counts()

print 'Number of Users: ', numUsers
print 'Number of Movies: ', numMovies
print '\n'
print 'Number of users that rate a particular Movie: \n\n', usersPerMovie.head()
print '\n'
print 'Number of movies rated by particular User: \n\n', moviesPerUser.head()

Number of Users:  943
Number of Movies:  1682


Number of users that rate a particular Movie: 

Star Wars (1977)             583
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: title, dtype: int64


Number of movies rated by particular User: 

405    737
655    685
13     636
450    540
276    518
Name: userId, dtype: int64


In [9]:
data.head()

Unnamed: 0,userId,itemId,rating,timestamp,title
23781,943,2,5,888639953,GoldenEye (1995)
65410,943,9,3,875501960,Dead Man Walking (1995)
35098,943,11,4,888639000,Seven (Se7en) (1995)
43773,943,12,5,888639093,"Usual Suspects, The (1995)"
57040,943,22,4,888639042,Braveheart (1995)


In [10]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def topN(activeUser,N):
    user_topN = data.loc[data.userId == activeUser]
    return user_topN.loc[user_topN.rating > 4].head(N)

In [11]:
moviesPerUser.index[:10]

Int64Index([405, 655, 13, 450, 276, 416, 537, 303, 234, 393], dtype='int64')

In [12]:
TopMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    TopMoviesList = TopMoviesList.append(topN(i,Num_Movies_by_Each_Critic))

del TopMoviesList['userId']
del TopMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the top rating for the movies

TopMoviesList = TopMoviesList.title.value_counts()
TopMoviesList = TopMoviesList[TopMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated highly by most active movie raters in the dataset\n\n', TopMoviesList


Movies that are rated highly by most active movie raters in the dataset

Star Wars (1977)                                                               15
Godfather, The (1972)                                                          13
Usual Suspects, The (1995)                                                     11
Monty Python and the Holy Grail (1974)                                         10
Pulp Fiction (1994)                                                            10
Apocalypse Now (1979)                                                           9
Jaws (1975)                                                                     9
Schindler's List (1993)                                                         9
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)     9
Empire Strikes Back, The (1980)                                                 9
Raising Arizona (1987)                                                          9
Raiders of the Lost Ark 

In [15]:
# Since userID 405 is the most active user and seems like a movie buff. Its a good idea to check which movies he liked
# Lets see user ID 405's highest and lowest rated movies.

user_405 = data.loc[data.userId == 405]
user_405_HighestRatings = user_405.loc[user_405.rating > 4]
user_405_LowestRatings = user_405.loc[user_405.rating < 2]

In [16]:
print '5 Highest Rated Movies by UserID 405', user_405_HighestRatings.head(5)
print '\n5 Lowest Rated Movies by UserID 405', user_405_LowestRatings.head(5)

5 Highest Rated Movies by UserID 405        userId  itemId  rating  timestamp                       title
43709     405      12       5  885545306  Usual Suspects, The (1995)
56861     405      22       5  885545167           Braveheart (1995)
14992     405      23       5  885545372          Taxi Driver (1976)
68788     405      38       5  885548093             Net, The (1995)
48303     405      47       5  885545429              Ed Wood (1994)

5 Lowest Rated Movies by UserID 405        userId  itemId  rating  timestamp                 title
23701     405       2       1  885547953      GoldenEye (1995)
72281     405      27       1  885546487       Bad Boys (1995)
89654     405      30       1  885549544  Belle de jour (1967)
87587     405      31       1  885548579   Crimson Tide (1995)
6166      405      32       1  885546025          Crumb (1994)


As in the personalized recommendation scenario, the introduction of new users or new items can 
cause the cold start problem, as there will be insufficient data on these new entries for the 
collaborative filtering to work accurately
Next we can quickly find the active raters, we call them Movie Critics, and see which movies they rated highest
and which movies they rated lowest. These movies in general can be recommended to the people who have not rated
or seen any movies yet, and are new to the system.

In [13]:
#Function to return the topN Movies for a specific user. N is an arbitrary number, and can be changed as needed.

def bottomN(activeUser,N):
    user_bottomN = data.loc[data.userId == activeUser]
    return user_bottomN.loc[user_bottomN.rating < 3].head(N)

In [14]:
bottomMoviesList = pd.DataFrame()

Num_Active_Critics_to_Check = 20
Num_Movies_by_Each_Critic = 500

for i in moviesPerUser.index[:Num_Active_Critics_to_Check]:
    bottomMoviesList = bottomMoviesList.append(bottomN(i,Num_Movies_by_Each_Critic))

del bottomMoviesList['userId']
del bottomMoviesList['timestamp']

#Atleast 20% of the critics are agreein to the bottom rating for the movies

bottomMoviesList = bottomMoviesList.title.value_counts()
bottomMoviesList = bottomMoviesList[bottomMoviesList>Num_Active_Critics_to_Check/5]

print '\nMovies that are rated low by most active movie raters in the dataset\n\n', bottomMoviesList


Movies that are rated low by most active movie raters in the dataset

Batman Forever (1995)                                      8
Very Brady Sequel, A (1996)                                7
Volcano (1997)                                             7
Waterworld (1995)                                          7
Die Hard: With a Vengeance (1995)                          7
Natural Born Killers (1994)                                7
Pretty Woman (1990)                                        7
Lord of Illusions (1995)                                   6
Free Willy (1993)                                          6
Long Kiss Goodnight, The (1996)                            6
Event Horizon (1997)                                       6
Remains of the Day, The (1993)                             6
Twister (1996)                                             6
High School High (1996)                                    6
Liar Liar (1997)                                           6
Broken Arrow (

In [18]:
from scipy.spatial.distance import correlation 
def similarity(user1,user2):
    user1=np.array(user1)-np.nanmean(user1) # we are first normalizing user1 by 
    # the mean rating of user 1 for any movie. Note the use of np.nanmean() - this 
    # returns the mean of an array after ignoring and NaN values 
    user2=np.array(user2)-np.nanmean(user2)
    # Now to find the similarity between 2 users
    # We'll first subset each user to be represented only by the ratings for the 
    # movies the 2 users have in common 
    commonItemIds=[i for i in range(len(user1)) if user1[i]>0 and user2[i]>0]
    # Gives us movies for which both users have non NaN ratings 
    if len(commonItemIds)==0:
        # If there are no movies in common 
        return 0
    else:
        user1=np.array([user1[i] for i in commonItemIds])
        user2=np.array([user2[i] for i in commonItemIds])
        return correlation(user1,user2)

In [19]:
#Creating a very sparse Matrix "user_to_Movie_Rating_Matrix" of UserID and MovieRatig which we will use later 
# on to find the user-user correlation and hence will be able to find which users are similar to each other.

user_to_Movie_Rating_Matrix=pd.pivot_table(data, values='rating',
                                    index=['userId'], columns=['itemId'])

In [20]:
user_to_Movie_Rating_Matrix.head()

itemId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### Collaborative Filtering

#### Memory-based: Find similar users (user-based CF) or items (item-based CF) to predict missing ratings
1. Produce recommendations based on the preferences of similar users 
	(Goldberg et al., 1992; Resnick et al., 1994; Mild and Reutterer, 2001)
2. Produce recommendations based on the relationship between items in the user-item matrix 
	(Kitts et al., 2000; Sarwar et al., 2001)

#### Model-based: Build a model from the rating data (clustering, latent semantic structure, etc.) and then use this model to predict missing ratings
There are many techniques:
1. Cluster users and then recommend items the users in the cluster closest to the active user like
2. Mine association rules and then use the rules to recommend items (for binary/binarized data)
3. Define a null-model (a stochastic process which models usage of independent items) and then find significant deviation from the null-model
4. Learn a latent factor model from the data and then use the discovered factors to find items with high expected ratings

First we are going to use the K Nearest Neighbors technique (Memory Based Collaborative Filtering technique)
To achieve this we are going to create a K-Nearest Neighbors (Similar Users) of the user in question, and looking at "Neighbors / Similar Users" ratings for a specific item/movie, predict the rating for the user in question.

The idea here is to predict users ratings for the Movies/Products they have not yet rated based on the ratings or feedback received by other users who are in one way or other very similar to the user we are trying to recommend/predict for

Next we are going to use model based approach by using Latent Factor and Association Rules mining to predict the ratings and recommend the movies to users.
    

In [21]:
similarityMatrix=pd.DataFrame(index=user_to_Movie_Rating_Matrix.index,
                                  columns=['Similarity'])

In [22]:
similarityMatrix.head()

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
1,
2,
3,
4,
5,


In [23]:
for i in user_to_Movie_Rating_Matrix.index:
    similarityMatrix.loc[i]=similarity(user_to_Movie_Rating_Matrix.loc[2],
                                          user_to_Movie_Rating_Matrix.loc[i])
    
    # Find the similarity between user_i and user_1 and add it to the similarityMatrix
        
    similarityMatrix=pd.DataFrame.sort_values(similarityMatrix,
                                              ['Similarity'],ascending=[0])

In [24]:
similarityMatrix.head()

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
55,2
107,2
77,2
443,2
847,2


In [25]:
nearestNeighbours=similarityMatrix[:10]
nearestNeighbours

Unnamed: 0_level_0,Similarity
userId,Unnamed: 1_level_1
55,2.0
107,2.0
77,2.0
443,2.0
847,2.0
370,1.66667
314,1.6455
230,1.63246
913,1.61237
675,1.61237
