# Assignment 2

---

In this assignment for CUNY's DATA 643 Recommender Systems I will use both User-User based collaberative filtering and content based collaberative filtering to create two systems to recomend movies to users. 

The data was downloaded from the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. I chose to use the 100k dataset because larger datasets are two small to run on my computer and it was not worth the extra time and expense to run the data on the cloud. 

While the ratings were from the MovieLens data, I used the [OMDBAPI](http://www.omdbapi.com) to scrape plot summaries for the content based system. For this I purchased a membership and scraped the data in a seperate script (available on [my github](https://github.com/kaiserxc/DATA643).

---

## Collaberative Filtering (CF):

CF uses the similarity of user ratings to recomend movies that users have not seen yet. This is commonly done with Pearson Correlation Similarity which can be seen as a mean centered cosin distance. This calculates the correlation between each user and the target user. This can be computationally expensive because it involves computing a $MxM$ dimensional correlation matrix. 

In my implementation I used the [Surprise library](http://surpriselib.com) to implement a [Singular Value Decomposition (SVD) algorithm](https://medium.com/@m_n_malaeb/singular-value-decomposition-svd-in-recommender-systems-for-non-math-statistics-programming-4a622de653e9).

SVD uses matrix factorization to reduce the dimensions from n movies to k features. These k features can be thought of almost like genres in that they represent ideas like 'action movies' or 'movies with a strong female lead'. However, the actual values will not align to this, it is just a easier way for humans to think about SVD.

In [1]:
import pandas as pd
import surprise
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import SVD

from surprise.model_selection import cross_validate

In [2]:
df = pd.read_csv('/Users/kailukowiak/DATA643/Project2/ml-latest-small/ratings.csv')
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
algo = SVD()
mod = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8944  0.9053  0.8968  0.8902  0.8998  0.8973  0.0051  
Fit time          7.25    7.13    7.07    7.11    7.09    7.13    0.07    
Test time         0.28    0.22    0.22    0.25    0.22    0.24    0.03    


In [3]:
mod

{'test_rmse': array([ 0.89438535,  0.90533662,  0.89681796,  0.89020844,  0.89976644]),
 'fit_time': (7.253556966781616,
  7.129652976989746,
  7.066343069076538,
  7.105708837509155,
  7.090865135192871),
 'test_time': (0.28129005432128906,
  0.22143006324768066,
  0.2174060344696045,
  0.24765801429748535,
  0.21558666229248047)}

In [4]:
mod['test_rmse']

array([ 0.89438535,  0.90533662,  0.89681796,  0.89020844,  0.89976644])

This model is actually very trivial to implement, but provides good results.

---

## Content Based Filtering (CBF):

CBF uses atributes of the film to find similar movies. At it's simplest, it could involve finding the nearest other movie based on eucludian distance with atributes like `action movie`, `Bruce Whilis` and `Rated R`.

I wanted to use plot synopsys taken from OMDB to perform TF-IDF on this. Unfortunelty, the download process for this was very slow. I ended up using only a subset of the movies. This also made computation easier. 

Because the dataset is much smaller, we can't compare the results, but it is interesting to do anyways.

In [5]:
df = pd.read_csv('~/DATA643/Project2/movieInfo.csv', index_col=0)
df.head()

Unnamed: 0,plot,director,actors,rated,genre,imdb_rating,title
1,A cowboy doll is profoundly threatened and jea...,John Lasseter,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney",G,"Animation, Adventure, Comedy",8.3,Toy Story
2,When two kids find and play a magical board ga...,Joe Johnston,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...",PG,"Adventure, Family, Fantasy",6.9,Jumanji
3,John and Max resolve to save their beloved bai...,Howard Deutch,"Walter Matthau, Jack Lemmon, Sophia Loren, Ann...",PG-13,"Comedy, Romance",6.6,Grumpier Old Men
4,"Based on Terry McMillan's novel, this film fol...",Forest Whitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",R,"Comedy, Drama, Romance",5.7,Waiting to Exhale
5,George Banks must deal not only with the pregn...,Charles Shyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",PG,"Comedy, Family, Romance",6.0,Father of the Bride Part II


This is a subset of the meta data of the MovieLens dataset scraped from OMDB.

In [6]:
df[df['plot'].isnull()]

Unnamed: 0,plot,director,actors,rated,genre,imdb_rating,title
1165,,Nina Menkes,"Russ Little, Tinka Menkes, Robert Muller, Jack...",,"Crime, Drama",3.9,The Bloody Child


There is only one missing value here so I'm just going to delete it.

In [7]:
df = df.drop(1165)

Now for the TF-IDF system:

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

tfidfMatrix = tfidf.fit_transform(df['plot'])

tfidfMatrix.shape

(1791, 7829)

In [9]:
from sklearn.metrics.pairwise import linear_kernel
cosineSim = linear_kernel(tfidfMatrix, tfidfMatrix)


In [10]:
testSeries = pd.Series(list(enumerate(cosineSim[0])))
testSeries.index = df.index
testSeries[0:10]

1                 (0, 1.0)
2                 (1, 0.0)
3     (2, 0.0193175578159)
4                 (3, 0.0)
5                 (4, 0.0)
6                 (5, 0.0)
7                 (6, 0.0)
8                 (7, 0.0)
9                 (8, 0.0)
10                (9, 0.0)
dtype: object

This is a single column for the first movie, _Toy Story_, and it shows the cosine similarity. All that's left is to find the top distances and recomend them. 

In [11]:
df[df.title == 'Hamlet']

Unnamed: 0,plot,director,actors,rated,genre,imdb_rating,title
1411,"Hamlet, Prince of Denmark, returns home to fin...",Kenneth Branagh,"Riz Abbasi, Richard Attenborough, David Blair,...",PG-13,Drama,7.8,Hamlet
1941,Prince Hamlet struggles over whether or not he...,Laurence Olivier,"John Laurie, Esmond Knight, Anthony Quayle, Ni...",NOT RATED,Drama,7.8,Hamlet


Because there are overlaps in the titles, I'm going to stick with the movie IDs. This makes it less user friendly but I don't want to drop too many datapoints.

In [12]:
cosineSim = pd.DataFrame(cosineSim)
cosineSim.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790
0,1.0,0.0,0.019318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.088884,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.019876,0.123599,0.0,...,0.012895,0.0,0.0,0.0,0.0,0.0,0.0,0.016867,0.0,0.0
2,0.019318,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Here we have a similarity matrix as a dataframe. We now need to reindex it for easier lookup. 

In [13]:
cosineSim.index = df.index
cosineSim.columns = df.index
cosineSim.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,2390,2391,2392,2394,2395,2396,2397,2398,2399,2400
1,1.0,0.0,0.019318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.088884,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.019876,0.123599,0.0,...,0.012895,0.0,0.0,0.0,0.0,0.0,0.0,0.016867,0.0,0.0
3,0.019318,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
cosineSim.loc[1,:].sort_values(ascending=False)[1:11]

2315    0.163656
2321    0.109381
1799    0.106079
60      0.103100
1875    0.094552
147     0.094310
392     0.093675
1991    0.093601
1707    0.090216
1767    0.090205
Name: 1, dtype: float64

This is the basis for our recomender, now _Toy Story_ is indexed at 1 and we find the top 10 movies excluding _Toy Story_.

In [15]:
def top10(df, cosineSim, movieIndex):
    recs = cosineSim.loc[movieIndex,:].sort_values(ascending=False)[1:11]
    recsIndex = recs.index
    return df.loc[recsIndex,['title', 'plot']]

top10(df, cosineSim, 1)

Unnamed: 0,title,plot
2315,Bride of Chucky,"Chucky, the doll possessed by a serial killer,..."
2321,Pleasantville,Two 1990s teenage siblings find themselves in ...
1799,Suicide Kings,A group of youngsters kidnap a respected Mafia...
60,The Indian in the Cupboard,On his ninth birthday a boy receives many pres...
1875,Clockwatchers,The relationship between four female temps all...
147,The Basketball Diaries,A teenager finds his dreams of becoming a bask...
392,The Secret Adventures of Tom Thumb,A boy born the size of a small doll is kidnapp...
1991,Child's Play,A single mother gives her son a much sought-af...
1707,Home Alone 3,"Alex Pruitt, a young boy of nine living in Chi..."
1767,Music from Another Room,Music From Another Room is a romantic comedy t...


Toy Story is at movie index 1. This gives the result of recomending Bride of Chucky which is a horror film. It's a pretty disturbing failure.

Let's see how it fairs with _Bride of Chucky_:

In [16]:
top10(df, cosineSim, 2315)

Unnamed: 0,title,plot
1991,Child's Play,A single mother gives her son a much sought-af...
481,Kalifornia,A journalist duo go on a tour of serial killer...
320,Suture,"After his brother tries to kill him, a man sur..."
1,Toy Story,A cowboy doll is profoundly threatened and jea...
22,Copycat,An agoraphobic psychologist and a female detec...
1324,Amityville Dollhouse,"A children's doll house, which is a miniature ..."
1661,Switchback,An FBI agent tries to catch a serial killer wh...
1493,Love and Other Catastrophes,A day in the life of two film-school students ...
392,The Secret Adventures of Tom Thumb,A boy born the size of a small doll is kidnapp...
1993,Child's Play 3,"Chucky returns for revenge against Andy, the y..."


This is more of a mixed bag, again, we have _Toy Story_ but _Kalifornia_ might be more appropriate.

In [17]:
top10(df, cosineSim, 296) # Pulp Fiction
# At least we get Goodfellas.


Unnamed: 0,title,plot
2318,Happiness,The lives of several individuals intertwine as...
1228,Raging Bull,"Emotionally self-destructive boxer, Jake La Mo..."
1900,Children of Heaven,"After a boy loses his sister's pair of shoes, ..."
2019,Seven Samurai,A poor village under attack by bandits recruit...
962,They Made Me a Criminal,A boxer flees believing he has comitted a murd...
1226,The Quiet Man,A retired American boxer returns to the villag...
1114,The Funeral,"In the 30's, in New York, the coffin of the le..."
1213,Goodfellas,The story of Henry Hill and his life in the mo...
18,Four Rooms,Four interlocking tales that take place in a f...
2130,Atlantic City,"In a corrupt city, a small-time gangster and t..."


## Atribute Based Recomender

Given the questionable results from the TF-IDF recommender, let's try an atribute based recomender.

In [19]:
movies = pd.read_csv('DATA643/Project2/ml-latest-small/movies.csv', index_col='movieId')
movies.head()
# And merging the atributes with the previous df

movies = df.join(movies, lsuffix='_left', rsuffix='_right')
movies = movies.drop(['title_right', 'plot', 'imdb_rating', 'genres'], 1)

movies.head()

Unnamed: 0,director,actors,rated,genre,title_left
1,John Lasseter,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney",G,"Animation, Adventure, Comedy",Toy Story
2,Joe Johnston,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...",PG,"Adventure, Family, Fantasy",Jumanji
3,Howard Deutch,"Walter Matthau, Jack Lemmon, Sophia Loren, Ann...",PG-13,"Comedy, Romance",Grumpier Old Men
4,Forest Whitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",R,"Comedy, Drama, Romance",Waiting to Exhale
5,Charles Shyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",PG,"Comedy, Family, Romance",Father of the Bride Part II


It could be worthwhile seperating our genres into individual dummy variables so that Animation, Adventure and Comedy all have their own feature. I'm not going to do this because I don't think the relationship is linear and it's easier. However, I am going to do this for the actors because we would lose degrees of freedom and because I think that movies with Tom Hanks would be of interest to people even if Tim Allen wasn't in it.

Now we just need to make dummy variables for each feature.

In [20]:
dumDirector = pd.get_dummies(movies['director'])
dumRated = pd.get_dummies(movies['rated'])
dumGenre = pd.get_dummies(movies['genre'])
dumActors = df['actors'].str.get_dummies(sep=',')

In [21]:
distDF = pd.concat([dumDirector, dumRated, dumGenre, dumActors], axis=1)
distDF.head()

Unnamed: 0,Aaron Speiser,Abbas Kiarostami,Abel Ferrara,Abraham Polonsky,Adam Resnick,Adrian Lyne,Agnieszka Holland,Akira Kurosawa,Al Pacino,Alain Berliner,...,Ying Huang,You Ge,Yun-Fat Chow,Yves Montand,Yûki Kudô,Zach Galligan,Zbigniew Zamachowski,Zdenek Sverák,Élodie Bouchez,Émile Genest
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(distDF, distDF)

In [23]:
distDF = pd.DataFrame(dist)
distDF.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790
0,0.0,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,...,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.464102
1,3.741657,0.0,3.741657,3.741657,3.464102,3.741657,3.464102,3.464102,3.741657,3.741657,...,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.464102,3.741657,3.162278,3.741657
2,3.741657,3.741657,0.0,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.464102,...,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657
3,3.741657,3.741657,3.741657,0.0,3.741657,3.464102,3.741657,3.741657,3.464102,3.741657,...,3.464102,3.464102,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.741657,3.741657
4,3.741657,3.464102,3.741657,3.741657,0.0,3.741657,3.464102,3.464102,3.741657,3.741657,...,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.464102,3.741657,3.464102,3.741657


In [24]:
distDF.columns = df.index
distDF.index = df.index
distDF.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,2390,2391,2392,2394,2395,2396,2397,2398,2399,2400
1,0.0,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,...,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.464102
2,3.741657,0.0,3.741657,3.741657,3.464102,3.741657,3.464102,3.464102,3.741657,3.741657,...,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.464102,3.741657,3.162278,3.741657
3,3.741657,3.741657,0.0,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.464102,...,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657,3.741657
4,3.741657,3.741657,3.741657,0.0,3.741657,3.464102,3.741657,3.741657,3.464102,3.741657,...,3.464102,3.464102,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.741657,3.741657
5,3.741657,3.464102,3.741657,3.741657,0.0,3.741657,3.464102,3.464102,3.741657,3.741657,...,3.741657,3.741657,3.464102,3.464102,3.741657,3.741657,3.464102,3.741657,3.464102,3.741657


In [25]:
def eucludRec(df=df, distDF=distDF, movieIndex=1):
    recs = distDF.loc[movieIndex,:].sort_values(ascending=True)[1:11]
    recsIndex = recs.index
    return df.loc[recsIndex,['title', 'plot']]
eucludRec(df, distDF, 1)

Unnamed: 0,title,plot
1881,The Magic Sword: Quest for Camelot,"An adventurous girl, a young blind hermit, and..."
1031,Bedknobs and Broomsticks,"An apprentice witch, three kids and a cynical ..."
759,Maya Lin: A Strong Clear Vision,A film about the work of the artist most famou...
709,Oliver & Company,A lost and alone kitten joins a gang of dogs e...
1566,Hercules,The son of the Greek Gods Zeus and Hera is str...
2080,Lady and the Tramp,The romantic tale of a sheltered uptown Cocker...
2085,101 Dalmatians,When a litter of Dalmatian puppies are abducte...
588,Aladdin,When a street urchin vies for the love of a be...
2090,The Rescuers,Two mice of the Rescue Aid Society search for ...
2092,Aladdin: The Return of Jafar,"Jafar comes for revenge on Aladdin, using a fo..."


In [26]:
df.loc[1881]

plot           An adventurous girl, a young blind hermit, and...
director                                        Frederik Du Chau
actors         Jessalyn Gilsig, Andrea Corr, Cary Elwes, Brya...
rated                                                          G
genre                               Animation, Adventure, Comedy
imdb_rating                                                  6.3
title                         The Magic Sword: Quest for Camelot
Name: 1881, dtype: object

This looks a lot better. While _Toy Story_ is a bit different from _The Magic Sword_ many of the other recomendations are very similar.

For _Bride of Chucky_ we get:

In [27]:
eucludRec(df, distDF, 2315)

Unnamed: 0,title,plot
2102,Steamboat Willie,"Mickey Mouse, piloting a steamboat, delights h..."
1746,Senseless,A student gets his senses enhanced by an exper...
285,Beyond Bedlam,Dr. Stephanie Lyell works for Neurological Res...
1991,Child's Play,A single mother gives her son a much sought-af...
204,Under Siege 2: Dark Territory,Casey Ryback hops on a Colorado to LA train to...
759,Maya Lin: A Strong Clear Vision,A film about the work of the artist most famou...
1261,Evil Dead II,The lone survivor of an onslaught of flesh-pos...
496,What Happened Was...,This darkly humorous film explores the persona...
1321,An American Werewolf in London,Two American college students on a walking tou...
1241,Dead Alive,A young man's mother is bitten by a Sumatran r...


Which also looks a bit better, although I'm still uncomfortable that it reccomended a Mickey Mouse short from the start of the 20th century.

Many of the other results are much better though.

---

There might be room for improvement for mixing the TF-IDF and eucludian recomenders but that is beyond the scope of the assignment. 