## Import Module

Before you can work with the data, you must import module to use some function in analysis.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Loading movielens dataset

# User's data
users_columns = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=users_columns, parse_dates=True) 
# Ratings data
rating_columns = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=rating_columns)
# Movies data
movie_columns = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=movie_columns, usecols=range(5),encoding='latin-1')

In [3]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995)


In [6]:
# Merge movie's data with their ratings
movie_ratings = pd.merge(movies, ratings)
# merging movie_ratings data with the User's dataframe
df = pd.merge(movie_ratings, users)

In [7]:
df.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,user_id,rating,unix_timestamp,age,sex,occupation,zip_code
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,308,4,887736532,60,M,retired,95076
1,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,308,5,887737890,60,M,retired,95076
2,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),308,4,887739608,60,M,retired,95076
3,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,308,4,887738847,60,M,retired,95076
4,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),308,5,887736696,60,M,retired,95076


### Pre-processing

In [8]:
# drop columns that aren't needed
df.drop(df.columns[[3,4,7]], axis=1, inplace=True)
ratings.drop( "unix_timestamp", inplace = True, axis = 1 ) 
movies.drop(movies.columns[[3,4]], inplace = True, axis = 1 )

In [9]:
df.head()

Unnamed: 0,movie_id,title,release_date,user_id,rating,age,sex,occupation,zip_code
0,1,Toy Story (1995),01-Jan-1995,308,4,60,M,retired,95076
1,4,Get Shorty (1995),01-Jan-1995,308,5,60,M,retired,95076
2,5,Copycat (1995),01-Jan-1995,308,4,60,M,retired,95076
3,7,Twelve Monkeys (1995),01-Jan-1995,308,4,60,M,retired,95076
4,8,Babe (1995),01-Jan-1995,308,5,60,M,retired,95076


### Exploratory Analysis

Let's explore the data a bit and get a look at some of the higest rating movie.

In [10]:
df[['title','rating']].groupby('title')['rating'].mean().sort_values(ascending=False)

title
Marlene Dietrich: Shadow and Light (1996)       5.0
Prefontaine (1997)                              5.0
Santa with Muscles (1996)                       5.0
Star Kid (1997)                                 5.0
Someone Else's America (1995)                   5.0
                                               ... 
Touki Bouki (Journey of the Hyena) (1973)       1.0
JLG/JLG - autoportrait de décembre (1994)       1.0
Daens (1992)                                    1.0
Butterfly Kiss (1995)                           1.0
Eye of Vichy, The (Oeil de Vichy, L') (1993)    1.0
Name: rating, Length: 1664, dtype: float64

movie with the most rating count

In [11]:
df[['title','rating']].groupby('title')['rating'].count().sort_values(ascending=False)

title
Star Wars (1977)                              583
Contact (1997)                                509
Fargo (1996)                                  508
Return of the Jedi (1983)                     507
Liar Liar (1997)                              485
                                             ... 
Man from Down Under, The (1943)                 1
Marlene Dietrich: Shadow and Light (1996)       1
Mat' i syn (1997)                               1
Mille bolle blu (1993)                          1
Á köldum klaka (Cold Fever) (1994)              1
Name: rating, Length: 1664, dtype: int64

## Recommending Similar Movies

The next step is to create a martix that has the user ids on one axis and the movie titles on another. Each cell will then consist of the rating of a movie by a particular user

First step is create pivot table (a matrix of users and movie_ratings)

In [16]:
ratings_matrix = df.pivot_table(index=['user_id'],columns=['title'],values='rating').reset_index(drop=True)
ratings_matrix.fillna( 0, inplace = True )

In [17]:
ratings_matrix

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
0,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,...,0.0,0.0,0.0,5.0,3.0,0.0,3.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,5.0,0.0,0.0,0.0,5.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


It's normal for there to be lots of NaN values, as not everyone would have see most movie.

In [18]:
toy = ratings_matrix['Toy Story (1995)']

In [19]:
toy.head()

0    5.0
1    4.0
2    0.0
3    0.0
4    4.0
Name: Toy Story (1995), dtype: float64

We can then use corrwith() method to get correlations between two pandas series:

In [21]:
similar_to_toy = ratings_matrix.corrwith(toy)

In [22]:
similar_to_toy

title
'Til There Was You (1997)                0.027505
1-900 (1994)                            -0.051609
101 Dalmatians (1996)                    0.277678
12 Angry Men (1957)                      0.140519
187 (1997)                              -0.004323
                                           ...   
Young Guns II (1990)                     0.128569
Young Poisoner's Handbook, The (1995)   -0.005483
Zeus and Roxanne (1997)                  0.060806
unknown                                  0.071664
Á köldum klaka (Cold Fever) (1994)       0.002251
Length: 1664, dtype: float64

In [23]:
corr_toy = pd.DataFrame(similar_to_toy,columns=['correlation'])
corr_toy.dropna(inplace=True)

In [26]:
corr_toy.head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
'Til There Was You (1997),0.027505
1-900 (1994),-0.051609
101 Dalmatians (1996),0.277678
12 Angry Men (1957),0.140519
187 (1997),-0.004323


In [27]:
corr_toy.sort_values('correlation',ascending=False).head(10)

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Toy Story (1995),1.0
Star Wars (1977),0.457677
Independence Day (ID4) (1996),0.454544
"Rock, The (1996)",0.431789
Willy Wonka and the Chocolate Factory (1971),0.423975
Return of the Jedi (1983),0.422991
Mission: Impossible (1996),0.41677
Aladdin (1992),0.407829
Twister (1996),0.404908
Star Trek: First Contact (1996),0.391073


And we're done! We had some recommendation after you see Toy Story movie

## Matrix Factorization

we have matrix of user

In [28]:
ratings_matrix

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
0,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,...,0.0,0.0,0.0,5.0,3.0,0.0,3.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,5.0,0.0,0.0,0.0,5.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
X = ratings_matrix.values.T
X.shape

(1664, 943)

### Now lets fit the model

In [42]:
import sklearn
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=300, random_state=1)
matrix = SVD.fit_transform(X)
matrix.shape

(1664, 300)

SVD with 90% - 95% sum of explained variance ratio are good enaugh

In [43]:
SVD.explained_variance_ratio_.sum()

0.9063283592501954

In [44]:
import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)
corr = np.corrcoef(matrix)
corr.shape

(1664, 1664)

### Now lets check the results

In [45]:
movie = ratings_matrix.columns
movie_list = list(movie)

here's the recommendation after we see Toy Story movie

In [49]:
coffey_hands = movie_list.index("Toy Story (1995)")
corr_coffey_hands  = corr[coffey_hands]
list(movie[(corr_coffey_hands >= 0.6)])

['Apollo 13 (1995)',
 'Empire Strikes Back, The (1980)',
 'Fargo (1996)',
 'Independence Day (ID4) (1996)',
 'Jerry Maguire (1996)',
 'Mission: Impossible (1996)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Rock, The (1996)',
 'Star Trek: First Contact (1996)',
 'Star Wars (1977)',
 'Toy Story (1995)',
 'Twelve Monkeys (1995)',
 'Twister (1996)',
 'Willy Wonka and the Chocolate Factory (1971)']

see for another movie

In [50]:
coffey_hands = movie_list.index("Star Wars (1977)")
corr_coffey_hands  = corr[coffey_hands]
list(movie[(corr_coffey_hands >= 0.6)])

['Alien (1979)',
 'Aliens (1986)',
 'Apollo 13 (1995)',
 'Back to the Future (1985)',
 'Blade Runner (1982)',
 'Braveheart (1995)',
 'Contact (1997)',
 'Die Hard (1988)',
 'E.T. the Extra-Terrestrial (1982)',
 'Empire Strikes Back, The (1980)',
 'Fargo (1996)',
 'Forrest Gump (1994)',
 'Fugitive, The (1993)',
 'Godfather, The (1972)',
 'Groundhog Day (1993)',
 'Hunt for Red October, The (1990)',
 'Independence Day (ID4) (1996)',
 'Indiana Jones and the Last Crusade (1989)',
 'Jerry Maguire (1996)',
 'Jurassic Park (1993)',
 'Men in Black (1997)',
 'Mission: Impossible (1996)',
 'Monty Python and the Holy Grail (1974)',
 'Princess Bride, The (1987)',
 'Pulp Fiction (1994)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Rock, The (1996)',
 'Silence of the Lambs, The (1991)',
 'Star Trek: First Contact (1996)',
 'Star Trek: The Wrath of Khan (1982)',
 'Star Wars (1977)',
 'Terminator 2: Judgment Day (1991)',
 'Terminator, The (1984)',
 'Toy Story (1995)',
 'Twelve Mon

In [51]:
coffey_hands = movie_list.index("Terminator, The (1984)")
corr_coffey_hands  = corr[coffey_hands]
list(movie[(corr_coffey_hands >= 0.6)])

['2001: A Space Odyssey (1968)',
 'Alien (1979)',
 'Aliens (1986)',
 'Apollo 13 (1995)',
 'Back to the Future (1985)',
 'Batman (1989)',
 'Blade Runner (1982)',
 'Blues Brothers, The (1980)',
 'Braveheart (1995)',
 'Clear and Present Danger (1994)',
 'Dances with Wolves (1990)',
 'Die Hard (1988)',
 'Die Hard 2 (1990)',
 'Die Hard: With a Vengeance (1995)',
 'E.T. the Extra-Terrestrial (1982)',
 'Empire Strikes Back, The (1980)',
 'Fish Called Wanda, A (1988)',
 'Forrest Gump (1994)',
 'Fugitive, The (1993)',
 'Full Metal Jacket (1987)',
 'Get Shorty (1995)',
 'GoodFellas (1990)',
 'Groundhog Day (1993)',
 'Highlander (1986)',
 'Hunt for Red October, The (1990)',
 'Indiana Jones and the Last Crusade (1989)',
 'Jaws (1975)',
 'Jurassic Park (1993)',
 'Monty Python and the Holy Grail (1974)',
 'Princess Bride, The (1987)',
 'Pulp Fiction (1994)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Seven (Se7en) (1995)',
 'Shawshank Redemption, The (1994)',
 'Silence of the

In [53]:
coffey_hands = movie_list.index("Babe (1995)")
corr_coffey_hands  = corr[coffey_hands]
list(movie[(corr_coffey_hands >= 0.5)])

['2001: A Space Odyssey (1968)',
 'Aladdin (1992)',
 'Amadeus (1984)',
 'American President, The (1995)',
 'Apollo 13 (1995)',
 'Babe (1995)',
 'Back to the Future (1985)',
 'Beauty and the Beast (1991)',
 'Braveheart (1995)',
 'Clueless (1995)',
 'Dances with Wolves (1990)',
 'Dead Poets Society (1989)',
 'E.T. the Extra-Terrestrial (1982)',
 'Empire Strikes Back, The (1980)',
 'Field of Dreams (1989)',
 'Fish Called Wanda, A (1988)',
 'Forrest Gump (1994)',
 'Four Weddings and a Funeral (1994)',
 'Fugitive, The (1993)',
 'Get Shorty (1995)',
 'Groundhog Day (1993)',
 'Indiana Jones and the Last Crusade (1989)',
 "It's a Wonderful Life (1946)",
 'Jurassic Park (1993)',
 'Lion King, The (1994)',
 'Mary Poppins (1964)',
 'Monty Python and the Holy Grail (1974)',
 'Mrs. Doubtfire (1993)',
 'Princess Bride, The (1987)',
 'Pulp Fiction (1994)',
 'Quiz Show (1994)',
 'Raiders of the Lost Ark (1981)',
 'Raising Arizona (1987)',
 'Return of the Jedi (1983)',
 "Schindler's List (1993)",
 'Shaw

Done!!!