# Popular Recommender System Algorithms

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this kernel, we will explore different types of recommendation systems and implement them.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

import os
# print(os.listdir("../input"))

In [None]:
  from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Preprocessing

In [None]:
#Storing the movie information into a pandas dataframe
movies = pd.read_csv('/content/drive/MyDrive/movie data/movie.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
# Using regular expressions to find a year stored between parentheses
# We specify the parantheses so we don't conflict with movies that have years in their titles
movies['year'] = (movies.title.str.extract('(\(\d\d\d\d\))', expand=False)
                              .str.extract('(\d\d\d\d)', expand=False))  # Removing the parentheses

# Removing the years from the 'title' column
# Strip function to get rid of any ending whitespace characters that may have appeared
movies['title'] = (movies.title.str.replace('(\(\d\d\d\d\))', '')
                               .apply(lambda x: x.strip()))

# Every genre is separated by a | so we simply have to call the split function on |
movies['genres'] = movies.genres.str.split('|')
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [None]:
movies.info() # Check for null elements

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
 3   year     27256 non-null  object
dtypes: int64(1), object(3)
memory usage: 852.6+ KB


Before we move on, let's compress our ratings size when reading to speed up to further processing. We can see(by looking at the data description) from userId column that the values are between 1 to 138.493. We can use int32 as its range (-2,147,483,648 to +2,147,483,647) contains what we need. For movieId, same thing also can be done. Lastly, we can convert rating column to float32.(Pandas doesn't support np.float16 for most of their operations so we have to stick with float32). Due to huge memory usage, we can further decrease our data by multiplying these columns with 2 to make everthing int and then convert back to np.int8.

In [None]:
# Storing the user information into a pandas dataframe
ratings = pd.read_csv('/content/drive/MyDrive/movie data/rating.csv', usecols=['userId', 'movieId', 'rating'],
                     dtype={'userId':np.int32, 'movieId':np.int32, 'rating':np.float32})
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [None]:
ratings['rating'] = ratings['rating'] * 2
ratings['rating'] = ratings['rating'].astype(np.int8)
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 3 columns):
 #   Column   Dtype
---  ------   -----
 0   userId   int32
 1   movieId  int32
 2   rating   int8 
dtypes: int32(2), int8(1)
memory usage: 171.7 MB


In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7


## Popularity Based Recommenders

In this part we are going to find the most popular movies and recommend them to users. This can be useful for newcomer users who don't know anything about the movies. 

In [None]:
most_voted = (ratings.groupby('movieId')[['rating']]
                     .count()
                     .sort_values('rating', ascending=False)
                     .reset_index())
most_voted = pd.merge(most_voted, movies, on='movieId').drop('rating', axis=1)
most_voted.head()

Unnamed: 0,movieId,title,genres,year
0,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
1,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
2,318,"Shawshank Redemption, The","[Crime, Drama]",1994
3,593,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991
4,480,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993


Our result shows that:
- Pulp Fiction 
- Forrest Gump
- The Shawshank Redemption
- The Silence of Lambs and 
- Jurassic Park

are the most voted movies ever. So, based on our method we could suggest them to novices.

In [None]:
n = 1000

avg_vote = ((ratings.groupby('movieId')[['rating']]
                     .sum()/ratings.groupby('movieId')[['rating']]
                     .count()))

avg_vote_n = avg_vote[ratings.groupby('movieId')[['rating']]
                      .count()['rating']>=n]

avg_vote_n = pd.merge(avg_vote_n.sort_values('rating', ascending=False)
                      .reset_index(), movies, on='movieId').drop('rating', axis=1)
avg_vote_n.reset_index(drop=True, inplace=True)
avg_vote_n.head()

Unnamed: 0,movieId,title,genres,year
0,318,"Shawshank Redemption, The","[Crime, Drama]",1994
1,858,"Godfather, The","[Crime, Drama]",1972
2,50,"Usual Suspects, The","[Crime, Mystery, Thriller]",1995
3,527,Schindler's List,"[Drama, War]",1993
4,1221,"Godfather: Part II, The","[Crime, Drama]",1974


###Collaborative Recommender systems



<!-- User Based Collaborative Filtering -->

<!-- User Based Collaborative Filtering

Collaborative filtering is making recommend according to combination of your experience and experiences of other people.
First we need to make user vs item matrix.
Each row is users and each columns are items like movie, product or websites
Secondly, computes similarity scores between users.
Each row is users and each row is vector.
Compute similarity of these rows (users).
Thirdly, find users who are similar to you based on past behaviours
Finally, it suggests that you are not experienced before.
Lets make an example of user based collaborative filtering

Think that there are two people
First one watched 2 movies that are lord of the rings and hobbit
Second one watched only lord of the rings movie
User based collaborative filtering computes similarity of these two people and sees both are watched a lord of the rings.
Then it recommends hobbit movie to second one as it can be seen picture * -->

<!-- User based collaborative filtering has some problems

In this system, each row of matrix is user. Therefore, comparing and finding similarity between of them is computationaly hard and spend too much computational power.
Also, habits of people can be changed. Therefore making correct and useful recommendation can be hard in time.
In order to solve these problems, lets look at another recommender system that is item based collaborative filtering -->

###Item Based Collaborative filtering

The next type of recommendation system to look at is correlation-based recommendation systems. These recommenders offer a basic form of collaborative filtering. That's because with correlation-based recommendation systems items are recommended based on similarities in their user review. In this sense, they do take user preferences into account. In these systems, you use Pearson's R correlation to recommend an item that is most similar to the item a user has already chosen. In other words, to recommend an item that has a review score that correlates with another item that a user has already chosen.

In [None]:
movies.columns

Index(['movieId', 'title', 'genres', 'year'], dtype='object')

In [None]:
# what we need is that movie id and title
movie = movies.loc[:,["movieId","title"]]
movie.head(10)

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II
5,6,Heat
6,7,Sabrina
7,8,Tom and Huck
8,9,Sudden Death
9,10,GoldenEye


In [None]:
# what we need is that user id, movie id and rating
rating = ratings.loc[:,["userId","movieId","rating"]]
rating.head(10)

Unnamed: 0,userId,movieId,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7
5,1,112,7
6,1,151,8
7,1,223,8
8,1,253,8
9,1,260,8


In [None]:
# then merge movie and rating data
data= pd.merge(movie,rating)

In [None]:
# now lets look at our data 
data.head(10)

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story,3,8
1,1,Toy Story,6,10
2,1,Toy Story,8,8
3,1,Toy Story,10,8
4,1,Toy Story,11,9
5,1,Toy Story,12,8
6,1,Toy Story,13,8
7,1,Toy Story,14,9
8,1,Toy Story,16,6
9,1,Toy Story,19,10


In [None]:
data.shape

(20000263, 4)

As it can be seen data frame that is above, we have 4 features that are movie id, title user id and rating
According to these data frame, we will make item based recommendation system
Lets look at shape of the data. The number of sample in data frame is 20 million that is too much. There can be problem in kaggle even if their own desktop ide's like spyder or pycharm.
Therefore, in order to learn item based recommendation system lets use 1 million of sample in data.

In [None]:
# Due to problems with pandas, we can't use pivot_table with our all data as it throws MemoryError.
# Therefore, for this part we will work with a sample data
sample_ratings = ratings.sample(n=100000, random_state=20)

# Creating our sparse matrix and fill NA's with 0 to avoid high memory usage.
pivot = pd.pivot_table(sample_ratings, values='rating', index='userId', columns='movieId', fill_value=0)
pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,114935,115071,115170,115235,115569,115617,115664,115971,116401,116797,116799,116823,116977,117123,117176,117434,117444,117490,117511,117851,118200,118482,118546,118696,118898,118942,118952,118972,118997,120474,120610,120819,121235,123947,125916,126420,127622,128151,129659,130490
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
pivot = pivot.astype(np.int8)
pivot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52242 entries, 1 to 138493
Columns: 8433 entries, 1 to 130490
dtypes: int8(8433)
memory usage: 420.5 MB


In [None]:
# # lets make a pivot table in order to make rows are users and columns are movies. And values are rating
# pivot_table = pd.pivot_table(data1,index = "userId",columns = "movieId",values = "rating",fill_value=0)
# pivot_table.shape

In [None]:
# # lets make a pivot table in order to make rows are users and columns are movies. And values are rating
# pivot_table = data.pivot_table(index = ["userId"],columns = ["movieId"],values = "rating",fill_value=0)
# pivot_table.head(10)

In [None]:
movie_watched = pivot[50]
similarity_with_other_movies = pivot.corrwith(movie_watched,drop=True).to_frame(name='PearsonR')  # find correlation between "Bad Boys (1995)" and other movies
# similarity_with_other_movies = similarity_with_other_movies.sort_values('PearsonR',ascending=False)
similarity_with_other_movies.head()

Unnamed: 0_level_0,PearsonR
movieId,Unnamed: 1_level_1
1,0.000923
2,0.002701
3,-0.001973
4,-0.000864
5,-0.002013


In [None]:
rating_count = (ratings.groupby('movieId')[['rating']]
                       .count()
                       .sort_values('rating', ascending=False)
                       .reset_index())
rating_count = pd.merge(rating_count, movies, on='movieId')
rating_count.head()

Unnamed: 0,movieId,rating,title,genres,year
0,296,67310,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
1,356,66172,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
2,318,63366,"Shawshank Redemption, The","[Crime, Drama]",1994
3,593,63299,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991
4,480,59715,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993


But let's think about this for a minute here. If we've found some movies that were really well correlated with Pulp Fiction but that had only, say, ten ratings total, then those movies probably wouldn't really be all that similar to Pulp Fiction. I mean maybe those movies got similar ratings, but they wouldn't be very popular. Therefore, that correlation really wouldn't be significant. We also need to take stock of how popular each of these movies is, in addition to how well the review scores correlate with the ratings that were given to other movies in the dataset. So to do that, we will join our corr data frame with a rating state of frame.

In [None]:
similar_sum = similarity_with_other_movies.join(rating_count['rating'])
similar_top10 = similar_sum[similar_sum['rating']>=500].sort_values(['PearsonR', 'rating'], 
                                                            ascending=[False, False]).head(11)
# Add movie names
similar_top10 = pd.merge(similar_top10[1:11], movie[['title', 'movieId']], on='movieId')
similar_top10

Unnamed: 0,movieId,PearsonR,rating,title
0,1896,0.077938,2224.0,Cousin Bette
1,3492,0.053275,834.0,"Son of the Sheik, The"
2,3350,0.047593,901.0,"Raisin in the Sun, A"
3,366,0.046885,11975.0,Wes Craven's New Nightmare (Nightmare on Elm S...
4,2902,0.045475,1152.0,Psycho II
5,3299,0.042488,924.0,Hanging Up
6,3548,0.041289,805.0,Auntie Mame
7,4111,0.033645,604.0,Gardens of Stone
8,1713,0.032848,2601.0,Mouse Hunt
9,1605,0.031811,2847.0,Excess Baggage


###User Based

In [None]:
#Storing the movie information into a pandas dataframe
movies = pd.read_csv('/content/drive/MyDrive/movie data/movie.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
# Storing the user information into a pandas dataframe
ratings = pd.read_csv('/content/drive/MyDrive/movie data/rating.csv', usecols=['userId', 'movieId', 'rating'],
                     dtype={'userId':np.int32, 'movieId':np.int32, 'rating':np.float32})
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [None]:
df=movies.merge(ratings,how='left',on='movieId')
df=df.copy()
df.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5


In [None]:
# number of unique titles in ratings 
comment_counts = pd.DataFrame(df["title"].value_counts())
comment_counts

Unnamed: 0,title
Pulp Fiction (1994),67310
Forrest Gump (1994),66172
"Shawshank Redemption, The (1994)",63366
"Silence of the Lambs, The (1991)",63299
Jurassic Park (1993),59715
...,...
Century of the Dragon (Long zai bian yuan) (1999),1
"Dangerous Man, A (2009)",1
El chocolate del loro (2004),1
We Cause Scenes (2014),1


In [None]:
# movies rarely rated 
rare_movies = comment_counts[comment_counts["title"] <= 1000].index

In [None]:
# exclusion of movies rarely rated 
common_movies = df[~df["title"].isin(rare_movies)]
common_movies.shape
# check number of common movies
common_movies["title"].nunique()

3159

In [None]:
# creating pivot table consisting of so called common movies             
user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
#user_movie_df.shape
user_movie_df.head(10)

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),13 Going on 30 (2004),"13th Warrior, The (1999)",1408 (2007),15 Minutes (2001),16 Blocks (2006),17 Again (2009),1984 (Nineteen Eighty-Four) (1984),2 Days in the Valley (1996),"2 Fast 2 Furious (Fast and the Furious 2, The) (2003)","20,000 Leagues Under the Sea (1954)",200 Cigarettes (1999),2001: A Space Odyssey (1968),2010: The Year We Make Contact (1984),2012 (2009),2046 (2004),21 (2008),21 Grams (2003),21 Jump Street (2012),24 Hour Party People (2002),25th Hour (2002),27 Dresses (2008),28 Days (2000),28 Days Later (2002),28 Weeks Later (2007),3 Ninjas (1992),3-Iron (Bin-jip) (2004),30 Days of Night (2007),300 (2007),...,"World According to Garp, The (1982)","World Is Not Enough, The (1999)","World's Fastest Indian, The (2005)",Wreck-It Ralph (2012),"Wrestler, The (2008)",Wristcutters: A Love Story (2006),Wyatt Earp (1994),"X-Files: Fight the Future, The (1998)",X-Men (2000),X-Men Origins: Wolverine (2009),X-Men: Days of Future Past (2014),X-Men: First Class (2011),X-Men: The Last Stand (2006),X2: X-Men United (2003),"Year of Living Dangerously, The (1982)",Yellow Submarine (1968),Yes Man (2008),Yojimbo (1961),You Can Count on Me (2000),You Don't Mess with the Zohan (2008),You Only Live Twice (1967),You've Got Mail (1998),"You, Me and Dupree (2006)",Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1.0,,,,,,,,,,,,,,,,,,,,,,,,3.5,,,,,,,,,,,3.5,,,,,,...,,,,,,,,,,,,,,4.0,,,,3.0,,,,,,4.0,,,,,,,,,,,,,,,,
2.0,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,,3.0,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,,,,,,,,,,,,,,,,,,,5.0,,,,,5.0,4.0,,,,,,,,,,,,,,,,...,,,,,,,,5.0,,,,,,,,3.0,,,,,,,,5.0,,,,,,,,,,,,,,,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7.0,,,,,,,,,,,,,,,,,,,,,,,,3.0,3.0,,,,,,,,,,,,,,,,...,,,,,,,,,4.0,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,2.0
8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
random_user=28941.0

In [None]:
# picking up a random user for user based recommendation
random_user = int(pd.Series(user_movie_df.index).sample(1).values)

In [None]:
#selecting the movies the the random picked user watched 
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),13 Going on 30 (2004),"13th Warrior, The (1999)",1408 (2007),15 Minutes (2001),16 Blocks (2006),17 Again (2009),1984 (Nineteen Eighty-Four) (1984),2 Days in the Valley (1996),"2 Fast 2 Furious (Fast and the Furious 2, The) (2003)","20,000 Leagues Under the Sea (1954)",200 Cigarettes (1999),2001: A Space Odyssey (1968),2010: The Year We Make Contact (1984),2012 (2009),2046 (2004),21 (2008),21 Grams (2003),21 Jump Street (2012),24 Hour Party People (2002),25th Hour (2002),27 Dresses (2008),28 Days (2000),28 Days Later (2002),28 Weeks Later (2007),3 Ninjas (1992),3-Iron (Bin-jip) (2004),30 Days of Night (2007),300 (2007),...,"World According to Garp, The (1982)","World Is Not Enough, The (1999)","World's Fastest Indian, The (2005)",Wreck-It Ralph (2012),"Wrestler, The (2008)",Wristcutters: A Love Story (2006),Wyatt Earp (1994),"X-Files: Fight the Future, The (1998)",X-Men (2000),X-Men Origins: Wolverine (2009),X-Men: Days of Future Past (2014),X-Men: First Class (2011),X-Men: The Last Stand (2006),X2: X-Men United (2003),"Year of Living Dangerously, The (1982)",Yellow Submarine (1968),Yes Man (2008),Yojimbo (1961),You Can Count on Me (2000),You Don't Mess with the Zohan (2008),You Only Live Twice (1967),You've Got Mail (1998),"You, Me and Dupree (2006)",Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Zack and Miri Make a Porno (2008),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
61207.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,3.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
#moving them to a list 
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist() 
movies_watched

['Alien (1979)',
 'American Beauty (1999)',
 'Armageddon (1998)',
 'Austin Powers: International Man of Mystery (1997)',
 'Battle Royale (Batoru rowaiaru) (2000)',
 'Big (1988)',
 'Blade Runner (1982)',
 'Blues Brothers, The (1980)',
 'Bridge on the River Kwai, The (1957)',
 'Butch Cassidy and the Sundance Kid (1969)',
 'Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)',
 'Demolition Man (1993)',
 'Fistful of Dollars, A (Per un pugno di dollari) (1964)',
 'For a Few Dollars More (Per qualche dollaro in più) (1965)',
 'Full Metal Jacket (1987)',
 'Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)',
 'Great Escape, The (1963)',
 'Happy Gilmore (1996)',
 'Hard-Boiled (Lat sau san taam) (1992)',
 'Hunt for Red October, The (1990)',
 'Insider, The (1999)',
 'Kill Bill: Vol. 1 (2003)',
 'Kill Bill: Vol. 2 (2004)',
 'Killer, The (Die xue shuang xiong) (1989)',
 'Kingdom of Heaven (2005)',
 'Lord of the Rings: The Return of the King, The (2003)',
 'Léon: The Prof

In [None]:
# selecting the movies that random user watched which also includes other users
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.head()

title,Alien (1979),American Beauty (1999),Armageddon (1998),Austin Powers: International Man of Mystery (1997),Battle Royale (Batoru rowaiaru) (2000),Big (1988),Blade Runner (1982),"Blues Brothers, The (1980)","Bridge on the River Kwai, The (1957)",Butch Cassidy and the Sundance Kid (1969),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",Demolition Man (1993),"Fistful of Dollars, A (Per un pugno di dollari) (1964)",For a Few Dollars More (Per qualche dollaro in più) (1965),Full Metal Jacket (1987),"Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)","Great Escape, The (1963)",Happy Gilmore (1996),Hard-Boiled (Lat sau san taam) (1992),"Hunt for Red October, The (1990)","Insider, The (1999)",Kill Bill: Vol. 1 (2003),Kill Bill: Vol. 2 (2004),"Killer, The (Die xue shuang xiong) (1989)",Kingdom of Heaven (2005),"Lord of the Rings: The Return of the King, The (2003)",Léon: The Professional (a.k.a. The Professional) (Léon) (1994),"Machinist, The (Maquinista, El) (2004)",Memento (2000),Million Dollar Baby (2004),Monty Python Live at the Hollywood Bowl (1982),Monty Python and the Holy Grail (1975),Monty Python's Life of Brian (1979),Monty Python's The Meaning of Life (1983),Old Boy (2003),Papillon (1973),"Passion of the Christ, The (2004)",Pulp Fiction (1994),Reservoir Dogs (1992),Ringu (Ring) (1998),"Shining, The (1980)",Sin City (2005),Star Trek II: The Wrath of Khan (1982),Star Wars: Episode III - Revenge of the Sith (2005),"Thing, The (1982)",Top Gun (1986),"Usual Suspects, The (1995)",Who Framed Roger Rabbit? (1988),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
1.0,4.0,,,,,,4.0,,,3.0,4.0,,,,3.5,3.0,3.5,,3.5,,,,4.0,,,5.0,4.0,,3.5,,,3.5,3.5,3.5,,,,4.0,3.5,3.5,4.0,,4.0,,4.0,,3.5,,
2.0,5.0,3.0,,,,,5.0,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3.0,5.0,,4.0,,,4.0,5.0,5.0,,5.0,,3.0,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,5.0,,5.0,,5.0,,5.0,,5.0,,
4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5.0,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,5.0,5.0,,,,,,,,,,,,,,,,


In [None]:
movies_watched_df.T.notnull().sum().reset_index()

Unnamed: 0,userId,0
0,1.0,22
1,2.0,4
2,3.0,13
3,4.0,0
4,5.0,3
...,...,...
138488,138489.0,2
138489,138490.0,3
138490,138491.0,1
138491,138492.0,9


In [None]:
# number of movies watched by users to find the similar pattern with random user
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count = user_movie_count.reset_index()
#number of movies watched by users
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count.head(10)
# excluding the user who watched less than 20 movies to get similar pattern with random user
# user_movie_count[user_movie_count["movie_count"] > 20].sort_values("movie_count", ascending=False)
# users who watched same amount of movies with random user
# user_movie_count[user_movie_count["movie_count"] == 33].count() # just 17

Unnamed: 0,userId,movie_count
0,1.0,22
1,2.0,4
2,3.0,13
3,4.0,0
4,5.0,3
5,6.0,0
6,7.0,6
7,8.0,2
8,9.0,0
9,10.0,5


In [None]:
# selecting the users who watched more than %60 of movies the the random user watched to get better results
perc = len(movies_watched) * 60 / 100
users_same_movies = user_movie_count[user_movie_count["movie_count"] > perc]["userId"]
len(users_same_movies)

2833

In [None]:
movies_watched_df[movies_watched_df.index.isin(users_same_movies.index)]

title,Alien (1979),American Beauty (1999),Armageddon (1998),Austin Powers: International Man of Mystery (1997),Battle Royale (Batoru rowaiaru) (2000),Big (1988),Blade Runner (1982),"Blues Brothers, The (1980)","Bridge on the River Kwai, The (1957)",Butch Cassidy and the Sundance Kid (1969),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",Demolition Man (1993),"Fistful of Dollars, A (Per un pugno di dollari) (1964)",For a Few Dollars More (Per qualche dollaro in più) (1965),Full Metal Jacket (1987),"Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)","Great Escape, The (1963)",Happy Gilmore (1996),Hard-Boiled (Lat sau san taam) (1992),"Hunt for Red October, The (1990)","Insider, The (1999)",Kill Bill: Vol. 1 (2003),Kill Bill: Vol. 2 (2004),"Killer, The (Die xue shuang xiong) (1989)",Kingdom of Heaven (2005),"Lord of the Rings: The Return of the King, The (2003)",Léon: The Professional (a.k.a. The Professional) (Léon) (1994),"Machinist, The (Maquinista, El) (2004)",Memento (2000),Million Dollar Baby (2004),Monty Python Live at the Hollywood Bowl (1982),Monty Python and the Holy Grail (1975),Monty Python's Life of Brian (1979),Monty Python's The Meaning of Life (1983),Old Boy (2003),Papillon (1973),"Passion of the Christ, The (2004)",Pulp Fiction (1994),Reservoir Dogs (1992),Ringu (Ring) (1998),"Shining, The (1980)",Sin City (2005),Star Trek II: The Wrath of Khan (1982),Star Wars: Episode III - Revenge of the Sith (2005),"Thing, The (1982)",Top Gun (1986),"Usual Suspects, The (1995)",Who Framed Roger Rabbit? (1988),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
57.0,,,,2.5,,,,,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,3.0
90.0,3.5,5.0,3.0,4.5,,,,,,,,,,,,,,3.0,,,,4.0,4.5,,,,,,,,,,,,,,,3.5,,,,,,3.5,,,,5.0,4.0
293.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
358.0,3.0,,,,,,4.0,3.0,,,,2.0,,,,,,,,,,,,,,,,,,,,4.0,4.0,,,,,3.0,3.0,,,,,,,,4.0,,
366.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138306.0,,5.0,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
138324.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
138405.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,5.0,,
138410.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
# creating dataframe consisting of movies watched by random user and other users who watched them
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies.index)],
                      random_user_df[movies_watched]])
final_df

title,Alien (1979),American Beauty (1999),Armageddon (1998),Austin Powers: International Man of Mystery (1997),Battle Royale (Batoru rowaiaru) (2000),Big (1988),Blade Runner (1982),"Blues Brothers, The (1980)","Bridge on the River Kwai, The (1957)",Butch Cassidy and the Sundance Kid (1969),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",Demolition Man (1993),"Fistful of Dollars, A (Per un pugno di dollari) (1964)",For a Few Dollars More (Per qualche dollaro in più) (1965),Full Metal Jacket (1987),"Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)","Great Escape, The (1963)",Happy Gilmore (1996),Hard-Boiled (Lat sau san taam) (1992),"Hunt for Red October, The (1990)","Insider, The (1999)",Kill Bill: Vol. 1 (2003),Kill Bill: Vol. 2 (2004),"Killer, The (Die xue shuang xiong) (1989)",Kingdom of Heaven (2005),"Lord of the Rings: The Return of the King, The (2003)",Léon: The Professional (a.k.a. The Professional) (Léon) (1994),"Machinist, The (Maquinista, El) (2004)",Memento (2000),Million Dollar Baby (2004),Monty Python Live at the Hollywood Bowl (1982),Monty Python and the Holy Grail (1975),Monty Python's Life of Brian (1979),Monty Python's The Meaning of Life (1983),Old Boy (2003),Papillon (1973),"Passion of the Christ, The (2004)",Pulp Fiction (1994),Reservoir Dogs (1992),Ringu (Ring) (1998),"Shining, The (1980)",Sin City (2005),Star Trek II: The Wrath of Khan (1982),Star Wars: Episode III - Revenge of the Sith (2005),"Thing, The (1982)",Top Gun (1986),"Usual Suspects, The (1995)",Who Framed Roger Rabbit? (1988),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
57.0,,,,2.5,,,,,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,3.0
90.0,3.5,5.0,3.0,4.5,,,,,,,,,,,,,,3.0,,,,4.0,4.5,,,,,,,,,,,,,,,3.5,,,,,,3.5,,,,5.0,4.0
293.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
358.0,3.0,,,,,,4.0,3.0,,,,2.0,,,,,,,,,,,,,,,,,,,,4.0,4.0,,,,,3.0,3.0,,,,,,,,4.0,,
366.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138324.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
138405.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,5.0,,
138410.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
138436.0,4.5,3.0,,,,3.5,,4.5,,,4.0,,,,,2.0,,,,,,2.5,4.0,,,3.5,3.5,4.5,5.0,,,4.5,1.5,3.5,3.5,,,5.0,5.0,,,,4.5,,,,5.0,,4.5


In [None]:
# finding correlations between users
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()
corr_df

Unnamed: 0,user_id_1,user_id_2,corr
0,116646.0,41443.0,-1.0
1,27052.0,107255.0,-1.0
2,121051.0,34102.0,-1.0
3,22549.0,35779.0,-1.0
4,2396.0,116094.0,-1.0
...,...,...,...
132623,132184.0,61990.0,1.0
132624,128761.0,35779.0,1.0
132625,36525.0,118415.0,1.0
132626,122533.0,32299.0,1.0


In [None]:
# selecting users at least %65 correlated with random user
top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][
    ["user_id_2", "corr"]].reset_index(drop=True)
top_users = top_users.sort_values(by='corr', ascending=False)
top_users.rename(columns={"user_id_2": "userId"}, inplace=True)
top_users

Unnamed: 0,userId,corr
26,77089.0,0.883883
25,52501.0,0.87114
24,3032.0,0.856897
23,41214.0,0.835881
22,45492.0,0.813113
21,46280.0,0.796107
20,81337.0,0.793969
19,96749.0,0.768408
18,90614.0,0.762586
17,13344.0,0.758071


In [None]:
# rating scores of similar users with random user
top_users_ratings = top_users.merge(ratings[["userId", "movieId", "rating"]], how='inner')
top_users_ratings.head()

Unnamed: 0,userId,corr,movieId,rating
0,77089.0,0.883883,50,5.0
1,77089.0,0.883883,158,1.0
2,77089.0,0.883883,163,3.0
3,77089.0,0.883883,172,3.0
4,77089.0,0.883883,186,4.0


In [None]:
# considering rating and correlation together: weighted average
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
# getting the movie IDs and weighted ratings
recommendation_df = top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"})
recommendation_df = recommendation_df.reset_index()
recommendation_df.head()

Unnamed: 0,movieId,weighted_rating
0,1,2.569328
1,2,2.339958
2,3,1.857263
3,5,1.697224
4,6,2.891553


In [None]:
# 5 movies to recommend (user-based)
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 4].sort_values("weighted_rating", ascending=False)
movies_to_be_recommend = movies_to_be_recommend.merge(movies[["movieId", "title"]])["title"]
movies_to_be_recommend.head(5)

0                        Magnificent Seven, The (1960)
1             Man Who Shot Liberty Valance, The (1962)
2                    Flight of the Phoenix, The (1965)
3    Maltese Falcon, The (a.k.a. Dangerous Female) ...
4                                       Rebecca (1940)
Name: title, dtype: object

https://www.kaggle.com/sankha1998/content-based-movie-reommendation-system

## Model-based Collaborative Filtering Systems
## SVD Matrix Factorization

With these systems you build a model from user ratings, and then make recommendations based on that model. This offers a speed and scalability that's not available when you're forced to refer back to the entire dataset to make a prediction. We are going to see something called a utility matrix.

Utility matrix is also known as user item matrix. These matrices contain values for each user, each item, and the rating each user gave to each item. Another thing to note is that utility matrices are sparse because every user does not review every item. Actually, only a few users provide reviews for a few items. So in these matrices, we are likely to see mostly null values. Before explaining the truncated version, let's see the regular singular value decomposition or SVD.

SVD is a linear algebra method that you can use to decompose a utility matrix into three compressed matrices. It's useful for building a model-based recommender because we can use these compressed matrices to make recommendations without having to refer back to the complete and entire dataset. With SVD, we uncover latent variables. These are inferred variables that are present within and affect the behavior of a dataset. Although these variables are present and influential within a dataset, they're not directly observable. Now let's look at the anatomy of SVD.

Utility Matrix = U x S x V

We see three resultant matrices, U, S, and V. U is the left orthogonal matrix, and it holds the important,
non-redundant information about users. On the right, we see matrix V. That's the right orthogonal matrix.
It holds important, non-redundant information on items. In the middle, we see S, the diagonal matrix. This contains all of the information about the decomposition processes performed during the compression.

We want to use the similarities between users, to decide which movies to recommend, so we can use truncated SVD to compress all of the user ratings down to just small number of latent variables. These variables are going to capture most of the information that was stored in user columns previously. They represent a generalized view of users' tastes and preferences. The first thing we will do is to transpose our matrix, so that movies are represented by rows, and users are represented by columns. Then we'll use SVD to compress this matrix. All of the individual movie names will be retained along the rows. But the users will have been compressed down to number synthetic components which we will choose, that represent a generalized view of users' tastes.

In [None]:
pivot

In [None]:
from sklearn.decomposition import TruncatedSVD

X = pivot.T
SVD = TruncatedSVD(n_components=500, random_state=20)
SVD_matrix = SVD.fit_transform(X)

Let's see how much of these 500 variables cover the whole data

In [None]:
SVD.explained_variance_ratio_.sum()

We see that it covers about 52% of our whole data.

### Generating a Correlation Matrix

In [None]:
# We'll calculate the Pearson r correlation coefficient, 
# for every movie pair in the resultant matrix. With correlation being 
# based on similarities between user preferences.

corr_mat = np.corrcoef(SVD_matrix)
corr_mat.shape

### Isolating One Movie From the Correlation Matrix

Let's stick with Pulp Fiction choice

In [None]:
rand_movie = 296
corr_pulp_fiction = corr_mat[rand_movie]

# Recommending a Highly Correlated Movie.
# We will get different results due to decompression with svd
idx = X[(corr_pulp_fiction < 1.0) & (corr_pulp_fiction > 0.5)].index
movies.loc[idx+1, 'title']

###Content Based Movie Reommendation System

In [None]:
#Storing the movie information into a pandas dataframe
movies = pd.read_csv('/content/drive/MyDrive/movie data/movie.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
# Storing the user information into a pandas dataframe
ratings = pd.read_csv('/content/drive/MyDrive/movie data/rating.csv', usecols=['userId', 'movieId', 'rating'],
                     dtype={'userId':np.int32, 'movieId':np.int32, 'rating':np.float32})
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [None]:
# movie and rating are sutable for analysis
movie_details=movies.merge(rating,on='movieId')
movie_details.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,8
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,10
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,8
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,8
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,9


In [None]:
total_ratings=movie_details.groupby(['movieId','genres'])['rating'].sum().reset_index()
total_ratings.head()

Unnamed: 0,movieId,genres,rating
0,1,Adventure|Animation|Children|Comedy|Fantasy,389732.0
1,2,Adventure|Children|Fantasy,142888.0
2,3,Comedy|Romance,80257.0
3,4,Comedy|Drama|Romance,15772.0
4,5,Comedy,74537.0


In [None]:
df=movie_details.copy()
df.drop_duplicates(['title','genres'],inplace=True) 
df.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,8
49695,2,Jumanji (1995),Adventure|Children|Fantasy,1,7
71938,3,Grumpier Old Men (1995),Comedy|Romance,2,8
84673,4,Waiting to Exhale (1995),Comedy|Drama|Romance,41,4
87429,5,Father of the Bride Part II (1995),Comedy,12,4


In [None]:
df=df.merge(total_ratings,on='movieId')
df.drop(columns=['userId','rating_x','genres_y'],inplace=True)
df.rename(columns={'genres_x':'genres','rating_y':'rating'},inplace=True)
df.head()

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389732.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,142888.0
2,3,Grumpier Old Men (1995),Comedy|Romance,80257.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,15772.0
4,5,Father of the Bride Part II (1995),Comedy,74537.0


In [None]:
df['rating']=df['rating'].astype(int)

In [None]:
df.dtypes

movieId     int64
title      object
genres     object
rating      int64
dtype: object

In [None]:
df = df[df['rating']>100]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(analyzer='word',stop_words='english',ngram_range=(1, 2),min_df=1)
x = tfv.fit_transform(df['genres'])

In [None]:
from sklearn.metrics.pairwise import sigmoid_kernel
model = sigmoid_kernel(x, x)

In [None]:
df1=df.copy()
ti=[]
for i in df1['title']:
    ti.append(i.split(' (')[0])
df1['title']=ti

In [None]:
df1

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,389732
1,2,Jumanji,Adventure|Children|Fantasy,142888
2,3,Grumpier Old Men,Comedy|Romance,80257
3,4,Waiting to Exhale,Comedy|Drama|Romance,15772
4,5,Father of the Bride Part II,Comedy,74537
...,...,...,...,...
26178,128488,Wild Card,Crime|Drama|Thriller,129
26368,129354,Focus,Comedy|Crime|Drama|Romance,192
26393,129428,The Second Best Exotic Marigold Hotel,Comedy|Drama,150
26530,130073,Cinderella,Adventure|Children|Drama|Sci-Fi,143


In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df1.index, index=df1['title']).drop_duplicates()
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [None]:
 # Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=model):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
get_recommendations('Jumanji')

55                         Kids of the Round Table (1995)
59                     Indian in the Cupboard, The (1995)
124                     NeverEnding Story III, The (1994)
990                       Escape to Witch Mountain (1975)
1959            Darby O'Gill and the Little People (1959)
2009                                  Return to Oz (1985)
2077                        NeverEnding Story, The (1984)
2078    NeverEnding Story II: The Next Chapter, The (1...
2314                        Santa Claus: The Movie (1985)
4800    Harry Potter and the Sorcerer's Stone (a.k.a. ...
Name: title, dtype: object

https://www.kaggle.com/sankha1998/content-based-movie-reommendation-system

## Conclusions

### Advantages and Disadvantages of Content-Based Filtering

##### Advantages
* Learns user's preferences
* Highly personalized for the user

##### Disadvantages
* Doesn't take into account what others think of the item, so low quality item recommendations might happen
* Extracting data is not always intuitive
* Determining what characteristics of the item the user dislikes or likes is not always obvious

### Advantages and Disadvantages of Collaborative Filtering

##### Advantages
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time

##### Disadvantages
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences
