# Movie Recommendation System

### Data Files for project-
1) [movie.csv](https://www.kaggle.com/grouplens/movielens-20m-dataset?select=movie.csv)

2) [tag.csv](https://www.kaggle.com/grouplens/movielens-20m-dataset?select=tag.csv)

3) [rating.csv](https://www.kaggle.com/grouplens/movielens-20m-dataset?select=rating.csv)

In [2]:
#Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import pickle
import sklearn

In [5]:
#Importing all datasets
movies=pd.read_csv('C:/Projects/Movie Recommendation System/Data/movie.csv')
tags=pd.read_csv('C:/Projects/Movie Recommendation System/Data/tag.csv')
ratings=pd.read_csv('C:/Projects/Movie Recommendation System/Data/rating.csv')

In [6]:
#Checking head of all datasets
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


**Data Cleaning and preprocessing**

In [7]:
#Dropping time stamps from ratings and tags
ratings.drop(['timestamp'],axis=1,inplace=True)
tags.drop(['timestamp'],axis=1,inplace=True)

In [8]:
#Replacing '|' with space in genres column in movies dataset
movies['genres']=movies['genres'].str.replace('|',' ')

  


In [9]:
#Checkind unique Movie Ids
movies['movieId'].nunique()

27278

In [10]:
#Filtering out users who have not reviewed more than 300 movies
ratings_f=ratings.groupby('userId').filter(lambda x:len(x)>=300)
movie_list_rating=ratings_f['movieId'].unique().tolist()

In [11]:
#Movie we still have after filtering
(ratings_f['movieId'].nunique()/movies['movieId'].nunique())*100

96.66397829752914

In [12]:
#Users lost after filtering
(ratings_f['userId'].nunique()/ratings['userId'].nunique())*100

11.774602326471374

In [13]:
#Filtering movies data set based on Movie Id
movies=movies[movies['movieId'].isin(movie_list_rating)]

In [14]:
Mapping_file=dict(zip(movies['title'].tolist(),movies['movieId'].tolist()))

In [15]:
#Combining movies and tag Dataframe
mixed=pd.merge(movies,tags,on='movieId',how='left')
mixed.head()

Unnamed: 0,movieId,title,genres,userId,tag
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,1644.0,Watched
1,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,1741.0,computer animation
2,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,1741.0,Disney animated feature
3,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,1741.0,Pixar animation
4,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,1741.0,TÃ©a Leoni does not star in this movie


In [16]:
#Combining all tags of movies together
mixed.fillna('',inplace=True)
mixed=pd.DataFrame(mixed.groupby('movieId')['tag'].apply(lambda x:"%s" % ' '.join(x)))

In [17]:
#Combining genre and tags togetehr
Final=pd.merge(movies,mixed,on='movieId',how='left')
Final['metadata']=Final[['genres','tag']].apply(lambda x:' '.join(x),axis=1)
Final.head()

Unnamed: 0,movieId,title,genres,tag,metadata
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,Watched computer animation Disney animated fea...,Adventure Animation Children Comedy Fantasy Wa...
1,2,Jumanji (1995),Adventure Children Fantasy,time travel adapted from:book board game child...,Adventure Children Fantasy time travel adapted...
2,3,Grumpier Old Men (1995),Comedy Romance,old people that is actually funny sequel fever...,Comedy Romance old people that is actually fun...
3,4,Waiting to Exhale (1995),Comedy Drama Romance,chick flick revenge characters chick flick cha...,Comedy Drama Romance chick flick revenge chara...
4,5,Father of the Bride Part II (1995),Comedy,Diane Keaton family sequel Steve Martin weddin...,Comedy Diane Keaton family sequel Steve Martin...


In [18]:
#Checkin shape of Final dataset
Final.shape

(26368, 5)

In [19]:
#Vectorization of out Metadata column
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(stop_words='english')
tfidf_matrix=tfidf.fit_transform(Final['metadata'])

tfidf_df=pd.DataFrame(tfidf_matrix.toarray(),index=Final.index.tolist())
print(tfidf_df.shape)

(26368, 23647)


In [20]:
#Applying SVD to reduce the columns to 200
from sklearn.decomposition import TruncatedSVD
svd=TruncatedSVD(n_components=200)
latent_matrix=svd.fit_transform(tfidf_df)

In [21]:
n=200
latent_matrix_1_df=pd.DataFrame(latent_matrix[:,0:n],index=Final.title.tolist())

In [22]:
ratings_f.head()

Unnamed: 0,userId,movieId,rating
960,11,1,4.5
961,11,10,2.5
962,11,19,3.5
963,11,32,5.0
964,11,39,4.5


In [23]:
#Merging ratings_f and movieId
ratings_f1=pd.merge(movies[['movieId']],ratings_f,on='movieId',how='right')

In [24]:
#Creating a pivot table for collaborative filtering
ratings_f2=ratings_f1.pivot_table(index='movieId',columns='userId',values='rating').fillna(0)

In [25]:
#Used for content based filtering
ratings_f1

Unnamed: 0,movieId,userId,rating
0,1,11,4.5
1,10,11,2.5
2,19,11,3.5
3,32,11,5.0
4,39,11,4.5
...,...,...,...
9928187,68954,138493,4.5
9928188,69526,138493,4.5
9928189,69644,138493,3.0
9928190,70286,138493,5.0


In [26]:
ratings_f.userId.nunique()

16307

In [27]:
#Used for collaborative based filterning
ratings_f2

userId,11,24,54,58,91,96,104,116,131,132,...,138406,138411,138414,138436,138437,138454,138456,138472,138474,138493
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.5,4.0,4.0,5.0,4.0,3.5,0.0,3.0,2.0,0.0,...,4.0,5.0,0.0,3.5,4.0,5.0,1.0,3.0,5.0,3.5
2,0.0,0.0,3.0,0.0,3.5,0.0,0.0,2.0,1.0,3.0,...,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,4.0
3,0.0,0.0,0.0,0.0,3.0,4.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131254,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131260,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
#Applying SVD to ratings_f2
from sklearn.decomposition import TruncatedSVD
svd=TruncatedSVD(n_components=200)
latent_matrix_2=svd.fit_transform(ratings_f2)
latent_matrix_2_df=pd.DataFrame(latent_matrix_2,index=Final.title.tolist())

In [29]:
latent_matrix_2_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
Toy Story (1995),382.188851,13.029184,-40.358686,24.273674,68.447791,69.829067,16.43498,-32.100606,1.480774,11.909373,...,-1.085201,-3.71548,6.437322,1.737904,10.537574,1.328712,-2.732184,-2.454114,8.433241,3.898177
Jumanji (1995),200.45373,45.210932,-58.400621,-13.959314,11.362901,47.081919,8.325662,-19.377823,-33.636647,30.273885,...,-1.086374,4.180515,-4.619468,11.207845,7.429976,2.290622,0.94024,6.95824,9.416351,-1.346856
Grumpier Old Men (1995),88.368562,-12.677233,-44.868782,-29.019624,3.160094,-4.068618,-9.726881,-13.613156,3.770563,33.364314,...,0.734489,-1.521906,-2.577084,-3.328841,-1.872034,-0.505561,7.036442,-1.376763,1.665229,1.252015
Waiting to Exhale (1995),21.918169,-17.092062,-6.422947,-16.739235,5.023366,-4.264192,-6.76948,-16.370571,-11.493518,4.386103,...,-0.770158,-0.65052,0.208153,-0.166949,-0.853833,1.635238,-2.397123,-1.325518,0.903486,-1.498916
Father of the Bride Part II (1995),73.641019,-7.29896,-37.683245,-40.260038,15.340435,3.449107,-4.039195,-16.293266,-7.758173,25.737076,...,-2.263518,-1.95843,-3.78009,0.125907,-0.262076,0.387163,0.804806,0.051535,-1.571198,0.044327


### Taking the inut from user about the movie he watched

In [30]:
movie_given=input('Enter the movies watched : ')

Enter the movies watched :  Toy Story (1995)


In [31]:
#Aplying cosing similarirty 
from sklearn.metrics.pairwise import cosine_similarity
a_1=np.array(latent_matrix_1_df.loc[movie_given]).reshape(1,-1)
a_2=np.array(latent_matrix_2_df.loc[movie_given]).reshape(1,-1)
score_1=cosine_similarity(latent_matrix_1_df,a_1).reshape(-1)
score_2=cosine_similarity(latent_matrix_2_df,a_2).reshape(-1)   
hybrid=(score_1+score_2)/2.0 

In [32]:
score_1

array([ 1.        ,  0.14277717,  0.08307063, ...,  0.07299973,
       -0.00111878,  0.06622199])

In [33]:
#Creating a final dataframe 
dict_1={'Content':score_1,'Collaborative':score_2,'Hybrid':hybrid}
similar_movies=pd.DataFrame(dict_1,index=latent_matrix_1_df.index)

### Taking input from user about the type of filtering

In [34]:
print('Content based Recommendation - 1 \nCollaborative based Recommendation - 2\nHybrid based Recommendation - 3')
type=int(input('Enter your choice : '))

Content based Recommendation - 1 
Collaborative based Recommendation - 2
Hybrid based Recommendation - 3


Enter your choice :  1


In [35]:
mapping={1:'Content',2:'Collaborative',3:'Hybrid'}
sort_type=mapping[type]

In [36]:
#Sorting the dataframe in descending value
similar_movies.sort_values(sort_type,ascending=False,inplace=True)

In [37]:
#Showing the first 20 similar movies
print('Movies the user is expected to Like (based on his filtering choice): ')
print(similar_movies[1:].head(20))

Movies the user is expected to Like (based on his filtering choice): 
                                          Content  Collaborative    Hybrid
Toy Story 2 (1999)                       0.961404       0.931317  0.946360
Bug's Life, A (1998)                     0.906896       0.899005  0.902950
Ratatouille (2007)                       0.898726       0.606812  0.752769
Monsters, Inc. (2001)                    0.883125       0.851515  0.867320
Ice Age (2002)                           0.870921       0.715849  0.793385
Finding Nemo (2003)                      0.870415       0.819746  0.845080
Toy Story 3 (2010)                       0.866037       0.526972  0.696505
Incredibles, The (2004)                  0.788992       0.777376  0.783184
Monsters University (2013)               0.780310       0.309095  0.544703
Up (2009)                                0.743946       0.557677  0.650812
Antz (1998)                              0.735723       0.778833  0.757278
Cars (2006)                   