# Recommender System
---
A Web Base user-item Movie Recommendation Engine using Collaborative Filtering by 
matrix factorizations algorithm and the recommendation based on the underlying 
idea that if a person likes a particular movie, then with the help of dataset tags, 
related movies are recommended.

## Working on Dataset
First, we are going to download the required dataset. In this project we 
are using TMDB 5000 Movie Dataset (Metadata on ~5,000 movies). There 
are two csv files:
> → tmdb_5000_credits.csv: This file contains movie id, title, cast and crew.

> → tmdb_5000_movies.csv: This file contains 20 columns with almost all data  regarding every movie. 

First, we’ll import the libraries which we are ging to use in this project.

In [44]:
import pandas as pd
import numpy as np
import ast
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.stem.porter import PorterStemmer
import pickle

Importing both the datasets which we have downloaded.

In [45]:
movies = pd.read_csv('./dataset/tmdb_5000_movies.csv')
credits = pd.read_csv('./dataset/tmdb_5000_credits.csv')

In [46]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [47]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


combine both datasets based on title column. So that we can access 
data easily.

In [48]:
movies = movies.merge(credits, on='title')

In [49]:
movies.isnull().sum()

budget                     0
genres                     0
homepage                3096
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
movie_id                   0
cast                       0
crew                       0
dtype: int64

In [50]:
movies.duplicated().sum()

0

### Important Data for Recommender System
Now we’ll import columns for our recommender system. In this project we 
are mainly using 7 main and most important columns outs of 24

In [51]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

### Check for Null Values
Check for null values in the dataset.

In [52]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

Since we have only 3 null values which is very less so we’ll remove these 
three rows’ data.

In [53]:
movies.dropna(inplace=True)

In [54]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### Data Cleaning 
Now, since genres and keywords are in the form of list of dictionaries of id 
and name, we’ll convert it into list of names only because we already get the 
respective movie id column by merging the two datasets. 

In [55]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [56]:
def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

In [57]:
movies['genres'] = movies['genres'].apply(convert)

In [58]:
movies.iloc[0].keywords

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [59]:
movies['keywords'] = movies['keywords'].apply(convert)

In [60]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


Similarly cast column is also in the form of list of dictionaries which 
contains a lot of data, we only need the cast name. So here also we will do 
the same thing which we have done with the genres and keywords but the 
only difference here is, since there are more than one cast name in a 
particular movie, so for our project for recommendation we’ll take only first 
three cast names.

In [61]:
def convert3(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L 

In [62]:
movies['cast'] = movies['cast'].apply(convert3)

In [63]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


Crew section contains all data od all departments such as editing, camera, 
sound, production, art and many more from who has different jobs, so we’ll 
filter out the director’s name only.

In [64]:
def fetch_director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
            break
    return L   

In [65]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [66]:
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

Now, overview section contains the overview of every movie in the form of 
paragraph or sentence. So, in order to iterate through this, we will split the 
overview section in the form of list of words.


In [67]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

Now, we’ll remove blank space between names of cast, crew, genres, cast, 
keywords so that there should not be any problem while iterating through 
these data. Eg- Joe Russo -> JoeRusso

In [68]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

movies['genres']=movies['genres'].apply(collapse)
movies['keywords']=movies['keywords'].apply(collapse)
movies['cast']=movies['cast'].apply(collapse)
movies['crew']=movies['crew'].apply(collapse)

In [69]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


Creating a list named tag which contains all the names from overview, 
genres, keyword, cast and crew.

In [70]:
movies['tags'] = movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

### Creating DataFrame 
Now, we’ll create data frame of movie id, title and tags, of all the data which 
we have filtered and manipulated till now, in order to do further operations.

In [71]:
new_df = movies[['movie_id','title','tags']]

Now, removing comma from tags and change it to lowercase.

In [72]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [73]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


In [74]:
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [75]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [76]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [77]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

### Using CountVectorizer to Extracting Features from Text 
Now, we’ll transform the text data into vector on the basis of its frequency 
(count) of each word that occurs in the entire text.

In [78]:
cv = CountVectorizer(max_features=5000, stop_words='english')

Now, we’ll calculate and transform the actual data from data frame and 
produce their normalised value.

In [79]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [80]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Stemming words with NLTK 
Now we’ll use PorterStemmer library in order to implement stemming 
algorithms. It reduces the words “chocolates”, “chocolatey”, and “choco” 
to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce 
to the stem “retrieve”.

In [81]:
ps = PorterStemmer()

In [82]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
        
    return " ".join(y)

In [83]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


Again, we’ll transform the text data into vector on the basis of its frequency 
(count) of each word that occurs in the entire text. Also, calculate and 
transform the actual data from data frame and produce their normalised 
value.

In [84]:
cv = CountVectorizer(max_features=5000, stop_words='english')

In [85]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [86]:
# len(cv.get_feature_names())

## Using Cosine Similarity to identify similarity among vectors 
Using Cosine Similarity, we’ll measure angle between two vectors which
represents features of the data object, in a dataset. And determines 
whether two vectors are pointing in roughly the same direction. If the angle
is less, there will be a high degree of similarity, but when the angle is large, 
there will be a low degree of similarity.

In [87]:
similarity = cosine_similarity(vectors)

In [88]:
sorted(list(enumerate(similarity[1])),reverse=True,key=lambda x:x[1])[1:6]

[(12, 0.4140043409440133),
 (199, 0.27500954910846337),
 (17, 0.2533201985524494),
 (216, 0.20579830217101058),
 (3572, 0.20579830217101058)]

## Recommendation Function 
Now, creating a function which takes movie title as parameter and then do 
the steaming then compute its corresponding movies id then calculate its 
cosine similarity angle with other vectors and then creates a list of movies 
title sorted in order to lowest distance (angle) first and following upto a 
particular index.

In [89]:
def recommend(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:21]
    
    j=1
    for i in movies_list:
        
        print(new_df.iloc[i[0]].title)
        j = j+1

Now, we’ll check our recommendation system,

In [90]:
recommend('Iron Man')

Iron Man 3
Iron Man 2
Avengers: Age of Ultron
The Avengers
Captain America: Civil War
Guardians of the Galaxy
X-Men
Thor: The Dark World
Ant-Man
X-Men Origins: Wolverine
X-Men: Days of Future Past
X2
X-Men: The Last Stand
The Helix... Loaded
The Incredible Hulk
Hellboy II: The Golden Army
Teenage Mutant Ninja Turtles II: The Secret of the Ooze
Captain America: The First Avenger
The Wolverine
Thor


#### Working fine :)

## Using Pickle to store data Serially 
Now, we’ll use Python pickle to store all these data and export it to a new 
.pkl file. Pickle is used for serializing and de-serializing a Python object 
structure. It “serializes” the object first before writing it to file. Pickling is a 
way to convert a python object (list, dict, etc.) into a character stream. The 
idea is that this character stream contains all the information necessary to 
reconstruct the object in another python script.

In [91]:
pickle.dump(new_df.to_dict(),open('movie_dict.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))

It creates two new files naming movie_dict.pkl and similarity.pkl

---
#### - By Sanket Saurav