# Importing libraries

In [1]:
import numpy as np
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import compress_pickle as cPickle

# Problem Formulation
---
The objective is to build a model which recommends movies similiar to the movie the user likes or already watched. We will first pre-process the data and then build a model using `nltk` to pre-process the text data for ML analysis and using `cosine similarity` from `scikit-learn` to retrieve similiar movies with `TMDB` dataset which consists of 5000 movies data.Next step would be integrating it with `streamlit` to create web app.

During model building we will cover concepts such as `Data loading`, `Data cleaning`, `Feature engineering`, `Dimensionality reduction`, `Vectorization`, `Stemming` and `Cosine similarity`.

Technology and tools wise this project covers : 
 - Language : `Python`
 - Data cleaning : `Numpy and Pandas`
 - Data visualization : `Matplotlib`
 - Data pre-processing : `Natural Language Toolkit(NLTK)`,`Scikit-learn`
 - Model building : `Scikit-learn`
 - UI :`Streamlit`

# Data Loading
---
Dataset - TMDB | 5000 movies and credits dataset
 - RReading data from a CSV file and creating a DataFrame using Pandas.

In [2]:
# contains movie info
movies = pd.read_csv(r"C:\Users\noble\PycharmProjects\movie-recommender\data\tmdb_3000_movies.csv",encoding='latin1')
# contains cast and crew info
credits = pd.read_csv(r"C:\Users\noble\PycharmProjects\movie-recommender\data\tmdb_3000_credits.csv",encoding='latin1')

In [3]:
# merging movies and credits
movies = movies.merge(credits,on='title')

In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2985 entries, 0 to 2984
Columns: 188 entries, budget to Unnamed: 168
dtypes: float64(3), int64(4), object(181)
memory usage: 4.3+ MB


# Data cleaning 
Taking only the columns required for content based filtering

In [5]:
movies = movies[['title','genres','id','keywords','overview','cast','crew']]

Checking movie which has any one or more null values in a column

In [6]:
movies.isnull().sum()

title       0
genres      0
id          0
keywords    0
overview    1
cast        0
crew        2
dtype: int64

In [7]:
movies.shape

(2985, 7)

dropping those rows which has null values

In [8]:
movies.dropna(inplace=True)

Converting json to string and getting only name of genres and keywords

In [9]:
def derive(a):
    L = []
    for i in ast.literal_eval(a):
        L.append(i['name'])
    return L

In [10]:
movies['genres']

0       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1       [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3       [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
                              ...                        
2980    [{"id": 18, "name": "Drama"}, {"id": 35, "name...
2981    [{"id": 878, "name": "Science Fiction"}, {"id"...
2982    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
2983    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
2984    [{"id": 27, "name": "Horror"}, {"id": 9648, "n...
Name: genres, Length: 2982, dtype: object

In [11]:
movies['genres'] = movies['genres'].apply(derive)

In [12]:
movies['keywords'] = movies['keywords'].apply(derive)

In [13]:
# movies.head(2)

In [14]:
movies['cast'].head(10)

0    [{"cast_id": 242, "character": "Jake Sully", "...
1    [{"cast_id": 4, "character": "Captain Jack Spa...
2    [{"cast_id": 1, "character": "James Bond", "cr...
3    [{"cast_id": 2, "character": "Bruce Wayne / Ba...
4    [{"cast_id": 5, "character": "John Carter", "c...
5    [{"cast_id": 30, "character": "Peter Parker / ...
6    [{"cast_id": 34, "character": "Flynn Rider (vo...
7    [{"cast_id": 76, "character": "Tony Stark / Ir...
8    [{"cast_id": 3, "character": "Harry Potter", "...
9    [{"cast_id": 18, "character": "Bruce Wayne / B...
Name: cast, dtype: object

To get top 3 cast of the movie

In [15]:
def derive3(a):
    L = []
    counter = 0
    for i in ast.literal_eval(a):
        if counter<3:
            L.append(i['name'])
            counter+=1
        else:
            break
    return L

In [16]:
movies['cast'] = movies['cast'].apply(derive3)

In [17]:
movies['cast'].head(10)

0     [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1        [Johnny Depp, Orlando Bloom, Keira Knightley]
2         [Daniel Craig, Christoph Waltz, Léa Seydoux]
3         [Christian Bale, Michael Caine, Gary Oldman]
4       [Taylor Kitsch, Lynn Collins, Samantha Morton]
5         [Tobey Maguire, Kirsten Dunst, James Franco]
6            [Zachary Levi, Mandy Moore, Donna Murphy]
7    [Robert Downey Jr., Chris Hemsworth, Mark Ruff...
8        [Daniel Radcliffe, Rupert Grint, Emma Watson]
9               [Ben Affleck, Henry Cavill, Gal Gadot]
Name: cast, dtype: object

To get the director name alone from cast json format

In [18]:
def deriveDirector(a):
    L = []
    for i in ast.literal_eval(a):
        if i['job']=="Director":
            L.append(i['name'])
            break
    return L

In [19]:
movies['crew'] = movies['crew'].apply(deriveDirector)

In [20]:
movies.head(2)

Unnamed: 0,title,genres,id,keywords,overview,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


Converting overview(str) as list

In [21]:
movies['overview'] = movies['overview'].apply(lambda x:x.split(" "))

Creating a new column by concatenating all columns

In [22]:
movies['tags'] = movies['genres']+movies['keywords']+movies['overview']+movies['cast']+movies['crew']
movies.head()

Unnamed: 0,title,genres,id,keywords,overview,cast,crew,tags
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",19995,"[culture clash, future, space war, space colon...","[In, the, 22nd, century,, a, paraplegic, Marin...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron],"[Action, Adventure, Fantasy, Science Fiction, ..."
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]",285,"[ocean, drug abuse, exotic island, east india ...","[Captain, Barbossa,, long, believed, to, be, d...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski],"[Adventure, Fantasy, Action, ocean, drug abuse..."
2,Spectre,"[Action, Adventure, Crime]",206647,"[spy, based on novel, secret agent, sequel, mi...","[A, cryptic, message, from, Bonds, past, send...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes],"[Action, Adventure, Crime, spy, based on novel..."
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]",49026,"[dc comics, crime fighter, terrorist, secret i...","[Following, the, death, of, District, Attorney...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan],"[Action, Crime, Drama, Thriller, dc comics, cr..."
4,John Carter,"[Action, Adventure, Science Fiction]",49529,"[based on novel, mars, medallion, space travel...","[John, Carter, is, a, war-weary,, former, mili...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton],"[Action, Adventure, Science Fiction, based on ..."


Storing it in a new df

In [23]:
new_movies = movies[['id','title','tags']].copy()

Deleting the space between words

In [24]:
new_movies['tags'] = new_movies['tags'].apply(lambda x:[i.replace(" ","") for i in x])
new_movies.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction, c..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action, ocean, drugabuse,..."
2,206647,Spectre,"[Action, Adventure, Crime, spy, basedonnovel, ..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller, dccomics, cri..."
4,49529,John Carter,"[Action, Adventure, ScienceFiction, basedonnov..."


In [25]:
pd.set_option('max_colwidth', None)

Convertng tags --> list to string

In [26]:
new_movies['tags'] = new_movies['tags'].apply(lambda x:" ".join(x))

In [27]:
new_movies.head(3)

Unnamed: 0,id,title,tags
0,19995,Avatar,"Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. SamWorthington ZoeSaldana SigourneyWeaver JamesCameron"
1,285,Pirates of the Caribbean: At World's End,"Adventure Fantasy Action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems. JohnnyDepp OrlandoBloom KeiraKnightley GoreVerbinski"
2,206647,Spectre,"Action Adventure Crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom A cryptic message from Bonds past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret service alive, Bond peels back the layers of deceit to reveal the terrible truth behind SPECTRE. DanielCraig ChristophWaltz LéaSeydoux SamMendes"


In [28]:
new_movies['tags'] = new_movies['tags'].apply(lambda x:x.lower())

In [29]:
new_movies.head(4)

Unnamed: 0,id,title,tags
0,19995,Avatar,"action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. samworthington zoesaldana sigourneyweaver jamescameron"
1,285,Pirates of the Caribbean: At World's End,"adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. johnnydepp orlandobloom keiraknightley goreverbinski"
2,206647,Spectre,"action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom a cryptic message from bonds past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. danielcraig christophwaltz léaseydoux sammendes"
3,49026,The Dark Knight Rises,"action crime drama thriller dccomics crimefighter terrorist secretidentity burglar hostagedrama timebomb gothamcity vigilante cover-up superhero villainess tragichero terrorism destruction catwoman catburglar imax flood criminalunderworld batman following the death of district attorney harvey dent, batman assumes responsibility for dent's crimes to protect the late attorney's reputation and is subsequently hunted by the gotham city police department. eight years later, batman encounters the mysterious selina kyle and the villainous bane, a new terrorist leader who overwhelms gotham's finest. the dark knight resurfaces to protect a city that has branded him an enemy. christianbale michaelcaine garyoldman christophernolan"


Preprocessing ends --Convertng the strng to lowercase

# Vectorization
 - creating a vector and its count using countvectorizer and remove stop words using sklearn

In [30]:
cv = CountVectorizer(max_features=5000,stop_words='english')

vector = cv.fit_transform(new_movies['tags']).toarray()

analyse the feature name of the vectors

In [31]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

#Stemming
---
remove the repeated words using stem fn from ntlk

In [32]:
ps = PorterStemmer()

In [33]:
def stemming(text):
    L =[]
    for i in text:
        L.append(ps.stem(i))
    return " ".join(L)

# Cosine similarity

In [34]:
similarity = cosine_similarity(vector)

In [35]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(535, 0.24715576637149034),
 (1185, 0.24459979523511427),
 (503, 0.241948228618021),
 (257, 0.23007892341722033),
 (1207, 0.23000322710873397)]

# Recommendation function
write fn to calculate movie index,distnace and get list of movies(top 5) with high similiarity

In [36]:
def recommend(movie):
    movie_index = new_movies[new_movies['title'] == movie].index[0]
    distance = sorted(list(enumerate(similarity[movie_index])),reverse=True,key=lambda x:x[1])
    for i in distance[1:6]:
        print(new_movies.iloc[i[0]].title)

In [37]:
# recommend("Avatar")

# Exporting
---
 - Pickling the whole pre-processed movie data and also the similarity model we built so that we can use it in streamlit
 - Using `compress-pickle` to reduce the size of pickle

In [38]:
cPickle.dump(new_movies,open('C:/Users/noble/PycharmProjects/movie-recommender/assets/pickles/movies.lzma','wb'),compression='lzma')

In [39]:
cPickle.dump(similarity,open('C:/Users/noble/PycharmProjects/movie-recommender/assets/pickles/similarity.lzma','wb'),compression='lzma')