<img align='center' src="https://heartoflongmont.org/wp-content/uploads/2019/02/Movie-Recommendation.jpg">

<h1 align='center'>Content Based Filtering</h1>

<h3>Importing required packages, modules, functions</h3>

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ipywidgets as widgets

### Reading the dataset

In [2]:
df = pd.read_csv("Downloads/imdb.csv")

In [3]:
df.head()

Unnamed: 0,cast_crew,description,genre,gross,imdb_score,movie_name,rating,runtime,votes
0,"\n Director:\n,Thomas Kail,\n ...",\n The real life of one of America's foremo...,"\nBiography, Drama, History",,9.0,Hamilton,PG-13,160 min,22819
1,"\n Director:\n,David Dobkin,\n ...",\n When aspiring musicians Lars and Sigrit ...,"\nComedy, Music",,6.6,Eurovision Song Contest: The Story of Fire Saga,PG-13,123 min,45555
2,"\n Director:\n,Gina Prince-Bythewood,\n ...",\n A covert team of immortal mercenaries ar...,"\nAction, Adventure, Fantasy",,6.7,The Old Guard,R,125 min,43653
3,"\n Directors:\n,Barbara Bialowas,, \n,Tomas...",\n Massimo is a member of the Sicilian Mafi...,"\nDrama, Romance",,3.3,365 Days,TV-MA,114 min,26324
4,"\n Director:\n,Aaron Schneider,\n ...","\n Early in World War II, an inexperienced ...","\nAction, Drama, History",,7.1,Greyhound,PG-13,91 min,16830


### Data Cleaning and Data Preparation

<p><b>Data Cleaning</b> includes removing unnecessary characters like the new-line character '\n', whitespaces etc. </p>
<p><b>Data Preparation</b> includes separating the directors and the main-cast of the movie from the 'cast_crew' column. A new column 'combined' is created which contains the necessary data required for movie recommendation using content based filtering. </p>

##### Data Cleaning

In [4]:
df['cast_crew'] = df['cast_crew'].str.strip()
df['description'] = df['description'].str.strip()
df['genre'] = df['genre'].str.strip()

In [5]:
df['cast_crew'] = df['cast_crew'].str.split("|")

In [6]:
for i, cast in enumerate(df['cast_crew']):
    try:
        df.loc[i, 'director'] = cast[0]
    except:
        df.loc[i, 'director'] = np.nan
    try:
        df.loc[i, 'cast'] = cast[1]
    except:
        df.loc[i, 'cast'] = np.nan

In [7]:
df.drop('cast_crew', axis=1, inplace=True)

In [8]:
df['director'] = df['director'].str.replace("\n", "")
df['cast'] = df['cast'].str.replace("\n", "")

In [9]:
df['director'] = df['director'].str.replace("Director:,", "")
df['cast'] = df['cast'].str.replace("Stars:,", "")

In [10]:
df['director'] = df['director'].str.replace("Directors:,", "")

In [11]:
df['cast'] = df['cast'].str.replace(",", "", 1)

In [12]:
df['cast'] = df['cast'].str.strip()

In [13]:
for i, d in enumerate(df['director']):
    d = d[::-1]
    d = d.replace(",", "", 1)
    d = d.strip()
    d = d[1:]
    d = d[::-1]
    df.loc[i, 'director'] = d

In [14]:
for i, c in enumerate(df['cast']):
    try:
        c = set(c.split(","))
        c.discard("")
        c.discard(" ")
        c = list(c)
        y = ""
        for x in c:
            y = y + x + ", "
        df.loc[i, 'cast'] = y
    except:
        pass
    

In [15]:
df['cast'] = df['cast'].str.rstrip()

In [16]:
for i, c in enumerate(df['cast']):
    try:
        c = c[::-1]
        c = c.replace(",", "", 1)
        c = c[::-1]
        df.loc[i, 'cast'] = c
    except:
        pass

In [17]:
for i, d in enumerate(df['director']):
    try:
        d = set(d.split(","))
        d.discard("")
        d.discard(" ")
        d = list(d)
        y = ""
        for j, x in enumerate(d):
            if j == len(d)-1:
                y = y + x
            else:
                y = y + x + ", "
        df.loc[i, 'director'] = y
    except:
        pass

In [18]:
for i, g in enumerate(df['gross']):
    try:
        g = float(g[1:-1])
        df.loc[i, 'gross'] = g
    except:
        pass

In [19]:
df = df[['movie_name', 'director', 'cast', 'genre', 'description', 'imdb_score', 'votes', 'rating', 'runtime', 'gross']]

In [20]:
df['votes'] = df['votes'].str.replace(",", "")

In [21]:
df['votes'] = df['votes'].astype(float)

In [22]:
df['runtime'] = df['runtime'].str.replace(" min", "")

In [23]:
df['runtime'] = df['runtime'].astype(float)

In [25]:
#df.to_csv("imdb_cleaned.csv")

##### Data preparation

In [24]:
df.fillna("", inplace=True)
df.head()

Unnamed: 0,movie_name,director,cast,genre,description,imdb_score,votes,rating,runtime,gross
0,Hamilton,Thomas Kail,"Leslie Odom Jr., Renée Elise Goldsberry, Phill...","Biography, Drama, History",The real life of one of America's foremost fou...,9.0,22819,PG-13,160,
1,Eurovision Song Contest: The Story of Fire Saga,David Dobkin,"Dan Stevens, Rachel McAdams, Mikael Persbrandt...","Comedy, Music",When aspiring musicians Lars and Sigrit are gi...,6.6,45555,PG-13,123,
2,The Old Guard,Gina Prince-Bythewood,"Matthias Schoenaerts, Charlize Theron, Marwan ...","Action, Adventure, Fantasy",A covert team of immortal mercenaries are sudd...,6.7,43653,R,125,
3,365 Days,"Barbara Bialowas, Tomasz Mandes","Anna Maria Sieklucka, Bronislaw Wroclawski, Mi...","Drama, Romance",Massimo is a member of the Sicilian Mafia fami...,3.3,26324,TV-MA,114,
4,Greyhound,Aaron Schneider,"Stephen Graham, Tom Hanks, Elisabeth Shue, Mat...","Action, Drama, History","Early in World War II, an inexperienced U.S. N...",7.1,16830,PG-13,91,


In [25]:
df['cast'] = df['cast'].str.replace(",", "")
df['director'] = df['director'].str.replace(",", "")
df['genre'] = df['genre'].str.replace(",", "")

In [26]:
df.head(1)

Unnamed: 0,movie_name,director,cast,genre,description,imdb_score,votes,rating,runtime,gross
0,Hamilton,Thomas Kail,Leslie Odom Jr. Renée Elise Goldsberry Phillip...,Biography Drama History,The real life of one of America's foremost fou...,9,22819,PG-13,160,


The following function returns the combined data of each movie required for content-based filtering

In [27]:
def combine(row):
    return row['movie_name'] + " " + row['director'] + " " + row['cast'] + " " + row['genre'] + " " + row['description'] + " " + str(row['imdb_score']) + " " + row['rating'] + " " + str(row['runtime'])

In [28]:
df['combined'] = df.apply(combine, axis=1)

In [29]:
df.head()

Unnamed: 0,movie_name,director,cast,genre,description,imdb_score,votes,rating,runtime,gross,combined
0,Hamilton,Thomas Kail,Leslie Odom Jr. Renée Elise Goldsberry Phillip...,Biography Drama History,The real life of one of America's foremost fou...,9.0,22819,PG-13,160,,Hamilton Thomas Kail Leslie Odom Jr. Renée Eli...
1,Eurovision Song Contest: The Story of Fire Saga,David Dobkin,Dan Stevens Rachel McAdams Mikael Persbrandt W...,Comedy Music,When aspiring musicians Lars and Sigrit are gi...,6.6,45555,PG-13,123,,Eurovision Song Contest: The Story of Fire Sag...
2,The Old Guard,Gina Prince-Bythewood,Matthias Schoenaerts Charlize Theron Marwan Ke...,Action Adventure Fantasy,A covert team of immortal mercenaries are sudd...,6.7,43653,R,125,,The Old Guard Gina Prince-Bythewood Matthias S...
3,365 Days,Barbara Bialowas Tomasz Mandes,Anna Maria Sieklucka Bronislaw Wroclawski Mich...,Drama Romance,Massimo is a member of the Sicilian Mafia fami...,3.3,26324,TV-MA,114,,365 Days Barbara Bialowas Tomasz Mandes Anna M...
4,Greyhound,Aaron Schneider,Stephen Graham Tom Hanks Elisabeth Shue Matt Helm,Action Drama History,"Early in World War II, an inexperienced U.S. N...",7.1,16830,PG-13,91,,Greyhound Aaron Schneider Stephen Graham Tom H...


### Movie Recommendation using Content Based Filtering

In [30]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined'])
cosine = cosine_similarity(count_matrix)

In [31]:
#Returns the index of the movie name
def get_index(title):
    return df[df['movie_name'] == title].index[0]

In [32]:
#A dropdown menu containing all the movie titles
r = widgets.Dropdown(
    options=list(df['movie_name'].values),
    description='Movie:',
    disabled=False,
)
r

Dropdown(description='Movie:', options=('Hamilton', 'Eurovision Song Contest: The Story of Fire Saga', 'The Ol…

In [36]:
#This step generates a sorted list of movies based on the similarity score
movie = r.value
movie_index = get_index(movie)
#print(movie_index)
recommended = list(enumerate(cosine[movie_index]))
recommended_sorted = sorted(recommended, reverse=True, key=lambda x:x[1])

In [37]:
#Returns the title of the movie
def get_title(index):
    return df[df.index == index]['movie_name'][index]

In [38]:
#Display the top 50 recommended movies
i = 0
for movie in recommended_sorted:
    print(get_title(movie[0]))
    i += 1
    if i > 50:
        break

1917
Mudbound
Force 10 from Navarone
Us
World War Z
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
Windtalkers
War Horse
Men in Black 3
Hart's War
Thank You for Your Service
War Games
Tae Guk Gi: The Brotherhood of War
The Monuments Men
Stalingrad
The Lobster
Men with Brooms
Sunshine on Leith
The Keeping Room
Silence
Far from Men
Revolutionary Road
Hearts and Bones
The Aftermath
The Lucky Ones
The Midnight Sky
The Prophecy
Tomorrow Never Dies
A Private War
The Beach
Duck Butter
Phase IV
The Big Red One
The Great War
Thunder Road
New Rose Hotel
Enemy Lines
Mata Hari
Sonatine
Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutes
Bus Stop
My Way
The Foreman Went to France
When Hitler Stole Pink Rabbit
War Pigs
3:10 to Yuma
122
Jailbreak
The Best Years of Our Lives
Two Brothers
Blade the Iron Cross
