As always, we first import the **libraries** that we are going to use



In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import string
pd.set_option('display.max_rows', None)
np.set_printoptions(edgeitems=100)
np.core.arrayprint._line_width = 200


## Content-based filtering ##

We will first develop the **content-based filtering system**.
1. Data **exploration**
2. Data **cleaning**
3. Features **plotting**
4. System builder

### Data exploration ###

Uploading data set with **+100k  movies** and info

In [2]:
movies=pd.read_csv("Datasets/IMDb movies.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [36]:
movies.tail(1000)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
84855,tt8900082,Commando Ninja,Commando Ninja,2018,2018-12-21,"Action, Comedy",68,France,"English, French",Benjamin Combes,...,"Eric Carlesi, Philippe Allier, Stéphane Asensi...","John Hunter is Vietnam Green Beret Veteran, Ha...",6.8,354,EUR 35000,,,,4.0,6.0
84856,tt8900098,Deslembro,Deslembro,2018,2019-06-20,Drama,96,Brazil,"Spanish, Portuguese, French",Flávia Castro,...,"Jeanne Boudier, Sara Antunes, Eliane Giardini,...",In 1979 a teenage girl emigrates back to Brazi...,7.2,282,,,,,,5.0
84857,tt8900142,Yom Adaatou Zouli,Yom Adaatou Zouli,2018,2019-09-28,Drama,94,"Syria, France, Lebanon, Qatar",Arabic,Soudade Kaadan,...,"Reham Alkassar, Sawsan Arshid, Samer Ismail, O...",In Syria in 2012 a mother ventures into a war ...,5.9,143,,,,,1.0,8.0
84858,tt8900172,"Chelovek, kotoryy udivil vsekh","Chelovek, kotoryy udivil vsekh",2018,2018-10-25,Drama,105,"Russia, France, Estonia",Russian,"Aleksey Chupov, Natasha Merkulova",...,"Evgeniy Tsyganov, Natalya Kudryashova, Yuriy K...",When a Siberian state forest guard discovers h...,6.6,621,,,$ 113717,,3.0,21.0
84859,tt8900302,Kucumbu Tubuh Indahku,Kucumbu Tubuh Indahku,2018,2019-04-18,Drama,105,Indonesia,Indonesian,Garin Nugroho,...,"Muhammad Khan, Raditya Evandra, Rianto, Sujiwo...",A pre-teen boy who abandoned by his father joi...,7.5,337,,,,,6.0,8.0
84860,tt8900984,"Dark, Deadly & Dreadful","Dark, Deadly & Dreadful",2018,2018-07-28,Horror,87,USA,English,"Luke Jaden, Jeanne Jo",...,Jessee Foudray,"Fun Size Horror Presents ""Dark, Deadly & Dread...",4.9,259,,,,,11.0,2.0
84861,tt8901582,Saf,Saf,2018,2019-04-19,Drama,102,"Turkey, Germany, Romania","Turkish, Arabic",Ali Vatansever,...,"Erol Afsin, Saadet Aksoy, Onur Buldu, Emrullah...",The Fikirtepe district of Istanbul. Urban tran...,6.3,359,EUR 530000,,$ 7813,,,7.0
84862,tt8902948,Tigertail,Tigertail,2020,2020-04-10,Drama,91,USA,"English, Min Nan, Mandarin",Alan Yang,...,"Tzi Ma, Christine Ko, Hong-Chi Lee, Yo-Hsing F...","In this multi-generational drama, a Taiwanese ...",6.4,2552,,,,65.0,37.0,63.0
84863,tt8902990,The Sky Is Pink,The Sky Is Pink,2019,2019-12-05,"Drama, Family, Romance",143,"India, UK, Canada, USA",Hindi,Shonali Bose,...,"Priyanka Chopra, Farhan Akhtar, Zaira Wasim, R...",Based on the love story of a couple spanning 2...,7.5,6427,,$ 652592,$ 1088641,55.0,195.0,34.0
84864,tt8903294,À cause des filles..?,À cause des filles..?,2019,2019-01-30,Comedy,96,France,French,Pascal Thomas,...,"José Garcia, Valérie Decobert-Koretzky, Elisa ...",A comedy of characters on the theme of seducti...,5.0,132,,,$ 106143,,1.0,1.0


In [4]:
movies.dtypes

imdb_title_id             object
title                     object
original_title            object
year                      object
date_published            object
genre                     object
duration                   int64
country                   object
language                  object
director                  object
writer                    object
production_company        object
actors                    object
description               object
avg_vote                 float64
votes                      int64
budget                    object
usa_gross_income          object
worlwide_gross_income     object
metascore                float64
reviews_from_users       float64
reviews_from_critics     float64
dtype: object

We first create a new df with the features we are going to use to match similar movies 

In [5]:
movies_def= movies[["title","director","genre","country","description"]].copy()
movies_def.head()

Unnamed: 0,title,director,genre,country,description
0,Miss Jerry,Alexander Black,Romance,USA,The adventures of a female reporter in the 1890s.
1,The Story of the Kelly Gang,Charles Tait,"Biography, Crime, Drama",Australia,True story of notorious Australian outlaw Ned ...
2,Den sorte drøm,Urban Gad,Drama,"Germany, Denmark",Two men of high rank are both wooing the beaut...
3,Cleopatra,Charles L. Gaskill,"Drama, History",USA,The fabled queen of Egypt's affair with Roman ...
4,L'Inferno,"Francesco Bertolini, Adolfo Padovan","Adventure, Drama, Fantasy",Italy,Loosely adapted from Dante's Divine Comedy and...


We check for **NAN** and **fill** them with blank space

In [6]:
movies_def.isnull().sum()

title             0
director         87
genre             0
country          64
description    2115
dtype: int64

In [7]:
columns=["title","director","genre","country","description"]
for column in columns:
    movies_def[column]=movies_def[column].fillna(" ")
movies_def.head()

Unnamed: 0,title,director,genre,country,description
0,Miss Jerry,Alexander Black,Romance,USA,The adventures of a female reporter in the 1890s.
1,The Story of the Kelly Gang,Charles Tait,"Biography, Crime, Drama",Australia,True story of notorious Australian outlaw Ned ...
2,Den sorte drøm,Urban Gad,Drama,"Germany, Denmark",Two men of high rank are both wooing the beaut...
3,Cleopatra,Charles L. Gaskill,"Drama, History",USA,The fabled queen of Egypt's affair with Roman ...
4,L'Inferno,"Francesco Bertolini, Adolfo Padovan","Adventure, Drama, Fantasy",Italy,Loosely adapted from Dante's Divine Comedy and...


Define a **function** to combine all the features in one single row

In [8]:
def combined_features (row):
    try:
        return row["title"]+" "+row["director"]+" "+row["genre"]+" "+row["country"]+" "+row["description"]
    except:
        return "Error",row



Then we **apply** the function to the dataset so it combines all the feature columns into one containing a string with all the features

In [9]:
movies_def["combined_features"]=movies_def.apply(combined_features, axis =1)

In [10]:
movies_def["combined_features"]=movies_def["combined_features"].str.lower()# As the matrix doesn't take upper cases

Double check that the dataframe looks like we wanted and there is no missing values

In [11]:
movies_def.head()


Unnamed: 0,title,director,genre,country,description,combined_features
0,Miss Jerry,Alexander Black,Romance,USA,The adventures of a female reporter in the 1890s.,miss jerry alexander black romance usa the adv...
1,The Story of the Kelly Gang,Charles Tait,"Biography, Crime, Drama",Australia,True story of notorious Australian outlaw Ned ...,the story of the kelly gang charles tait biogr...
2,Den sorte drøm,Urban Gad,Drama,"Germany, Denmark",Two men of high rank are both wooing the beaut...,"den sorte drøm urban gad drama germany, denmar..."
3,Cleopatra,Charles L. Gaskill,"Drama, History",USA,The fabled queen of Egypt's affair with Roman ...,"cleopatra charles l. gaskill drama, history us..."
4,L'Inferno,"Francesco Bertolini, Adolfo Padovan","Adventure, Drama, Fantasy",Italy,Loosely adapted from Dante's Divine Comedy and...,"l'inferno francesco bertolini, adolfo padovan ..."


In [22]:
moviestest=movies_def.tail(10000)


In [23]:
movies_def.isnull().sum()

title                0
director             0
genre                0
country              0
description          0
combined_features    0
dtype: int64

Now we have to create the **count matrix** and compute **cosine similarity** for this new column with all the features values

In [24]:
# We import the model and fit the column into the matrix 
cv=CountVectorizer()
#count_matrix=cv.fit_transform(movies_def["combined_features"])

count_matrixtest=cv.fit_transform(moviestest["combined_features"]) # Test as the full one is to big 

In [25]:
# Compute the cosine similarity in the count matrix
#cos_sim=cosine_similarity(count_matrix)


cos_simtest=cosine_similarity(count_matrixtest)# Test as the full one is to big 
cos_simtest

array([[1.        , 0.07819291, 0.03768892, 0.125     , 0.02946278,
        0.26261287, 0.08333333, 0.12166607, 0.11952286, 0.13608276,
        0.09682458, 0.04902903, 0.16222142, 0.21650635, 0.13130643,
        0.02665009, 0.12141073, 0.02475369, 0.10910895, 0.10606602,
        0.07372098, 0.16248342, 0.07354355, 0.08006408, 0.02727724,
        0.07426107, 0.03175003, 0.04902903, 0.0559017 , 0.15638581,
        0.        , 0.05892557, 0.02665009, 0.21821789, 0.02451452,
        0.05661385, 0.22086305, 0.09375   , 0.04256283, 0.16535946,
        0.        , 0.35355339, 0.02362278, 0.1767767 , 0.02321192,
        0.04642383, 0.06565322, 0.05455447, 0.12126781, 0.1181139 ,
        0.08282364, 0.09712859, 0.05455447, 0.06350006, 0.05      ,
        0.        , 0.15715464, 0.02551552, 0.13693064, 0.06963575,
        0.04856429, 0.07576144, 0.18190172, 0.03175003, 0.06454972,
        0.08087458, 0.14301939, 0.03857584, 0.06565322, 0.19245009,
        0.13479097, 0.14530955, 0.15523011, 0.13

Defining functions to get **index from title and title from index**

In [26]:
def get_index(title):
    return movies[movies.title==title].index.values[0]
def get_title(index):
    return movies[movies.index==index]["title"].values[0]


Defining the function to get  top 5 **recommended movies** based on similarity

In [27]:
def movie_recomendation():
    movie_user_likes=input("Please write your choice here...")
    movie_user_index=get_index(movie_user_likes)
    sim_movies=list(enumerate(cos_simtest[movie_user_index]))
    sorted_sim_movies=sorted(sim_movies,key=lambda x: x[1],reverse= True)
    i=0
    for movie in sorted_sim_movies:
        print(get_title(movie[0]))
        i=i+1
        if i>10:
            break

In [43]:
movie_recomendation()

Please write your choice here...Ironman


IndexError: index 0 is out of bounds for axis 0 with size 0