## Content Based Recommender System

In [1]:
import numpy as np 
import pandas as pd

The TMDB (The Movie Database) dataset provides two commonly used files:

1. **Movies Dataset**:  
   This dataset contains general information about movies. It includes features like:
   - **Title**: The name of the movie.
   - **Genres**: A list of genres associated with the movie.
   - **Overview**: A short summary or synopsis.
   - **Release Date**: The date when the movie was released.
   - **Popularity**: A score reflecting the popularity of the movie.
   - **Vote Average/Count**: Aggregated ratings and the number of votes it received.
   - **Runtime**: The duration of the movie.
   - **Language**: The language(s) in which the movie is available.
   - **Production Companies**: Information about the studios involved in the movie's creation.

2. **Credits Dataset**:  
   This dataset contains information about the people associated with the movie. Key features include:
   - **Cast**: A list of actors and their respective roles in the movie.
   - **Crew**: A list of crew members, including their roles (e.g., director, producer, writer).
   - **Movie ID**: A unique identifier linking the credits data to the movies dataset.

### Relationship Between the Two:
- **Movies dataset** provides metadata about the movie itself.
- **Credits dataset** focuses on the people (cast and crew) involved in the movie's production.
- The two datasets can be merged using the **Movie ID** field for comprehensive analyses or building applications like recommender systems.

In [2]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

In [3]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
movies.shape

(4803, 20)

In [5]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
credits.shape

(4803, 4)

### Merging the datasets

Merging the movies and credits datasets helps to create more detailed, personalized, and accurate dataset to make DAta Analysis process smoother.


In [7]:
# Let's merge on column "title"

movies = movies.merge(credits,on='title')

In [8]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [9]:
movies.shape

(4809, 23)

In [10]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

#### Let's select the most relevant columns for our consideration

In [11]:
# Keeping important columns for recommendation

movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [12]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [13]:
movies.shape

(4809, 7)

In [14]:
# Let's check for missing values

movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [15]:
movies.dropna(inplace=True)

In [16]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [17]:
movies.shape

(4806, 7)

### Let's now check for duplicate rows within our dataset

In [18]:
movies.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
4804    False
4805    False
4806    False
4807    False
4808    False
Length: 4806, dtype: bool

In [19]:
movies.duplicated().sum()

0

In [20]:
movies["genres"].head(10)

0    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1    [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3    [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
5    [{"id": 14, "name": "Fantasy"}, {"id": 28, "na...
6    [{"id": 16, "name": "Animation"}, {"id": 10751...
7    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
8    [{"id": 12, "name": "Adventure"}, {"id": 14, "...
9    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
Name: genres, dtype: object

In [21]:
# handle genres

movies['genres'].iloc[0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [22]:
movies['genres'].iloc[2]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 80, "name": "Crime"}]'

In [None]:
movies['genres'].iloc[3]

'[{"id": 28, "name": "Action"}, {"id": 80, "name": "Crime"}, {"id": 18, "name": "Drama"}, {"id": 53, "name": "Thriller"}]'

### Now let us create a function to extract the values of the key "name" from the "genres" column

In [24]:
import ast

# Example text (string representation of a list of dictionaries)
text = '[{"name": "Action"}, {"name": "Comedy"}, {"name": "Drama"}]'

# Use literal_eval to convert the string into a Python object (list of dictionaries)
ast.literal_eval(text)


[{'name': 'Action'}, {'name': 'Comedy'}, {'name': 'Drama'}]

In [25]:
import ast #for converting str to list

def convert(text):
    L = []
    for dictionary in ast.literal_eval(text):
        L.append(dictionary['name']) 
    return L

In [26]:
movies['genres'] = movies['genres'].apply(convert)

In [27]:
movies['genres'].iloc[3]


['Action', 'Crime', 'Drama', 'Thriller']

In [28]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [29]:
# handle keywords
movies.iloc[0]['keywords']

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

### Let's apply the convert function on the "keyword" column to return only the values from the keys in the dictionary

In [30]:
movies['keywords'] = movies['keywords'].apply(convert)
movies['keywords']

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states–mexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [31]:
movies['keywords'].iloc[3]


['dc comics',
 'crime fighter',
 'terrorist',
 'secret identity',
 'burglar',
 'hostage drama',
 'time bomb',
 'gotham city',
 'vigilante',
 'cover-up',
 'superhero',
 'villainess',
 'tragic hero',
 'terrorism',
 'destruction',
 'catwoman',
 'cat burglar',
 'imax',
 'flood',
 'criminal underworld',
 'batman']

In [32]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [33]:
# handle cast
movies.iloc[0]['cast']

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

### Here we want to keep the name of the first three casts from the  'cast' column

In [34]:
# Let's keep the first three casts

def convert_cast(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L

In [35]:
movies['cast'] = movies['cast'].apply(convert_cast)
movies['cast']

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [36]:
movies['cast'].iloc[0]

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [37]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### Let's handle the crew

We want to extract the directors from each row

In [38]:
# handle crew

movies['crew'].iloc[0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [39]:
def fetch_director(text):
    L = []
    for dictionary in ast.literal_eval(text):
        if dictionary['job'] == 'Director':
            L.append(dictionary['name'])
            break
    return L

In [40]:
movies['crew'] = movies['crew'].apply(fetch_director)
movies['crew']

0           [James Cameron]
1          [Gore Verbinski]
2              [Sam Mendes]
3       [Christopher Nolan]
4          [Andrew Stanton]
               ...         
4804     [Robert Rodriguez]
4805         [Edward Burns]
4806          [Scott Smith]
4807          [Daniel Hsia]
4808     [Brian Herzlinger]
Name: crew, Length: 4806, dtype: object

In [41]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


### Let's handle the overview column



In [42]:
# handle overview (converting to list)

movies['overview'].iloc[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [43]:
# Let's create a function to separate the words in the 'overview' column

movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies['overview']

0       [In, the, 22nd, century,, a, paraplegic, Marin...
1       [Captain, Barbossa,, long, believed, to, be, d...
2       [A, cryptic, message, from, Bond’s, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, war-weary,, former, mili...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couple's, honeymoon, is, upended...
4806    ["Signed,, Sealed,, Delivered", introduces, a,...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: overview, Length: 4806, dtype: object

In [44]:
movies['overview'].iloc[0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

In [45]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [46]:
movies.sample(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
3661,298312,The Visit,"[The, terrifying, story, of, a, brother, and, ...","[Horror, Thriller]","[rap music, pennsylvania, brother sister relat...","[Olivia DeJonge, Ed Oxenbould, Kathryn Hahn]",[M. Night Shyamalan]
605,59981,Legends of Oz: Dorothy's Return,"[Dorothy, wakes, up, in, post-tornado, Kansas,...","[Animation, Music, Family]",[],"[Lea Michele, Dan Aykroyd, Patrick Stewart]",[Dan St. Pierre]
3008,12182,Nick and Norah's Infinite Playlist,"[Nick, cannot, stop, obsessing, over, his, ex-...","[Comedy, Music, Romance]","[concert, teenager, one night, based on young ...","[Michael Cera, Kat Dennings, Aaron Yoo]",[Peter Sollett]
3844,11363,She's the One,"[Mickey,, a, free-spirited, New, York, cabbie,...","[Comedy, Romance]","[brother brother relationship, taxi, ex-girlfr...","[Edward Burns, Michael McGlone, Cameron Diaz]",[Edward Burns]
4339,1366,Rocky,"[When, world, heavyweight, boxing, champion,, ...",[Drama],"[underdog, philadelphia, transporter, italo-am...","[Sylvester Stallone, Talia Shire, Burt Young]",[John G. Avildsen]


### Let's remove the extra spaces between the words in all the columns to make tkenisation more efficient

In [47]:
# now removing space like that 
'Gabriel John'
'GabrielJohn'


def remove_space(L):
    L1 = []
    for words in L:
        L1.append(words.replace(" ",""))
    return L1

In [48]:

movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)
movies['keywords'] = movies['keywords'].apply(remove_space)

### **Why did we need to remove the extra space between the words in each column?**

In the context of a **content-based recommendation system**, the goal is to prepare the text data (such as movie titles, genres, cast names, and keywords) for further analysis, such as vectorization, which will then be used to compute similarities between items. The extra spaces can cause issues in several ways:

- **Text Tokenization**: When preparing text data for analysis, spaces are often used to split text into tokens (individual words). If there are extra spaces, they may lead to unnecessary tokens or errors in tokenization, making it difficult to accurately process the data.
  
- **Vectorization Issues**: If you use techniques like TF-IDF or word embeddings, extra spaces can result in incorrect tokenization or redundant features. For example, `'Gabriel John'` might be treated differently from `'GabrielJohn'`, causing a mismatch in the data and making the model less accurate.

- **String Matching**: For systems like content-based recommendation, it's important that similar strings are recognized as such. Extra spaces can break the matching process or make the comparison between strings inaccurate. Removing spaces ensures that the comparison happens at the word level, rather than considering the spaces as part of the words.


### Sources:
- **Content-Based Recommender Systems**: [Towards Data Science on Medium](https://towardsdatascience.com/understanding-content-based-filtering-for-recommendation-systems-631488cf3b66)

In [49]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [50]:
# Concatinate all
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies['tags']

0       [In, the, 22nd, century,, a, paraplegic, Marin...
1       [Captain, Barbossa,, long, believed, to, be, d...
2       [A, cryptic, message, from, Bond’s, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, war-weary,, former, mili...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couple's, honeymoon, is, upended...
4806    ["Signed,, Sealed,, Delivered", introduces, a,...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: tags, Length: 4806, dtype: object

### **Why did we need to convert everything into one column named `tags`?**

The reason for concatenating different textual features into one column (`tags`) is to **simplify the feature set** and enable better similarity calculations:

- **Improved Similarity Measures**: Content-based recommendation relies on comparing items (movies) based on their content, typically using cosine similarity or other distance metrics. By combining the `overview`, `genres`, `keywords`, `cast`, and `crew` into a single column, you're ensuring that all of the available information is used when calculating the similarity between movies. Each of these attributes contributes to the "identity" of the movie, and when combined, they provide a richer representation.

- **Simplified Text Processing**: Combining the columns into one reduces complexity in subsequent steps of the processing pipeline. Instead of processing each column individually, you can treat the movie as a single "document" of textual information. This helps streamline the model, especially when using techniques like **TF-IDF** or **word embeddings** for text vectorization.

- **Feature Fusion**: The `tags` column essentially fuses different sources of information (genres, cast, keywords, etc.) that may help in making recommendations. By concatenating these features, you're essentially creating a **unified feature** that captures all aspects of the movie's content in one place. This can lead to better similarity matching and more relevant recommendations.

- **Scalability**: When you later apply machine learning algorithms or other similarity-based techniques, working with one column is often more efficient than dealing with multiple columns separately. It's easier to manage and scale the data.

In conclusion, **removing spaces** ensures accurate text processing, and **combining everything into one column (`tags`)** allows you to utilize all of the movie's content when determining similarities, making the recommendation system more comprehensive and efficient.

### Sources:
- **Text Preprocessing in NLP**: [KDnuggets](https://www.kdnuggets.com/2018/12/essential-guide-text-preprocessing.html)

In [51]:
movies['tags'].iloc[0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.',
 'Action',
 'Adventure',
 'Fantasy',
 'ScienceFiction',
 'cultureclash',
 'future',
 'spacewar',
 'spacecolony',
 'society',
 'spacetravel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alienplanet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'loveaffair',
 'antiwar',
 'powerrelations',
 'mindandsoul',
 '3d',
 'SamWorthington',
 'ZoeSaldana',
 'SigourneyWeaver',
 'JamesCameron']

In [52]:
# Let's create a new dataframe
Content_Based_df = movies[['movie_id','title','tags']]
Content_Based_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [53]:
# Converting list to str
Content_Based_df['tags'] = Content_Based_df['tags'].apply(lambda x: " ".join(x))
Content_Based_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [54]:
Content_Based_df['tags'].iloc[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

## Why is it neccessary to convert the list into a normal string in the context of a content-based recommendation system for several reasons

### 1. **Compatibility with Text Vectorization Techniques**
   Text vectorization methods like **TF-IDF** (Term Frequency-Inverse Document Frequency), **Count Vectorizer**, or **word embeddings** typically expect a **single string of text** per document/item. These methods work by converting the text into numerical vectors based on the frequency or occurrence of words. Lists of words or tokens can't be directly processed by these techniques. Hence, converting the list of words (tags) into a single string allows you to use these vectorization methods, which are crucial for computing similarities between movies based on their content.

### 2. **Similarity Calculations**
   When performing similarity-based recommendations, the content-based system often uses metrics like **cosine similarity** or **Euclidean distance** to compare the vectors of different movies. These similarity metrics require the data to be in a consistent, comparable format, and strings are the standard form used in most distance calculations. If the tags are in list form, it would be challenging to compute the similarity between two rows because the list items don't represent a unified "document" of features.

### 3. **Text Processing and Machine Learning Models**
   Most machine learning models, especially those designed for **Natural Language Processing (NLP)**, require the data to be in a string format to properly process and analyze the content. The `join()` operation is a simple method to ensure the data is in a usable form for text-based models, which generally take one continuous string of words as input.

### 4. **Data Representation**
   In data analysis, it's easier to manipulate and analyze data when it is in a consistent format. Lists within a DataFrame column can make it more difficult to apply operations like **text cleaning, filtering**, or **pattern matching**. By converting the lists to strings, you simplify the representation of the data and make it easier to apply various text processing functions, such as removing stopwords, stemming, or lemmatizing.


### Sources:
- **TF-IDF and text vectorization**: [DataCamp](https://www.datacamp.com/community/tutorials/tutorial-implement-tf-idf-python)
- **Cosine similarity and its application in recommendation systems**: [GeeksforGeeks](https://www.geeksforgeeks.org/cosine-similarity-python/)
- **Text Preprocessing and NLP**: [Towards Data Science](https://towardsdatascience.com/text-preprocessing-in-nlp-steps-tools-and-techniques-5efebcdbf3b3)

In [55]:
# Let's convert the contents of the "tags" column to lower case

Content_Based_df['tags'] = Content_Based_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [56]:
Content_Based_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [57]:
Content_Based_df['tags'].iloc[0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

### Let's now develope the prediction model

In [58]:
import nltk
from nltk.stem import PorterStemmer as ps
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle


In [59]:
def stems(text):
    T = []
    
    for i in text.split():
        T.append(ps().stem(i))
    
    return " ".join(T)

In [60]:
Content_Based_df['tags'] = Content_Based_df['tags'].apply(stems)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [61]:
Content_Based_df['tags'].iloc[0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

#### Let's convert the 'tags' column, which contains text data, into numerical vectors representing the frequency of each word across all movies. These vectors will then be used for the content-based similarity calculations in the recommendation system.


In [62]:
cv = CountVectorizer(max_features=5000,stop_words='english')
vector = cv.fit_transform(Content_Based_df['tags']).toarray()
vector[0]

array([0, 0, 0, ..., 0, 0, 0])

### Let's get the vocabulary mapping

In [63]:
# Get the vocabulary mapping (word -> index)
vocab = cv.get_feature_names_out()
non_zero_indices = vector[0].nonzero()[0]

# Map indices to words in the vocabulary
matching_words = [vocab[i] for i in non_zero_indices]
matching_words


['3d',
 'action',
 'adventur',
 'alien',
 'alienplanet',
 'battl',
 'becom',
 'century',
 'cultureclash',
 'dispatch',
 'fantasi',
 'follow',
 'futur',
 'futurist',
 'jamescameron',
 'marin',
 'mission',
 'moon',
 'order',
 'pandora',
 'protect',
 'romanc',
 'sciencefict',
 'sigourneyweav',
 'societi',
 'soldier',
 'space',
 'spacetravel',
 'torn',
 'tribe',
 'uniqu',
 'zoesaldana']

In [64]:
vector.shape

(4806, 5000)

Let's confirm the length of our feature since we specified max_feature = 5000

In [None]:
len(cv.get_feature_names_out())



5000

### Let's check the similarity of data

In [66]:
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.08346223, 0.0860309 , ..., 0.04499213, 0.        ,
        0.        ],
       [0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
        0.02615329],
       [0.0860309 , 0.06063391, 1.        , ..., 0.02451452, 0.        ,
        0.        ],
       ...,
       [0.04499213, 0.02378257, 0.02451452, ..., 1.        , 0.03962144,
        0.04229549],
       [0.        , 0.        , 0.        , ..., 0.03962144, 1.        ,
        0.08714204],
       [0.        , 0.02615329, 0.        , ..., 0.04229549, 0.08714204,
        1.        ]])

In [67]:
similarity.shape

(4806, 4806)

### Now let's create a function for out content-based recommender

In [68]:
Content_Based_df['title']

0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4804                                 El Mariachi
4805                                   Newlyweds
4806                   Signed, Sealed, Delivered
4807                            Shanghai Calling
4808                           My Date with Drew
Name: title, Length: 4806, dtype: object

In [70]:
Content_Based_df[Content_Based_df['title'] == 'Spectre']

Unnamed: 0,movie_id,title,tags
2,206647,Spectre,a cryptic messag from bond’ past send him on a...


In [71]:
Content_Based_df[Content_Based_df['title'] == 'Newlyweds']

Unnamed: 0,movie_id,title,tags
4805,72766,Newlyweds,a newlyw couple' honeymoon is upend by the arr...


In [74]:
# Let's get the index of each movie

index = Content_Based_df[Content_Based_df['title'] == 'Newlyweds'].index[0]
index

4805

### Let's see how we can retrieve the corresponding movie titles for those indices.

In [75]:
list(enumerate(similarity[index]))


[(0, 0.0),
 (1, 0.02615328904829707),
 (2, 0.0),
 (3, 0.08633531446374632),
 (4, 0.04448840533001291),
 (5, 0.04075695729696112),
 (6, 0.0),
 (7, 0.045980048987170286),
 (8, 0.1113692092779409),
 (9, 0.022733144649015782),
 (10, 0.12056070554260302),
 (11, 0.04448840533001291),
 (12, 0.02831827358942995),
 (13, 0.0),
 (14, 0.024112141108520606),
 (15, 0.07874992309581577),
 (16, 0.0),
 (17, 0.037542552788913885),
 (18, 0.10709164570498712),
 (19, 0.019525441139613995),
 (20, 0.10911909431527576),
 (21, 0.0515539262226467),
 (22, 0.03112864031823452),
 (23, 0.0),
 (24, 0.07521183158458504),
 (25, 0.12788955112115236),
 (26, 0.0),
 (27, 0.05857632341884199),
 (28, 0.05477910356647767),
 (29, 0.02037847864848056),
 (30, 0.08378915848902159),
 (31, 0.0),
 (32, 0.060072129859745485),
 (33, 0.0),
 (34, 0.0),
 (35, 0.03112864031823452),
 (36, 0.023255813953488372),
 (37, 0.024738534799764674),
 (38, 0.1143739277494535),
 (39, 0.02577696311132335),
 (40, 0.04822428221704121),
 (41, 0.066732607

### Let's see how we can know the extract the similarity from the "similarity list"

`lambda x: x[0]` sorts the tuples based on their first element (the index) instead of the similarity score

`lambda x: x[1]` recommends movies based on the highest similarity scores.

In [76]:
distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[0])
distances

[(4805, 1.0000000000000002),
 (4804, 0.0871420401900598),
 (4803, 0.042295493443781355),
 (4802, 0.0),
 (4801, 0.12668775432006435),
 (4800, 0.07587087446739768),
 (4799, 0.024738534799764674),
 (4798, 0.03970724559546876),
 (4797, 0.0),
 (4796, 0.0),
 (4795, 0.0),
 (4794, 0.08804509063256237),
 (4793, 0.0),
 (4792, 0.0),
 (4791, 0.0),
 (4790, 0.0),
 (4789, 0.1143739277494535),
 (4788, 0.0),
 (4787, 0.11095900821829638),
 (4786, 0.05247835714059896),
 (4785, 0.07397267214553091),
 (4784, 0.08151391459392224),
 (4783, 0.09983374884595828),
 (4782, 0.024112141108520606),
 (4781, 0.05477910356647767),
 (4780, 0.11717224905430869),
 (4779, 0.02934836354418746),
 (4778, 0.14159136794714974),
 (4777, 0.050832856777534886),
 (4776, 0.0),
 (4775, 0.0),
 (4774, 0.0),
 (4773, 0.1113692092779409),
 (4772, 0.022484687520664393),
 (4771, 0.0),
 (4770, 0.0),
 (4769, 0.03251280443811775),
 (4768, 0.11627906976744186),
 (4767, 0.034099716973523674),
 (4766, 0.04849444837696853),
 (4765, 0.038124642583

In [77]:
distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
distances

[(4805, 1.0000000000000002),
 (2534, 0.26687249808205815),
 (3815, 0.2346903478277816),
 (4529, 0.22421317963806006),
 (4451, 0.2114774672189068),
 (193, 0.20736758755125534),
 (3065, 0.20691022044226628),
 (3734, 0.20339830145074406),
 (4603, 0.1985362279773438),
 (2295, 0.1979082783981174),
 (2960, 0.19760804978495328),
 (1907, 0.19687480773953941),
 (868, 0.19525441139613994),
 (4453, 0.19525441139613994),
 (3507, 0.19289712886816485),
 (4585, 0.19172686248267184),
 (2040, 0.19062321291575585),
 (4674, 0.1903297204970161),
 (2420, 0.1867718419094071),
 (1070, 0.18604651162790697),
 (4383, 0.18088778123376797),
 (2491, 0.18021638957923647),
 (3192, 0.18021638957923647),
 (4743, 0.178754329258837),
 (2692, 0.17795362132005163),
 (3573, 0.1779149987213721),
 (4103, 0.17609018126512474),
 (2867, 0.17572897025652595),
 (4275, 0.17572897025652595),
 (1953, 0.17454775580151477),
 (3419, 0.17430604019629448),
 (4260, 0.17316974359835272),
 (1678, 0.17157429638972982),
 (3271, 0.171560891624

In [78]:
def recommend(movie):
    index = Content_Based_df[Content_Based_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    # Excluding the input movie itself (at index 0) and selecting the top 5 similar movies.
    for i in distances[1:6]:
        print(Content_Based_df.iloc[i[0]].title)

In [80]:
recommend('Superman')

Superman Returns
Superman II
Iron Man 2
Superman III
Superman IV: The Quest for Peace


### Now let us save our processed dataframe and similarity matrix into the artifacts folder for prediction purpose

In [None]:
# Dump the dataframe
pickle.dump(Content_Based_df,open('artifacts/movie_list.pkl','wb'))

# Dump the simlarity
pickle.dump(similarity,open('artifacts/similarity.pkl','wb'))