## Group 4 - Netflix Movie Recommender | Jay-An, Huu Huy Anh, Luis and Anmol

## 1. Installing Packages and Installing Libraries

In [1]:
# Necessary Packages:

#!pip install rake-nltk
#!pip install pandas
#!pip install nltk
#!pip install scikit-learn
#!install requests
#!pip install numpy
#!pip install rake-nltk
#!pip install gensim
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('omw-1.4')

In [2]:
import pandas as pd
import ast 
import nltk 
import string
import re
import requests
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
from nltk.stem import WordNetLemmatizer
from scipy.sparse import hstack
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from rake_nltk import Rake
from gensim.utils import tokenize
from gensim.parsing.preprocessing import remove_stopwords
from pprint import PrettyPrinter
from nltk.tokenize import RegexpTokenizer
from gensim.models import Word2Vec

## 2. Importing DataFrames

This section of code imports the data we need and displays a few rows for checking purposes. 

In [3]:
# Reads .csv files:
credits_df = pd.read_csv('data/tmdb_5000_credits.csv')
movies_df = pd.read_csv('data/tmdb_5000_movies.csv')

In [4]:
movies_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [5]:
credits_df.head(3)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


## 3. Joining DataFrames

This section of code merges the "credits" dataframe and "movies" dataframe that were imported earlier into a new dataframe called "df_not_cleaned" for further use.

In [6]:
# Renames movie_id column to id
credits_df = credits_df.rename(columns={'movie_id': 'id'})

# Joins dataframes
df_not_cleaned = pd.merge(movies_df, credits_df, on='id', how='inner')
df_not_cleaned = df_not_cleaned[['id', 'genres', 'original_title', 'title_x', 'title_y', 'keywords','overview', 'popularity', 'release_date', 'runtime', 'vote_average', 'vote_count', 'cast', 'crew','production_companies']]

# Displays dataframe
df_not_cleaned.head(3)

Unnamed: 0,id,genres,original_title,title_x,title_y,keywords,overview,popularity,release_date,runtime,vote_average,vote_count,cast,crew,production_companies
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,Avatar,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,162.0,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""name"": ""Ingenious Film Partners"", ""id"": 289..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,169.0,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""..."
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,Spectre,Spectre,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,148.0,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam..."


## 4. Cleaning DataFrame

This section of code is responsible for cleaning the dataset. First, we check for null data and rows with 0 runtime, using an API key to fill in missing runtime and overview information from the TMDB website. Since there was only one null data point for the release date, we decided to drop it. 

Next, we created functions to extract specific data from multiple columns, convert text to lowercase, remove spaces from certain column values, combine relevant data into new columns, and rename/drop columns for clarity. We then created a new dataframe with the cleaned data. Finally, we joined the "tag", "tag_genres", and "tag_ppl" lists into single string values for further processing.  

## 4.a. Droppping runtime and release_date null rows

In [7]:
# Displays movies that has 0 or null in 'runtime' and 'vote_average' 
for index, row in df_not_cleaned.iterrows():
    if pd.isnull(row['runtime']) or row['runtime'] == 0 or pd.isnull(row['vote_average']) or row['vote_average'] == 0:
        print(row['id'], row['title_x'], row['runtime'], row['vote_average'])

53953 The Tooth Fairy 0.0 4.3
310706 Black Water Transit 100.0 0.0
370980 Chiamatemi Francesco - Il Papa della gente nan 7.3
41894 Blood Done Sign My Name 0.0 6.0
113406 Should've Been Romeo 0.0 0.0
447027 Running Forever 88.0 0.0
158150 How to Fall in Love 0.0 5.2
395766 The Secret 200.0 0.0
370662 Time to Choose 100.0 0.0
281230 Fort McCoy 0.0 6.3
170480 The Deported 90.0 0.0
79587 Four Single Fathers 100.0 0.0
346081 Sardaarji 0.0 9.5
433715 8 Days 90.0 0.0
364083 Mi America 126.0 0.0
371085 Sharkskin 0.0 0.0
325140 Hum To Mohabbat Karega 0.0 0.0
459488 To Be Frank, Sinatra at 100 nan 0.0
386826 A Beginner's Guide to Snuff 87.0 0.0
66468 N-Secure 0.0 4.3
74084 Dil Jo Bhi Kahey... 0.0 0.0
51820 The Salon 0.0 3.5
280381 House at the End of the Drive 91.0 0.0
218500 The Ballad of Gregorio Cortez 104.0 0.0
295914 Queen of the Mountains 135.0 0.0
357834 The Algerian 99.0 0.0
114065 Down & Out With The Dolls 88.0 0.0
49951 Certifiably Jonathan 85.0 0.0
355629 The Blade of Don Juan 98.0 0.

In [8]:
# Defines function to fetch data for a single movie by ID

# Defines your TMDB API key
api_key = 'a38b4a8ed24b9edec801f0bc153f0177'

def get_movie_data(movie_id, column_name):
    url = f'https://api.themoviedb.org/3/movie/{movie_id}?api_key={api_key}&language=en-US'
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return data[column_name]
    else:
        return None

In [9]:
# Loops through each movie and fill in 'runtime' and 'vote_avearge' using get_movie_data function
for i, row in df_not_cleaned.iterrows():
    if pd.isnull(row['runtime']) or row['runtime'] == 0:
        runtime = get_movie_data(row['id'], 'runtime')
        if runtime:
            df_not_cleaned.at[i, 'runtime'] = runtime
            
    if pd.isnull(row['vote_average']) or row['vote_average'] == 0:
        vote_average = get_movie_data(row['id'], 'vote_average')
        if vote_average:
            df_not_cleaned.at[i, 'vote_average'] = vote_average

In [10]:
# Checks if there are movies with 'runtime' equal 0 or null
for index, row in df_not_cleaned.iterrows():
    if pd.isnull(row['runtime']) or row['runtime'] == 0:
        print(row['id'], row['title_x'], row['runtime'])

41894 Blood Done Sign My Name 0.0
113406 Should've Been Romeo 0.0
281230 Fort McCoy 0.0
51820 The Salon 0.0
310933 Bleeding Hearts 0.0
325579 Diamond Ruff 0.0
328307 Rise of the Entrepreneur: The Search for a Better Way 0.0
320435 UnDivided 0.0


In [11]:
# Checks if there are movies with 'vote_average' equal 0 or null
for index, row in df_not_cleaned.iterrows():
    if pd.isnull(row['vote_average']) or row['vote_average'] == 0:
        print(row['id'], row['title_x'], row['vote_average'])

395766 The Secret 0.0
170480 The Deported 0.0
364083 Mi America 0.0
371085 Sharkskin 0.0
296943 The Hadza:  Last of the First 0.0
181940 Carousel of Revenge 0.0
331493 Light from the Darkroom 0.0
43743 Fabled 0.0
300327 Death Calls 0.0
378237 Amidst the Devil's Wings 0.0
320435 UnDivided 0.0
376010 Western Religion 0.0
194588 Short Cut to Nirvana: Kumbh Mela 0.0
361398 Theresa Is a Mother 0.0
288927 Archaeology of a Woman 0.0
354624 Heroes of Dirt 0.0
282128 An American in Hollywood 0.0
266857 The Work and The Story 0.0
366967 Dutch Kills 0.0


In [12]:
# There are no data online for the remaining 0s in 'runtime' and 'vote_average' columns
# Drops movies that have 'runtime' equal 0
# Drops movies that have 'release_date' null
# Drops movies that have 'vote_average' equal 0

df_not_cleaned = df_not_cleaned[df_not_cleaned['runtime'] != 0]
df_not_cleaned.dropna(subset=['release_date'], inplace=True)
df_not_cleaned.drop(df_not_cleaned[df_not_cleaned['vote_average'] == 0].index, inplace=True)

df_not_cleaned.isnull().sum() #checking if any null data left

id                       0
genres                   0
original_title           0
title_x                  0
title_y                  0
keywords                 0
overview                31
popularity               0
release_date             0
runtime                  0
vote_average             0
vote_count               0
cast                     0
crew                     0
production_companies     0
dtype: int64

In [13]:
# Displays dataframe
df_not_cleaned.head(3)

Unnamed: 0,id,genres,original_title,title_x,title_y,keywords,overview,popularity,release_date,runtime,vote_average,vote_count,cast,crew,production_companies
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,Avatar,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,162.0,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""name"": ""Ingenious Film Partners"", ""id"": 289..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,169.0,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""..."
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,Spectre,Spectre,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,148.0,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam..."


## 4.b. Dealing with "overview" null rows

In [14]:
# Creates new dataframe with movies that are null in 'overview' column 
null_overview = df_not_cleaned[df_not_cleaned["overview"].isnull()]

# Upon further analysis, we decided to only keep 'title_x' as 'title_y' is the same thing as 'title_x'
# and 'original_title' shows names that are not in english
# Drops 'title_y' and 'original_title' columns
df_not_cleaned = df_not_cleaned.drop(columns=['title_y','original_title'])

# Prints out all movie titles that have NAN in 'overview' column 
null_overview["title_x"] 

65                                        The Dark Knight
77                                             Inside Out
94                                Guardians of the Galaxy
95                                           Interstellar
96                                              Inception
262     The Lord of the Rings: The Fellowship of the Ring
287                                      Django Unchained
298                               The Wolf of Wall Street
329         The Lord of the Rings: The Return of the King
330                 The Lord of the Rings: The Two Towers
494                                         The Lion King
634                                            The Matrix
662                                            Fight Club
690                                        The Green Mile
809                                          Forrest Gump
1553                                                Se7en
1818                                     Schindler's List
1881          

In [15]:
# Loops through each movie and fill in 'overview' using get_movie_data function
for i, row in df_not_cleaned.iterrows():
    if pd.isnull(row['overview']):
        overview = get_movie_data(row['id'], 'overview')
        if overview:
            df_not_cleaned.at[i, 'overview'] = overview  

In [16]:
# Checks if there are any null data in overview column
null_overview = df_not_cleaned[df_not_cleaned["overview"].isnull()]
null_overview["title_x"]  # No Null data on overview.

Series([], Name: title_x, dtype: object)

## 4.c. Creating Functions to Clean Rows

In [17]:
# Cleaning Functions:

def extract(lst): # Function to extract values from a dictionary
    feat = []
    for i in ast.literal_eval(lst):
        feat.append(i['name'])        
    return feat

def get_names (lst): # Function to extract first 3 values from a dictionary
    feat = []
    counter = 0 
    for i in ast.literal_eval(lst):
        if counter != 3:
            feat.append(i['name'])
            counter += 1
        else:
            break
    return feat

def get_director(lst): # Function to extract director name from crew column
    feat = []
    for i in ast.literal_eval(lst):
        if i['job'] == 'Director':
            feat.append(i['name'])
    return feat

## 4.d. Cleaning Cast, Crew, Genres, Keywords, and Production Company Columns

In [18]:
# Extracts data from multiple columns and converts to lowercase:

# 1 Cast - ONLY TOP 3 Actor/Actress
df_not_cleaned['cast_names'] = df_not_cleaned['cast'].apply(lambda x: get_names(x.lower()))
df_not_cleaned = df_not_cleaned.drop(columns=['cast']) #cast is now cast_names

# 2 Crew - ONLY Director
df_not_cleaned['crew_names'] = df_not_cleaned['crew'].apply(get_director)
df_not_cleaned['crew_names'] = df_not_cleaned['crew_names'].apply(lambda x: [name.lower() for name in x])
df_not_cleaned = df_not_cleaned.drop(columns=['crew']) #crew is now crew_names

# 3 Genres - ONLY TOP 3 Genres
df_not_cleaned['genres_names'] = df_not_cleaned['genres'].apply(lambda x: get_names(x.lower()))
df_not_cleaned = df_not_cleaned.drop(columns=['genres']) #genres is now genres_names

# 4 Keywords - All
df_not_cleaned['keywords_names'] = df_not_cleaned['keywords'].apply(lambda x: extract(x.lower()))
df_not_cleaned = df_not_cleaned.drop(columns=['keywords'])#keywords is now keywords_names

# 5 Production Companies - All
df_not_cleaned['production_companies_names'] = df_not_cleaned['production_companies'].apply(lambda x: extract(x.lower()))
df_not_cleaned = df_not_cleaned.drop(columns=['production_companies'])#production_companies is now Production_companiesnames_names

# Sets the maximum column width to display full column contents
pd.set_option('max_colwidth', None)

# Displays dataframe
df_not_cleaned.head(3)

Unnamed: 0,id,title_x,overview,popularity,release_date,runtime,vote_average,vote_count,cast_names,crew_names,genres_names,keywords_names,production_companies_names
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.",150.437577,2009-12-10,162.0,7.2,11800,"[sam worthington, zoe saldana, sigourney weaver]",[james cameron],"[action, adventure, fantasy]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, space, alien, tribe, alien planet, cgi, marine, soldier, battle, love affair, anti war, power relations, mind and soul, 3d]","[ingenious film partners, twentieth century fox film corporation, dune entertainment, lightstorm entertainment]"
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems.",139.082615,2007-05-19,169.0,6.9,4500,"[johnny depp, orlando bloom, keira knightley]",[gore verbinski],"[adventure, fantasy, action]","[ocean, drug abuse, exotic island, east india trading company, love of one's life, traitor, shipwreck, strong woman, ship, alliance, calypso, afterlife, fighter, pirate, swashbuckler, aftercreditsstinger]","[walt disney pictures, jerry bruckheimer films, second mate productions]"
2,206647,Spectre,"A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret service alive, Bond peels back the layers of deceit to reveal the terrible truth behind SPECTRE.",107.376788,2015-10-26,148.0,6.3,4466,"[daniel craig, christoph waltz, léa seydoux]",[sam mendes],"[action, adventure, crime]","[spy, based on novel, secret agent, sequel, mi6, british secret service, united kingdom]","[columbia pictures, danjaq, b24]"


In [19]:
# Converts overview text to lowercase and split into a list of words
df_not_cleaned['overview'] = df_not_cleaned['overview'].apply(lambda x: x.lower().split())

# Removes spaces from genre, keyword, and production company names
df_not_cleaned['genres_names'] = df_not_cleaned['genres_names'].apply(lambda x:[i.replace(" ","") for i in x])
df_not_cleaned['keywords_names'] = df_not_cleaned['keywords_names'].apply(lambda x:[i.replace(" ","") for i in x])
df_not_cleaned['production_companies_names'] = df_not_cleaned['production_companies_names'].apply(lambda x:[i.replace(" ","") for i in x])

# Displays dataframe
df_not_cleaned.head(3)

Unnamed: 0,id,title_x,overview,popularity,release_date,runtime,vote_average,vote_count,cast_names,crew_names,genres_names,keywords_names,production_companies_names
0,19995,Avatar,"[in, the, 22nd, century,, a, paraplegic, marine, is, dispatched, to, the, moon, pandora, on, a, unique, mission,, but, becomes, torn, between, following, orders, and, protecting, an, alien, civilization.]",150.437577,2009-12-10,162.0,7.2,11800,"[sam worthington, zoe saldana, sigourney weaver]",[james cameron],"[action, adventure, fantasy]","[cultureclash, future, spacewar, spacecolony, society, spacetravel, futuristic, romance, space, alien, tribe, alienplanet, cgi, marine, soldier, battle, loveaffair, antiwar, powerrelations, mindandsoul, 3d]","[ingeniousfilmpartners, twentiethcenturyfoxfilmcorporation, duneentertainment, lightstormentertainment]"
1,285,Pirates of the Caribbean: At World's End,"[captain, barbossa,, long, believed, to, be, dead,, has, come, back, to, life, and, is, headed, to, the, edge, of, the, earth, with, will, turner, and, elizabeth, swann., but, nothing, is, quite, as, it, seems.]",139.082615,2007-05-19,169.0,6.9,4500,"[johnny depp, orlando bloom, keira knightley]",[gore verbinski],"[adventure, fantasy, action]","[ocean, drugabuse, exoticisland, eastindiatradingcompany, loveofone'slife, traitor, shipwreck, strongwoman, ship, alliance, calypso, afterlife, fighter, pirate, swashbuckler, aftercreditsstinger]","[waltdisneypictures, jerrybruckheimerfilms, secondmateproductions]"
2,206647,Spectre,"[a, cryptic, message, from, bond’s, past, sends, him, on, a, trail, to, uncover, a, sinister, organization., while, m, battles, political, forces, to, keep, the, secret, service, alive,, bond, peels, back, the, layers, of, deceit, to, reveal, the, terrible, truth, behind, spectre.]",107.376788,2015-10-26,148.0,6.3,4466,"[daniel craig, christoph waltz, léa seydoux]",[sam mendes],"[action, adventure, crime]","[spy, basedonnovel, secretagent, sequel, mi6, britishsecretservice, unitedkingdom]","[columbiapictures, danjaq, b24]"


## 4.e. Organizing DataFrame

In [20]:
# Combines overview, genres, keywords, and production company into one column "tag"
# Renames "genres_names" to "tag_genres"
# Combines "cast_names", "crew_names" into one column "tag_ppl"

df_not_cleaned['tag'] = df_not_cleaned['overview']+df_not_cleaned['genres_names']+df_not_cleaned['keywords_names']+df_not_cleaned['production_companies_names']
df_not_cleaned['tag_genres'] = df_not_cleaned['genres_names']
df_not_cleaned['tag_ppl'] = df_not_cleaned['cast_names']+df_not_cleaned['crew_names']

# Displays dataframe
df_not_cleaned.head(3)

Unnamed: 0,id,title_x,overview,popularity,release_date,runtime,vote_average,vote_count,cast_names,crew_names,genres_names,keywords_names,production_companies_names,tag,tag_genres,tag_ppl
0,19995,Avatar,"[in, the, 22nd, century,, a, paraplegic, marine, is, dispatched, to, the, moon, pandora, on, a, unique, mission,, but, becomes, torn, between, following, orders, and, protecting, an, alien, civilization.]",150.437577,2009-12-10,162.0,7.2,11800,"[sam worthington, zoe saldana, sigourney weaver]",[james cameron],"[action, adventure, fantasy]","[cultureclash, future, spacewar, spacecolony, society, spacetravel, futuristic, romance, space, alien, tribe, alienplanet, cgi, marine, soldier, battle, loveaffair, antiwar, powerrelations, mindandsoul, 3d]","[ingeniousfilmpartners, twentiethcenturyfoxfilmcorporation, duneentertainment, lightstormentertainment]","[in, the, 22nd, century,, a, paraplegic, marine, is, dispatched, to, the, moon, pandora, on, a, unique, mission,, but, becomes, torn, between, following, orders, and, protecting, an, alien, civilization., action, adventure, fantasy, cultureclash, future, spacewar, spacecolony, society, spacetravel, futuristic, romance, space, alien, tribe, alienplanet, cgi, marine, soldier, battle, loveaffair, antiwar, powerrelations, mindandsoul, 3d, ingeniousfilmpartners, twentiethcenturyfoxfilmcorporation, duneentertainment, lightstormentertainment]","[action, adventure, fantasy]","[sam worthington, zoe saldana, sigourney weaver, james cameron]"
1,285,Pirates of the Caribbean: At World's End,"[captain, barbossa,, long, believed, to, be, dead,, has, come, back, to, life, and, is, headed, to, the, edge, of, the, earth, with, will, turner, and, elizabeth, swann., but, nothing, is, quite, as, it, seems.]",139.082615,2007-05-19,169.0,6.9,4500,"[johnny depp, orlando bloom, keira knightley]",[gore verbinski],"[adventure, fantasy, action]","[ocean, drugabuse, exoticisland, eastindiatradingcompany, loveofone'slife, traitor, shipwreck, strongwoman, ship, alliance, calypso, afterlife, fighter, pirate, swashbuckler, aftercreditsstinger]","[waltdisneypictures, jerrybruckheimerfilms, secondmateproductions]","[captain, barbossa,, long, believed, to, be, dead,, has, come, back, to, life, and, is, headed, to, the, edge, of, the, earth, with, will, turner, and, elizabeth, swann., but, nothing, is, quite, as, it, seems., adventure, fantasy, action, ocean, drugabuse, exoticisland, eastindiatradingcompany, loveofone'slife, traitor, shipwreck, strongwoman, ship, alliance, calypso, afterlife, fighter, pirate, swashbuckler, aftercreditsstinger, waltdisneypictures, jerrybruckheimerfilms, secondmateproductions]","[adventure, fantasy, action]","[johnny depp, orlando bloom, keira knightley, gore verbinski]"
2,206647,Spectre,"[a, cryptic, message, from, bond’s, past, sends, him, on, a, trail, to, uncover, a, sinister, organization., while, m, battles, political, forces, to, keep, the, secret, service, alive,, bond, peels, back, the, layers, of, deceit, to, reveal, the, terrible, truth, behind, spectre.]",107.376788,2015-10-26,148.0,6.3,4466,"[daniel craig, christoph waltz, léa seydoux]",[sam mendes],"[action, adventure, crime]","[spy, basedonnovel, secretagent, sequel, mi6, britishsecretservice, unitedkingdom]","[columbiapictures, danjaq, b24]","[a, cryptic, message, from, bond’s, past, sends, him, on, a, trail, to, uncover, a, sinister, organization., while, m, battles, political, forces, to, keep, the, secret, service, alive,, bond, peels, back, the, layers, of, deceit, to, reveal, the, terrible, truth, behind, spectre., action, adventure, crime, spy, basedonnovel, secretagent, sequel, mi6, britishsecretservice, unitedkingdom, columbiapictures, danjaq, b24]","[action, adventure, crime]","[daniel craig, christoph waltz, léa seydoux, sam mendes]"


In [21]:
# Creates new dataframe with cleaned columns we needed for recommendation system 
df_cleaned = df_not_cleaned[['id','title_x','tag','tag_genres','tag_ppl','runtime','vote_average']]

# Joins 'tag', 'tag_genres', and 'tag_ppl' lists into single string values
df_cleaned['tag'] = df_cleaned['tag'].apply(lambda x:' '.join(x))
df_cleaned['tag_genres'] = df_cleaned['tag_genres'].apply(lambda x:' '.join(x))
df_cleaned['tag_ppl'] = df_cleaned['tag_ppl'].apply(lambda x:' '.join(x))

# Renames 'title_x' to 'title'
df_cleaned = df_cleaned.rename(columns={'title_x': 'title'})

# Displays new dataframe
df_cleaned.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['tag'] = df_cleaned['tag'].apply(lambda x:' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['tag_genres'] = df_cleaned['tag_genres'].apply(lambda x:' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['tag_ppl'] = df_cleaned['tag_ppl'].apply(lambda 

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3


## 5. Text Mining Preprocessing Techniques

This section of code is responsible for text mining processing. The techniques included in this section are remove punctuation, tokenization, stop words, and lemmatization. In the previous section, we already applied lower case while extracting the data.

## 5.a. Removing Punctuation

In [22]:
# Creates copy of cleaned dataframe
df_clean_rp = df_cleaned

# Defines function to remove punctuation
def remove_punctuation(text):
    if isinstance(text, str):
        return text.translate(str.maketrans('', '', string.punctuation))
    return text

# Applies the function to 'tag', 'tag_genres', and 'tag_ppl' columns
df_clean_rp[['RP_tag','RP_tag_genres','RP_tag_ppl']] = df_clean_rp[['tag','tag_genres','tag_ppl']].apply(remove_punctuation)

# Displays dataframe
df_clean_rp.head(3)

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average,RP_tag,RP_tag_genres,RP_tag_ppl
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes


## 5.b. Tokenization

In [23]:
# Creates copy of cleaned dataframe
df_clean_token = df_cleaned

# Splits the text based on commas, Removes empty strings/extra whitespace, and Converts to lowercase
def tokenize_string(text):
    tokens = re.split(r',\s*', str(text))
    tokens = [token.strip() for token in tokens if token.strip()]
    tokens = [token.lower() for token in tokens]
    return tokens

# Applies function to 'tag', 'tag_genres', and 'tag_ppl' columns
df_clean_token[['tokenized_tag','tokenized_tag_genres','tokenized_tag_ppl']] = df_clean_token[['tag','tag_genres','tag_ppl']].applymap(tokenize_string)

# Displays dataframe
df_clean_token.head(3)

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average,RP_tag,RP_tag_genres,RP_tag_ppl,tokenized_tag,tokenized_tag_genres,tokenized_tag_ppl
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,"[in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment]",[action adventure fantasy],[sam worthington zoe saldana sigourney weaver james cameron]
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,"[captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions]",[adventure fantasy action],[johnny depp orlando bloom keira knightley gore verbinski]
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,"[a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24]",[action adventure crime],[daniel craig christoph waltz léa seydoux sam mendes]


## 5.c. Removing Stop Words

In [24]:
# Creates copy of cleaned dataframe
df_clean_stopwords = df_cleaned

# Creates set of stopwords for English language
stop_words = set(stopwords.words('english'))

# Defines function to remove stopwords
def remove_stopwords(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [token for token in tokens if not token in stop_words]
    return ' '.join(filtered_tokens)

# Applies Stop Words to Overview Column
df_clean_stopwords['tag_sw'] = df_clean_stopwords['tag'].apply(remove_stopwords)

# Displays dataframe
df_clean_stopwords.head(3)

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average,RP_tag,RP_tag_genres,RP_tag_ppl,tokenized_tag,tokenized_tag_genres,tokenized_tag_ppl,tag_sw
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,"[in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment]",[action adventure fantasy],[sam worthington zoe saldana sigourney weaver james cameron],"22nd century , paraplegic marine dispatched moon pandora unique mission , becomes torn following orders protecting alien civilization . action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment"
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,"[captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions]",[adventure fantasy action],[johnny depp orlando bloom keira knightley gore verbinski],"captain barbossa , long believed dead , come back life headed edge earth turner elizabeth swann . nothing quite seems . adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions"
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,"[a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24]",[action adventure crime],[daniel craig christoph waltz léa seydoux sam mendes],"cryptic message bond ’ past sends trail uncover sinister organization . battles political forces keep secret service alive , bond peels back layers deceit reveal terrible truth behind spectre . action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24"


## 5.d. Lemmatization

In [25]:
# Creates copy of cleaned dataframe
df_clean_lm = df_cleaned

# Creates instances for lemmatization
lemmatizer = WordNetLemmatizer()

# Defines funtion to lemmatize text 
def lemmatize(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

# Applies Lemmatization to 'tag_sw', 'tokenized_tag_genres', and 'tokenized_tag_ppl' columns
df_clean_lm['tag_sw_lem'] = df_clean_lm['tag_sw'].apply(word_tokenize).apply(lemmatize)
df_clean_lm['tag_genres_lem'] = df_clean_lm['tokenized_tag_genres'].apply(lemmatize)
df_clean_lm['tag_ppl_lem'] = df_clean_lm['tokenized_tag_ppl'].apply(lemmatize)

# Displays dataframe
df_clean_lm.head(3)

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average,RP_tag,RP_tag_genres,RP_tag_ppl,tokenized_tag,tokenized_tag_genres,tokenized_tag_ppl,tag_sw,tag_sw_lem,tag_genres_lem,tag_ppl_lem
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,"[in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment]",[action adventure fantasy],[sam worthington zoe saldana sigourney weaver james cameron],"22nd century , paraplegic marine dispatched moon pandora unique mission , becomes torn following orders protecting alien civilization . action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment","22nd century , paraplegic marine dispatched moon pandora unique mission , becomes torn following order protecting alien civilization . action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,"[captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions]",[adventure fantasy action],[johnny depp orlando bloom keira knightley gore verbinski],"captain barbossa , long believed dead , come back life headed edge earth turner elizabeth swann . nothing quite seems . adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions","captain barbossa , long believed dead , come back life headed edge earth turner elizabeth swann . nothing quite seems . adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,"[a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24]",[action adventure crime],[daniel craig christoph waltz léa seydoux sam mendes],"cryptic message bond ’ past sends trail uncover sinister organization . battles political forces keep secret service alive , bond peels back layers deceit reveal terrible truth behind spectre . action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24","cryptic message bond ’ past sends trail uncover sinister organization . battle political force keep secret service alive , bond peel back layer deceit reveal terrible truth behind spectre . action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes


## 5.e. Cleaning Text

In [26]:
# Creates copy of cleaned dataframe
df_clean_lm_sw_cleaned = df_cleaned

# Defines function to remove punctuation
def remove_punctuation(text):
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Applies the function to 'tag_sw_lem','tag_genres_lem', and 'tag_ppl_lem' columns
df_clean_lm_sw_cleaned['tag_lem_2x_sw'] = df_clean_lm_sw_cleaned['tag_sw_lem'].apply(remove_punctuation)
df_clean_lm_sw_cleaned['tag_lem_2x_genres'] = df_clean_lm_sw_cleaned['tag_genres_lem'].apply(remove_punctuation)
df_clean_lm_sw_cleaned['tag_lem_2x_ppl'] = df_clean_lm_sw_cleaned['tag_ppl_lem'].apply(remove_punctuation)

# Displays dataframe
df_clean_lm_sw_cleaned.head(3)

Unnamed: 0,id,title,tag,tag_genres,tag_ppl,runtime,vote_average,RP_tag,RP_tag_genres,RP_tag_ppl,tokenized_tag,tokenized_tag_genres,tokenized_tag_ppl,tag_sw,tag_sw_lem,tag_genres_lem,tag_ppl_lem,tag_lem_2x_sw,tag_lem_2x_genres,tag_lem_2x_ppl
0,19995,Avatar,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2,"in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,"[in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment]",[action adventure fantasy],[sam worthington zoe saldana sigourney weaver james cameron],"22nd century , paraplegic marine dispatched moon pandora unique mission , becomes torn following orders protecting alien civilization . action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment","22nd century , paraplegic marine dispatched moon pandora unique mission , becomes torn following order protecting alien civilization . action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment",action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,22nd century paraplegic marine dispatched moon pandora unique mission becomes torn following order protecting alien civilization action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment,action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9,"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,"[captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions]",[adventure fantasy action],[johnny depp orlando bloom keira knightley gore verbinski],"captain barbossa , long believed dead , come back life headed edge earth turner elizabeth swann . nothing quite seems . adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions","captain barbossa , long believed dead , come back life headed edge earth turner elizabeth swann . nothing quite seems . adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions",adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofoneslife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions,adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski
2,206647,Spectre,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3,"a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,"[a cryptic message from bond’s past sends him on a trail to uncover a sinister organization. while m battles political forces to keep the secret service alive, bond peels back the layers of deceit to reveal the terrible truth behind spectre. action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24]",[action adventure crime],[daniel craig christoph waltz léa seydoux sam mendes],"cryptic message bond ’ past sends trail uncover sinister organization . battles political forces keep secret service alive , bond peels back layers deceit reveal terrible truth behind spectre . action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24","cryptic message bond ’ past sends trail uncover sinister organization . battle political force keep secret service alive , bond peel back layer deceit reveal terrible truth behind spectre . action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24",action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,cryptic message bond past sends trail uncover sinister organization battle political force keep secret service alive bond peel back layer deceit reveal terrible truth behind spectre action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24,action adventure crime,daniel craig christoph waltz léa seydoux sam mendes


## 5.f. Choosing and Preparing Final Dataframe

In [27]:
# Selects relevant rows to new dataframe, Renames and Displays
df_ready_to_vector = df_clean_lm[['id', 'title', 'tag_lem_2x_sw','tag_lem_2x_genres','tag_lem_2x_ppl', 'runtime','vote_average']]
df_ready_to_vector = df_ready_to_vector.rename(columns={'tag_lem_2x_sw': 'tag','tag_lem_2x_genres': 'genres_tag','tag_lem_2x_ppl': 'ppl_tag'})

df_ready_to_vector.head(3)

Unnamed: 0,id,title,tag,genres_tag,ppl_tag,runtime,vote_average
0,19995,Avatar,22nd century paraplegic marine dispatched moon pandora unique mission becomes torn following order protecting alien civilization action adventure fantasy cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment,action adventure fantasy,sam worthington zoe saldana sigourney weaver james cameron,162.0,7.2
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofoneslife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger waltdisneypictures jerrybruckheimerfilms secondmateproductions,adventure fantasy action,johnny depp orlando bloom keira knightley gore verbinski,169.0,6.9
2,206647,Spectre,cryptic message bond past sends trail uncover sinister organization battle political force keep secret service alive bond peel back layer deceit reveal terrible truth behind spectre action adventure crime spy basedonnovel secretagent sequel mi6 britishsecretservice unitedkingdom columbiapictures danjaq b24,action adventure crime,daniel craig christoph waltz léa seydoux sam mendes,148.0,6.3


## 5.g. Creating Genres Matrix (Genre Feature)

In [28]:
# Extracts movie ID and genres columns
genre_df = df_ready_to_vector[['id', 'genres_tag']]

# Creates dictionary of merged genres
merged_genres = {'romanc': 'romance', 'rom': 'romance', 'thril': 'thriller', 'mysteri': 'mystery', 
                 'sciencefiction': 'sciencefict', 'act': 'action', 'famili': 'family', 'tvmovi': 'tvmovie', 
                 'documentari': 'documentary', 'anim': 'animation', 'crim': 'crime', 'mus': 'music', 
                 'adventur': 'adventure', 'comedi': 'comedy', 'histori':'history', 'fantasi':'fantasy'}

# Merges similar genres
genre_df['genres_tag'] = genre_df['genres_tag'].apply(lambda x: [merged_genres.get(genre, genre) for genre in x.split()])

# Creates set of unique genres
unique_genres = set(g for genres in genre_df['genres_tag'] for g in genres)

# Creates matrix with movie ID as row index and genre as column index
genre_matrix = pd.DataFrame(index=genre_df['id'], columns=list(unique_genres))

# Fills matrix with binary values based on the genres of each movie
for i, row in genre_df.iterrows():
    for genre in row['genres_tag']:
        genre_matrix.loc[row['id'], genre] = 1

# Fills NaN values with 0 and Displays
genre_matrix.fillna(0, inplace=True)
genre_matrix.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  genre_df['genres_tag'] = genre_df['genres_tag'].apply(lambda x: [merged_genres.get(genre, genre) for genre in x.split()])


Unnamed: 0_level_0,romance,action,horror,adventure,war,comedy,foreign,mystery,music,documentary,drama,sciencefict,thriller,family,crime,tvmovie,western,history,fantasy,animation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
19995,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
285,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
206647,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


## 6. Vectorization and Similarity Measurement 

This section of code is our movie recommender. System has two possible paths: "Path 1" is when the user inputs a movie title whereas "Path 2" is triggered when user inputs a name of a person (actress, director, etc.) 

Path 1: When the user inputs a movie name, the system filter the movie by genre first. It looks for other movies that match at least 70% of the genres of the other movies. After that, we manually filtered the ouput to only show results above 85% genre-matching, which essentially filters the output by half. Once the genre portion is done, we compute TdifVectorization along with Cosine similarity on the "tag" of filtered movies ("tag" is a combination of overview, keywords, production company, and other metadata). Lastly, the function adds a "runtime" filter, in the sense that we don't want to suggest movies that are either too short or too long compared to the inputted movie. The system then provides the recommendations. 

Path 2: If the user inputs a person's name, we assume that the person more concerned about seeing their favorite actress, actor, or director, and runtime nor genre necessarily matter as much. Therefore, the system first computes TdifVectorization along with Cosine similarity on the "tag" of the movies containing the person. Second, the system will calculate and sort for vote average, which represents the overall "liking" or "rating" of the movie. Finally, the system shows the recommendations.

In [29]:
# Movie Recommender Function:

# Creates TfidfVectorizer object
vectorizer = TfidfVectorizer()

def get_movie_recommendation(movie_input, top_n=5):
    
    # PATH 1
    if movie_input in df_ready_to_vector['title'].values:
        # Gets id and then genres
        movie_id = df_ready_to_vector[df_ready_to_vector['title'] == movie_input]['id'].values[0]
        movie_genres = genre_matrix.loc[movie_id]

        # Gets % of Genre Matching, then Computes Scores
        genre_match_percentages = (genre_matrix == movie_genres).sum(axis=1) / len(movie_genres)
        score = 100 * (1 - (len(movie_genres) - 1) / (genre_match_percentages.max() * len(movie_genres)))

        # Selects movies with Genre Matching Percentage > 70% (our decision factor) and sorts by that score
        recommended_movies = genre_match_percentages[genre_match_percentages > 0.70]
        sorted_recommendations = recommended_movies.sort_values(ascending=False)

        # Gets all recommended movies and Creates in dataframe
        recommended_movies_based_on_genres = []
        for movie_id, score in sorted_recommendations.items():
            score_percentage = int(score * 100)
            if score_percentage >= 85:  #85% eliminates about half of the output (arbitrary, our filter factor)
                recommended_movies_based_on_genres.append({'id': movie_id, 'score_percentage': score_percentage})
            else:
                continue       

        recommended_movies_based_on_genres = pd.DataFrame(recommended_movies_based_on_genres, columns=['id', 'score_percentage'])

        # Saves output to recommended_movies_df
        recommended_movies_df = recommended_movies_based_on_genres

        # Merges two dataframes into one
        genre_filtered_output = pd.merge(recommended_movies_df[['id', 'score_percentage']], df_ready_to_vector[['id', 'title', 'tag','genres_tag', 'ppl_tag', 'runtime', 'vote_average']], on='id', how='inner')
        
        # Create Bag of Words then computes pairwise cosine similarities based on their BoW feature vectors
        tag_bow = vectorizer.fit_transform(genre_filtered_output['tag'] + ' ' + genre_filtered_output['ppl_tag'])
        bow_features = tag_bow.toarray()
        movie_similarities = cosine_similarity(bow_features, bow_features)

        # Calculates original movie's runtime range
        original_movie_title = movie_input
        original_movie_runtime = genre_filtered_output.loc[genre_filtered_output['title'] == original_movie_title, 'runtime'].iloc[0]
        runtime_range = (original_movie_runtime - 30, original_movie_runtime + 30)

        # Applies runtime filter and Recommend movies
        if original_movie_title in genre_filtered_output['title'].values:
            # Gets the index of the movie in the feature matrix, computes Cosine similarites, and sorts
            movie_index = genre_filtered_output.index[genre_filtered_output['title'] == original_movie_title][0]
            movie_similarities_matrix = cosine_similarity(bow_features[movie_index].reshape(1, -1), bow_features)
            movie_indices = np.argsort(movie_similarities_matrix.squeeze())[::-1][1:top_n+1]

            print(f"\nRecommended movies within 30 minutes of {original_movie_title} (Runtime: {original_movie_runtime} minutes)")
            for i in movie_indices:
                movie_title = genre_filtered_output.iloc[i].title
                movie_runtime = genre_filtered_output.iloc[i].runtime
                print(f"{movie_title} (Runtime: {movie_runtime} minutes)")
        else:
            print("Movie input is not in genre_filtered_output dataframe")
    
    # PATH 2
    else: # If user inserts a person's name that is not a movie, this code (Path 2) will run
        tag_bow = vectorizer.fit_transform(df_ready_to_vector['tag'])
        ppl_bow = vectorizer.fit_transform(df_ready_to_vector['ppl_tag'])
        bow_features = np.hstack((tag_bow.toarray(), ppl_bow.toarray()))

        # Computes pairwise cosine similarities between movies
        movie_similarities = cosine_similarity(bow_features, bow_features)

        # Filters by actor/actress/director name, gets indices, computes cosine similarity, sorts on similarity, then on vote average
        movies_by_cast = df_ready_to_vector[df_ready_to_vector['ppl_tag'].apply(lambda x: movie_input.lower() in x.lower())]
        movie_indices = movies_by_cast.index
        distance_cast = cosine_similarity(bow_features[movie_indices], bow_features)
        similarity_scores = distance_cast.sum(axis=0)
        movie_indices = np.argsort(similarity_scores.squeeze())[::-1][:top_n]
        movie_indices = sorted(movie_indices, key=lambda i: df_ready_to_vector.loc[i, 'vote_average'], reverse=True)

        print(f"Recommended movies based on {movie_input}:")
        for i in movie_indices:
            movie_title = df_ready_to_vector.loc[i, 'title']
            movie_vote_average = df_ready_to_vector.loc[i, 'vote_average']
            print(f"{movie_title} (Vote Average: {movie_vote_average})")

## 7. Recommendations

This section highlights our movie recommendations.

In [30]:
get_movie_recommendation('Frozen', top_n=5) 


Recommended movies within 30 minutes of Frozen (Runtime: 102.0 minutes)
The Snow Queen (Runtime: 76.0 minutes)
Aladdin (Runtime: 90.0 minutes)
Delgo (Runtime: 94.0 minutes)
Snow White and the Seven Dwarfs (Runtime: 83.0 minutes)
Brave (Runtime: 93.0 minutes)


In [31]:
get_movie_recommendation('The Dark Knight', top_n=5) 


Recommended movies within 30 minutes of The Dark Knight (Runtime: 152.0 minutes)
The Dark Knight Rises (Runtime: 165.0 minutes)
Batman Begins (Runtime: 140.0 minutes)
Batman Returns (Runtime: 126.0 minutes)
Batman: The Dark Knight Returns, Part 2 (Runtime: 78.0 minutes)
Batman Forever (Runtime: 121.0 minutes)


In [32]:
get_movie_recommendation('The Shawshank Redemption', top_n=5)


Recommended movies within 30 minutes of The Shawshank Redemption (Runtime: 142.0 minutes)
Prison (Runtime: 102.0 minutes)
Civil Brand (Runtime: 95.0 minutes)
Penitentiary (Runtime: 99.0 minutes)
The Longest Yard (Runtime: 113.0 minutes)
Mean Machine (Runtime: 99.0 minutes)


In [33]:
get_movie_recommendation('Christopher Nolan', top_n=5) 

Recommended movies based on Christopher Nolan:
The Dark Knight (Vote Average: 8.2)
Interstellar (Vote Average: 8.1)
The Prestige (Vote Average: 8.0)
The Dark Knight Rises (Vote Average: 7.6)
Batman Begins (Vote Average: 7.5)


## 8. Different Vectorizations Comparison

This section of code compares the movie recommendation result using a combination of different vectorizer (CountVectorizer, TfidfVectorizer, HashingVectorizer) and similarity measurement (cosine_similarity, euclidean_distances).

In [34]:
def compare_get_movie_recommendation(movie_input, top_n=5):
    if movie_input in df_ready_to_vector['title'].values:
        # Gets id and then genres
        movie_id = df_ready_to_vector[df_ready_to_vector['title'] == movie_input]['id'].values[0]
        movie_genres = genre_matrix.loc[movie_id]

        # Gets % of Genre Matching, then Computes Scores
        genre_match_percentages = (genre_matrix == movie_genres).sum(axis=1) / len(movie_genres)
        score = 100 * (1 - (len(movie_genres) - 1) / (genre_match_percentages.max() * len(movie_genres)))

        # Selects movies with Genre Matching Percentage > 70% (our decision factor) and sorts by that score
        recommended_movies = genre_match_percentages[genre_match_percentages > 0.70]
        sorted_recommendations = recommended_movies.sort_values(ascending=False)

        # Get all recommended movies and Creates dataframe
        recommended_movies_based_on_genres = []
        for movie_id, score in sorted_recommendations.items():
            score_percentage = int(score * 100)
            if score_percentage >= 85:  #85% eliminates about half of the output (arbitrary, our filter factor)
                recommended_movies_based_on_genres.append({'id': movie_id, 'score_percentage': score_percentage})
            else:
                continue     

        recommended_movies_based_on_genres = pd.DataFrame(recommended_movies_based_on_genres, columns=['id', 'score_percentage'])

        # Saves the output to recommended_movies_df
        recommended_movies_df = recommended_movies_based_on_genres

        # Merges two dataframes into one
        genre_filtered_output = pd.merge(recommended_movies_df[['id', 'score_percentage']], df_ready_to_vector[['id', 'title', 'tag', 'genres_tag', 'ppl_tag', 'runtime', 'vote_average']], on='id', how='inner')
        
        # Defines list of vectorizers and list of distance metrics
        vectorizers = [CountVectorizer(), TfidfVectorizer(), HashingVectorizer(n_features=10000)]
        distance_metrics = [cosine_similarity, euclidean_distances]


        # Iterates through the combinations of vectorizers and distance metrics
        for vectorizer in vectorizers:
            for distance_metric in distance_metrics:
                tag_bow = vectorizer.fit_transform(genre_filtered_output['tag'] + ' ' + genre_filtered_output['ppl_tag'])
                bow_features = tag_bow.toarray()

                # Computes pairwise similarities between movies
                movie_similarities = distance_metric(bow_features, bow_features)

                # Applies runtime filter and Recommend movies
                if movie_input in genre_filtered_output['title'].values:
                    # Gets the index of the movie in the feature matrix, computes Cosine similarites, and sorts
                    movie_index = genre_filtered_output.index[genre_filtered_output['title'] == movie_input][0]
                    movie_similarities_matrix = distance_metric(bow_features[movie_index].reshape(1, -1), bow_features)
                    movie_indices = np.argsort(movie_similarities_matrix.squeeze())[::-1][1:top_n+1]

                    print(f"Recommended movies based on {movie_input} using {vectorizer.__class__.__name__} and {distance_metric.__name__}:")
                    for i in movie_indices:
                        movie_title = genre_filtered_output.iloc[i]['title']
                        similarity_score = round(movie_similarities_matrix[0][i], 2)
                        print(f"{movie_title}")
                    
                    print() 

In [35]:
compare_get_movie_recommendation('Frozen', top_n=5)

Recommended movies based on Frozen using CountVectorizer and cosine_similarity:
Aladdin
Delgo
Spirit: Stallion of the Cimarron
The Princess and the Frog
Mulan

Recommended movies based on Frozen using CountVectorizer and euclidean_distances:
Star Wars: Clone Wars: Volume 1
Alpha and Omega
Henry & Me
Dear Frankie
Madagascar 3: Europe's Most Wanted

Recommended movies based on Frozen using TfidfVectorizer and cosine_similarity:
The Snow Queen
Aladdin
Delgo
Snow White and the Seven Dwarfs
Brave

Recommended movies based on Frozen using TfidfVectorizer and euclidean_distances:
The Looking Glass
Harrison Montgomery
The Algerian
Sardaarji
Sparkler

Recommended movies based on Frozen using HashingVectorizer and cosine_similarity:
Aladdin
Delgo
Spirit: Stallion of the Cimarron
The Princess and the Frog
Mulan

Recommended movies based on Frozen using HashingVectorizer and euclidean_distances:
Harrison Montgomery
Lisa Picard Is Famous
The Algerian
The Looking Glass
Sardaarji



In [36]:
compare_get_movie_recommendation('The Dark Knight', top_n=5)

Recommended movies based on The Dark Knight using CountVectorizer and cosine_similarity:
The Dark Knight Rises
Batman Begins
Batman Returns
Batman: The Dark Knight Returns, Part 2
Batman

Recommended movies based on The Dark Knight using CountVectorizer and euclidean_distances:
Fight Valley
Gladiator
Thank You for Smoking
Tae Guk Gi: The Brotherhood of War
Pocketful of Miracles

Recommended movies based on The Dark Knight using TfidfVectorizer and cosine_similarity:
The Dark Knight Rises
Batman Begins
Batman Returns
Batman: The Dark Knight Returns, Part 2
Batman Forever

Recommended movies based on The Dark Knight using TfidfVectorizer and euclidean_distances:
Crowsnest
The Book of Mormon Movie, Volume 1: The Journey
The Outrageous Sophie Tucker
Childless
The Looking Glass

Recommended movies based on The Dark Knight using HashingVectorizer and cosine_similarity:
The Dark Knight Rises
Batman Begins
Batman Returns
Batman: The Dark Knight Returns, Part 2
Batman

Recommended movies based 

In [37]:
compare_get_movie_recommendation('The Shawshank Redemption', top_n=5)

Recommended movies based on The Shawshank Redemption using CountVectorizer and cosine_similarity:
Civil Brand
Prison
Penitentiary
Mean Machine
The Longest Yard

Recommended movies based on The Shawshank Redemption using CountVectorizer and euclidean_distances:
Solomon and Sheba
Tadpole
Semi-Pro
Slacker Uprising
The Midnight Meat Train

Recommended movies based on The Shawshank Redemption using TfidfVectorizer and cosine_similarity:
Prison
Civil Brand
Penitentiary
The Longest Yard
Mean Machine

Recommended movies based on The Shawshank Redemption using TfidfVectorizer and euclidean_distances:
Sisters in Law
Baggage Claim
The FP
Hostel: Part II
Dream with the Fishes

Recommended movies based on The Shawshank Redemption using HashingVectorizer and cosine_similarity:
Prison
Civil Brand
Penitentiary
The Last Station
The Longest Yard

Recommended movies based on The Shawshank Redemption using HashingVectorizer and euclidean_distances:
Dwegons
Private Benjamin
The Ten
Devil's Due
Grand Theft 