# MOVIE RECOMMENDATION SYSTEM

### Dataset link: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

### Importing libraries

In [279]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# warnings
import warnings
warnings.filterwarnings('ignore')

# abstract syntax trees
import ast

# feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# measure document similarity in text analysis
from sklearn.metrics.pairwise import cosine_similarity

# split data
from sklearn.model_selection import train_test_split

# Linear Regression
from sklearn.linear_model import LinearRegression, LogisticRegression

# Polynomial features
from sklearn.preprocessing import PolynomialFeatures

# calculate accuracy
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

# natuaral language tool kit
import nltk
from nltk.stem.porter import PorterStemmer

### Importing datasets

In [280]:
df_credits = pd.read_csv("credits.csv")
df_movies_data = pd.read_csv("movies_metadata.csv")
df_keywords = pd.read_csv("keywords.csv")

In [281]:
df_credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [282]:
pd.set_option('display.max_columns', None)
df_movies_data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [283]:
df_keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [284]:
df_credits.shape

(45476, 3)

In [285]:
df_movies_data.shape

(45466, 24)

In [286]:
df_keywords.shape

(46419, 2)

In [287]:
df_credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


### Explore and Clean the data

In [288]:
df_movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [289]:
df_keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


In [290]:
df_movies_data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [291]:
df_movies_data["cast"] = df_credits["cast"]

In [292]:
df_movies_data["crew"] = df_credits["crew"]

In [293]:
df_movies_data["keywords"] = df_keywords["keywords"]

In [294]:
df_movies = df_movies_data[["id", "original_title", "genres", "overview", "cast", "crew", "keywords"]]

In [295]:
df_movies.head()

Unnamed: 0,id,original_title,genres,overview,cast,crew,keywords
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Cheated on, mistreated and stepped on, the wom...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [296]:
df_movies.isnull().sum()

id                  0
original_title      0
genres              0
overview          954
cast                0
crew                0
keywords            0
dtype: int64

In [297]:
df_movies.dropna(inplace = True)

In [298]:
df_movies.isnull().sum()

id                0
original_title    0
genres            0
overview          0
cast              0
crew              0
keywords          0
dtype: int64

In [299]:
df_movies.duplicated().sum()

8

In [300]:
df_movies.drop_duplicates(inplace = True)

In [301]:
df_movies.duplicated().sum()

0

In [302]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44504 entries, 0 to 45465
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              44504 non-null  object
 1   original_title  44504 non-null  object
 2   genres          44504 non-null  object
 3   overview        44504 non-null  object
 4   cast            44504 non-null  object
 5   crew            44504 non-null  object
 6   keywords        44504 non-null  object
dtypes: object(7)
memory usage: 2.7+ MB


In [303]:
df_movies.head()

Unnamed: 0,id,original_title,genres,overview,cast,crew,keywords
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Cheated on, mistreated and stepped on, the wom...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [304]:
df_movies.iloc[0]["genres"]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [305]:
df_movies.iloc[0]["cast"]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

In [306]:
df_movies.iloc[0]["crew"]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

In [307]:
df_movies.iloc[0]["keywords"]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [308]:
def extract(obj):
    arr = []
    for dict in ast.literal_eval(obj):
        arr.append(dict["name"].lower())
    return arr
df_movies["genres"] = df_movies["genres"].apply(extract)
df_movies["cast"] = df_movies["cast"].apply(extract)
df_movies["crew"] = df_movies["crew"].apply(extract)
df_movies["keywords"] = df_movies["keywords"].apply(extract)

In [309]:
df_movies.head()

Unnamed: 0,id,original_title,genres,overview,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","Led by Woody, Andy's toys live happily in his ...","[tom hanks, tim allen, don rickles, jim varney...","[john lasseter, joss whedon, andrew stanton, j...","[jealousy, toy, boy, friendship, friends, riva..."
1,8844,Jumanji,"[adventure, fantasy, family]",When siblings Judy and Peter discover an encha...,"[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","[board game, disappearance, based on children'..."
2,15602,Grumpier Old Men,"[romance, comedy]",A family wedding reignites the ancient feud be...,"[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","[fishing, best friend, duringcreditsstinger, o..."
3,31357,Waiting to Exhale,"[comedy, drama, romance]","Cheated on, mistreated and stepped on, the wom...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","[based on novel, interracial relationship, sin..."
4,11862,Father of the Bride Part II,[comedy],Just when George Banks has recovered from his ...,"[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","[baby, midlife crisis, confidence, aging, daug..."


In [310]:
df_movies["overview"] = df_movies["overview"].apply(lambda char:char.lower())

In [311]:
df_movies["overview"] = df_movies["overview"].apply(lambda x:x.split())

In [312]:
df_movies.head()

Unnamed: 0,id,original_title,genres,overview,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","[led, by, woody,, andy's, toys, live, happily,...","[tom hanks, tim allen, don rickles, jim varney...","[john lasseter, joss whedon, andrew stanton, j...","[jealousy, toy, boy, friendship, friends, riva..."
1,8844,Jumanji,"[adventure, fantasy, family]","[when, siblings, judy, and, peter, discover, a...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","[board game, disappearance, based on children'..."
2,15602,Grumpier Old Men,"[romance, comedy]","[a, family, wedding, reignites, the, ancient, ...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","[fishing, best friend, duringcreditsstinger, o..."
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[cheated, on,, mistreated, and, stepped, on,, ...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","[based on novel, interracial relationship, sin..."
4,11862,Father of the Bride Part II,[comedy],"[just, when, george, banks, has, recovered, fr...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","[baby, midlife crisis, confidence, aging, daug..."


In [313]:
df_movies["full"] = df_movies["genres"] + df_movies["overview"] + df_movies["cast"] + df_movies["crew"] + df_movies["keywords"]

In [314]:
df_movies["full"] = df_movies["full"].apply(lambda str:" ".join(str))

In [315]:
df_movies.head()

Unnamed: 0,id,original_title,genres,overview,cast,crew,keywords,full
0,862,Toy Story,"[animation, comedy, family]","[led, by, woody,, andy's, toys, live, happily,...","[tom hanks, tim allen, don rickles, jim varney...","[john lasseter, joss whedon, andrew stanton, j...","[jealousy, toy, boy, friendship, friends, riva...","animation comedy family led by woody, andy's t..."
1,8844,Jumanji,"[adventure, fantasy, family]","[when, siblings, judy, and, peter, discover, a...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","[board game, disappearance, based on children'...",adventure fantasy family when siblings judy an...
2,15602,Grumpier Old Men,"[romance, comedy]","[a, family, wedding, reignites, the, ancient, ...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","[fishing, best friend, duringcreditsstinger, o...",romance comedy a family wedding reignites the ...
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[cheated, on,, mistreated, and, stepped, on,, ...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","[based on novel, interracial relationship, sin...","comedy drama romance cheated on, mistreated an..."
4,11862,Father of the Bride Part II,[comedy],"[just, when, george, banks, has, recovered, fr...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","[baby, midlife crisis, confidence, aging, daug...",comedy just when george banks has recovered fr...


In [316]:
df_movies.iloc[0]["full"]

"animation comedy family led by woody, andy's toys live happily in his room until andy's birthday brings buzz lightyear onto the scene. afraid of losing his place in andy's heart, woody plots against buzz. but when circumstances separate buzz and woody from their owner, the duo eventually learns to put aside their differences. tom hanks tim allen don rickles jim varney wallace shawn john ratzenberger annie potts john morris erik von detten laurie metcalf r. lee ermey sarah freeman penn jillette john lasseter joss whedon andrew stanton joel cohen alec sokolow bonnie arnold ed catmull ralph guggenheim steve jobs lee unkrich ralph eggleston robert gordon mary helen leasman kim blanchette marilyn mccoppen randy newman dale e. grahn robin cooper john lasseter pete docter joe ranft patsy bouge norm decarlo ash brannon randy newman roman figun don davis james flamberg mary beth smith rick mackay susan bradley william reeves randy newman andrew stanton pete docter gary rydstrom karen robert ja

In [317]:
df_movies_random = df_movies.sample(5000).reset_index()

In [318]:
df_movies_random.head()

Unnamed: 0,index,id,original_title,genres,overview,cast,crew,keywords,full
0,24859,84352,The Violent Enemy,"[drama, crime]","[during, the, troubles, in, ireland, an, ira, ...","[benedict cumberbatch, keira knightley, matthe...","[maria djurkovic, william goldenberg, alexandr...","[gay, england, world war ii, mathematician, bi...",drama crime during the troubles in ireland an ...
1,13233,84397,Double Dynamite,"[comedy, music]","[an, innocent, bank, teller,, suspected, of, e...","[jane russell, groucho marx, frank sinatra, do...","[melville shavelson, robert de grasse, leo ros...","[bookie, bank teller]","comedy music an innocent bank teller, suspecte..."
2,3267,820,JFK,"[drama, thriller, history]","[new, orleans, district, attorney, jim, garris...","[kevin costner, tommy lee jones, gary oldman, ...","[john williams, heidi levitt, robert richardso...","[assassination, cia, homophobia, new orleans, ...",drama thriller history new orleans district at...
3,35250,105952,Huge,[],"[a, feuding, double, act, try, to, make, it, i...","[masato sakai, takayuki yamada, sakura ando, h...","[masaaki akahori, masaaki akahori]",[],a feuding double act try to make it in the cut...
4,43419,412209,Tramps,"[comedy, romance]","[a, young, man, and, woman, find, love, in, an...","[kate o'rourke, te kaea beri, campbell cooley,...","[david blyth, andrew beattie, jed town, marc m...","[civil war, chinese communists, 3d]",comedy romance a young man and woman find love...


### Data training

In [319]:
cv = CountVectorizer(max_features = 5200, stop_words = "english")

In [320]:
arr = cv.fit_transform(df_movies_random["full"]).toarray()

In [321]:
arr.shape

(5000, 5200)

In [322]:
ps = PorterStemmer()

In [323]:
def joiningAll(text):
    array = []
    for string in text.split():
        array.append(ps.stem(string))
    return " ".join(array)
df_new = df_movies_random
df_new["full"] = df_new["full"].apply(joiningAll)

In [324]:
df_movies_random.head()

Unnamed: 0,index,id,original_title,genres,overview,cast,crew,keywords,full
0,24859,84352,The Violent Enemy,"[drama, crime]","[during, the, troubles, in, ireland, an, ira, ...","[benedict cumberbatch, keira knightley, matthe...","[maria djurkovic, william goldenberg, alexandr...","[gay, england, world war ii, mathematician, bi...",drama crime dure the troubl in ireland an ira ...
1,13233,84397,Double Dynamite,"[comedy, music]","[an, innocent, bank, teller,, suspected, of, e...","[jane russell, groucho marx, frank sinatra, do...","[melville shavelson, robert de grasse, leo ros...","[bookie, bank teller]","comedi music an innoc bank teller, suspect of ..."
2,3267,820,JFK,"[drama, thriller, history]","[new, orleans, district, attorney, jim, garris...","[kevin costner, tommy lee jones, gary oldman, ...","[john williams, heidi levitt, robert richardso...","[assassination, cia, homophobia, new orleans, ...",drama thriller histori new orlean district att...
3,35250,105952,Huge,[],"[a, feuding, double, act, try, to, make, it, i...","[masato sakai, takayuki yamada, sakura ando, h...","[masaaki akahori, masaaki akahori]",[],a feud doubl act tri to make it in the cut-thr...
4,43419,412209,Tramps,"[comedy, romance]","[a, young, man, and, woman, find, love, in, an...","[kate o'rourke, te kaea beri, campbell cooley,...","[david blyth, andrew beattie, jed town, marc m...","[civil war, chinese communists, 3d]",comedi romanc a young man and woman find love ...


In [325]:
df_new.head()

Unnamed: 0,index,id,original_title,genres,overview,cast,crew,keywords,full
0,24859,84352,The Violent Enemy,"[drama, crime]","[during, the, troubles, in, ireland, an, ira, ...","[benedict cumberbatch, keira knightley, matthe...","[maria djurkovic, william goldenberg, alexandr...","[gay, england, world war ii, mathematician, bi...",drama crime dure the troubl in ireland an ira ...
1,13233,84397,Double Dynamite,"[comedy, music]","[an, innocent, bank, teller,, suspected, of, e...","[jane russell, groucho marx, frank sinatra, do...","[melville shavelson, robert de grasse, leo ros...","[bookie, bank teller]","comedi music an innoc bank teller, suspect of ..."
2,3267,820,JFK,"[drama, thriller, history]","[new, orleans, district, attorney, jim, garris...","[kevin costner, tommy lee jones, gary oldman, ...","[john williams, heidi levitt, robert richardso...","[assassination, cia, homophobia, new orleans, ...",drama thriller histori new orlean district att...
3,35250,105952,Huge,[],"[a, feuding, double, act, try, to, make, it, i...","[masato sakai, takayuki yamada, sakura ando, h...","[masaaki akahori, masaaki akahori]",[],a feud doubl act tri to make it in the cut-thr...
4,43419,412209,Tramps,"[comedy, romance]","[a, young, man, and, woman, find, love, in, an...","[kate o'rourke, te kaea beri, campbell cooley,...","[david blyth, andrew beattie, jed town, marc m...","[civil war, chinese communists, 3d]",comedi romanc a young man and woman find love ...


In [326]:
similar_df = cosine_similarity(arr)

In [327]:
similar_df

array([[1.        , 0.        , 0.23498886, ..., 0.        , 0.03572158,
        0.03600115],
       [0.        , 1.        , 0.03928007, ..., 0.        , 0.0587558 ,
        0.03289758],
       [0.23498886, 0.03928007, 1.        , ..., 0.        , 0.04569706,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.048795  ],
       [0.03572158, 0.0587558 , 0.04569706, ..., 0.        , 1.        ,
        0.10631081],
       [0.03600115, 0.03289758, 0.        , ..., 0.048795  , 0.10631081,
        1.        ]])

In [328]:
similar_df[0]

array([1.        , 0.        , 0.23498886, ..., 0.        , 0.03572158,
       0.03600115])

In [329]:
similar_df.shape

(5000, 5000)

In [339]:
def recommended(movie):
    movie_index = df_new[df_new["original_title"] == movie].index[0]
    diff = similar_df[movie_index]
    movie_lists = sorted(list(enumerate(diff)), reverse = True, key = lambda x:x[1])[1:11]
    for index in movie_lists:
        print(df_new.iloc[index[0]].original_title)

In [342]:
recommended("Doctor Strange")

War of the Worlds
Armageddon
Schalcken the Painter
Demolition Man
Gyakufunsha kazoku
Blade II
Batman & Robin
火锅英雄
The Social Network
Constantine


In [344]:
recommended("Gyakufunsha kazoku")

Doctor Strange
Constantine
Schalcken the Painter
The Social Network
This Is the End
War of the Worlds
3:10 to Yuma
Children of Men
Tant qu'on a la santé
Freddy vs. Jason
