In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   # for heatmap


In [1]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Omar
[nltk_data]     Hady\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Omar
[nltk_data]     Hady\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
#read the data

df_movies = pd.read_csv('data/tmdb_5000_movies.csv')
df_credits = pd.read_csv('data/tmdb_5000_credits.csv')

print(df_movies.shape)
print(df_credits.shape)


(4803, 20)
(4803, 4)


In [3]:
#merge the two dataframes
df = pd.merge(df_movies, df_credits, left_on='id', right_on='movie_id')
print(df.shape)

(4803, 24)


## Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It involves summarizing the main characteristics of the dataset, often using visual methods. Here, we will perform EDA on our merged movie dataset to understand its structure, identify patterns, detect anomalies, and check assumptions.

### Steps in EDA:

1. **Data Summary**: 
    - Display the first few rows of the dataset to get an initial understanding.
    - Check the dimensions of the dataset.
    - Summarize the data types and missing values.

2. **Descriptive Statistics**:
    - Calculate summary statistics for numerical columns (mean, median, standard deviation, etc.).
    - Analyze the distribution of key variables.

3. **Data Visualization**:
    - Plot histograms, box plots, and scatter plots to visualize the distribution and relationships between variables.
    - Use heatmaps to visualize correlations between numerical variables.

4. **Handling Missing Values**:
    - Identify columns with missing values.
    - Decide on strategies to handle missing data (e.g., imputation, removal).

5. **Feature Engineering**:
    - Create new features based on existing data to enhance the predictive power of the dataset.

By performing these steps, we aim to gain insights into the dataset, which will guide us in further analysis and modeling.

In [4]:
#view all columns 
pd.set_option('display.max_columns', None)
df.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

Homepage, titles y and x and status are not really  defenitive of the features and won't be used.
homepage and release status aren't a core feature and the other features are unique to each movie

In [6]:
#select the columns that we need
df_workable = df[['id', 'original_title', 'genres', 'keywords', 'cast', 'crew', 'overview', 'popularity','runtime', 'release_date']]
df_workable.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              4803 non-null   int64  
 1   original_title  4803 non-null   object 
 2   genres          4803 non-null   object 
 3   keywords        4803 non-null   object 
 4   cast            4803 non-null   object 
 5   crew            4803 non-null   object 
 6   overview        4800 non-null   object 
 7   popularity      4803 non-null   float64
 8   runtime         4801 non-null   float64
 9   release_date    4802 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 375.4+ KB


In [7]:
#remove nan values
df_workable = df_workable.dropna()
df_workable.info()


<class 'pandas.core.frame.DataFrame'>
Index: 4799 entries, 0 to 4802
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              4799 non-null   int64  
 1   original_title  4799 non-null   object 
 2   genres          4799 non-null   object 
 3   keywords        4799 non-null   object 
 4   cast            4799 non-null   object 
 5   crew            4799 non-null   object 
 6   overview        4799 non-null   object 
 7   popularity      4799 non-null   float64
 8   runtime         4799 non-null   float64
 9   release_date    4799 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 412.4+ KB


In [8]:
df.crew[3]

'[{"credit_id": "52fe4781c3a36847f81398c3", "department": "Sound", "gender": 2, "id": 947, "job": "Original Music Composer", "name": "Hans Zimmer"}, {"credit_id": "52fe4781c3a36847f8139899", "department": "Production", "gender": 0, "id": 282, "job": "Producer", "name": "Charles Roven"}, {"credit_id": "52fe4781c3a36847f813991b", "department": "Writing", "gender": 2, "id": 525, "job": "Screenplay", "name": "Christopher Nolan"}, {"credit_id": "52fe4781c3a36847f8139865", "department": "Directing", "gender": 2, "id": 525, "job": "Director", "name": "Christopher Nolan"}, {"credit_id": "52fe4781c3a36847f8139893", "department": "Production", "gender": 2, "id": 525, "job": "Producer", "name": "Christopher Nolan"}, {"credit_id": "52fe4781c3a36847f8139915", "department": "Writing", "gender": 2, "id": 525, "job": "Story", "name": "Christopher Nolan"}, {"credit_id": "52fe4781c3a36847f813990f", "department": "Writing", "gender": 2, "id": 527, "job": "Screenplay", "name": "Jonathan Nolan"}, {"credit_

In [9]:
#extract director name from crew column

#convert string to list
import ast
df_workable['crew'] = df_workable['crew'].apply(ast.literal_eval)

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

df_workable['director'] = df_workable['crew'].apply(get_director)
df_workable = df_workable.drop('crew', axis=1)
df_workable.director.head(10)

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4       Andrew Stanton
5            Sam Raimi
6         Byron Howard
7          Joss Whedon
8          David Yates
9          Zack Snyder
Name: director, dtype: object

In [10]:
#extract genres
df_workable['genres'] = df_workable['genres'].apply(ast.literal_eval)
df_workable['genres'] = df_workable['genres'].apply(lambda x: [i['name'] for i in x])

In [11]:
df_workable.genres.value_counts().head(10)

genres
[Drama]                     369
[Comedy]                    282
[Drama, Romance]            164
[Comedy, Romance]           144
[Comedy, Drama]             142
[Comedy, Drama, Romance]    109
[Horror, Thriller]           88
[Documentary]                66
[Horror]                     64
[Drama, Thriller]            62
Name: count, dtype: int64

In [12]:
#let's look at the keywords column
df_workable['keywords'].head()

0    [{"id": 1463, "name": "culture clash"}, {"id":...
1    [{"id": 270, "name": "ocean"}, {"id": 726, "na...
2    [{"id": 470, "name": "spy"}, {"id": 818, "name...
3    [{"id": 849, "name": "dc comics"}, {"id": 853,...
4    [{"id": 818, "name": "based on novel"}, {"id":...
Name: keywords, dtype: object

In [13]:
df_workable['keywords'] = df_workable['keywords'].apply(ast.literal_eval)
df_workable['keywords'] = df_workable['keywords'].apply(lambda x: [i['name'] for i in x])
df_workable['keywords'].head()

0    [culture clash, future, space war, space colon...
1    [ocean, drug abuse, exotic island, east india ...
2    [spy, based on novel, secret agent, sequel, mi...
3    [dc comics, crime fighter, terrorist, secret i...
4    [based on novel, mars, medallion, space travel...
Name: keywords, dtype: object

In [14]:
#extract cast
df_workable['cast'] = df_workable['cast'].apply(ast.literal_eval)
df_workable['cast'] = df_workable['cast'].apply(lambda x: [i['name'] for i in x])
df_workable['cast'].head()

0    [Sam Worthington, Zoe Saldana, Sigourney Weave...
1    [Johnny Depp, Orlando Bloom, Keira Knightley, ...
2    [Daniel Craig, Christoph Waltz, Léa Seydoux, R...
3    [Christian Bale, Michael Caine, Gary Oldman, A...
4    [Taylor Kitsch, Lynn Collins, Samantha Morton,...
Name: cast, dtype: object

In [18]:
#remove spaces from the names
df_workable['cast'] = df_workable['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
df_workable['cast'].head(20)

0      [samworthington, zoesaldana, sigourneyweaver]
1         [johnnydepp, orlandobloom, keiraknightley]
2          [danielcraig, christophwaltz, léaseydoux]
3          [christianbale, michaelcaine, garyoldman]
4        [taylorkitsch, lynncollins, samanthamorton]
5          [tobeymaguire, kirstendunst, jamesfranco]
6             [zacharylevi, mandymoore, donnamurphy]
7     [robertdowneyjr., chrishemsworth, markruffalo]
8         [danielradcliffe, rupertgrint, emmawatson]
9                [benaffleck, henrycavill, galgadot]
10         [brandonrouth, kevinspacey, katebosworth]
11      [danielcraig, olgakurylenko, mathieuamalric]
12        [johnnydepp, orlandobloom, keiraknightley]
13        [johnnydepp, armiehammer, williamfichtner]
14           [henrycavill, amyadams, michaelshannon]
15       [benbarnes, williammoseley, annapopplewell]
16        [robertdowneyjr., chrisevans, markruffalo]
17            [johnnydepp, penélopecruz, ianmcshane]
18            [willsmith, tommyleejones, joshb

In [19]:
#remove spaces from the director name, genres and keywords
df_workable['director'] = df_workable['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
df_workable['director'] = df_workable['director'].apply(lambda x: [x])
df_workable['genres'] = df_workable['genres'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
df_workable['keywords'] = df_workable['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
df_workable.head(5)

Unnamed: 0,id,original_title,genres,keywords,cast,overview,popularity,runtime,release_date,director
0,19995,Avatar,"[action, adventure, fantasy, sciencefiction]","[cultureclash, future, spacewar, spacecolony, ...","[samworthington, zoesaldana, sigourneyweaver]","In the 22nd century, a paraplegic Marine is di...",150.437577,162.0,2009-12-10,jamescameron
1,285,Pirates of the Caribbean: At World's End,"[adventure, fantasy, action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[johnnydepp, orlandobloom, keiraknightley]","Captain Barbossa, long believed to be dead, ha...",139.082615,169.0,2007-05-19,goreverbinski
2,206647,Spectre,"[action, adventure, crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[danielcraig, christophwaltz, léaseydoux]",A cryptic message from Bond’s past sends him o...,107.376788,148.0,2015-10-26,sammendes
3,49026,The Dark Knight Rises,"[action, crime, drama, thriller]","[dccomics, crimefighter, terrorist, secretiden...","[christianbale, michaelcaine, garyoldman]",Following the death of District Attorney Harve...,112.31295,165.0,2012-07-16,christophernolan
4,49529,John Carter,"[action, adventure, sciencefiction]","[basedonnovel, mars, medallion, spacetravel, p...","[taylorkitsch, lynncollins, samanthamorton]","John Carter is a war-weary, former military ca...",43.926995,132.0,2012-03-07,andrewstanton


In [21]:
#make overview column lower case
df_workable['overview'] = df_workable['overview'].apply(lambda x: str.lower(x))
df_workable['original_title'] = df_workable['original_title'].apply(lambda x: str.lower(x))
df_workable.head(5)

Unnamed: 0,id,original_title,genres,keywords,cast,overview,popularity,runtime,release_date,director
0,19995,avatar,"[action, adventure, fantasy, sciencefiction]","[cultureclash, future, spacewar, spacecolony, ...","[samworthington, zoesaldana, sigourneyweaver]","in the 22nd century, a paraplegic marine is di...",150.437577,162.0,2009-12-10,jamescameron
1,285,pirates of the caribbean: at world's end,"[adventure, fantasy, action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[johnnydepp, orlandobloom, keiraknightley]","captain barbossa, long believed to be dead, ha...",139.082615,169.0,2007-05-19,goreverbinski
2,206647,spectre,"[action, adventure, crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[danielcraig, christophwaltz, léaseydoux]",a cryptic message from bond’s past sends him o...,107.376788,148.0,2015-10-26,sammendes
3,49026,the dark knight rises,"[action, crime, drama, thriller]","[dccomics, crimefighter, terrorist, secretiden...","[christianbale, michaelcaine, garyoldman]",following the death of district attorney harve...,112.31295,165.0,2012-07-16,christophernolan
4,49529,john carter,"[action, adventure, sciencefiction]","[basedonnovel, mars, medallion, spacetravel, p...","[taylorkitsch, lynncollins, samanthamorton]","john carter is a war-weary, former military ca...",43.926995,132.0,2012-03-07,andrewstanton


In [22]:
#change the overview column to word list
df_workable['overview'] = df_workable['overview'].apply(lambda x: x.split())
df_workable['overview'].head()  


0    [in, the, 22nd, century,, a, paraplegic, marin...
1    [captain, barbossa,, long, believed, to, be, d...
2    [a, cryptic, message, from, bond’s, past, send...
3    [following, the, death, of, district, attorney...
4    [john, carter, is, a, war-weary,, former, mili...
Name: overview, dtype: object

In [29]:
#make a soup column

df_workable['soup'] =  df_workable['overview']+ df_workable['keywords'] + df_workable['cast'] + df_workable['director'] + df_workable['genres']
df_workable.head(5)


Unnamed: 0,id,original_title,genres,keywords,cast,overview,popularity,runtime,release_date,director,soup
0,19995,avatar,"[action, adventure, fantasy, sciencefiction]","[cultureclash, future, spacewar, spacecolony, ...","[samworthington, zoesaldana, sigourneyweaver]","[in, the, 22nd, century,, a, paraplegic, marin...",150.437577,162.0,2009-12-10,[jamescameron],"[in, the, 22nd, century,, a, paraplegic, marin..."
1,285,pirates of the caribbean: at world's end,"[adventure, fantasy, action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[johnnydepp, orlandobloom, keiraknightley]","[captain, barbossa,, long, believed, to, be, d...",139.082615,169.0,2007-05-19,[goreverbinski],"[captain, barbossa,, long, believed, to, be, d..."
2,206647,spectre,"[action, adventure, crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[danielcraig, christophwaltz, léaseydoux]","[a, cryptic, message, from, bond’s, past, send...",107.376788,148.0,2015-10-26,[sammendes],"[a, cryptic, message, from, bond’s, past, send..."
3,49026,the dark knight rises,"[action, crime, drama, thriller]","[dccomics, crimefighter, terrorist, secretiden...","[christianbale, michaelcaine, garyoldman]","[following, the, death, of, district, attorney...",112.31295,165.0,2012-07-16,[christophernolan],"[following, the, death, of, district, attorney..."
4,49529,john carter,"[action, adventure, sciencefiction]","[basedonnovel, mars, medallion, spacetravel, p...","[taylorkitsch, lynncollins, samanthamorton]","[john, carter, is, a, war-weary,, former, mili...",43.926995,132.0,2012-03-07,[andrewstanton],"[john, carter, is, a, war-weary,, former, mili..."


In [25]:
#convert release_date to datetime
df_workable['release_date'] = pd.to_datetime(df_workable['release_date'], errors='coerce')  #coerce will put NaT for the errors

In [30]:
#final look at the data before embeddings
df_workable['soup'] = df_workable['soup'].apply(lambda x: ' '.join(x))
df_workable.soup.head(5)

0    in the 22nd century, a paraplegic marine is di...
1    captain barbossa, long believed to be dead, ha...
2    a cryptic message from bond’s past sends him o...
3    following the death of district attorney harve...
4    john carter is a war-weary, former military ca...
Name: soup, dtype: object

In [31]:
#Now let's create a new dataframe with the columns that we need
df_new = df_workable[['id', 'original_title', 'soup', 'popularity', 'runtime', 'release_date']]
df_new.head(5)

Unnamed: 0,id,original_title,soup,popularity,runtime,release_date
0,19995,avatar,"in the 22nd century, a paraplegic marine is di...",150.437577,162.0,2009-12-10
1,285,pirates of the caribbean: at world's end,"captain barbossa, long believed to be dead, ha...",139.082615,169.0,2007-05-19
2,206647,spectre,a cryptic message from bond’s past sends him o...,107.376788,148.0,2015-10-26
3,49026,the dark knight rises,following the death of district attorney harve...,112.31295,165.0,2012-07-16
4,49529,john carter,"john carter is a war-weary, former military ca...",43.926995,132.0,2012-03-07


## Creating Embeddings

Embeddings are a crucial component in natural language processing (NLP) and machine learning. They are dense vector representations of words, phrases, or other entities that capture semantic meaning and relationships. Here, we will discuss why we create embeddings and their targets.

### Why Create Embeddings?

1. **Dimensionality Reduction**:
    - Embeddings reduce the high-dimensional space of one-hot encoded vectors into a lower-dimensional space, making computations more efficient.

2. **Semantic Similarity**:
    - Embeddings capture semantic relationships between words. Words with similar meanings have similar vector representations, enabling better understanding and processing of language.

3. **Handling Sparsity**:
    - One-hot encoding results in sparse vectors with many zeros. Embeddings provide dense representations, which are more efficient for machine learning models.

4. **Improved Performance**:
    - Embeddings improve the performance of NLP models by providing meaningful representations that capture context and relationships between words.

5. **Transfer Learning**:
    - Pre-trained embeddings (e.g., Word2Vec, GloVe, BERT) can be used across different tasks, leveraging knowledge from large datasets and reducing the need for extensive training data.

### Targets of Embeddings

1. **Words**:
    - Word embeddings represent individual words in a continuous vector space. Examples include Word2Vec and GloVe.

2. **Sentences**:
    - Sentence embeddings capture the meaning of entire sentences. They are useful for tasks like sentence similarity and sentiment analysis. Examples include Universal Sentence Encoder and Sentence-BERT.

3. **Documents**:
    - Document embeddings represent entire documents or paragraphs. They are used in tasks like document classification and topic modeling. Examples include Doc2Vec.

4. **Entities**:
    - Entity embeddings represent categorical variables or entities in a dataset. They are used in recommendation systems and structured data tasks.

By creating embeddings, we aim to transform raw data into meaningful representations that can be effectively used by machine learning models to achieve better performance and insights.

In [32]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ----------------------------- ---------- 30.7/41.5 kB 1.4 MB/s eta 0:00:01
     -------------------------------------- 41.5/41.5 kB 666.1 kB/s eta 0:00:00
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.7 kB ? eta -:--:--
     ---------------------------------------- 57.7/57.7 kB 3.0 MB/s eta 0:00:00
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.5 MB 6.4 MB/s eta 0:00:01
   ----- ---------------------------------- 0.2/1.5 MB 3.1 MB/s eta 



In [33]:
#Now let's preprocess the soup column using nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import PorterStemmer

#We start with stemming
stemmer = PorterStemmer()
df_new['soup'] = df_new['soup'].apply(lambda x: ' '.join([stemmer.stem(i) for i in x.split()]))
df_new['soup'].head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['soup'] = df_new['soup'].apply(lambda x: ' '.join([stemmer.stem(i) for i in x.split()]))


0    in the 22nd century, a parapleg marin is dispa...
1    captain barbossa, long believ to be dead, ha c...
2    a cryptic messag from bond’ past send him on a...
3    follow the death of district attorney harvey d...
4    john carter is a war-weary, former militari ca...
Name: soup, dtype: object

In [34]:
#second lemmatisation
lemmatizer = WordNetLemmatizer()
df_new['soup'] = df_new['soup'].apply(lambda x: ' '.join([lemmatizer.lemmatize(i) for i in x.split()]))
df_new['soup'].head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['soup'] = df_new['soup'].apply(lambda x: ' '.join([lemmatizer.lemmatize(i) for i in x.split()]))


0    in the 22nd century, a parapleg marin is dispa...
1    captain barbossa, long believ to be dead, ha c...
2    a cryptic messag from bond’ past send him on a...
3    follow the death of district attorney harvey d...
4    john carter is a war-weary, former militari ca...
Name: soup, dtype: object

In [35]:
#Now let's create embeddings for the soup column
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df_new['soup'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim


array([[1.        , 0.06776309, 0.06862635, ..., 0.03571429, 0.        ,
        0.        ],
       [0.06776309, 1.        , 0.05063697, ..., 0.01976424, 0.        ,
        0.02132007],
       [0.06862635, 0.05063697, 1.        , ..., 0.02001602, 0.        ,
        0.        ],
       ...,
       [0.03571429, 0.01976424, 0.02001602, ..., 1.        , 0.03311331,
        0.03370999],
       [0.        , 0.        , 0.        , ..., 0.03311331, 1.        ,
        0.07143996],
       [0.        , 0.02132007, 0.        , ..., 0.03370999, 0.07143996,
        1.        ]])

In [47]:
#Now let's create a function that takes a movie title as input and returns the 10 most similar movies
indices = pd.Series(df_new.index, index=df_new['original_title']).drop_duplicates() #create a series with the movie title as index and the index as value

def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return df_new['original_title'].iloc[movie_indices]
df_new['original_title'].head(10)


0                                      avatar
1    pirates of the caribbean: at world's end
2                                     spectre
3                       the dark knight rises
4                                 john carter
5                                spider-man 3
6                                     tangled
7                     avengers: age of ultron
8      harry potter and the half-blood prince
9          batman v superman: dawn of justice
Name: original_title, dtype: object

In [48]:
get_recommendations('tangled')

[(0, 0.030116930096841705), (1, 0.016666666666666666), (2, 0.016878989451394443), (3, 0.043721722385596865), (4, 0.0404888165089458), (5, 0.026352313834736494), (6, 0.9999999999999991), (7, 0.04225771273642583), (8, 0.03465834966066909), (9, 0.04188539082916955), (10, 0.014617633655117155), (11, 0.01384091330895666), (12, 0.017099639201419235), (13, 0.06168610084980043), (14, 0.016074760739013736), (15, 0.012012499502607452), (16, 0.03214952147802747), (17, 0.01209127083516686), (18, 0.011712139482105107), (19, 0.011785113019775792), (20, 0.03501504876259268), (21, 0.031083493608010458), (22, 0.0), (23, 0.033757978902788886), (24, 0.03178208630818641), (25, 0.018490006540840973), (26, 0.030751040267629502), (27, 0.01098967455659645), (28, 0.015214515486254616), (29, 0.024674440339920174), (30, 0.013608276348795433), (31, 0.016878989451394443), (32, 0.01259881576697424), (33, 0.018932061141568826), (34, 0.11180339887498947), (35, 0.03726779962499649), (36, 0.028171808490950554), (37, 0.

269     the princess and the frog
255             home on the range
1695                      aladdin
42                    toy story 3
2309                         逃出生天
1594                 corpse bride
1676          konferenz der tiere
391                     enchanted
506               despicable me 2
358     atlantis: the lost empire
Name: original_title, dtype: object

In [49]:
df_new.to_csv('data/df_new.csv', index=False)

In [3]:
#export this code in a python file
!jupyter nbconvert --to script Notebook.ipynb




[NbConvertApp] Converting notebook Notebook.ipynb to script
[NbConvertApp] Writing 9501 bytes to Notebook.py
