# Movie plot processing
We're reading a dataset containing movie summaries (`movies_metadata.csv` downloaded from https://www.kaggle.com/rounakbanik/the-movies-dataset). In the end, we want to have the titles, summaries and their corresponding embeddings for some well-known movies.

In [1]:
import pandas as pd
import spacy

In [2]:
original = pd.read_csv('original.csv', low_memory=False)
print(f'example summary: {original["overview"][0]}')
print(f'number of movies: {len(original)}')

example summary: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
number of movies: 45466


In [3]:
numeric = pd.DataFrame(original)
numeric['popularity'] = pd.to_numeric(numeric['popularity'], errors='coerce')

In [4]:
popular = numeric.nlargest(1024, 'popularity')

In [5]:
processed = pd.DataFrame()
processed['title'] = popular['original_title']
processed['summary'] = popular['overview']

In [6]:
processed.head()

Unnamed: 0,title,summary
30700,Minions,"Minions Stuart, Kevin and Bob are recruited by..."
33356,Wonder Woman,An Amazon princess comes to the world of Man t...
42222,Beauty and the Beast,A live-action adaptation of Disney's version o...
43644,Baby Driver,After being coerced into working for a crime b...
24455,Big Hero 6,The special bond that develops between plus-si...


In [7]:
def harry_potter(df):
    locations = df['title'].str.contains('Harry Potter')
    return df.loc[locations]

In [8]:
harry_potter(processed)

Unnamed: 0,title,summary
4766,Harry Potter and the Philosopher's Stone,Harry Potter has lived under the stairs at his...
5678,Harry Potter and the Chamber of Secrets,"Ignoring threats to his life, Harry returns to..."
7725,Harry Potter and the Prisoner of Azkaban,"Harry, Ron and Hermione return to Hogwarts for..."
17437,Harry Potter and the Deathly Hallows: Part 2,"Harry, Ron and Hermione continue their quest t..."
10554,Harry Potter and the Goblet of Fire,"Harry starts his fourth year at Hogwarts, comp..."
16128,Harry Potter and the Deathly Hallows: Part 1,"Harry, Ron and Hermione walk away from their l..."
11927,Harry Potter and the Order of the Phoenix,Returning for his fifth year of study at Hogwa...
13893,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he..."


In [9]:
processed = processed.drop_duplicates(subset='title')
processed = processed.drop_duplicates(subset='summary')

In [10]:
harry_potter(processed)

Unnamed: 0,title,summary
4766,Harry Potter and the Philosopher's Stone,Harry Potter has lived under the stairs at his...
5678,Harry Potter and the Chamber of Secrets,"Ignoring threats to his life, Harry returns to..."
7725,Harry Potter and the Prisoner of Azkaban,"Harry, Ron and Hermione return to Hogwarts for..."
17437,Harry Potter and the Deathly Hallows: Part 2,"Harry, Ron and Hermione continue their quest t..."
10554,Harry Potter and the Goblet of Fire,"Harry starts his fourth year at Hogwarts, comp..."
16128,Harry Potter and the Deathly Hallows: Part 1,"Harry, Ron and Hermione walk away from their l..."
11927,Harry Potter and the Order of the Phoenix,Returning for his fifth year of study at Hogwa...
13893,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he..."


In [11]:
len(processed)

1014

In [12]:
lengths = processed['summary'].map(len)
okay_length = (lengths > 64)# & (lengths < 8192)
processed = processed.loc[okay_length]

In [13]:
harry_potter(processed)

Unnamed: 0,title,summary
4766,Harry Potter and the Philosopher's Stone,Harry Potter has lived under the stairs at his...
5678,Harry Potter and the Chamber of Secrets,"Ignoring threats to his life, Harry returns to..."
7725,Harry Potter and the Prisoner of Azkaban,"Harry, Ron and Hermione return to Hogwarts for..."
17437,Harry Potter and the Deathly Hallows: Part 2,"Harry, Ron and Hermione continue their quest t..."
10554,Harry Potter and the Goblet of Fire,"Harry starts his fourth year at Hogwarts, comp..."
16128,Harry Potter and the Deathly Hallows: Part 1,"Harry, Ron and Hermione walk away from their l..."
11927,Harry Potter and the Order of the Phoenix,Returning for his fifth year of study at Hogwa...
13893,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he..."


In [14]:
len(processed)

1012

## It's embedding time!

In [15]:
nlp = spacy.load('en_core_web_lg')

In [17]:
title_vectors = [nlp(title).vector for title in processed['title']]

In [18]:
print(f'example title vector: {title_vectors[0].dtype}, len: {len(title_vectors[0])}')

example title vector: float32, len: 300


In [19]:
processed['title_vector'] = title_vectors

In [20]:
summary_vectors = [nlp(summary).vector for summary in processed['summary']]
processed['summary_vector'] = summary_vectors

In [21]:
harry_potter(processed)

Unnamed: 0,title,summary,title_vector,summary_vector
4766,Harry Potter and the Philosopher's Stone,Harry Potter has lived under the stairs at his...,"[0.14711046, 0.08602543, -0.042742576, -0.0879...","[0.007231035, 0.17376778, -0.08632122, 0.01730..."
5678,Harry Potter and the Chamber of Secrets,"Ignoring threats to his life, Harry returns to...","[0.10979699, 0.11679685, -0.07873614, -0.11134...","[-0.08677669, 0.11283945, -0.15686333, 0.04908..."
7725,Harry Potter and the Prisoner of Azkaban,"Harry, Ron and Hermione return to Hogwarts for...","[0.061607003, -0.09384886, -0.0041819983, -0.0...","[-0.0077857934, 0.15869738, -0.13243689, -0.00..."
17437,Harry Potter and the Deathly Hallows: Part 2,"Harry, Ron and Hermione continue their quest t...","[0.07672289, -0.08384389, -0.096302, -0.060086...","[-0.0021664782, 0.016848696, -0.12276313, -0.0..."
10554,Harry Potter and the Goblet of Fire,"Harry starts his fourth year at Hogwarts, comp...","[-0.011108716, -0.049533136, -0.06561043, -0.1...","[-0.02013416, 0.108690456, -0.06220861, -0.037..."
16128,Harry Potter and the Deathly Hallows: Part 1,"Harry, Ron and Hermione walk away from their l...","[0.05401389, -0.06443133, -0.07180156, -0.0186...","[-0.010623759, 0.12037627, -0.08234681, -0.007..."
11927,Harry Potter and the Order of the Phoenix,Returning for his fifth year of study at Hogwa...,"[0.03729488, -0.0061281826, -0.076897874, -0.1...","[0.011530182, 0.1461688, -0.09513208, 0.000799..."
13893,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he...","[-0.0019846223, 0.01903063, -0.067967, -0.0175...","[0.022718994, 0.21176529, -0.08548818, 0.00756..."


In [22]:
processed.to_csv('processed.csv', index=False)