# 01 - Data Preprocessing

## 1. Import Packages <a name="import"></a>

In [1]:
import pandas as pd
import numpy as np
import pprint as pp
import pickle
import json

In [2]:
import json
import os

## Table of Contents <a name="table"></a>
1. [Import Packages](#import)
2. [Load Data](#load)
3. [Combine Data](#combine)
3. [Save Data](#save)

## 2. Load Data <a name="load"></a>

The movie data for the project was obtained from https://www.kaggle.com/rmisra/imdb-spoiler-dataset/data# <br>
The files were compressed using gzip in terminal.

In [3]:
movie_details = pd.read_json('./data/IMDB_movie_details.json.gz', 
                             lines = True, compression = 'gzip')

movie_details.head()

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in..."
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...
2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"[Comedy, Romance]",6.7,2002-04-11,
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...


In [4]:
movie_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1572 entries, 0 to 1571
Data columns (total 7 columns):
movie_id         1572 non-null object
plot_summary     1572 non-null object
duration         1572 non-null object
genre            1572 non-null object
rating           1572 non-null float64
release_date     1572 non-null object
plot_synopsis    1572 non-null object
dtypes: float64(1), object(6)
memory usage: 86.1+ KB


A list of titles associated with movie IDs can be found at https://datasets.imdbws.com/ <br>
The only columns of interest are the first and third column.

In [5]:
titles = pd.read_csv('./data/title.basics.tsv.gz', encoding = 'utf8',
                     delimiter = '\t', compression = 'gzip', usecols = [0, 2])

#rename the columns
titles.columns = ['movie_id', 'title']

titles.head()

Unnamed: 0,movie_id,title
0,tt0000001,Carmencita
1,tt0000002,Le clown et ses chiens
2,tt0000003,Pauvre Pierrot
3,tt0000004,Un bon bock
4,tt0000005,Blacksmith Scene


In [6]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6621144 entries, 0 to 6621143
Data columns (total 2 columns):
movie_id    object
title       object
dtypes: object(2)
memory usage: 101.0+ MB


Movie reviews are the main source of our data for movie review spoiler detection.

In [7]:
movie_reviews = pd.read_json('./data/IMDB_reviews.json.gz', 
                             lines = True, compression = 'gzip')

movie_reviews.head()

Unnamed: 0,review_date,movie_id,user_id,is_spoiler,review_text,rating,review_summary
0,10 February 2006,tt0111161,ur1898687,True,"In its Oscar year, Shawshank Redemption (writt...",10,A classic piece of unforgettable film-making.
1,6 September 2000,tt0111161,ur0842118,True,The Shawshank Redemption is without a doubt on...,10,Simply amazing. The best film of the 90's.
2,3 August 2001,tt0111161,ur1285640,True,I believe that this film is the best story eve...,8,The best story ever told on film
3,1 September 2002,tt0111161,ur1003471,True,"**Yes, there are SPOILERS here**This film has ...",10,Busy dying or busy living?
4,20 May 2004,tt0111161,ur0226855,True,At the heart of this extraordinary movie is a ...,8,"Great story, wondrously told and acted"


In [8]:
movie_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573913 entries, 0 to 573912
Data columns (total 7 columns):
review_date       573913 non-null object
movie_id          573913 non-null object
user_id           573913 non-null object
is_spoiler        573913 non-null bool
review_text       573913 non-null object
rating            573913 non-null int64
review_summary    573913 non-null object
dtypes: bool(1), int64(1), object(5)
memory usage: 26.8+ MB


Return to [Table of Contents](#table)

## 3. Combine Data <a name="combine"></a>

Some movie IDs have a '/' character, which will interfere with merging later. <br>
These need to be removed.

In [9]:
movie_details['movie_id'] = movie_details['movie_id'].apply(lambda movie_id: movie_id.replace('/',''))

We are only interested in movies with plot synopses because we need the text data in order to perform our modeling later. <br>
A plot synopsis can be regarded as the ultimate spoiler.

In [10]:
movie_details = movie_details[movie_details['plot_synopsis'] != '']

movie_details

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in..."
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...
5,tt0286716,"Bruce Banner, a brilliant scientist with a clo...",2h 18min,"[Action, Sci-Fi]",5.7,2003-06-20,Bruce Banner (Eric Bana) is a research scienti...
...,...,...,...,...,...,...,...
1563,tt0120655,An abortion clinic worker with a special herit...,2h 10min,"[Adventure, Comedy, Drama]",7.3,1999-11-12,The film opens with a homeless man (Bud Cort) ...
1565,tt0276751,Twelve year old Marcus Brewer lives with his c...,1h 41min,"[Comedy, Drama, Romance]",7.1,2002-05-17,Will Freeman (Hugh Grant) is a 38-year-old bac...
1567,tt0289879,Evan Treborn grows up in a small town with his...,1h 53min,"[Sci-Fi, Thriller]",7.7,2004-01-23,"In the year 1998, Evan Treborn (Ashton Kutcher..."
1568,tt1723811,Brandon is a 30-something man living in New Yo...,1h 41min,[Drama],7.2,2012-01-13,"Brandon (Michael Fassbender) is a successful, ..."


We then focus on movies that have plot synopses with at least 50 sentences. <br>
This is to ensure there is enough data for topic modeling later.

In [12]:
num_sentences = movie_details['plot_synopsis'].apply(lambda synopsis: len(synopsis.split('.')))

movie_details = movie_details[num_sentences >= 50]

We combine our 'movie_details' dataframe with the 'titles' dataframe. <br>
We now have a title for each movie ID.

In [13]:
movie_details = (movie_details.merge(titles, on='movie_id', how='inner')
                              .drop(['rating', 'release_date', 'duration'], axis = 1))

movie_details.head()

Unnamed: 0,movie_id,plot_summary,genre,plot_synopsis,title
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...","[Action, Thriller]","Jack Ryan (Ford) is on a ""working vacation"" in...",Patriot Games
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",[Comedy],Four boys around the age of 10 are friends in ...,Last Vegas
2,tt0126886,Tracy Flick is running unopposed for this year...,"[Comedy, Drama, Romance]",Jim McAllister (Matthew Broderick) is a much-a...,Election
3,tt0286716,"Bruce Banner, a brilliant scientist with a clo...","[Action, Sci-Fi]",Bruce Banner (Eric Bana) is a research scienti...,Hulk
4,tt0090605,57 years after Ellen Ripley had a close encoun...,"[Action, Adventure, Sci-Fi]","After the opening credits, we see a spacecraft...",Aliens


In [24]:
num_sentences = movie_reviews['review_text'].apply(lambda review: len(review.split('.')))

movie_reviews[num_sentences >= 1]

Unnamed: 0,review_date,movie_id,user_id,is_spoiler,review_text,rating,review_summary
0,10 February 2006,tt0111161,ur1898687,True,"In its Oscar year, Shawshank Redemption (writt...",10,A classic piece of unforgettable film-making.
1,6 September 2000,tt0111161,ur0842118,True,The Shawshank Redemption is without a doubt on...,10,Simply amazing. The best film of the 90's.
2,3 August 2001,tt0111161,ur1285640,True,I believe that this film is the best story eve...,8,The best story ever told on film
3,1 September 2002,tt0111161,ur1003471,True,"**Yes, there are SPOILERS here**This film has ...",10,Busy dying or busy living?
4,20 May 2004,tt0111161,ur0226855,True,At the heart of this extraordinary movie is a ...,8,"Great story, wondrously told and acted"
...,...,...,...,...,...,...,...
573908,8 August 1999,tt0139239,ur0100166,False,"Go is wise, fast and pure entertainment. Assem...",10,The best teen movie of the nineties
573909,31 July 1999,tt0139239,ur0021767,False,"Well, what shall I say. this one´s fun at any ...",9,Go - see the movie
573910,20 July 1999,tt0139239,ur0392750,False,"Go is the best movie I have ever seen, and I'v...",10,It's the best movie I've ever seen
573911,11 June 1999,tt0139239,ur0349105,False,Call this 1999 teenage version of Pulp Fiction...,3,Haven't we seen this before?


To save memory, we will first identify which movie IDs are shared between 'movie_reviews' and 'movie_details'. <br>
We will then only focus on movies with review data. <br>
While we could merge 'movie_reviews' and 'movie_details' together on 'movie_id', we save more memory keeping them separate. <br>
This is because info about each movie would become redundant with the merge.

In [25]:
movie_ids_details = set(movie_details['movie_id'].values)
movie_ids_reviews = set(movie_reviews['movie_id'].unique())

shared_movie_ids = movie_ids_details.intersection(movie_ids_reviews)

In [26]:
movie_details = movie_details[movie_details['movie_id'].isin(shared_movie_ids)]

movie_details

Unnamed: 0,movie_id,plot_summary,genre,plot_synopsis,title
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...","[Action, Thriller]","Jack Ryan (Ford) is on a ""working vacation"" in...",Patriot Games
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",[Comedy],Four boys around the age of 10 are friends in ...,Last Vegas
2,tt0126886,Tracy Flick is running unopposed for this year...,"[Comedy, Drama, Romance]",Jim McAllister (Matthew Broderick) is a much-a...,Election
3,tt0286716,"Bruce Banner, a brilliant scientist with a clo...","[Action, Sci-Fi]",Bruce Banner (Eric Bana) is a research scienti...,Hulk
4,tt0090605,57 years after Ellen Ripley had a close encoun...,"[Action, Adventure, Sci-Fi]","After the opening credits, we see a spacecraft...",Aliens
...,...,...,...,...,...
875,tt0120891,Jim West is a guns-a-blazing former Civil War ...,"[Action, Comedy, Sci-Fi]",The story opens in Louisiana in 1869. A man fi...,Wild Wild West
876,tt0120655,An abortion clinic worker with a special herit...,"[Adventure, Comedy, Drama]",The film opens with a homeless man (Bud Cort) ...,Dogma
877,tt0276751,Twelve year old Marcus Brewer lives with his c...,"[Comedy, Drama, Romance]",Will Freeman (Hugh Grant) is a 38-year-old bac...,About a Boy
878,tt1723811,Brandon is a 30-something man living in New Yo...,[Drama],"Brandon (Michael Fassbender) is a successful, ...",Shame


In [27]:
movie_reviews = movie_reviews[movie_reviews['movie_id'].isin(shared_movie_ids)]

movie_reviews

Unnamed: 0,review_date,movie_id,user_id,is_spoiler,review_text,rating,review_summary
0,10 February 2006,tt0111161,ur1898687,True,"In its Oscar year, Shawshank Redemption (writt...",10,A classic piece of unforgettable film-making.
1,6 September 2000,tt0111161,ur0842118,True,The Shawshank Redemption is without a doubt on...,10,Simply amazing. The best film of the 90's.
2,3 August 2001,tt0111161,ur1285640,True,I believe that this film is the best story eve...,8,The best story ever told on film
3,1 September 2002,tt0111161,ur1003471,True,"**Yes, there are SPOILERS here**This film has ...",10,Busy dying or busy living?
4,20 May 2004,tt0111161,ur0226855,True,At the heart of this extraordinary movie is a ...,8,"Great story, wondrously told and acted"
...,...,...,...,...,...,...,...
573908,8 August 1999,tt0139239,ur0100166,False,"Go is wise, fast and pure entertainment. Assem...",10,The best teen movie of the nineties
573909,31 July 1999,tt0139239,ur0021767,False,"Well, what shall I say. this one´s fun at any ...",9,Go - see the movie
573910,20 July 1999,tt0139239,ur0392750,False,"Go is the best movie I have ever seen, and I'v...",10,It's the best movie I've ever seen
573911,11 June 1999,tt0139239,ur0349105,False,Call this 1999 teenage version of Pulp Fiction...,3,Haven't we seen this before?


Return to [Table of Contents](#table)

## 4. Save Data <a name="save"></a>

We will now save 'movie_details' and 'movie_reviews' for later use.

In [39]:
file_dir = os.path.abspath('.')
data_folder = 'data'
path = os.path.join(file_dir, data_folder, 'movie_details.pkl.gz')

movie_details.to_pickle(path, compression = 'gzip')

In [41]:
movie_details = pd.read_pickle('./data/movie_details.pkl.gz', compression = 'gzip')

movie_details.head()

Unnamed: 0,movie_id,plot_summary,genre,plot_synopsis,title
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...","[Action, Thriller]","Jack Ryan (Ford) is on a ""working vacation"" in...",Patriot Games
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",[Comedy],Four boys around the age of 10 are friends in ...,Last Vegas
2,tt0126886,Tracy Flick is running unopposed for this year...,"[Comedy, Drama, Romance]",Jim McAllister (Matthew Broderick) is a much-a...,Election
3,tt0286716,"Bruce Banner, a brilliant scientist with a clo...","[Action, Sci-Fi]",Bruce Banner (Eric Bana) is a research scienti...,Hulk
4,tt0090605,57 years after Ellen Ripley had a close encoun...,"[Action, Adventure, Sci-Fi]","After the opening credits, we see a spacecraft...",Aliens


In [42]:
file_dir = os.path.abspath('.')
data_folder = 'data'
path = os.path.join(file_dir, data_folder, 'movie_reviews.pkl.gz')

movie_reviews.to_pickle(path, compression = 'gzip')

In [43]:
movie_reviews = pd.read_pickle('./data/movie_reviews.pkl.gz', compression = 'gzip')

movie_reviews.head()

Unnamed: 0,review_date,movie_id,user_id,is_spoiler,review_text,rating,review_summary
0,10 February 2006,tt0111161,ur1898687,True,"In its Oscar year, Shawshank Redemption (writt...",10,A classic piece of unforgettable film-making.
1,6 September 2000,tt0111161,ur0842118,True,The Shawshank Redemption is without a doubt on...,10,Simply amazing. The best film of the 90's.
2,3 August 2001,tt0111161,ur1285640,True,I believe that this film is the best story eve...,8,The best story ever told on film
3,1 September 2002,tt0111161,ur1003471,True,"**Yes, there are SPOILERS here**This film has ...",10,Busy dying or busy living?
4,20 May 2004,tt0111161,ur0226855,True,At the heart of this extraordinary movie is a ...,8,"Great story, wondrously told and acted"


Return to [Table of Contents](#table)