# Final Project

**Use plot synopses of movies to predict the genre of the film.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

## Data

### Import Datasets

In [2]:
basics = pd.read_csv('title.basics.tsv', sep='\t')
synopses = pd.read_csv('mpst_full_data.csv')

In [3]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [4]:
synopses.head()

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0057603,I tre volti della paura,Note: this synopsis is for the orginal Italian...,"cult, horror, gothic, murder, atmospheric",train,imdb
1,tt1733125,Dungeons & Dragons: The Book of Vile Darkness,"Two thousand years ago, Nhagruul the Foul, a s...",violence,train,imdb
2,tt0033045,The Shop Around the Corner,"Matuschek's, a gift store in Budapest, is the ...",romantic,test,imdb
3,tt0113862,Mr. Holland's Opus,"Glenn Holland, not a morning person by anyone'...","inspiring, romantic, stupid, feel-good",train,imdb
4,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb


### Combine Data

In [5]:
all = pd.merge(left = basics, right = synopses, how = 'inner', left_on = 'tconst', right_on = 'imdb_id')
all.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0000091,short,The House of the Devil,Le manoir du diable,0,1896,\N,3,"Horror,Short",tt0000091,Le manoir du diable,The film opens with a large bat flying into a ...,"paranormal, gothic",train,wikipedia
1,tt0000225,short,Beauty and the Beast,La belle et la bête,0,1899,\N,\N,"Family,Fantasy,Romance",tt0000225,La belle et la bête,A widower merchant lives in a mansion with his...,fantasy,train,wikipedia
2,tt0000230,short,Cinderella,Cendrillon,0,1899,\N,6,"Drama,Family,Fantasy",tt0000230,Cendrillon,"A prologue in front of the curtain, suppressed...",fantasy,train,wikipedia


In [6]:
print('Number of movies with plot data:', synopses.shape[0])
print('Number of movies after merge:', all.shape[0])

Number of movies with plot data: 14828
Number of movies after merge: 14820


The IMDb basics dataset is quite large, containing a lot of information about all IMDb tracked digital media content. Including **`genre`**, which is the response variable that I'm hope to be able to predict using a multinomial classification approach. Some other variables in this dataset could also be useful (like `runtime` and `release year`).

The MPST Kaggle dataset is much smaller, containing plot descriptions and summaries that were collected from IMDb and Wikipedia pages. There is also an `imdb_id`, which corresponds directly to the `tconst` variable that uniquely identifies each film in the IMDb dataset. This made it possible to merge the IMDb data with the Kaggle plot synopsis data, keeping only the movies that were common between those two datasets. There was a very small loss of plot data, only 8 observations did not have a match with a movie in the IMDb dataset.

At this point, what's left over are movies for which there is an available plot summary. Additional filtering and cleaning will be performed in the next section.

### Data Cleaning

####  Filter where `titleType == 'movie'` and `isAdult == 0`

In [7]:
#all['titleType'].value_counts()
movies = all[(all['titleType'] == 'movie') & (all['isAdult'] == 0)]
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,imdb_id,title,plot_synopsis,tags,split,synopsis_source
12,tt0002130,movie,Dante's Inferno,L'Inferno,0,1911,\N,71,"Adventure,Drama,Fantasy",tt0002130,L'Inferno,The exhumation of Lizzie Siddal's desiccated b...,"psychedelic, violence",train,wikipedia
13,tt0003419,movie,The Student of Prague,Der Student von Prag,0,1913,\N,85,"Drama,Fantasy,Horror",tt0003419,Der Student von Prag,Being praised as the finest fencer in his Univ...,haunting,test,imdb
14,tt0003489,movie,The Last Days of Pompeii,Gli ultimi giorni di Pompei,0,1913,\N,88,"Adventure,Drama",tt0003489,Gli ultimi giorni di Pompei,"In Pompeii 79AD, Glaucus and Jone are in love ...","romantic, murder",train,wikipedia
15,tt0004022,movie,Julius Caesar,Cajus Julius Caesar,0,1914,\N,112,"Drama,History",tt0004022,Cajus Julius Caesar,The play opens with the commoners of Rome cele...,tragedy,train,wikipedia
16,tt0004099,movie,The New Wizard of Oz,"His Majesty, the Scarecrow of Oz",0,1914,\N,59,"Adventure,Comedy,Family",tt0004099,"His Majesty, the Scarecrow of Oz",King Krewl (Raymond Russell) is a cruel dictat...,romantic,train,wikipedia


In [8]:
print('Number of obersavtions left after filtering:', movies.shape[0])

Number of obersavtions left after filtering: 12296


#### Drop Variables

In [9]:
movies.drop(['titleType', 'originalTitle', 'isAdult', 'endYear', 'imdb_id', 
             'title', 'tags', 'split', 'synopsis_source'],
            axis = 1, inplace = True)
movies.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,plot_synopsis
12,tt0002130,Dante's Inferno,1911,71,"Adventure,Drama,Fantasy",The exhumation of Lizzie Siddal's desiccated b...
13,tt0003419,The Student of Prague,1913,85,"Drama,Fantasy,Horror",Being praised as the finest fencer in his Univ...
14,tt0003489,The Last Days of Pompeii,1913,88,"Adventure,Drama","In Pompeii 79AD, Glaucus and Jone are in love ..."
15,tt0004022,Julius Caesar,1914,112,"Drama,History",The play opens with the commoners of Rome cele...
16,tt0004099,The New Wizard of Oz,1914,59,"Adventure,Comedy,Family",King Krewl (Raymond Russell) is a cruel dictat...


#### Missing values

In [10]:
## find how many \Ns in each variable
(movies == '\\N').sum()

tconst             0
primaryTitle       0
startYear         18
runtimeMinutes    52
genres             1
plot_synopsis      0
dtype: int64

Missing values are represented in the data by a '\N' character. Some of these can be imputed, like `runtimeMinutes` which will be replaced with the variable mean. Others like `startYear` and `genre` should be removed (especially the oberservations that do not have data in the genre field, as that is the target variable).

In [29]:
## replace \N strings with NaNs
movies.replace(to_replace = '\\N', value = np.nan, inplace = True)

## make runtimeMinutes numeric, fill NAs with the mean
movies['runtimeMinutes'] = pd.to_numeric(movies['runtimeMinutes'])
movies['runtimeMinutes'].fillna(movies['runtimeMinutes'].mean(), inplace = True)

## drop remaining NAs
movies_clean = movies.dropna()

In [33]:
## reformat
movies_clean.reset_index(drop = True, inplace = True)
movies_clean.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,plot_synopsis
0,tt0002130,Dante's Inferno,1911,71.0,"Adventure,Drama,Fantasy",The exhumation of Lizzie Siddal's desiccated b...
1,tt0003419,The Student of Prague,1913,85.0,"Drama,Fantasy,Horror",Being praised as the finest fencer in his Univ...
2,tt0003489,The Last Days of Pompeii,1913,88.0,"Adventure,Drama","In Pompeii 79AD, Glaucus and Jone are in love ..."
3,tt0004022,Julius Caesar,1914,112.0,"Drama,History",The play opens with the commoners of Rome cele...
4,tt0004099,The New Wizard of Oz,1914,59.0,"Adventure,Comedy,Family",King Krewl (Raymond Russell) is a cruel dictat...


In [34]:
movies_clean.shape

(12277, 6)

At the end of the data cleaning process, I am left with a dataframe with 12277 observations and 6 features. I expect to remove one of or both `tconst` and `primaryTitle` as they are all unique values and won't aid in modeling.

Further feature engineering will include deriving a main genre for each movie and creating classes of interest. And determining TF-IDF scores for the text in the `plot_synopsis` feature.

### Feature Engineering

#### Refine `genre` feature

In [47]:
## extract the main genre (assume it is the first listed)
movies_clean['genre'] = movies_clean['genres'].apply(lambda x: re.findall('\w+', x)[0])
print('Number of unique primary genres:', movies_clean['genre'].nunique())
print('Number of movies of each genre:')
print(movies_clean['genre'].value_counts())

Number of unique primary genres: 20
Number of movies of each genre:
Action         2778
Comedy         2649
Drama          2444
Crime          1207
Horror         1048
Adventure       917
Biography       478
Western         154
Animation       130
Fantasy         107
Mystery          83
Documentary      69
Romance          49
Thriller         40
Sci              38
Family           30
Musical          25
Film             17
Music             7
History           7
Name: genre, dtype: int64


Make **Action, Comedy, Drama, Crime, Horror, and Adventure** the five main classes. All others can be lumped together in an **Other** category.

In [57]:
## lump smaller genre groups into a bigger OTHER level
other = ['Biography', 'Western', 
         'Animation', 'Fantasy',
         'Mystery', 'Documentary',
         'Romance', 'Thriller',
         'Sci', 'Family',
         'Musical', 'Film',
         'Music', 'History']

movies_clean['genre'] = movies_clean['genre'].replace(to_replace = other, value = "Other")
movies_clean.drop(['genres'], axis = 1, inplace = True)
movies_clean.head(10)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,plot_synopsis,genre
0,tt0002130,Dante's Inferno,1911,71.0,The exhumation of Lizzie Siddal's desiccated b...,Adventure
1,tt0003419,The Student of Prague,1913,85.0,Being praised as the finest fencer in his Univ...,Drama
2,tt0003489,The Last Days of Pompeii,1913,88.0,"In Pompeii 79AD, Glaucus and Jone are in love ...",Adventure
3,tt0004022,Julius Caesar,1914,112.0,The play opens with the commoners of Rome cele...,Drama
4,tt0004099,The New Wizard of Oz,1914,59.0,King Krewl (Raymond Russell) is a cruel dictat...,Adventure
5,tt0004635,The Squaw Man,1914,74.0,James Wynnegate (Dustin Farnum) and his cousin...,Action
6,tt0004972,The Birth of a Nation,1915,195.0,=== Part 1: Civil War of United States ===\nTh...,Drama
7,tt0005059,The Captive,1915,50.0,The Captive chronicles the life of a young wom...,Drama
8,tt0006206,Les vampires,1915,421.0,"=== Episode 1 – ""The Severed Head"" ===\nPhilip...",Action
9,tt0006780,Hell's Hinges,1916,64.0,Hell's Hinges tells the story of a weak-willed...,Other


In [None]:
## encode genre categories as integers

#### Train and Test Split

#### TF-IDF with `plot_synopsis`

## Modeling