# Final Project

**Use plot synopses of movies to predict the genre of the film.**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Data

### Import Datasets

In [3]:
basics = pd.read_csv('title.basics.tsv', sep='\t')
synopses = pd.read_csv('mpst_full_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [5]:
synopses.head()

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0057603,I tre volti della paura,Note: this synopsis is for the orginal Italian...,"cult, horror, gothic, murder, atmospheric",train,imdb
1,tt1733125,Dungeons & Dragons: The Book of Vile Darkness,"Two thousand years ago, Nhagruul the Foul, a s...",violence,train,imdb
2,tt0033045,The Shop Around the Corner,"Matuschek's, a gift store in Budapest, is the ...",romantic,test,imdb
3,tt0113862,Mr. Holland's Opus,"Glenn Holland, not a morning person by anyone'...","inspiring, romantic, stupid, feel-good",train,imdb
4,tt0086250,Scarface,"In May 1980, a Cuban man named Tony Montana (A...","cruelty, murder, dramatic, cult, violence, atm...",val,imdb


### Combine Data

In [7]:
all = pd.merge(left = basics, right = synopses, how = 'inner', left_on = 'tconst', right_on = 'imdb_id')
all.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,imdb_id,title,plot_synopsis,tags,split,synopsis_source
0,tt0000091,short,The House of the Devil,Le manoir du diable,0,1896,\N,3,"Horror,Short",tt0000091,Le manoir du diable,The film opens with a large bat flying into a ...,"paranormal, gothic",train,wikipedia
1,tt0000225,short,Beauty and the Beast,La belle et la bête,0,1899,\N,\N,"Family,Fantasy,Romance",tt0000225,La belle et la bête,A widower merchant lives in a mansion with his...,fantasy,train,wikipedia
2,tt0000230,short,Cinderella,Cendrillon,0,1899,\N,6,"Drama,Family,Fantasy",tt0000230,Cendrillon,"A prologue in front of the curtain, suppressed...",fantasy,train,wikipedia


In [10]:
print('Number of movies with plot data:', synopses.shape[0])
print('Number of movies after merge:', all.shape[0])

Number of movies with plot data: 14828
Number of movies after merge: 14820


The IMDb basics dataset is quite large, containing a lot of information about all IMDb tracked digital media content. Including **`genre`**, which is the response variable that I'm hope to be able to predict using a multinomial classification approach. Some other variables in this dataset could also be useful (like `runtime` and `release year`).

The MPST Kaggle dataset is much smaller, containing plot descriptions and summaries that were collected from IMDb and Wikipedia pages. There is also an `imdb_id`, which corresponds directly to the `tconst` variable that uniquely identifies each film in the IMDb dataset. This made it possible to merge the IMDb data with the Kaggle plot synopsis data, keeping only the movies that were common between those two datasets. There was a very small loss of plot data, only 8 observations did not have a match with a movie in the IMDb dataset.

At this point, what's left over are movies for which there is an available plot summary. Additional filtering and cleaning will be performed in the next section.

### Data Cleaning

####  Filter where `titleType == 'movie'` and `isAdult == 0`

In [None]:
#all['titleType'].value_counts()
movies = all[(all['titleType'] == 'movie') & (all['isAdult'] == 0)]
movies.head()

In [None]:
print('Number of obersavtions left after filtering:', movies.shape[0])

#### Drop Variables

In [12]:
movies.drop(['titleType', 'originalTitle', 'isAdult', 'endYear', 'imdb_id', 
             'title', 'tags', 'split', 'synopsis_source'],
            axis = 1, inplace = True)
movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,isAdult,startYear,runtimeMinutes,genres,plot_synopsis
0,tt0000091,short,The House of the Devil,0,1896,3,"Horror,Short",The film opens with a large bat flying into a ...
1,tt0000225,short,Beauty and the Beast,0,1899,\N,"Family,Fantasy,Romance",A widower merchant lives in a mansion with his...
2,tt0000230,short,Cinderella,0,1899,6,"Drama,Family,Fantasy","A prologue in front of the curtain, suppressed..."
3,tt0000417,short,A Trip to the Moon,0,1902,13,"Action,Adventure,Comedy","At a meeting of the Astronomic Club, its presi..."
4,tt0000488,short,The Land Beyond the Sunset,0,1912,14,"Drama,Fantasy,Short",Joe is an impoverished New York newsboy who li...


#### Missing values

In [20]:
## find how many \Ns in each variable
(movies == '\\N').sum()

tconst             0
titleType          0
primaryTitle       0
isAdult            0
startYear         18
runtimeMinutes    52
genres             1
plot_synopsis      0
dtype: int64

Missing values are represented in the data by a '\N' character. Some of these can be imputed, like `runtimeMinutes` which will be replaced with the variable mean. Others like `startYear` and `genre` should be removed (especially the oberservations that do not have data in the genre field, as that is the target variable).

In [24]:
## replace \N strings with NaNs
movies.replace(to_replace = '\\N', value = np.nan, inplace = True)

## fill NAs in runtime with the average
movies.fillna(movies.mean())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


KeyboardInterrupt: 

In [25]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12296 entries, 12 to 14819
Data columns (total 8 columns):
tconst            12296 non-null object
titleType         12296 non-null object
primaryTitle      12296 non-null object
isAdult           12296 non-null int64
startYear         12278 non-null object
runtimeMinutes    12244 non-null object
genres            12295 non-null object
plot_synopsis     12296 non-null object
dtypes: int64(1), object(7)
memory usage: 864.6+ KB
