<a href="https://colab.research.google.com/github/itayse10/GoingToMovies/blob/master/Going_to_the_movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning 3253 - Project Assignment - "Going to the Movies"

## Team Members
    
* Craig Barbisan
* Nisha Choondassery
* Mark Hubbard
* Itay Segal

## Project Overview

Using the "Movies Dataset"  from Kaggle, this project will create an optimal model for predicting a movie's financial success.

A set of features that would only be known *prior* to a movies release will be used.  A classifier for predicting success will be developed based on a revenue_to_budget ratio as the threshold.

The feature set will be further assessed via an optimal regression algorithm to predict revenue, where a low RMSE will increase confidence in the feature engineering and feature selection for the classifier.

### The "Movies Dataset" from Kaggle
The dataset for this project is originally sourced from https://www.kaggle.com/rounakbanik/the-movies-dataset.

Credit: Rounak Banik

This dataset is an ensemble of data collected from TMDB ("The Movie DataBase) and GroupLens.
* The Movie Details (i.e. Metadata), Credits and Keywords have been collected from the TMDB Open API.
* The Movie Links and Ratings have been obtained from the Official GroupLens website.

The following spreadsheets are used by this project (all from TMDB):

* __movies_metadata.csv__: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

* __credits.csv__: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

* __keywords.csv__:  Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

### Project Video

The following 10 minute video describes the project.

In [0]:
from IPython.display import YouTubeVideo

YouTubeVideo('c1TbyxPeM4Y', width=600, height=400)


### The Big Picture (Pun Intended)

Profit is the primary measure of a movie's financial success.  

The proposed machine learning solution for predicting success is by using a supervised learning classifier with a profit-oriented target and selecting a set of features known prior to a movie's release.

Since profit itself is based on a specific currency, and doesn't account for inflation or increased marketing and distribution expenses over time or within a particular country, a better proxy that can be used in any geographic region which remains relatively constant over time and across movie genres is a revenue to budget ratio. A ratio of 2.5 will be used as the target for the classifier.

A reasonable model should have a probability of predicting financial success that exceeds a simple coin toss (i.e. 50% probability).  Given that critical success (i.e. how the movie is received by the general viewing audience) is a big factor but less predictable, this project is aiming for a model with a 75% probability of predicting financial success based on the features known prior to creating the movie itself.

## Notebook Contents

This notebook will explore the data, evaluate some models and draw conclusions. It is divided into the following main sections:

0. Setup the environment - seting up the notebook environment.
1. Get the data - loading the data set into the notebook.
2. Explore the data - exploring the raw movies data.
3. Prepare the data - data cleansing, feature selection\reduction\engineering and standardization
4. Model - training and testing various models.
5. Evaluate the model - analyzing model results.
6. Summary - summarizing the observations and conclusions.

## 0. Setup the environment

### Initialize libraries

In [0]:
# import the basic libraries
import os
import numpy as np
import pandas as pd
import json
import ast
import tarfile

In [0]:
# scatter matrix plotting
from pandas.plotting import scatter_matrix

# enable advanced plots
import seaborn as sns

In [0]:
# for reading and writing ZIP files
from zipfile import ZipFile
# to retrieve from a url
from urllib.request import urlretrieve
# to parse URLs into components
from urllib.parse import urlparse

In [0]:
# import sklearn libraries

# feature extraction and decomposition
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF

# feature selection and reduction
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
# feature clustering
from sklearn.cluster import KMeans

# pipeline processing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

# data splitting
from sklearn.model_selection import train_test_split

# classifier models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

# regression models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

# model evaluation
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

### Basic settings

In [0]:
# make this notebook's output stable across runs
SEED = 42
np.random.seed(SEED)

In [0]:
# ensure full display for dataframe content
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('precision', 5)
pd.set_option('large_repr', 'truncate')
pd.set_option('display.max_colwidth', -1)
pd.set_option('colheader_justify', 'left')

In [0]:
# enable basic plots with pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [0]:
# advanced plots settings
sns.set(style='darkgrid')
sns.set(font_scale=1.2)

In [0]:
# suppress warnings
import warnings
warnings.filterwarnings('ignore')

### Constants

In [0]:
# specify file names and locations
URL_DROP_DOMAIN    = 'https://www.dropbox.com'  # data is stored on dropbox due to size limitations of GitHub
URL_DROP_PATH      = '/s/89uv5kntgiolkno/the-movies-dataset.zip?dl=1'

URL_GIT_DOMAIN    = 'https://raw.githubusercontent.com'  # GitHub domain
URL_GIT_PROJECT_PATH      = '/itayse10/GoingToMovies/master'  # GitHub project path

PROJECT_LOCAL_ROOT_DIR = 'movies'
PROJECT_LOCAL_DATA_DIR  = os.path.join(PROJECT_LOCAL_ROOT_DIR,'datasets')
PROJECT_LOCAL_RESOURCE_DIR  = os.path.join(PROJECT_LOCAL_ROOT_DIR,'resources')
PROJECT_LOCAL_CHECKPOINT_DIR  = os.path.join(PROJECT_LOCAL_ROOT_DIR,'checkpoints')

PROJECT_OUTPUT_DIR = '/content/movies/output'
PROJECT_FILE       = 'the-movies-dataset.zip'

## 1. Get the data

### Data retrieval and storage functions

In [0]:
# function: download files from the internet and optionally unzip them
def fetch_data(url, file, local_path, zip=False):
    if not os.path.isdir(local_path):
        os.makedirs(local_path)
    
    remote_file = url
    local_file  = os.path.join(local_path, file)
   
    print(remote_file)
    
    if not os.path.isfile(local_file):
        print('Downloading ' + file + '...')
        
        download_from_Web(remote_file, local_file)

        print('Download complete.')
        
    else:
        print('Already downloaded.')

    if zip:
       unzip_file(local_file)

In [0]:
# unzip a file
def unzip_file(file):
 
    zip_path = file[:-4]
  
    if (os.path.isdir(zip_path)):
        print('Already extracted.')
    else:
        print('Extracting...')
        zfile = ZipFile(file, 'r')
        print(zfile.infolist())
        zfile.extractall(zip_path)
        zfile.close()
        print('Extraction complete.')

In [0]:
# download an entire file from a public Web site
def download_from_Web(remote_file, local_file):
    urlretrieve(remote_file, local_file)

In [0]:
# save a checkpoint for a dataset locally
def save_checkpoint(df, df_name, checkpoint_identifier):
    current_checkpoint = 'checkpoint_' + checkpoint_identifier
    directory_path = os.path.join(PROJECT_LOCAL_CHECKPOINT_DIR,current_checkpoint)
    if not os.path.isdir(directory_path):
        os.makedirs(directory_path)
        print('Creating new directory: {}'.format(directory_path))
    target_file = df_name + '.csv'
    file_path = os.path.join(directory_path, target_file)
    df.to_csv(file_path, encoding='utf-8', index=False )
    print('File {} saved in {}'.format(target_file, directory_path))

In [0]:
# restore a checkpoint into a dataframe object
def restore_checkpoint(df_name,checkpoint_identifier):
    current_checkpoint = 'checkpoint_' + checkpoint_identifier
    directory_path = os.path.join(PROJECT_LOCAL_CHECKPOINT_DIR,current_checkpoint)
    if not os.path.isdir(directory_path):
        print('Directory for checkpoint does not exist: {}'.format(directory_path))
        exit()
    target_file = df_name + '.csv'
    file_path = os.path.join(directory_path, target_file)
    if not os.path.isfile(file_path):
        print('File for checkpoint does not exist: {}'.format(file_path))
        exit()
    df = pd.read_csv(file_path,   encoding='utf-8'   )
    return df

### Data retrieval - load the data sets

In [0]:
# download the dataset file
url = URL_DROP_DOMAIN + URL_DROP_PATH
fetch_data(url, PROJECT_FILE, PROJECT_LOCAL_DATA_DIR, zip=True)

### Create initial data frames
Create dataframes for each of the spreadsheets.

As part of loading the dataframes, convert null values to NaN on the fly (via pd.read_csv).

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In [0]:
# initialize the dataframes (and convert null fields on the fly)
zip_path = os.path.join(PROJECT_LOCAL_DATA_DIR, PROJECT_FILE[:-4])

metadata_file = os.path.join(zip_path, 'movies_metadata.csv')
credits_file  = os.path.join(zip_path, 'credits.csv')
plot_file     = os.path.join(zip_path, 'keywords.csv')

# load the metadata dataframe
metadata = pd.read_csv(metadata_file,
                     dtype = 'unicode',
                     na_values = ['no info', '.']
                    )

# load the credits dataframe
credits = pd.read_csv(credits_file,
                      dtype = 'unicode',
                      na_values = ['no info', '.']
                     )

# load the plot dataframe
plot =  pd.read_csv(plot_file,
                    dtype = 'unicode',
                    na_values = ['no info', '.']
                   )

## 2. Explore the data

In [0]:
# make copies of data for EDA
metadata_copy = metadata.copy()
metadata_copy.name = 'metadata'

credits_copy = credits.copy()
credits_copy.name = 'credit'

plot_copy = plot.copy()
plot_copy.name = 'plot'

### Data exploration functions

In [0]:
# basic schema analysis
def quick_schema_analysis(df):
    if not hasattr(df, 'name'):
        df.name = '' # set an empty name
    print("Basic Schema Analysis for dataframe=" + df.name)
    print("************************************************")
    
    print("Rows and Columns:")
    print(df.shape)
    print(df.info())
    print("\n")
    
    print("Null Values - percentage:")
    print((1 - df.count()/len(df.index)) * 100)
    print("\n")
    
    print("Null Values - count:")
    print(df.isnull().sum())
    print("\n")

In [0]:
# basic data analysis
def quick_data_analysis(df):
    if not hasattr(df, 'name'):
        df.name = '' # set an empty name
    print("Basic Data Analysis for dataframe=" + df.name)
    print(df.shape)

In [0]:
# review column results
def assess_columns(df, columns, categories=False):
    for column in columns:
        print('Info for column: {}'.format(column))
        num_empty = df[column].isnull().sum()
        print('Number of null entries: {}'.format(num_empty))
        if categories:
            num_unique = df[column].unique()
            print('Unique values: {}'.format(num_unique))
    print(df[columns].head(10))

In [0]:
def count_values(df, column):
    value_count = {}
    for row in df[column].dropna():
        if len(row) > 0:
            for key in row:
                if key in value_count:
                    value_count[key] += 1
                else:
                    value_count[key] = 1
        else:
            pass
    return value_count

### Explore the "metadata" data

#### Potential features

* __adult__: Indicates if the movie is X-Rated or Adult.
* __belongs_to_collection__: A stringified dictionary that gives information on the movie series the particular film belongs to.
* __budget__: The budget of the movie in dollars.
* __genres__: A stringified list of dictionaries that list out all the genres associated with the movie.
* __homepage__: The Official Homepage of the move.
* __id__: The ID of the movie.
* __imdb_id__: The IMDB ID of the movie.
* __original_language__: The language in which the movie was originally shot in.
* __original_title__: The original title of the movie.
* __overview__: A brief blurb of the movie.
* __popularity__: The Popularity Score assigned by TMDB.
* __poster_path__: The URL of the poster image.
* __production_companies__: A stringified list of production companies involved with the making of the movie.
* __production_countries__: A stringified list of countries where the movie was shot/produced in.
* __release_date__: Theatrical Release Date of the movie.
* __revenue__: The total revenue of the movie in dollars.
* __runtime__: The runtime of the movie in minutes.
* __spoken_languages__: A stringified list of spoken languages in the film.
* __status__: The status of the movie (Released, To Be Released, Announced, etc.)
* __tagline__: The tagline of the movie.
* __title__: The Official Title of the movie.
* __video__: Indicates if there is a video present of the movie with TMDB.
* __vote_average__: The average rating of the movie.
* __vote_count__: The number of votes by users, as counted by TMDB.

In [0]:
# quick review of the metadata data
quick_schema_analysis(metadata_copy)
quick_data_analysis(metadata_copy)

In [0]:
# examine the first 10 rows
metadata_copy.head(10)

In [0]:
# additional column info - counts, unique values, etc
metadata_copy.describe()

#### Assess columns

In [0]:
assess_columns(metadata_copy, ['release_date'])

In [0]:
assess_columns(metadata_copy, ['budget','revenue'])

In [0]:
# asses json columns
assess_columns(metadata_copy, ['belongs_to_collection','genres','production_companies','production_countries','spoken_languages'])

In [0]:
# assess numerical columns
assess_columns(metadata_copy, ['runtime','vote_average','vote_count'])

### Explore the "credits" data

In [0]:
# quick review of the credits data
quick_schema_analysis(credits_copy)
quick_data_analysis(credits_copy)

In [0]:
# examine the first 10 rows
credits_copy.head(10)

#### Assess columns

In [0]:
assess_columns(credits_copy, ['cast', 'crew'])

### Explore the "plot" data

In [0]:
# quick review of the plot data
quick_schema_analysis(plot_copy)
quick_data_analysis(plot_copy)

In [0]:
# examine the first 10 rows
plot_copy.head(10)

#### Assess columns

In [0]:
assess_columns(plot_copy,['keywords'])

## 3. Prepare the data

### Merge the 3 raw data sets

In [0]:
# compare the shapes of all the dataframes
print(metadata.shape)
print(credits.shape)
print(plot.shape)

In [0]:
# initialize the merged dataframe "movies"
movies = metadata.copy()

In [0]:
# merge the other data frames
movies = movies.merge(credits, on=["id"])
movies = movies.merge(plot, on=['id'])

In [0]:
# check the new data shape
print(movies.shape)

In [0]:
# explore the movies data
movies.head(5)

### Create data checkpoint - checkpoint_1
Seperate data sets at their raw state

In [0]:
# save checkpoint of the dataframes
save_checkpoint(metadata, 'metadata', '1')
save_checkpoint(credits, 'credits', '1')
save_checkpoint(plot, 'plot', '1')
save_checkpoint(movies, 'movies', '1')

In [0]:
# create a checkpoint for the movies data set
movies.name = 'movies_checkpoint_1'
movies_checkpoint_1 = movies.copy()
#movies=movies_checkpoint_1.copy()

In [0]:
movies_checkpoint_1.shape

### Data conversion functions

In [0]:
# convert json to abstract syntax trees (ast)
# use ast because json data has single quotes in the csvs
# which is invalid for a json object (should be double quotes)
def convert_json_to_ast(df, json_columns):
    for column in json_columns:
        df[column] = df[column].apply(lambda x: np.nan if pd.isnull(x)
                                                else ast.literal_eval(str(x)))

In [0]:
# convert the columns that contain the dictionary field with the name value in its ast 
def get_dict_val_from_ast(df,columns,field_name, fillna_str):
    for column in columns:
        df[column] = df[column].fillna(fillna_str) # first replace NaN with fillna_str
        df[column] = df[column].apply(lambda x: x[field_name] if isinstance(x, dict) else [])

In [0]:
# convert the columns that contain the list field with the name value in its ast
def get_list_val_from_ast(df,columns,field_name, fillna_str=None, new_col_dict=None):
    for column in columns:
        if(fillna_str):
            df[column] = df[column].fillna(fillna_str) # first replace NaN with fillna_str
        new_column = column
        if(new_col_dict and column in new_col_dict):
            new_column = new_col_dict[column]
        df[new_column] = df[column].apply(lambda x: [i[field_name] for i in x] if isinstance(x, list) else []) 

In [0]:
# convert to float
def cast_to_float(df, columns,d_cast=None):
    for column in columns:
        if(d_cast):
            df[column] = pd.to_numeric(df[column],errors='coerce',downcast=d_cast)
        else:
            df[column] = pd.to_numeric(df[column],errors='coerce')

### Data conversions 
Perform the following type conversions:
* Convert __release_date__ to datetime
* Convert __budget__ and __revenue__ to numerics
* Convert __runtime__ to float
* Convert __vote_average__ to float
* Convert __vote_count__ to integer
* Convert all JSON fields to abstract syntax trees

In [0]:
# convert movies columns to numeric
def data_convert(movies_df):
    if 'release_date' in movies_df.columns:
        movies_df['release_date'] = pd.to_datetime(movies_df['release_date'], errors='coerce')
    if 'budget' in movies_df.columns:
        movies_df['budget']  = pd.to_numeric(movies_df['budget'],  errors='coerce')
        movies_df['budget']  = movies_df['budget'].replace(0, np.nan)
    if 'revenue' in movies_df.columns:
        movies_df['revenue']  = pd.to_numeric(movies_df['revenue'],  errors='coerce')
        movies_df['revenue']  = movies_df['revenue'].replace(0, np.nan)
    if 'runtime' in movies_df.columns:
        cast_to_float(movies_df,['runtime'])
    if 'vote_average' in movies_df.columns:
        cast_to_float(movies_df,['vote_average'])
    if 'vote_count' in movies_df.columns:
        cast_to_float(movies_df,['vote_count'])
    if 'popularity' in movies_df.columns:
        cast_to_float(movies_df,['popularity'])
  
    return movies_df

In [0]:
movies = data_convert(movies)

__json columns__

In [0]:
# convert json columns to abstract syntax trees
json_columns = ['belongs_to_collection',
                'genres',
                'production_companies',
                'production_countries',
                'spoken_languages',
                'cast',
                'crew',
                'keywords']

convert_json_to_ast(movies, json_columns)

In [0]:
# convert the columns that contain the dictionary field with the name value
get_dict_val_from_ast(movies, ['belongs_to_collection'],'name','')

In [0]:
# convert the columns that contain the list field with the name value
get_list_val_from_ast(movies, ['genres','spoken_languages'],'name','Other')
get_list_val_from_ast(movies, ['production_companies','keywords'],'name','')
get_list_val_from_ast(movies, ['production_countries'],'iso_3166_1','Other')

### Create data checkpoint - checkpoint_2
Afer data conversion, before cleaning up empty values

In [0]:
save_checkpoint(movies, 'movies', '2')

In [0]:
# create a checkpoint for the movies data set
movies.name = 'movies_checkpoint_2'
movies_checkpoint_2 = movies.copy()
#movies=movies_checkpoint_2.copy()

In [0]:
movies_checkpoint_2.shape

### Handle empty values

__revenue__
* As revenue is the target value in the model, we have to clear all rows with empty revenue

In [0]:
# check shape before
print(movies.shape)

In [0]:
# remove empty target values
movies = movies[movies['revenue'].notnull()]

In [0]:
# check shape after
print(movies.shape)

In [0]:
# running a quick schema and data analysis after reducing empty revenue items:
quick_schema_analysis(movies)
quick_data_analysis(movies)

__budget, runtime__  - replace nan with mean

In [0]:
movies['budget'].fillna((movies['budget'].mean()), inplace=True)
movies['runtime'].fillna((movies['runtime'].mean()), inplace=True)

__homepage, overview, poster_path, status, tagline__ - replace nan with empty string

In [0]:
movies['homepage'].fillna('',inplace=True)
movies['overview'].fillna('',inplace=True)
movies['poster_path'].fillna('',inplace=True)
movies['status'].fillna('',inplace=True)
movies['tagline'].fillna('',inplace=True)

### Examine distribution and correlations

In [0]:
# define the collection of numeric columns
num_columns = ['budget','popularity','runtime','vote_average','vote_count','revenue']

In [0]:
# plot histograms
movies[num_columns].hist(bins=50, figsize=(11,8))
plt.show()

In [0]:
# examine correlaiton of numerical columns
scatter_matrix(movies[num_columns], figsize=(15, 15))
plt.show()

### Create data checkpoint - checkpoint_3
Before removing unnecessary features and adding new features to movies

In [0]:
save_checkpoint(movies, 'movies', '3')

In [0]:
# create a checkpoint for the movies data set
movies.name = 'movies_checkpoint_3'
movies_checkpoint_3 = movies.copy()
#movies=movies_checkpoint_3.copy()

In [0]:
movies_checkpoint_3.shape

### Feature removal
Drop features that don't provide any added value

In [0]:
movies.drop(columns = 'poster_path',inplace=True) # textual column with low added value
movies.drop(columns = 'imdb_id',inplace=True) # not used in this model

In [0]:
# assess categorial columns
assess_columns(movies,['adult','video'],True)

In [0]:
# drop video and adult as their distinct value is False
movies.drop(columns = ['adult','video'],inplace=True)

In [0]:
# case for dropping original_title
compare = ['title', 'original_title']
movies[movies['original_title'] != movies['title']][['title', 'original_title']].head(5)

In [0]:
# drop original_title (it appears to be the untranslated version of the title)
movies.drop(columns='original_title',inplace=True)

### Feature engineering

In [0]:
movies.describe()

In [0]:
# add a primary genre field strictly for use in other feature transformations
movies['primary_genre'] = movies['genres'].apply(lambda x: (x and x[0]) or ('Other'))

#### New feature - actors (from cast)

In [0]:
# create actors columns based on cast
get_list_val_from_ast(df=movies, columns=['cast'],field_name='name',new_col_dict = {'cast' : 'actors'})

In [0]:
assess_columns(movies, ['actors'])

#### New feature - director (from crew)

In [0]:
# extract the director from the crew field
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return 'Unknown'
  
movies['director'] = movies['crew'].apply(get_director)

In [0]:
assess_columns(movies, ['director'])

#### New features - cast_size and crew_size (from cast and crew)

In [0]:
movies['cast_size'] = movies['cast'].apply(lambda x: len(x))
movies['crew_size'] = movies['crew'].apply(lambda x: len(x))

In [0]:
assess_columns(movies, ['cast_size','crew_size'])

#### New feature - franchise (from belongs_to_collection)

In [0]:
movies['franchise'] = (movies['belongs_to_collection']
                       .apply(lambda x: len(x)>0)
                      )

In [0]:
assess_columns(movies, ['franchise'])

#### New features - season and release_year (from release_date)

In [0]:
# calculate the season of the release (spring, summer, fall, winter)
def season_of_date(date):
    if pd.isnull(date):
        return 'unknown'
    year = str(date.year)
    seasons = {'spring': pd.date_range(start='21/03/'+year, end='20/06/'+year),
               'summer': pd.date_range(start='21/06/'+year, end='22/09/'+year),
               'autumn': pd.date_range(start='23/09/'+year, end='20/12/'+year)}
    if date in seasons['spring']:
        return 'spring'
    if date in seasons['summer']:
        return 'summer'
    if date in seasons['autumn']:
        return 'autumn'
    else:
        return 'winter'

# create a new column    
movies['season'] = (movies['release_date']
                          .fillna(pd.NaT)
                          .apply(lambda x: season_of_date(x))
                   )

In [0]:
# get the year the movie was released
def year_of_date(date):
    if pd.isnull(date):
      return -1
    year = date.year
    return year

# create a new column    
movies['release_year'] = (movies['release_date']
                          .fillna(pd.NaT)
                          .apply(lambda x: year_of_date(x))
                   )

In [0]:
assess_columns(movies, ['season','release_year'],True)

#### New feature -  home_pagedomain (from homepage)

In [0]:
# get domain from url
def get_url_domain(url):
    if(not pd.isnull(url)):
        parsed_uri = urlparse(url )
        return '{uri.netloc}'.format(uri=parsed_uri)
    else:
        return url

In [0]:
# extract the domain from the homepage url
movies['homepage_domain'] = movies['homepage'].apply(get_url_domain)

In [0]:
assess_columns(movies, ['homepage_domain'])

#### New feature - weighted_rating (from vote_average and vote_count)
Weighted Rating (WR) = $\left(\frac{v}{v+m}\cdot R\right) + \left(\frac{m}{v+m}\cdot C\right)$ where,

<li>v is the number of votes for the movie
<li>m is the minimum votes required to be listed in the chart
<li>R is the average rating of the movie
<li>C is the mean vote across the whole report.
<br>Source: https://www.kaggle.com/rounakbanik/movie-recommender-systems

In [0]:
# add a weighted rating feature
vote_averages = (movies[movies['vote_average']
                 .notnull()]['vote_average'].astype('int')
                )

vote_counts = (movies[movies['vote_count']
               .notnull()]['vote_count'].astype('int')
              )

C = vote_averages.mean()
m = vote_counts.quantile(0.75)

def weighted_rating(x):
    v = x['vote_count']+1
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

movies['weighted_rating'] = movies.apply(weighted_rating, axis=1)

In [0]:
assess_columns(movies, ['weighted_rating'])

#### New feature - revenue_to_budget_ratio (from revenue and budget)

In [0]:
# this new feature can indicate "success" or "failure"
# (depending on whether ration is < 0 or > 0)
movies['revenue_to_budget_ratio'] = movies['revenue'] / movies['budget']

In [0]:
assess_columns(movies, ['revenue_to_budget_ratio'])

#### Drop columns that are no longer needed
Drop columns that were already engineered to generate other columns

In [0]:
# drop additional columns
movies.drop(columns = 'homepage',inplace=True)  #replaced by homepage_domain
movies.drop(columns = 'cast',inplace=True)  # parsed into actors
movies.drop(columns = 'crew',inplace=True)  # partially expressed in director feature
movies.drop(columns = 'release_date',inplace=True)  # can be replaced with season and release_year

#### Examine distribution and correlations - of new features

In [0]:
# define the collection of new numeric columns
new_num_columns = ['cast_size','crew_size','release_year','weighted_rating','revenue_to_budget_ratio']

In [0]:
# plot histograms
movies[new_num_columns].hist(bins=50, figsize=(11,8))
plt.show()

In [0]:
# examine correlaiton of numerical columns
if not 'revenue' in new_num_columns:
    new_num_columns.append('revenue')
scatter_matrix(movies[new_num_columns], figsize=(15, 15))
plt.show()

#### Remove columns that don't have any correlation with revenue

In [0]:
# drop columns that are not correlated with revenue
movies.drop(columns = ['revenue_to_budget_ratio','release_year'],inplace=True)

#### Asses movies data set to apply final data cleansing

In [0]:
movies.name = 'movies'
quick_schema_analysis(movies)

### Create data checkpoint - checkpoint_4
After feature engineering and some feature removal

In [0]:
save_checkpoint(movies, 'movies', '4')

In [0]:
# create a checkpoint for the movies data set
movies.name = 'movies_checkpoint_4'
movies_checkpoint_4 = movies.copy()
#movies=movies_checkpoint_4.copy()

In [0]:
movies_checkpoint_4.shape

### Handle textual and list origented features

#### Get the stopwords file

In [0]:
# specify file names and locations
stop_file_url = URL_GIT_DOMAIN + URL_GIT_PROJECT_PATH + '/stopwords.txt'

fetch_data(stop_file_url, 'stopwords.txt', PROJECT_LOCAL_RESOURCE_DIR, zip=False)

In [0]:
# load stop words file
def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words(os.path.join(PROJECT_LOCAL_RESOURCE_DIR,'stopwords.txt'))

#### Encoded and vectorized feature reduction functions

In [0]:
# reduce the encoded features using a threshold
def reduce_sparse_encoding(df,threshold):
    headers_list = list(df.columns.values)
    num_samples = df.shape[0]

    for header in headers_list:
        col_true_count = len(df[df[header] == 1])
        if col_true_count/num_samples < threshold:
            df = df.drop(header, axis=1)
    
    return df

In [0]:
# reduce the weighted features using a threshold
def reduce_sparse_weights(df,threshold):
    headers_list = list(df.columns.values)

    for header in headers_list:
        col_weight_sum = df[header].sum()
        if col_weight_sum < threshold:
            df = df.drop(header, axis=1)
    
    return df

#### title

In [0]:
# vectorizing the titles - using word counts
count_vectorizer = CountVectorizer(max_features=1000, binary=True, max_df=0.8,stop_words=stopwords) 
titles = movies["title"] 
titles_transformed = count_vectorizer.fit_transform(titles)

In [0]:
# examine the vectorized titles
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
title_titles = ['title_' + title for title in idx_to_word]
titles_df = pd.DataFrame(titles_transformed.toarray(),columns=title_titles)
titles_df.shape

In [0]:
# reduce the features using 0.1% of 1
titles_df = reduce_sparse_encoding(titles_df,0.001)
titles_df.shape

#### belongs_to_collection

In [0]:
# replace empty lists with an empty string
collections = [''.join(collection).lower().replace('collection','').replace('series','') 
               for collection in movies['belongs_to_collection'].values]
# Vectorizing the collection names - using word counts
count_vectorizer = CountVectorizer(max_features=100, stop_words=stopwords) 
collections_transformed = count_vectorizer.fit_transform(collections)

In [0]:
# examine the vectorized collections
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
collection_titles = ['coll_' + title for title in idx_to_word]
collections_df = pd.DataFrame(collections_transformed.toarray(),columns=collection_titles)
collections_df.shape

In [0]:
# reduce the features using 0.1% of 1
collections_df = reduce_sparse_encoding(collections_df,0.001)
collections_df.shape

#### keywords

In [0]:
# vectorizing keywords - re-use the vectorizer used for collections
keywords = [''.join(keyword).lower() 
               for keyword in movies['keywords'].values]

# Vectorizing the keywords - using word counts
count_vectorizer = CountVectorizer(max_features=100, stop_words=stopwords) 
keywords_transformed = count_vectorizer.fit_transform(keywords)

In [0]:
# examine the vectorized keywords
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
keyword_titles = ['key_' + title for title in idx_to_word]
keywords_df = pd.DataFrame(keywords_transformed.toarray(),columns=keyword_titles)
keywords_df.shape

In [0]:
# reduce the features using 0.1% of 1
keywords_df = reduce_sparse_encoding(keywords_df,0.001)
keywords_df.shape

#### tagline

In [0]:
# vectorizing tagline - using tf-idf weighted term-document matrix
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.8,stop_words=stopwords) 
taglines = movies['tagline'].replace(np.nan,'')
taglines_transformed = tfidf_vectorizer.fit_transform(taglines)

In [0]:
# examine the vectorized taglines
idx_to_word = np.array(tfidf_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# apply NMF
nmf = NMF(n_components=100, solver="mu")
taglines_nmf = nmf.fit_transform(taglines_transformed)

In [0]:
# create a DataFrame for tagline
tagline_titles = nmf.components_
tagline_titles = ['tagline_' + '_'.join(idx_to_word[title.argsort()[-10:]]) for title in tagline_titles]
tagline_df = pd.DataFrame(taglines_nmf,columns=tagline_titles)
tagline_df.shape

In [0]:
# reduce the features using sum weights of 1
tagline_df = reduce_sparse_weights(tagline_df,1)
tagline_df.shape

#### overview

In [0]:
# vectorizing overview - using tf-idf weighted term-document matrix 
overviews = movies['overview'].replace(np.nan,'')
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.8,stop_words=stopwords) 
overviews_transformed = tfidf_vectorizer.fit_transform(overviews)

In [0]:
# examine the vectorized overviews
idx_to_word = np.array(tfidf_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# apply NMF : re-use the nmf component 
overviews_nmf = nmf.fit_transform(overviews_transformed)

In [0]:
# create a DataFrame for overview
overview_titles = nmf.components_
overview_titles = ['overview_' + '_'.join(idx_to_word[title.argsort()[-10:]]) for title in overview_titles]
overview_df = pd.DataFrame(overviews_nmf,columns=overview_titles)
overview_df.shape

In [0]:
# reduce the features using sum weights of 1
overview_df = reduce_sparse_weights(overview_df,1)
overview_df.shape

#### spoken_languages

In [0]:
# vectorizing spoken languages - re-use the vectorizer used for collections
count_vectorizer = CountVectorizer()
spoken_laguages = [','.join(lang).lower() 
               for lang in movies['spoken_languages'].values]
spoken_laguages_transformed = count_vectorizer.fit_transform(spoken_laguages)

In [0]:
# examine the vectorized spoken languages
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
spoken_languages_titles = ['s_lang_' + lang for lang in idx_to_word]
spoken_languages_df = pd.DataFrame(spoken_laguages_transformed.toarray(),columns=spoken_languages_titles)
spoken_languages_df.shape

In [0]:
# reduce the features using 0.1% of 1
spoken_languages_df = reduce_sparse_encoding(spoken_languages_df,0.001)
spoken_languages_df.shape

#### original_language

In [0]:
# vectorizing original language - re-use the vectorizer used for collections
count_vectorizer = CountVectorizer()
original_languages = movies['original_language'].replace(np.nan,'')
original_languages_transformed = count_vectorizer.fit_transform(original_languages)

In [0]:
# examine the vectorized original languages
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
original_languages_titles = ['o_lang_' + lang for lang in idx_to_word]
original_languages_df = pd.DataFrame(original_languages_transformed.toarray(),columns=original_languages_titles)
original_languages_df.shape

In [0]:
# reduce the features using 0.1% of 1
original_languages_df = reduce_sparse_encoding(original_languages_df,0.001)
original_languages_df.shape

#### homepage_domain

In [0]:
# vectorizing original language - using ngram vectorizing (to select all 3 parts of the url)
ngram_vectorizer = CountVectorizer(max_features=100, binary=True, max_df=0.8,stop_words=stopwords,ngram_range=(3, 3)) 
homepage_domains = movies['homepage_domain'].replace(np.nan,'')
homepage_domains_transformed = ngram_vectorizer.fit_transform(homepage_domains)

In [0]:
# examine the vectorized original languages
idx_to_word = np.array(ngram_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
homepage_domain_titles = ['home_'+ url.replace(' ','.') for url in idx_to_word]
homepage_domains_df = pd.DataFrame(homepage_domains_transformed.toarray(),columns=homepage_domain_titles)
homepage_domains_df.shape

In [0]:
# reduce the features using 0.1% of 1
homepage_domains_df = reduce_sparse_encoding(homepage_domains_df,0.001)
homepage_domains_df.shape

#### genres

In [0]:
# Prepare the data in the column
genres = [','.join(gen) for gen in movies['genres'].values]

# Vectorizing genres 
count_vectorizer = CountVectorizer()
genres_transformed = count_vectorizer.fit_transform(genres)

In [0]:
# examine the vectorized production genres
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# set titles and define new DataFrame
genre_list = idx_to_word
genre_titles = ['gen_' + gen for gen in idx_to_word]
genres_df = pd.DataFrame(genres_transformed.toarray(),columns=genre_titles)
genres_df.shape

In [0]:
# show the top 10 genres

genres_count = pd.Series(count_values(movies, 'genres'))
genres_count.sort_values(ascending = False).head(10).plot(kind = 'bar')

In [0]:
# show the genre distribution

plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(5,5))
genre_count = []
for genre in genre_titles:
    genre_name = genre[4:]
    genre_count.append([genre_name, genres_df[genre].values.sum()])

genre_count.sort(key = lambda x:x[1], reverse = True)

labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_count]

ax.pie(sizes, labels=labels_selected,
      autopct = lambda x:'{:2.0f}%'.format(x) if x>1 else '',
      shadow = False, startangle=0)
ax.axis('equal')

plt.show()

In [0]:
# explore impact of genre on budget and revenue

mean_per_genre = movies[['primary_genre','revenue','budget']].groupby(['primary_genre'], as_index=True).mean()

mean_per_genre.rename({'revenue':'mean_revenue', 'budget':'mean_budget'}, axis=1, inplace=True)

mean_per_genre

In [0]:
# reduce the features using 0.1% of 1
genres_df = reduce_sparse_encoding(genres_df,0.001)
genres_df.shape

#### production_companies

In [0]:
# prepare the data in the column
production_companies = [','.join(prod).strip().replace(' ','_') 
               for prod in movies['production_companies'].values]

# Vectorizing production companies 
count_vectorizer = CountVectorizer(max_features=1000)
production_companies_transformed = count_vectorizer.fit_transform(production_companies)

In [0]:
# examine the vectorized production companies
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# set titles and define new DataFrame
production_companies_titles = ['prod_' + prod for prod in idx_to_word]
production_companies_df = pd.DataFrame(production_companies_transformed.toarray(),columns=production_companies_titles)
production_companies_df.shape

In [0]:
# reduce the features using 0.1% of 1
production_companies_df = reduce_sparse_encoding(production_companies_df,0.001)
production_companies_df.shape

#### production_countries

In [0]:
# Vectorizing production countries 
count_vectorizer = CountVectorizer() 
production_countries = [','.join(country).lower() 
               for country in movies['production_countries'].values]
production_countries_transformed = count_vectorizer.fit_transform(production_countries)

In [0]:
# examine the vectorized production countries
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
production_countries_titles = ['country_' + country for country in idx_to_word]
production_countries_df = pd.DataFrame(production_countries_transformed.toarray(),columns=production_countries_titles)
production_countries_df.shape

In [0]:
# reduce the features using 0.1% of 1
production_countries_df = reduce_sparse_encoding(production_countries_df,0.001)
production_countries_df.shape

#### director

In [0]:
# prepare the data in the column
directors = [director.strip().replace(' ','_') 
               for director in movies['director'].values]

# Vectorizing production companies
count_vectorizer = CountVectorizer(max_features=100) 
directors_transformed = count_vectorizer.fit_transform(directors)

In [0]:
# examine the vectorized directors
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# set titles and define new DataFrame
director_titles = ['director_' + director for director in idx_to_word]
directors_df = pd.DataFrame(directors_transformed.toarray(),columns=director_titles)
directors_df.shape

In [0]:
# reduce the features using 0.1% of 1
directors_df = reduce_sparse_encoding(directors_df,0.001)
directors_df.shape

#### actors

In [0]:
# prepare the data in the column
actors = [','.join(actor).strip().replace(' ','_') 
               for actor in movies['actors'].values]

# Vectorizing actors
count_vectorizer = CountVectorizer(max_features=1000) 
actors_transformed = count_vectorizer.fit_transform(actors)

In [0]:
# examine the vectorized actors
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# set titles and define new DataFrame
actor_titles = ['actor_' + actor for actor in idx_to_word]
actors_df = pd.DataFrame(actors_transformed.toarray(),columns=actor_titles)
actors_df.shape

In [0]:
# reduce the features using 0.1% of 1
actors_df = reduce_sparse_encoding(actors_df,0.001)
actors_df.shape

### One-hot encoding for categorical features

#### season

In [0]:
season_df = pd.get_dummies(movies[['season']],prefix='season')
season_df.shape

In [0]:
# reduce the features using 0.1% of 1
season_df = reduce_sparse_encoding(season_df,0.001)
season_df.shape

#### status

In [0]:
status_df = pd.get_dummies(movies[['status']],prefix='status')
status_df.shape

In [0]:
# reduce the features using 0.1% of 1
status_df = reduce_sparse_encoding(status_df,0.001)
status_df.shape

#### franchise

In [0]:
franchise_df = pd.get_dummies(movies[['franchise']],prefix='fran')
franchise_df.shape

### Merge encoded and vectorized sets into movies

In [0]:
remove_columns = ['revenue','title', 'belongs_to_collection', 'keywords', 'tagline', 'overview', 'spoken_languages',
                  'original_language', 'homepage_domain', 'genres', 'production_companies', 'production_countries', 
                  'director', 'actors', 'season', 'status', 'franchise']

# create the final data set
movies_final_arr = np.concatenate((movies[['revenue']].values,movies.drop(remove_columns,axis=1).values,
                                 titles_df.values,collections_df.values,
                                 keywords_df.values,tagline_df.values,overview_df.values,
                                 spoken_languages_df.values,original_languages_df.values,
                                 homepage_domains_df.values,genres_df.values,
                                 production_companies_df.values,production_countries_df.values,
                                 directors_df.values,actors_df.values,season_df.values,
                                 status_df.values,franchise_df.values),  axis=1)

# set column headers
movies_headers_arr = np.concatenate((movies[['revenue']].columns.values,movies.drop(remove_columns,axis=1).columns.values,
                                 titles_df.columns.values,collections_df.columns.values,
                                 keywords_df.columns.values,tagline_df.columns.values,overview_df.columns.values,
                                 spoken_languages_df.columns.values,original_languages_df.columns.values,
                                 homepage_domains_df.columns.values,genres_df.columns.values,
                                 production_companies_df.columns.values,production_countries_df.columns.values,
                                 directors_df.columns.values,actors_df.columns.values,season_df.columns.values,
                                 status_df.columns.values,franchise_df.columns.values))

# create a DataFrame object
movies_final = pd.DataFrame(movies_final_arr,columns=movies_headers_arr)

In [0]:
# check the new schema
movies_final.name = 'movies_final'
quick_schema_analysis(movies_final)

In [0]:
# assign back to movies
movies = movies_final

### Create data checkpoint - checkpoint_5
After creating the full set of the features 

In [0]:
save_checkpoint(movies, 'movies', '5')

In [0]:
# create a checkpoint for the movies data set
movies.name = 'movies_checkpoint_5'
movies_checkpoint_5 = movies.copy()
# movies=movies_checkpoint_5.copy()

In [0]:
movies_checkpoint_5.shape

### Feature selction and reduction - continued
Once the full feature set has been assembled, we use feature selection and reduction algorithms to come up with a subset of the features

The algorithms below is optimized for regression (specifically linear regression). Regardless, further algorithms we'll be used in the model creation section

In [0]:
# drop primary_genre since it was only used to assess and calculate other features
movies.drop(['primary_genre'], axis=1, inplace=True)

In [0]:
# define target value and features
target = 'revenue'
features = list(movies.columns)
features = [f for f in features if f!=target]

In [0]:
# define train set and test set
train_set, test_set = train_test_split(movies, test_size=0.3)

X_tr = train_set[features]
y_tr = train_set[[target]]

X_te = test_set[features]
y_te = test_set[[target]]

#add a new target variable for classification 
y_tr_cl = np.ravel(y_tr)/np.ravel(train_set.budget)>2.50
y_te_cl = np.ravel(y_te)/np.ravel(test_set.budget)>2.50

####Target

The target for our model is revenue.  For the purpose of classifying a movie as a financial success, the classifier specific target uses a revenue/budget ratio of 2.5.

#### Scoring functions

In [0]:
# display the score
def display_scores(scores):
    mean_score = scores.mean()
    print("Scores:", scores)
    print("Mean:", mean_score)
    return(mean_score)

#### SelectKBest with chi square

In [0]:
# define the select K best model - using chi square
select_best = SelectKBest(score_func=chi2)

In [0]:
# identify the best hyper parameter
k_arr = [20, 30, 40 ,50, 75, 100, 200]

best_k = 0
best_score = float('inf')

for k in k_arr:
    select_best.k = k
    select_best.fit(np.asarray(X_tr),np.asarray(y_tr, dtype="|S100"))
    X_tr_new = select_best.transform(X_tr)
    X_te_new = select_best.transform(X_te)
    print("\n")
    print("Performance for k={}".format(k))
    print("Trainng set:")
    lin_scores = cross_val_score(LinearRegression(), X_tr_new, y_tr, scoring="neg_mean_squared_error", cv=4)
    lin_rmse_scores = np.sqrt(-lin_scores)
    mean_train_score = display_scores(lin_rmse_scores)
    print("Test set:")
    lin_scores = cross_val_score(LinearRegression(), X_te_new, y_te, scoring="neg_mean_squared_error", cv=4)
    lin_rmse_scores = np.sqrt(-lin_scores)
    mean_test_score = display_scores(lin_rmse_scores)
    
    if mean_test_score < best_score:
        best_score = mean_test_score
        best_k = k

In [0]:
print('Best Score = ', best_score)
print('Best K     = ', best_k)

In [0]:
# pick the best k
select_best.k = best_k
select_best.fit(np.asarray(X_tr),np.asarray(y_tr, dtype="|S100"))
X_tr = select_best.transform(X_tr)
X_te = select_best.transform(X_te)

#### Scaling the features

In [0]:
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr = scaler.transform(X_tr)
X_te = scaler.transform(X_te)

#### Feature tuning pipeline (using linear regression)

In [0]:
# Linear regression and pipeline (to evaluate feature selection and reduction)
steps = [('regression',LinearRegression())]
pipeline = Pipeline(steps)

#### PCA

In [0]:
# Make an instance of the Model
pca = PCA()

In [0]:
# add to the steps collection
pipeline.steps.insert(0,('pca',pca))

In [0]:
# GridSearchCV
parameters = {}
grid_search = GridSearchCV(pipeline, parameters, cv=4, scoring='neg_mean_squared_error')

In [0]:
# add corresponding parameters
grid_search.param_grid['pca__n_components'] = [0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,1]

In [0]:
# fit the model
grid_search.fit(X_tr,y_tr)

In [0]:
# training scores
print("Best pca parameter is {}".format(grid_search.best_params_))
lin_scores = grid_search.best_score_
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# run best model with test data
grid_search_final = grid_search.best_estimator_ 
grid_search_final.fit(X_te,y_te)
lin_scores = cross_val_score(grid_search_final, X_te, y_te, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# tidy up - remove the pca step from the pipeline
del pipeline.steps[0]

* __Conclusion:__  pca doesn't really improve results

#### Clustering - K-means

In [0]:
kmeans = KMeans()

# add to the steps collection
pipeline.steps.insert(0,('kmeans',kmeans))

In [0]:
# GridSearchCV
parameters = {}
grid_search = GridSearchCV(pipeline,parameters, cv=4, scoring='neg_mean_squared_error')

# add corresponding parameters
grid_search.param_grid['kmeans__n_clusters'] = [2,4,6,8,10,15,20,30,40]

In [0]:
# fit the model
grid_search.fit(X_tr,y_tr)

In [0]:
# training set scores
print("Best k-means parameter is {}".format(grid_search.best_params_))
lin_scores = grid_search.best_score_
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# try with test data
grid_search_final = grid_search.best_estimator_ 
grid_search_final.fit(X_te,y_te)
lin_scores = cross_val_score(grid_search_final, X_te, y_te, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

* __Conclusion__: KMeans clustering doesn't improve the results

## 4. Model

### Reasons for considering a classification model in addition to the regression model.

Box Office profits are a function of production budgets, ticket prices & Marketing/ Distribution costs based on their year of release.


Hence, while evaluating historical values, profit calculations without accounting for inflation do not always provide the big picture (pun intended). A recommended option is arriving at a revenue to budget ratio (R/B), which offers better insights compared to revenue recognition alone. 

Furthermore, this ratio is also useful for determining the relative success of a theatrical release. The total cost of movie production is not limited to production budgets and consequently, should take into account the distribution and marketing costs which are increasingly  important to a movie's success. Accordingly, movie producers and industry watchers now believe that a movie needs to make in excess of 2.5 times its production budget to even be considered a moderate success.

Since the model is predicting success based on the revenue to budget ratio , the feature matrix drops 'revenue' and 'revenue to budget ratio' as new unseen data (i.e. data not available for movies that are yet to be released which we want to predict as being a success or failure). The attempt here is to predict film success using features like genre, production company, month/season of release, cast/crew size, country of production, whether part of a franchise, weighted rating, budget etc. These are parameters that are already known before the movie's release.

### Classification

#### Logistic regression

In [0]:
logreg=LogisticRegression()
params_logreg={'C':np.logspace(-5, 8, 15)}
logreg_cv = GridSearchCV(logreg,params_logreg,cv=4)
logreg_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy: ',logreg_cv.best_score_)

#### Random forest

In [0]:
forest=RandomForestClassifier(random_state=42)
params_forest = {'n_estimators': [3, 4, 6, 7, 10, 20, 50, 100]}
forest_cv=GridSearchCV(forest, params_forest,cv=4)
forest_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy: ',forest_cv.best_score_)

#### KNN

In [0]:
knn=KNeighborsClassifier()
params_knn = {'n_neighbors': [3, 5, 10, 20]}
knn_cv=GridSearchCV(knn, params_knn,cv=4,n_jobs=-1)
knn_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy:',knn_cv.best_score_)

#### Decision tree

In [0]:
tree = DecisionTreeClassifier()
params_tree = {"criterion": ["gini", "entropy"],"min_samples_split": [2, 10, 20],"max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],"max_leaf_nodes": [None, 5, 10, 20]
              }
tree_cv=GridSearchCV(tree, params_tree,cv=4,n_jobs=-1)
tree_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy:',tree_cv.best_score_)

#### AdaBoostClassifier with DecisionTree as base estimator

In [0]:
ABC= AdaBoostClassifier(base_estimator=tree_cv.best_estimator_,n_estimators=100)
np.mean(cross_val_score(ABC, X_tr, y_tr_cl, cv=3))

#### Voting classifier

In [0]:
classifiers = [('Logistic Regression', logreg_cv.best_estimator_),('K Nearest Neighbours', knn_cv.best_estimator_),
               ('RandomForestClassifier', forest_cv.best_estimator_),('DecisionTreeClassifier',tree_cv.best_estimator_)]

vc = VotingClassifier(estimators=classifiers)

In [0]:
print('Accuracy: ',np.mean(cross_val_score(vc, X_tr, y_tr_cl, cv=3)))

### Regression

#### Linear regression

In [0]:
lin_reg=LinearRegression()
lin_scores = cross_val_score(lin_reg,X_tr,y_tr, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

#### Ridge regression

In [0]:
param_grid = [{'alpha': [0.001,0.01,0.1,1,10,100,1000,1000]}]
rr_cv = GridSearchCV(Ridge(), param_grid, cv=4, scoring='neg_mean_squared_error')
rr_cv.fit(X_tr, y_tr)

In [0]:
print(np.sqrt(-rr_cv.best_score_))

#### Lasso regression

In [0]:
param_grid = [{'alpha': [0.001,0.01,0.1,1,10,100,1000,1000]}]
lr_cv = GridSearchCV(Lasso(), param_grid, cv=3, scoring='neg_mean_squared_error')
lr_cv.fit(X_tr, y_tr)

In [0]:
print(np.sqrt(-lr_cv.best_score_))

#### Elastic Net

In [0]:
# run grid search with 
param_grid = [{'alpha': [0.001,0.01,0.1,1,10,100,1000,1000], 'l1_ratio': [0.2, 0.4, 0.6, 0.8]}]
enr_cv = GridSearchCV(ElasticNet(), param_grid, cv=3,scoring='neg_mean_squared_error', verbose=2)
enr_cv.fit(X_tr, y_tr)

In [0]:
print('Best parameters: {}'.format(enr_cv.best_params_))
print('Best regression result: {}'.format(np.sqrt(-enr_cv.best_score_)))

### Test data set evaluation

#### Classification

In [0]:
final_cl_model = forest_cv.best_estimator_  
y_pred_cl = final_cl_model.predict(X_te)

print('Accuracy score: ',accuracy_score(y_te_cl,y_pred_cl))

* Random Forest has a train accuracy of __0.8080__ and a test accuracy of __0.8068__

#### Regression

In [0]:
final_reg_model = enr_cv.best_estimator_
final_reg_model.fit(X_tr,y_tr)

y_pred_reg = final_reg_model.predict(X_te)

final_mse = mean_squared_error(y_te, y_pred_reg)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

## 5. Evaluate the model

### Best Classifer - Random Forest
### Best Regressor - ElasticNet (with l1_ratio=0.6)

### Evaluation functions

In [0]:
# heatmap for the confusion matrix
def cm_heatmap(y_test,y_pred):
    cm=confusion_matrix(y_test,y_pred)
    conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
    plt.figure(figsize = (8,5))
    sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")

In [0]:
# compare different models
def metrics_comp(y_te_cl,y_pred_cl,y_pred_tree):
    metrics=pd.DataFrame(index=['Accuracy','Precision','Recall'],columns=['Random Forest','Decision Tree'])
    metrics.loc['Accuracy','Random Forest']=accuracy_score(y_te_cl,y_pred_cl)
    metrics.loc['Precision','Random Forest']=precision_score(y_te_cl,y_pred_cl)
    metrics.loc['Recall','Random Forest']=recall_score(y_te_cl,y_pred_cl)
    metrics.loc['Accuracy','Decision Tree']=accuracy_score(y_te_cl,y_pred_tree)
    metrics.loc['Precision','Decision Tree']=precision_score(y_te_cl,y_pred_tree)
    metrics.loc['Recall','Decision Tree']=recall_score(y_te_cl,y_pred_tree)
    return(metrics)

In [0]:
# precison-recall graph
def p_r_threshold(thresholds,precision,recall):
    fig,ax=plt.subplots(figsize=(8,5))
    ax.plot(thresholds,precision[1:],label='Precision')
    ax.plot(thresholds,recall[1:],label='Recall')
    ax.set_xlabel('Classification threshold')
    ax.set_ylabel('Precision,Recall')
    ax.set_title('Random Forest: Precison Recall')
    ax.legend()
    ax.grid();

### Plots

#### Confusion Matrix

In [0]:
cm_heatmap(y_te_cl,y_pred_cl)

#### ROC Curve and AUC

In [0]:
fpr, tpr, thresholds = roc_curve(y_te_cl, final_cl_model.predict_proba(X_te)[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for RandomForest classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('Recall')
plt.grid(True)

In [0]:
print ("Area under the curve is ", roc_auc_score(y_te_cl,final_cl_model.predict_proba(X_te)[:,1]))

#### Classification Report

In [0]:
print(classification_report(y_te_cl, y_pred_cl))

#### Evaluation of the 2nd best classifier model (Decision Tree)

In [0]:
# Test data
y_pred_tree = tree_cv.predict(X_te)
print('Accuracy score: ',accuracy_score(y_te_cl,y_pred_tree))

In [0]:
# Confusion matrix
cm_heatmap(y_te_cl,y_pred_tree)

In [0]:
print(classification_report(y_te_cl, y_pred_tree))

In [0]:
fpr, tpr, thresholds = roc_curve(y_te_cl, tree_cv.predict_proba(X_te)[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Decision Tree Classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('Recall')
plt.grid(True)

In [0]:
print ("Area under the curve is ", roc_auc_score(y_te_cl,tree_cv.predict_proba(X_te)[:,1]))

### Comparision of Accuracy, Precision and Recall between Random Forest and Decision Tree classifiers
In the case of movie box office prediction, a high recall translates to missing to identify a film that will be successful and high precision ensures the majority of positively predicted films will be successful.

As the average cost involved in producing a major studio film is extremely high and the number of films produced by a production house is relatively low, it is important that the classifier's positive predictions are going to be successful. Failing that will result in huge loss for the production house. Hence a model that has high precision rather than high recall is preferred.

In [0]:
# compare random forest and decision tree
metrics_comp(y_te_cl,y_pred_cl,y_pred_tree)

In [0]:
fig,ax=plt.subplots(figsize=(8,6))
metrics_comp(y_te_cl,y_pred_cl,y_pred_tree).plot(kind='bar',ax=ax)
plt.show()

* __Conclusion__ :Accuracy is relatively close when comparing both models, although Random Forest has a slight advantage. However, Random Forest has higher precision than Decision Tree.

#### Tradeoff between precision and recall at different thresholds

In [0]:
precision_forest, recall_forest, thresholds_forest=precision_recall_curve(y_te_cl,final_cl_model.predict_proba(X_te)[:,1])

In [0]:
p_r_threshold(thresholds_forest,precision_forest,recall_forest)

* __Conclusion__: At threshold 0.5 the recall is 0.4 with a very high precision (0.7). Moving the threshold to 0.6 results in a much higher precision (0.8) without a significant drop in recall.

## 6. Summary

### Conclusions
For this particular problem definition ("movie success") and dataset (from TMDB), the best classifier model was a **Random Forest**. 

This classifier produced the highest AUC and had the highest precision of all the models, where a higher precision can be obtained with a small reduction in recall by adjusting the threshold.
- AUC  **0.87**
- Precision **0.7**

Confidence in the feature set and classifier model was supported by a regression model that predicted revenue with an **RMSE** of roughly **$81M**.  Automatic feature selection using PCA or KMeans clustering did not improve the model.

As the average cost involved in producing a major studio film is extremely high and the number of films produced by a production house is relatively low, it is important that the classifier's positive predictions are going to be successful. Failing that will result in huge loss for the production house. Hence a model that has high precision rather than high recall is preferred.

### Recommendations
The majority of movies in this dataset with revenue and budget values are from the US.  This model could be used in other geographic markets, in particular for domestic releases, if values for revenue and budget were available.

Analysis of the feature set indicates that further model improvement is possible.  In particular, given the revenue variation by "genre", new features based on a *primary_genre* would be a next step:
- **revenue_mean_by_genre**
- **budget_mean_by_genre**
- **popularity_mean_by_genre**

Furthermore, the impact of directors and actors on movie success could be explored, by generating a rating formula based on the revenue of other movies in which they were involved (where genre should also be factored in):
- **director_rating_by_genre**
- **actor_rating_by_genre**