<a href="https://colab.research.google.com/github/itayse10/GoingToMovies/blob/master/going_to__the_movies_11_April_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning 3253 - Project Assignment



## Team Members
    
* Craig Barbisan
* Nisha Choondassery
* Mark Hubbard
* Itay Segal



## Project Overview

Using the "Movies Dataset"  from Kaggle, this project will create an optimal model for predicting a movie's rating.

## Notebook Overview

This notebook will explore the data, evaluate some models and draw conclusions.

It is divided into the following main sections:

1. Setup - seting up the Notebook environment.
2. Data - acquiring, exploring and processing the data.
3. Model - training and testing various models.
4. Analysis - analyzing the model results.
5. Summary - summarizing the observations and conclusions.

# 1.0 Setup

## Libraries

In [0]:
# import the basic libraries
import os
import numpy as np
import pandas as pd
import json
import ast
import tarfile

# make this notebook's output stable across runs
SEED = 42
np.random.seed(SEED)

# ensure full display for dataframe content
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('precision', 5)
pd.set_option('large_repr', 'truncate')
pd.set_option('display.max_colwidth', -1)
pd.set_option('colheader_justify', 'left')

# enable basic plots with pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# enable advanced plots
import seaborn as sns
sns.set(style='darkgrid')
sns.set(font_scale=1.2)

In [0]:
# import sklearn libraries

# pipeline processing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

# data splitting
from sklearn.model_selection import train_test_split

# classifier models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# model evaluation
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [0]:
# suppress warnings
import warnings
# warnings.filterwarnings('ignore')

## Common Functions

### Exploratory Data Analysis Functions

In [0]:
# function: basic schema analysis
def quick_schema_analysis(df):
    print("Basic Schema Analysis for dataframe=" + df.name)
    print("************************************************")
    
    print("Rows and Columns:")
    print(df.shape)
    print(df.info())
    print("\n")
    
    print("Null Values - percentage:")
    print((1 - df.count()/len(df.index)) * 100)
    print("\n")
    
    print("Null Values - count:")
    print(df.isnull().sum())
    print("\n")

In [0]:
# function: basic data analysus
def quick_data_analysis(df):
    print("Basic Data Analysis for dataframe=" + df.name)
    print(df.shape)

### Exploratory Data Analysis Plots

In [0]:
# function: plot a column vs column correlation map
def plot_data_correlation(data):
    sns.pairplot(data)

In [0]:
# function: plot a complete histogram for all columns
def plot_data_histograms(data):
    plt.figure()
    data.hist(bins=50, figsize=(20,15))
    plt.show()

In [0]:
# function: plot percentage/count of categoric feature values per target value
def plot_feature_vs_target(data, column_id, feature_name, target_name):

    plt.figure()
    plt.title(target_name + " by " + feature_name + " (Percent)")
    plt.xlabel(feature_name)
    ax = sns.barplot(x=column_id, y=column_id, data=data, estimator=lambda x: len(x) / len(data) * 100)
    ax.set(ylabel="Percent")
    plt.show()

    plt.figure()
    plt.title(target_name + " by " + feature_name + " (Count)")
    sns.countplot(x=column_id, hue=target_name, data=data, palette='RdBu')
    plt.show()

In [0]:
# function: plot a correlation heatmap
def plot_correlation(data):
    X = data.iloc[:,0:20]  #independent columns
    y = data.iloc[:,-1]    #target column i.e price range

    #get correlations of each features in dataset
    corrmat = data.corr()
    top_corr_features = corrmat.index
    plt.figure(figsize=(20,20))

    #plot heat map
    g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

### Model Evaluation Plots

#### Confusion Matrix Plots

In [0]:
# function: generate a confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict

def generate_confusion_matrix(classifier, X, y, cv):
  
    # calculate the predicted values
    y_cvp = cross_val_predict(classifier, X, y, cv=cv)
    
    # calculate the confusion matrix
    cm = confusion_matrix(y, y_cvp)

    tn, fp, fn, tp = confusion_matrix(y, y_cvp).ravel()
    print("True  Negatives: {}".format(tn))
    print("False Positives: {}".format(fp))
    print("False Negatives: {}".format(fn))
    print("True  Positives: {}".format(tp))

    # plot the confusion matrix
    plot_confusion_matrix(cm, classes=[target_name_negative, 
                                       target_name_positive], 
                          normalize=False, title='Confusion Matrix')

In [0]:
# function: plot a confusion matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure(figsize=(7, 7))
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print('Confusion Matrix (with normalization)')
    else:
        print('Confusion Matrix (without normalization)')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

#### Precision Recall Plots

In [0]:
# function: generate the Precision-Recall curves
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_score, recall_score

def generate_precision_vs_recall(y, y_scores):

    precisions, recalls, thresholds = precision_recall_curve(y, y_scores)

    plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

    plot_precision_vs_recall(precisions, recalls)
  
    print(precision_score(y_train, clf_y_train_cvp))
    print(recall_score(y_train, clf_y_train_cvp))

In [0]:
# function: plot the precision-recall-threshold curve
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(8, 4))
    plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
    plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')
    plt.xlabel('Threshold')
    plt.legend(loc='best')
    plt.ylim([0, 1])
    plt.grid(True)
    plt.show()

In [0]:
# function: plot the precision vs recall curve
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, 'b-', linewidth=2)
    plt.xlabel('Recall', fontsize=16)
    plt.ylabel('Precision', fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.show()
    
    print("True  Negatives: {}".format(tn))
    print("False Positives: {}".format(fp))
    print("False Negatives: {}".format(fn))
    print("True  Positives: {}".format(tp))

#### ROC Plots

In [0]:
# function: generate ROC curve
def generate_roc_curve(classifier, X, y, cv):
  
    # calculate probabilities
    y_probability_score = calculate_probability_score(classifier, X, y, cv)
    
    fpr, tpr, thresholds = roc_curve(y, y_probability_score)
    
    # plot the ROC curve
    plt.title('ROC Curve')

    plot_roc_curve(fpr, tpr,'Best Classifier')

    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.legend(loc='lower right', fontsize=16)
    plt.show()
    
    print("AUC Score: {}".format(roc_auc_score(y, y_probability_score)))

In [0]:
# function: plot the ROC curve
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])   

In [0]:
# calculate probability score
from sklearn.model_selection import cross_val_predict

def calculate_probability_score(classifier, X, y, cv):

  clf_y_probas = cross_val_predict(classifier, X, y, cv=cv, 
                                         method="predict_proba")

  clf_y_scores = clf_y_probas[:, 1] # score = proba of positive class
  
  return clf_y_scores

### Model Scoring Functions

In [0]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())

### Data Functions

In [0]:
# class: DataFrameSelector transform (scikit doesn't support DataFrames yet)
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self.fit(X, y)
    def transform(self, X):
        return X[self.attribute_names].copy()
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

In [0]:
# function: download a file from Kaggle in chunks
import requests

USERNAME = 'username'
PASSWORD = 'password'

def download_from_Kaggle(remote_file, local_file):

    kaggle_info = {'UserName': USERNAME, 'Password': PASSWORD}

    r = requests.get(remote_file, auth=(USERNAME, PASSWORD))

    # r = requests.post(r.url, data = kaggle_info)
    
    print('here...')
    # read and write 512KB chunks at a time
    f = open(local_file, 'wb')
    for chunk in r.iter_content(chunk_size = 512 * 1024):
        if chunk:
            f.write(chunk)
            print('chunk')

    f.close()

In [0]:
# function: download an entire file from a public Web site
from urllib.request import urlretrieve

def download_from_Web(remote_file, local_file):
    urlretrieve(remote_file, local_file)

In [0]:
# function: download files from the internet and optionally unzip them
from zipfile import ZipFile

def fetch_data(url, file, local_path, zip=False):
    if not os.path.isdir(local_path):
        os.makedirs(local_path)
    
    remote_file = url
    local_file  = os.path.join(local_path, file)
   
    print(remote_file)
    
    if not os.path.isfile(local_file):
        print('Downloading ' + file + '...')
        
        download_from_Web(remote_file, local_file)

        print('Download complete.')
        
    else:
        print('Already downloaded.')

    if zip:
       unzip_file(local_file)


In [0]:
# function: unzip a file

def unzip_file(file):
 
    zip_path = file[:-4]
  
    if (os.path.isdir(zip_path)):
        print('Already extracted.')
    else:
        print('Extracting...')
        zfile = ZipFile(file, 'r')
        print(zfile.infolist())
        zfile.extractall(zip_path)
        zfile.close()
        print('Extraction complete.')

In [0]:
# convert json to abstract syntax trees (ast)

# use ast because json data has single quotes in the csvs
# which is invalid for a json object (should be double quotes)

def convert_json_to_ast(df, json_columns):
  
    for column in json_columns:
        df[column] = df[column].apply(lambda x: np.nan if pd.isnull(x)
                                                else ast.literal_eval(str(x)))

In [0]:
# convert the columns that contain the dictionary field with the name value in its ast 
def get_dict_val_from_ast(df,columns,field_name, fillna_str):
    for column in columns:
        df[column] = df[column].fillna(fillna_str) # first replace NaN with fillna_str
        df[column] = df[column].apply(lambda x: x[field_name] if isinstance(x, dict) else [])  #.apply(ast.literal_eval)

In [0]:
# convert the columns that contain the list field with the name value in its ast
def get_list_val_from_ast(df,columns,field_name, fillna_str=None, new_col_dict=None):
    for column in columns:
        if(fillna_str):
            df[column] = df[column].fillna(fillna_str) # first replace NaN with fillna_str
        new_column = column
        if(new_col_dict and column in new_col_dict):
            new_column = new_col_dict[column]
        df[new_column] = df[column].apply(lambda x: [i[field_name] for i in x] if isinstance(x, list) else [])  #.apply(ast.literal_eval)

In [0]:
# convert to float
def cast_to_float(df, columns,d_cast=None):
    for column in columns:
        if(d_cast):
            df[column] = pd.to_numeric(df[column],errors='coerce',downcast=d_cast)
        else:
            df[column] = pd.to_numeric(df[column],errors='coerce')
                

In [0]:
# fill in empty values with mean
def replace_empty_with_mean(df, columns):
    for column in columns:
        df[column] = df[column].fillna(df[column].mean())

In [0]:
# review column results after data cleansing

def assess_column(df, column, categories=False):
    print(df[column].head(10))
    print('Number of null entries: ', df[column].isnull().sum())
    if categories:
      print(df[column].unique())

In [0]:
# display column stats
def column_stats(df,feature):
    print('Total count is {}'.format(len(df[feature].value_counts())))
    print(df[feature].value_counts())

In [0]:
from urllib.parse import urlparse

# get domain from url
def get_url_domain(url):
    if(not pd.isnull(url)):
        parsed_uri = urlparse(url )
        return '{uri.netloc}'.format(uri=parsed_uri)
    else:
        return url

In [0]:
# list all unique values in a categorical column

# 2.0 Data 

## Data Acquisition

The dataset for this project is sourced from https://www.kaggle.com/rounakbanik/the-movies-dataset.

Credit: Rounak Banik

This dataset is an ensemble of data collected from TMDB and GroupLens.
* The Movie Details (i.e. Metadata), Credits and Keywords have been collected from the TMDB Open API.
* The Movie Links and Ratings have been obtained from the Official GroupLens website.

The following spreadsheets are used by this project:

* __movies_metadata.csv__: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

* __credits.csv__: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

* __keywords.csv__:  Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

* __links.csv__: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

* __ratings_small.csv__: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed at https://grouplens.org/datasets/movielens/latest/

The dataset for this project is stored as a public file on Dropbox:
https://www.dropbox.com/s/89uv5kntgiolkno/the-movies-dataset.zip?dl=1

(Kaggle requires multiple manual steps by the notebook user and Google Drive injects a virus detection warning  page for downloads of large files).

In [0]:
# specify file names and locations
URL_DOMAIN    = 'https://www.dropbox.com'
URL_PATH      = '/s/89uv5kntgiolkno/the-movies-dataset.zip?dl=1'

G_URL_DOMAIN  = 'https://drive.google.com'
G_URL_PATH    = '/uc?export=download&confirm=no_antivirus&id=16zqahjyBrcdJYKBMK-zo2NbRHyyNHVQJ'

K_URL_DOMAIN2 = 'http://www.kaggle.com'
K_URL_PATH    = '/rounakbanik/the-movies-dataset/downloads/'

PROJECT_LOCAL_DIR  = 'movies/'
PROJECT_OUTPUT_DIR = '/content/movies/output'
PROJECT_FILE       = 'the-movies-dataset.zip'

In [0]:
# download the dataset file

url = URL_DOMAIN + URL_PATH

fetch_data(url, PROJECT_FILE, PROJECT_LOCAL_DIR, zip=True)

In [0]:
# download the text parsing file

url = URL_DOMAIN + '/s/svv8312fd9e7e9k/stopwords.txt?dl=1'

fetch_data(url, 'stopwords.txt', 'movies/resources', zip=False)

Create dataframes for each of the spreadsheets.

As part of loading the dataframes, convert null values to NaN on the fly (via pd.read_csv).

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In [0]:
# initialize the dataframes (and convert null fields on the fly)

zip_path = os.path.join(PROJECT_LOCAL_DIR, PROJECT_FILE[:-4])


metadata_file = os.path.join(zip_path, 'movies_metadata.csv')
credits_file  = os.path.join(zip_path, 'credits.csv')
plot_file     = os.path.join(zip_path, 'keywords.csv')
links_file    = os.path.join(zip_path, 'links.csv')
ratings_file  = os.path.join(zip_path, 'ratings_small.csv')

# load the metadata dataframe

metadata = pd.read_csv(metadata_file,
                     dtype = 'unicode',
                     na_values = ['no info', '.']
                    )

# load the credits dataframe

credits = pd.read_csv(credits_file,
                      dtype = 'unicode',
                      na_values = ['no info', '.']
                     )

# load the plot dataframe

plot =  pd.read_csv(plot_file,
                    dtype = 'unicode',
                    na_values = ['no info', '.']
                   )

# load the links dataframe

links =  pd.read_csv(links_file,
                    dtype = 'unicode',
                    na_values = ['no info', '.']
                   )

# load the ratings dataframe

ratings = pd.read_csv(ratings_file,
                      dtype = 'unicode',
                      na_values = ['no info', '.']
                     )

dataframes = [metadata, credits, plot, links, ratings]

## Exploratory Data Analysis (EDA)

In [0]:
# make copies of data for EDA

metadata_copy = metadata.copy()
metadata_copy.name = 'metadata'

credits_copy = credits.copy()
credits_copy.name = 'credit'

plot_copy = plot.copy()
plot_copy.name = 'plot'

links_copy = links.copy()
links_copy.name = 'links'



eda_dataframes = [metadata_copy,
                  credits_copy,
                  plot_copy,
                  links_copy
                  
                 ]


### Explore the "metadata" data

__Features__

* __adult__: Indicates if the movie is X-Rated or Adult.
* __belongs_to_collection__: A stringified dictionary that gives information on the movie series the particular film belongs to.
* __budget__: The budget of the movie in dollars.
* __genres__: A stringified list of dictionaries that list out all the genres associated with the movie.
* __homepage__: The Official Homepage of the move.
* __id__: The ID of the movie.
* __imdb_id__: The IMDB ID of the movie.
* __original_language__: The language in which the movie was originally shot in.
* __original_title__: The original title of the movie.
* __overview__: A brief blurb of the movie.
* __popularity__: The Popularity Score assigned by TMDB.
* __poster_path__: The URL of the poster image.
* __production_companies__: A stringified list of production companies involved with the making of the movie.
* __production_countries__: A stringified list of countries where the movie was shot/produced in.
* __release_date__: Theatrical Release Date of the movie.
* __revenue__: The total revenue of the movie in dollars.
* __runtime__: The runtime of the movie in minutes.
* __spoken_languages__: A stringified list of spoken languages in the film.
* __status__: The status of the movie (Released, To Be Released, Announced, etc.)
* __tagline__: The tagline of the movie.
* __title__: The Official Title of the movie.
* __video__: Indicates if there is a video present of the movie with TMDB.
* __vote_average__: The average rating of the movie.
* __vote_count__: The number of votes by users, as counted by TMDB.

In [0]:
# quick review of the metadata data
quick_schema_analysis(metadata_copy)
quick_data_analysis(metadata_copy)

metadata_copy.head(5)

In [0]:
# check if there are any duplicates

print(metadata_copy.shape)
print(metadata_copy.drop_duplicates().shape)

In [0]:
# plot a heatmap of nulls
sns.heatmap(metadata_copy.isnull(), yticklabels = False, cbar = False, cmap = 'viridis')

### Explore the "credits" data

In [0]:
# quick review of the credits data

quick_schema_analysis(credits_copy)
quick_data_analysis(credits_copy)

credits_copy.head(3)

### Explore the "plot" data

In [0]:
# quick review of the plot data

quick_schema_analysis(plot_copy)
quick_data_analysis(plot_copy)

plot_copy.head(5)

### Explore the links data

In [0]:
# quick review of the links data

quick_schema_analysis(links_copy)
quick_data_analysis(links_copy)

links_copy.head(5)

## Data Cleansing

### Clean the metadata data

In [0]:
metadata.head(1)

In [0]:
metadata['id'].isnull().sum()

Perform the following type conversions:
* Convert __release_date__ to datetime
* Convert __budget__ and __revenue__ to numerics
* Convert all JSON fields to abstract syntax trees
* Convert __vote_average__ to float
* Convert __vote_count__ to integer

In [0]:
# convert each item of release_date to a datetime type entity

metadata['release_date'] = pd.to_datetime(metadata['release_date'],
                                          errors='coerce')

# metadata['release_date'] = metadata['release_date'].fillna('?')

In [0]:
assess_column(metadata, 'release_date')

In [0]:
# convert budget and revenue fields to be numeric
# (...and convert 0 to a NaN to enable budget and revenue math)

metadata['budget']  = pd.to_numeric(metadata['budget'],  errors='coerce')
metadata['revenue'] = pd.to_numeric(metadata['revenue'], errors='coerce')

metadata['budget']  = metadata['budget'].replace(0, np.nan)
metadata['revenue'] = metadata['revenue'].replace(0, np.nan)

In [0]:
assess_column(metadata, 'budget')
assess_column(metadata, 'revenue')

In [0]:
# convert json columns to abstract syntax trees

json_columns = ['belongs_to_collection',
                'genres',
                'production_companies',
                'production_countries',
                'spoken_languages'
               ]

convert_json_to_ast(metadata, json_columns)

In [0]:
# def get_val_from_ast(df,columns,field_name)
get_dict_val_from_ast(metadata, ['belongs_to_collection'],'name','')

In [0]:
assess_column(metadata, 'belongs_to_collection')

In [0]:
get_list_val_from_ast(metadata, ['genres'],'name','Other')

In [0]:
assess_column(metadata, 'genres')

In [0]:
get_list_val_from_ast(metadata, ['production_companies'],'name','')

In [0]:
assess_column(metadata, 'production_companies')

In [0]:
get_list_val_from_ast(metadata, ['production_countries'],'iso_3166_1','Other')

In [0]:
assess_column(metadata, 'production_countries')

In [0]:
cast_to_float(metadata, ['runtime'])

In [0]:
# replaced by the previous cell
# convert runtime to float

metadata['runtime'] = pd.to_numeric(metadata['runtime'],
                                       errors='coerce'
                                      )

In [0]:
assess_column(metadata, 'runtime')

In [0]:
get_list_val_from_ast(metadata, ['spoken_languages'],'name','Other')

In [0]:
assess_column(metadata, 'spoken_languages')

In [0]:
cast_to_float(metadata, ['vote_average'])

replace_empty_with_mean(metadata, ['vote_average'])

In [0]:
assess_column(metadata, 'vote_average')

In [0]:
cast_to_float(metadata, ['vote_count'],'integer')

In [0]:
metadata['vote_count'] = metadata['vote_count'].fillna(0)

In [0]:
assess_column(metadata, 'vote_count')

### Clean the credits data

In [0]:
credits.head(1)

In [0]:
# convert json columns to abstract syntax trees

json_columns = ['cast', 'crew']
    
convert_json_to_ast(credits, json_columns)

In [0]:
# populate cast and crew with string extractions from their current ast

credits['cast'] = (credits['cast'].fillna('')
#                                 .apply(ast.literal_eval)
                  )
 
credits['crew'] = (credits['crew'].fillna('')
#                                 .apply(ast.literal_eval)
                  )

In [0]:
assess_column(credits, 'cast')
assess_column(credits, 'crew')

### Clean the plot data

In [0]:
plot.head(1)

In [0]:
# convert json columns to abstract syntax trees

json_columns = ['keywords']

convert_json_to_ast(plot, json_columns)

In [0]:
get_list_val_from_ast(plot, ['keywords'],'name','')

In [0]:
assess_column(plot, 'keywords')

## Data Merge

In [0]:
# compare the shapes of all the dataframes

print(metadata.shape)
print(credits.shape)
print(plot.shape)
print(links.shape)



In [0]:
# initialize the merged dataframe "movies"
movies = metadata.copy()
print(movies.shape)

### Merge metadata and credit 

In [0]:
# perform a left join of credits to movies (adds 4 columns)
movies = movies.merge(credits, on=["id"])
print(movies.shape)

In [0]:
movies.drop(['crew','cast'],axis=1).head(5)

### Merge plot

In [0]:
movies = movies.merge(plot, on=['id'])
print(movies.shape)

In [0]:
movies.drop(['crew','cast'],axis=1).head(5)

### Review movies Dataframe

In [0]:
# explore new movies dataframe

movies.name = 'movies'

quick_schema_analysis(movies)
quick_data_analysis(movies)

movies.head(5)

# 3.0 Features

__Feature Engineering__ - create new features to improve the model
* __Feature Extractions__ - derive new features from a single existing feature
* __Feature Aggregations__ - derive new features by combining multiple features (columns) or spanning samples (rows)
* __Feature Transformations__ - improve the form of existing features  to improve the model

__Feature Selection__ - prune features to optimize the model

## Feature Engineering

### Feature Extractions

Derive new features from a single existing feature.
1. actors - from cast
2. director - from cast
3. cast_size - from cast
4. crew_size - from crew
5. franchise - from belongs_to_collection
6. season - from release_date

#### New Feature: actors

In [0]:
# create actors columns based on cast
get_list_val_from_ast(df=movies, columns=['cast'],field_name='name',new_col_dict = {'cast' : 'actors'})

In [0]:
assess_column(movies, 'actors')

#### New Feature: director

In [0]:
movies['cast']=movies['cast'].fillna('')
movies['crew']=movies['crew'].fillna('')

In [0]:
# extract the director from the crew field

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return 'Unknown'
  
movies['director'] = movies['crew'].apply(get_director)

In [0]:
assess_column(movies, 'director')

#### New Features: cast_size and crew_size

In [0]:
# popoulate cast and crew size based on # of items in the cast and crew fields

movies['cast_size'] = movies['cast'].apply(lambda x: len(x))
movies['crew_size'] = movies['crew'].apply(lambda x: len(x))

In [0]:
assess_column(movies, 'cast_size')
assess_column(movies, 'crew_size')

#### New Feature: franchise

In [0]:
# populate as a boolean depending on whether belongs_to_collection is empty

movies['belongs_to_collection'] = movies['belongs_to_collection'].fillna('')

movies['franchise'] = (movies['belongs_to_collection']
                       .apply(lambda x: len(x)>0)
                      )

In [0]:
assess_column(movies, 'franchise')

#### New Feature: season

In [0]:
# calculate the season of the release (spring, summer, fall, winter)
def season_of_date(date):
    
    if pd.isnull(date):
      return 'unknown'
    
    year = str(date.year)
    seasons = {'spring': pd.date_range(start='21/03/'+year, end='20/06/'+year),
               'summer': pd.date_range(start='21/06/'+year, end='22/09/'+year),
               'autumn': pd.date_range(start='23/09/'+year, end='20/12/'+year)}
    if date in seasons['spring']:
        return 'spring'
    if date in seasons['summer']:
        return 'summer'
    if date in seasons['autumn']:
        return 'autumn'
    else:
        return 'winter'

# create a new column    
movies['season'] = (movies['release_date']
                          .fillna(pd.NaT)
                          .apply(lambda x: season_of_date(x))
                   )

In [0]:
assess_column(movies, 'season')

In [0]:
# get the year the movie was released
def year_of_date(date):
    if pd.isnull(date):
      return -1
    year = date.year
    return year

# create a new column    
movies['release_year'] = (movies['release_date']
                          .fillna(pd.NaT)
                          .apply(lambda x: year_of_date(x))
                   )

#### New Feature: homepage domain

In [0]:
# extract the domain from the homepage url
movies['homepage_domain'] = movies['homepage'].apply(get_url_domain)

### Feature Aggregations

Derive new features by combining existing features.
1. weighted_rating
2. revenue_to_budget_ratio

#### New Feature: weighted_rating

Weighted Rating (WR) = (vv+m.R)+(mv+m.C)(vv+m.R)+(mv+m.C) where,

* v is the number of votes for the movie
* m is the minimum votes required to be listed in the chart
* R is the average rating of the movie C is the mean vote across the whole report.

Source: https://www.kaggle.com/rounakbanik/movie-recommender-systems

In [0]:
# add a weighted rating feature

vote_averages = (movies[movies['vote_average']
                 .notnull()]['vote_average'].astype('int')
                )

vote_counts = (movies[movies['vote_count']
               .notnull()]['vote_count'].astype('int')
              )


C = vote_averages.mean()

m = vote_counts.quantile(0.75)

def weighted_rating(x):
    v = x['vote_count']+1
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

movies['weighted_rating'] = movies.apply(weighted_rating, axis=1)

In [0]:
assess_column(movies, 'weighted_rating')

#### New Feature: revenue_to_budget_ratio

In [0]:
# this new feature can indicate "success" or "failure"
# (depending on whether ration is < 0 or > 0)

movies['revenue_to_budget_ratio'] = movies['revenue'] / movies['budget']

In [0]:
assess_column(movies, 'revenue_to_budget_ratio')

### Feature Transformations

Change existing features such that they can contribute more effeciently and effectively during machine learning.
* one hot encoding transformation of categoric features
* scaling transformation of numeric features
* vectorizing paragraphs


In [0]:
# examine the movies data set without the "large" textual features
movies.drop(['cast','crew','actors','overview'],axis=1).head(20)

In [0]:
# examine the movies data set for the "large" textual features
movies[['cast','crew','actors','overview']].head(5)

In [0]:
# check the numeric columns
movies.describe()

#### Textual and categorical features

In [0]:
# Import required libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF

In [0]:
## load stop words file

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("movies/resources/stopwords.txt")

In [0]:
# title
# Vectorizing the titles - using word counts
count_vectorizer = CountVectorizer(max_features=1000, binary=True, max_df=0.8,stop_words=stopwords) 
titles = movies["title"].replace(np.nan,'')
titles_transformed = count_vectorizer.fit_transform(titles)

In [0]:
# Examine the vectorized titles
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
title_titles = ['title_' + title for title in idx_to_word]
titles_df = pd.DataFrame(titles_transformed.toarray(),columns=title_titles)
titles_df.shape

In [0]:
## reduce the encoded features using a threshold
def reduce_sparse_encoding(df,threshold):
    headers_list = list(df.columns.values)
    num_samples = df.shape[0]

    for header in headers_list:
        col_true_count = len(df[df[header] == 1])
        if col_true_count/num_samples < threshold:
            df = df.drop(header, axis=1)
    
    return df

In [0]:
# reduce the features using 0.1% of 1
titles_df = reduce_sparse_encoding(titles_df,0.001)
titles_df.shape

In [0]:
# belongs_to_collection
# Replace empty lists with an empty string
collections = [''.join(collection).lower().replace('collection','').replace('series','') 
               for collection in movies['belongs_to_collection'].values]
# Vectorizing the collection names - using word counts
count_vectorizer = CountVectorizer(max_features=100, stop_words=stopwords) 
collections_transformed = count_vectorizer.fit_transform(collections)

In [0]:
# Examine the vectorized collections
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
collection_titles = ['coll_' + title for title in idx_to_word]
collections_df = pd.DataFrame(collections_transformed.toarray(),columns=collection_titles)
collections_df.shape

In [0]:
# reduce the features using 0.1% of 1
collections_df = reduce_sparse_encoding(collections_df,0.001)
collections_df.shape

In [0]:
# keywords
# Vectorizing keywords - re-use the vectorizer used for collections
keywords = [''.join(keyword).lower() 
               for keyword in movies['keywords'].values]

# Vectorizing the keywords - using word counts
count_vectorizer = CountVectorizer(max_features=100, stop_words=stopwords) 
keywords_transformed = count_vectorizer.fit_transform(keywords)

In [0]:
# Examine the vectorized keywords
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
keyword_titles = ['key_' + title for title in idx_to_word]
keywords_df = pd.DataFrame(keywords_transformed.toarray(),columns=keyword_titles)
keywords_df.shape

In [0]:
# reduce the features using 0.1% of 1
keywords_df = reduce_sparse_encoding(keywords_df,0.001)
keywords_df.shape

In [0]:
# tagline
# Vectorizing tagline - using tf-idf weighted term-document matrix
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.8,stop_words=stopwords) 
taglines = movies['tagline'].replace(np.nan,'')
taglines_transformed = tfidf_vectorizer.fit_transform(taglines)

In [0]:
# Examine the vectorized taglines
idx_to_word = np.array(tfidf_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# apply NMF
nmf = NMF(n_components=100, solver="mu")
taglines_nmf = nmf.fit_transform(taglines_transformed)

In [0]:
# Create a DataFrame for tagline
tagline_titles = nmf.components_
tagline_titles = ['tagline_' + '_'.join(idx_to_word[title.argsort()[-10:]]) for title in tagline_titles]
tagline_df = pd.DataFrame(taglines_nmf,columns=tagline_titles)
tagline_df.shape

In [0]:
## reduce the weighted features using a threshold
def reduce_sparse_weights(df,threshold):
    headers_list = list(df.columns.values)

    for header in headers_list:
        col_weight_sum = df[header].sum()
        if col_weight_sum < threshold:
            df = df.drop(header, axis=1)
    
    return df

In [0]:
# reduce the features using sum weights of 1
tagline_df = reduce_sparse_weights(tagline_df,1)
tagline_df.shape

In [0]:
# overview
# Vectorizing overview - using tf-idf weighted term-document matrix 
overviews = movies['overview'].replace(np.nan,'')
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.8,stop_words=stopwords) 
overviews_transformed = tfidf_vectorizer.fit_transform(overviews)

In [0]:
# Examine the vectorized overviews
idx_to_word = np.array(tfidf_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# apply NMF : re-use the nmf component 
overviews_nmf = nmf.fit_transform(overviews_transformed)

In [0]:
# Create a DataFrame for overview
overview_titles = nmf.components_
overview_titles = ['overview_' + '_'.join(idx_to_word[title.argsort()[-10:]]) for title in overview_titles]
overview_df = pd.DataFrame(overviews_nmf,columns=overview_titles)
overview_df.shape

In [0]:
# reduce the features using sum weights of 1
overview_df = reduce_sparse_weights(overview_df,1)
overview_df.shape

In [0]:
# spoken_languages
# Vectorizing spoken languages - re-use the vectorizer used for collections
count_vectorizer = CountVectorizer()
spoken_laguages = [','.join(lang).lower() 
               for lang in movies['spoken_languages'].values]
spoken_laguages_transformed = count_vectorizer.fit_transform(spoken_laguages)

In [0]:
# Examine the vectorized spoken languages
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
spoken_languages_titles = ['s_lang_' + lang for lang in idx_to_word]
spoken_languages_df = pd.DataFrame(spoken_laguages_transformed.toarray(),columns=spoken_languages_titles)
spoken_languages_df.shape

In [0]:
# reduce the features using 0.1% of 1
spoken_languages_df = reduce_sparse_encoding(spoken_languages_df,0.001)
spoken_languages_df.shape

In [0]:
# original_language
# Vectorizing original language - re-use the vectorizer used for collections
count_vectorizer = CountVectorizer()
original_languages = movies['original_language'].replace(np.nan,'')
original_languages_transformed = count_vectorizer.fit_transform(original_languages)

In [0]:
# Examine the vectorized original languages
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
original_languages_titles = ['o_lang_' + lang for lang in idx_to_word]
original_languages_df = pd.DataFrame(original_languages_transformed.toarray(),columns=original_languages_titles)
original_languages_df.shape

In [0]:
# reduce the features using 0.1% of 1
original_languages_df = reduce_sparse_encoding(original_languages_df,0.001)
original_languages_df.shape

In [0]:
# homepage_domain
# Vectorizing original language - using ngram vectorizing (to select all 3 parts of the url)
ngram_vectorizer = CountVectorizer(max_features=100, binary=True, max_df=0.8,stop_words=stopwords,ngram_range=(3, 3)) 
homepage_domains = movies['homepage_domain'].replace(np.nan,'')
homepage_domains_transformed = ngram_vectorizer.fit_transform(homepage_domains)

In [0]:
# Examine the vectorized original languages
idx_to_word = np.array(ngram_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
homepage_domain_titles = ['home_'+ url.replace(' ','.') for url in idx_to_word]
homepage_domains_df = pd.DataFrame(homepage_domains_transformed.toarray(),columns=homepage_domain_titles)
homepage_domains_df.shape

In [0]:
# reduce the features using 0.1% of 1
homepage_domains_df = reduce_sparse_encoding(homepage_domains_df,0.001)
homepage_domains_df.shape

In [0]:
# genres
# Prepare the data in the column
genres = [','.join(gen) for gen in movies['genres'].values]

# Vectorizing genres 
count_vectorizer = CountVectorizer()
genres_transformed = count_vectorizer.fit_transform(genres)

In [0]:
# Examine the vectorized production genres
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# Set titles and define new DataFrame
genre_titles = ['gen_' + gen for gen in idx_to_word]
genres_df = pd.DataFrame(genres_transformed.toarray(),columns=genre_titles)
genres_df.shape

In [0]:
# reduce the features using 0.1% of 1
genres_df = reduce_sparse_encoding(genres_df,0.001)
genres_df.shape

In [0]:
# production_companies

# Prepare the data in the column
production_companies = [','.join(prod).strip().replace(' ','_') 
               for prod in movies['production_companies'].values]

# Vectorizing production companies 
count_vectorizer = CountVectorizer(max_features=1000)
production_companies_transformed = count_vectorizer.fit_transform(production_companies)

In [0]:
# Examine the vectorized production companies
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# Set titles and define new DataFrame
production_companies_titles = ['prod_' + prod for prod in idx_to_word]
production_companies_df = pd.DataFrame(production_companies_transformed.toarray(),columns=production_companies_titles)
production_companies_df.shape

In [0]:
# reduce the features using 0.1% of 1
production_companies_df = reduce_sparse_encoding(production_companies_df,0.001)
production_companies_df.shape

In [0]:
# production_countries

# Vectorizing production countries 
count_vectorizer = CountVectorizer() 
production_countries = [','.join(country).lower() 
               for country in movies['production_countries'].values]
production_countries_transformed = count_vectorizer.fit_transform(production_countries)

In [0]:
# Examine the vectorized production countries
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# create a DataFrame
production_countries_titles = ['country_' + country for country in idx_to_word]
production_countries_df = pd.DataFrame(production_countries_transformed.toarray(),columns=production_countries_titles)
production_countries_df.shape

In [0]:
# reduce the features using 0.1% of 1
production_countries_df = reduce_sparse_encoding(production_countries_df,0.001)
production_countries_df.shape

In [0]:
# director
# Prepare the data in the column
directors = [director.strip().replace(' ','_') 
               for director in movies['director'].values]

# Vectorizing production companies
count_vectorizer = CountVectorizer(max_features=100) 
directors_transformed = count_vectorizer.fit_transform(directors)

In [0]:
# Examine the vectorized directors
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# Set titles and define new DataFrame
director_titles = ['director_' + director for director in idx_to_word]
directors_df = pd.DataFrame(directors_transformed.toarray(),columns=director_titles)
directors_df.shape

In [0]:
# reduce the features using 0.1% of 1
directors_df = reduce_sparse_encoding(directors_df,0.001)
directors_df.shape

In [0]:
# actors

# Prepare the data in the column
actors = [','.join(actor).strip().replace(' ','_') 
               for actor in movies['actors'].values]

# Vectorizing actors
count_vectorizer = CountVectorizer(max_features=1000) 
actors_transformed = count_vectorizer.fit_transform(actors)

In [0]:
# Examine the vectorized actors
idx_to_word = np.array(count_vectorizer.get_feature_names())
print(idx_to_word)

In [0]:
# Set titles and define new DataFrame
actor_titles = ['actor_' + actor for actor in idx_to_word]
actors_df = pd.DataFrame(actors_transformed.toarray(),columns=actor_titles)
actors_df.shape

In [0]:
# reduce the features using 0.1% of 1
actors_df = reduce_sparse_encoding(actors_df,0.001)
actors_df.shape

In [0]:
# one-hot encoding for categorical features

# season
season_df = pd.get_dummies(movies[['season']],prefix='season')
season_df.shape

In [0]:
# reduce the features using 0.1% of 1
season_df = reduce_sparse_encoding(season_df,0.001)
season_df.shape

In [0]:
# status
status_df = pd.get_dummies(movies[['status']],prefix='status')
status_df.shape

In [0]:
# reduce the features using 0.1% of 1
status_df = reduce_sparse_encoding(status_df,0.001)
status_df.shape

In [0]:
# franchise
franchise_df = pd.get_dummies(movies[['franchise']],prefix='fran')
franchise_df.shape

In [0]:
# check the stats - no need to remove encoded features
column_stats(movies,'franchise')

In [0]:
movies.columns

## Feature Selection



In [0]:
# create a copy of original movies data frame
movies_reduced = movies.copy() 
# drop columns with no added value to the problem space
movies_reduced = movies_reduced.drop(columns = 'poster_path')


In [0]:
# case for dropping external IMDB identifiers

print('MOVIES: missing imdb ids in main data:',
      movies_reduced['imdb_id'].isnull().sum())

# print('LINKS: data missing Imdb ids:',
#       movies_reduced["imdb_id"].isnull().sum())

# What's tmdbId?
#print('LINKS: data missing Tmdb ids:'
#      ,movies_reduced["tmdbId"].isnull().sum())

# print('# of records where movies.id and links.movieId match:', 
#       movies_reduced.drop(['imdb_id'], axis=1)
#             .merge(links, left_on='id', right_on='movieId', how='inner').shape
#      )

# print('# of records where movies.imdb_id and links.imdbId match:', 
#       movies_reduced.merge(links, left_on='imdb_id',
#                           right_on='imdb_id',
#                           how='inner').shape
#      )

In [0]:
# drop id columns (since integrity of these fields is poor)
movies_reduced = movies_reduced.drop(columns = 'imdb_id')

In [0]:
# check adult feature
column_stats(movies_reduced,'adult')

In [0]:
# check video feature
column_stats(movies_reduced,'video')

In [0]:
# drop additional columns
movies_reduced = movies_reduced.drop(columns = 'adult')  # as it has only 9 true values
movies_reduced = movies_reduced.drop(columns = 'homepage')  #replaced by homepage_domain
movies_reduced = movies_reduced.drop(columns = 'video')  # as it has only 95 true values
movies_reduced = movies_reduced.drop(columns = 'cast')  # parsed into actors
movies_reduced = movies_reduced.drop(columns = 'crew')  # partially expressed in director feature
movies_reduced = movies_reduced.drop(columns = 'release_date')  # can be replaced with season and release_year

In [0]:
movies_reduced.info()

In [0]:
# case for dropping original_title
compare = ['title', 'original_title']
movies_reduced[movies_reduced['original_title'] != movies_reduced['title']][['title', 'original_title']].head(5)

In [0]:
# drop original_title (it appears to be the untranslated version of the title)
movies_reduced = movies_reduced.drop(columns='original_title')

In [0]:
# remove vectorized and encoded features (to be merged later)
movies_reduced = movies_reduced.drop(columns = 'belongs_to_collection') # to be replaced with collections_df
movies_reduced = movies_reduced.drop(columns = 'genres') # to be replaced with genres_df
movies_reduced = movies_reduced.drop(columns = 'title') # to be replaced with titles_df
movies_reduced = movies_reduced.drop(columns = 'keywords') # to be replaced with keywords_df
movies_reduced = movies_reduced.drop(columns = 'tagline') # to be replaced with tagline_df
movies_reduced = movies_reduced.drop(columns = 'overview') # to be replaced with overview_df
movies_reduced = movies_reduced.drop(columns = 'spoken_languages') # to be replaced with spoken_languages_df
movies_reduced = movies_reduced.drop(columns = 'original_language') # to be replaced with original_languages_df
movies_reduced = movies_reduced.drop(columns = 'homepage_domain') # to be replaced with homepage_domains_df
movies_reduced = movies_reduced.drop(columns = 'production_companies') # to be replaced with production_companies_df
movies_reduced = movies_reduced.drop(columns = 'production_countries') # to be replaced with production_countries_df
movies_reduced = movies_reduced.drop(columns = 'director') # to be replaced by directors_df
movies_reduced = movies_reduced.drop(columns = 'actors') # to be replaced with actors_df
movies_reduced = movies_reduced.drop(columns = 'season') # to be replaced with season_df
movies_reduced = movies_reduced.drop(columns = 'status') # to bre replaced with status_df
movies_reduced = movies_reduced.drop(columns = 'franchise') # to be replaced with franchise_df


In [0]:
movies_reduced.head(5)

#### Analyze Correlation

In [0]:
# Examine rating vs numerical features

from pandas.plotting import scatter_matrix

attributes = ["budget", "popularity","revenue","runtime","vote_average","vote_count","cast_size","crew_size","revenue_to_budget_ratio","release_year","weighted_rating"]
scatter_matrix(movies_reduced[attributes], figsize=(15, 15))
plt.show()

* Most of the features are correlated to some extent
* revenue_to_budget_ratio and release_year don't seem to be correlated with anything - probably can be removed

In [0]:
# remove features that don't have meaningful correlation with the revenue
movies_reduced = movies_reduced.drop(columns = 'release_year') 
movies_reduced = movies_reduced.drop(columns = 'revenue_to_budget_ratio') 

In [0]:
movies_reduced.info()

#### Adding the encoded data sets

In [0]:
movies_full = movies_reduced.copy()
movies_full.shape

In [0]:
movies_full = movies_full.join(titles_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(collections_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(keywords_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(tagline_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(overview_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(spoken_languages_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(original_languages_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(homepage_domains_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(production_companies_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(production_countries_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(directors_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(actors_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(season_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(status_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(franchise_df)
movies_full.shape

In [0]:
movies_full = movies_full.join(genres_df)
movies_full.shape

In [0]:
movies_full.head(10)

#### Final cleansing

In [0]:
# remove empty target values
movies_full_filtered = movies_full[movies_full['revenue'].notnull()]

In [0]:
# identify which columns have empty values
print('Empty values for budget: {}'.format(movies_full_filtered['budget'].isnull().values.any()))
print('Empty values for popularity: {}'.format(movies_full_filtered['popularity'].isnull().values.any()))
print('Empty values for runtime: {}'.format(movies_full_filtered['runtime'].isnull().values.any()))
print('Empty values for vote_average: {}'.format(movies_full_filtered['vote_average'].isnull().values.any()))
print('Empty values for vote_count: {}'.format(movies_full_filtered['vote_count'].isnull().values.any()))
print('Empty values for cast_size: {}'.format(movies_full_filtered['cast_size'].isnull().values.any()))
print('Empty values for crew_size: {}'.format(movies_full_filtered['crew_size'].isnull().values.any()))
print('Empty values for weighted_rating: {}'.format(movies_full_filtered['weighted_rating'].isnull().values.any()))

In [0]:
# replace NaN with mean
movies_full_filtered['budget'].fillna((movies_full_filtered['budget'].mean()), inplace=True)
movies_full_filtered['runtime'].fillna((movies_full_filtered['runtime'].mean()), inplace=True)

In [0]:
# get data set info
movies_full_filtered.info()
movies_full_filtered.describe()

#### Review and Save Enhanced Dataset

In [0]:
# copy the final feature set to movies
movies = movies_full_filtered.copy()

# verify the schema and data end results

movies.name = 'movies'
quick_schema_analysis(movies)
quick_data_analysis(movies)

In [0]:
# save movies dataframe to a csv
                    
movies_final_file = os.path.join(PROJECT_LOCAL_DIR, 'movies_final.csv')

print('Saving final model to ' + movies_final_file + '...')

movies.to_csv(movies_final_file,
              encoding='utf-8',
              index=False
              )

print('Model saved.')

## Feature selction and reduction - continued

In [0]:
movies_model = movies_full_filtered.copy().drop('id',axis=1)

In [0]:
# define target value and features
target = 'revenue'
features = list(movies_model.columns)
features = [f for f in features if f!=target]

feature_set = movies_model[features]
target_col = movies_model[target]

#### prepare the train and test data sets

In [0]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(movies_model, test_size=0.3)

X_tr = train_set[features]
y_tr = train_set[[target]]

X_te = test_set[features]
y_te = test_set[[target]]

#add a new target variable for classification 

y_tr_cl = np.ravel(y_tr)/np.ravel(train_set.budget)>2.50
y_te_cl = np.ravel(y_te)/np.ravel(test_set.budget)>2.50

#### SelectKBest with chi square

In [0]:
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [0]:
# define the select K best model - using chi square
select_best = SelectKBest(score_func=chi2)

In [0]:
# identify the best hyper parameter
from sklearn.linear_model import LinearRegression

k_arr = [10, 20, 30, 40 ,50, 75, 100]

for k in k_arr:
    select_best.k = k
    select_best.fit(np.asarray(X_tr),np.asarray(y_tr, dtype="|S100"))
    X_tr_new = select_best.transform(X_tr)
    X_te_new = select_best.transform(X_te)
    print("\n")
    print("Performance for k={}".format(k))
    print("Trainng set:")
    lin_scores = cross_val_score(LinearRegression(), X_tr_new, y_tr, scoring="neg_mean_squared_error", cv=4)
    lin_rmse_scores = np.sqrt(-lin_scores)
    display_scores(lin_rmse_scores)
    print("Test set:")
    lin_scores = cross_val_score(LinearRegression(), X_te_new, y_te, scoring="neg_mean_squared_error", cv=4)
    lin_rmse_scores = np.sqrt(-lin_scores)
    display_scores(lin_rmse_scores)

In [0]:
# pick the best : for k=40
select_best.k = 40
select_best.fit(np.asarray(X_tr),np.asarray(y_tr, dtype="|S100"))
X_tr = select_best.transform(X_tr)
X_te = select_best.transform(X_te)

#### Scaling

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr = scaler.transform(X_tr)
X_te = scaler.transform(X_te)

#### Feature tuning pipeline (using linear regression)

In [0]:
# Linear regression and pipeline (to evaluate feature selection and reduction)
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('regression',LinearRegression())]
pipeline = Pipeline(steps)

#### PCA

In [0]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA()

In [0]:
# add to the steps collection
pipeline.steps.insert(0,('pca',pca))

In [0]:
# GridSearchCV
parameters = {}
grid_search = GridSearchCV(pipeline,parameters, cv=4, scoring='neg_mean_squared_error')

In [0]:
# add corresponding parameters
grid_search.param_grid['pca__n_components'] = [0.9,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,0.999,1]

In [0]:
# fit the model
grid_search.fit(X_tr,y_tr)

In [0]:
# training scores
print("Best pca parameter is {}".format(grid_search.best_params_))
lin_scores = grid_search.best_score_
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# run best model with test data
grid_search_final = grid_search.best_estimator_ 
grid_search_final.fit(X_te,y_te)
lin_scores = cross_val_score(grid_search_final, X_te, y_te, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# apply tranformaiton to the features
pca.n_components = 0.999
pca.fit(X_tr)
X_tr = pca.transform(X_tr)
X_te = pca.transform(X_te)

# check the number of principal components
print(pca.n_components_)

In [0]:
# tidy up - remove the pca step from the pipeline
del pipeline.steps[0]

#### Clustering - K-means

In [0]:
from sklearn.cluster import KMeans
kmeans = KMeans()

In [0]:
# add to the steps collection
pipeline.steps.insert(0,('kmeans',kmeans))

In [0]:
# GridSearchCV
parameters = {}
grid_search = GridSearchCV(pipeline,parameters, cv=4, scoring='neg_mean_squared_error')

# add corresponding parameters
grid_search.param_grid['kmeans__n_clusters'] = [2,4,6,8,10,15,20,30,40]

In [0]:
# fit the model
grid_search.fit(X_tr,y_tr)

In [0]:
# training set scores
print("Best k-means parameter is {}".format(grid_search.best_params_))
lin_scores = grid_search.best_score_
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

In [0]:
# try with test data
grid_search_final = grid_search.best_estimator_ 
grid_search_final.fit(X_te,y_te)
lin_scores = cross_val_score(grid_search_final, X_te, y_te, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

**Conclusion** : results don't improve. Skip clustering

# 4.0 Model 

## Reasons for considering a classification model in addition to the regression model.

Box Office profits are a function of production budgets, ticket prices & Marketing/ Distribution costs based on their year of release.


Hence, while evaluating historical values, profit calculations without accounting for inflation doesnt always provide the big picture (pun intended). A recommended option is arriving at a revenue to budget ratio (R/B), which offers better insights compared to revenue recognition alone. 

Further, this ratio is also useful in determining the relative success of a theatrical release. The total cost of movie production is not limited to production budgets but should take into account the distribution and marketing costs which is increasingly becoming important. Accordingly, movie producers and industry watchers now believe that a movie needs to make in excess of 2.5 times its production budget even to be considered a moderate success.

Hence, the classifier feature matrix can drop the 'revenue' and 'revenue to budget ratio' as new unseen data ( data on movies that are yet to be released which we want to know whether will be a success or failure) will not have a revenue column. The attempt here is to predict film success using features like genre, production company, month/season of release, cast/crew size, country of production, whether part of a franchise, weighted rating etc. These are parameters that are already known before the movie's release.


## Classification

### 1. Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
params_logreg={'C':np.logspace(-5, 8, 15)}
logreg_cv = GridSearchCV(logreg,params_logreg,cv=4)
logreg_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy: ',logreg_cv.best_score_)

### 2. Random Forest

In [0]:
forest=RandomForestClassifier(random_state=42)
params_forest = {'n_estimators': [3, 4, 6, 7, 10, 20, 50, 100]}
forest_cv=GridSearchCV(forest, params_forest,cv=4)
forest_cv.fit(X_tr,y_tr_cl)
print('Accuracy: ',forest_cv.best_score_)

### 3. KNN

In [0]:
knn=KNeighborsClassifier()
params_knn = {'n_neighbors': [3, 5, 10, 20]}
knn_cv=GridSearchCV(knn, params_knn,cv=4,n_jobs=-1)
knn_cv.fit(X_tr,y_tr_cl)

In [0]:
print('Accuracy:',knn_cv.best_score_)

### 4. Decision tree

In [0]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
params_tree = {"criterion": ["gini", "entropy"],"min_samples_split": [2, 10, 20],"max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],"max_leaf_nodes": [None, 5, 10, 20]
              }
tree_cv=GridSearchCV(tree, params_tree,cv=4,n_jobs=-1)
tree_cv.fit(X_tr,y_tr_cl)


In [0]:
print('Accuracy:',tree_cv.best_score_)

### 5. AdaBoostClassifier with DecisionTree as base estimator

In [0]:
from sklearn.ensemble import AdaBoostClassifier
ABC= AdaBoostClassifier(base_estimator=tree_cv.best_estimator_,n_estimators=100)
np.mean(cross_val_score(ABC, X_tr, y_tr_cl, cv=3))

### 6. Voting Classifier 

In [0]:
from sklearn.ensemble import VotingClassifier
classifiers = [('Logistic Regression', logreg_cv.best_estimator_),('K Nearest Neighbours', knn_cv.best_estimator_),
               ('RandomForestClassifier', forest_cv.best_estimator_),('DecisionTreeClassifier',tree_cv.best_estimator_)]

vc = VotingClassifier(estimators=classifiers)

print('Accuracy: ',np.mean(cross_val_score(vc, X_tr, y_tr_cl, cv=3)))

## Regression

### 1. Linear Regression

In [0]:
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_scores = cross_val_score(lin_reg,X_tr,y_tr, scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)


### 2. Ridge Regression

In [0]:
from sklearn.linear_model import Ridge
param_grid = [{'alpha': [0.001,0.01,0.1,1,10,100,1000,1000]}]
rr_cv = GridSearchCV(Ridge(), param_grid, cv=4, scoring='neg_mean_squared_error')
rr_cv.fit(X_tr, y_tr)

In [0]:
print(np.sqrt(-rr_cv.best_score_))

### 3. Lasso Regression

In [0]:
from sklearn.linear_model import Lasso
param_grid = [{'alpha': [0.001,0.01,0.1,1,10,100,1000,1000]}]
lr_cv = GridSearchCV(Lasso(), param_grid, cv=3, scoring='neg_mean_squared_error')
lr_cv.fit(X_tr, y_tr)

In [0]:
print(np.sqrt(-lr_cv.best_score_))

## Test

#### Classification

In [0]:
from sklearn.metrics import accuracy_score

final_cl_model = logreg_cv.best_estimator_  
y_pred_cl = final_cl_model.predict(X_te)

print('Accuracy score: ',accuracy_score(y_te_cl,y_pred_cl))

**Logistic Regression has a train accuracy of 0.769 and a test accuracy of 0.782**

#### Regression

In [0]:
from sklearn.metrics import mean_squared_error

final_reg_model = lin_reg
final_reg_model.fit(X_tr,y_tr)

y_pred_reg = final_reg_model.predict(X_te)

final_mse = mean_squared_error(y_te, y_pred_reg)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

## Model Evaluation 

### Best model - Logistic Regression

#### Confusion Matrix

In [0]:
def cm_heatmap(y_test,y_pred):
    cm=confusion_matrix(y_test,y_pred)
    conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
    plt.figure(figsize = (8,5))
    sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")

    
cm_heatmap(y_te_cl,y_pred_cl)

#### Classification report

In [0]:
print(classification_report(y_te_cl, y_pred_cl))

#### ROC Curve and AUC

In [0]:
fpr, tpr, thresholds = roc_curve(y_te_cl, final_cl_model.predict_proba(X_te)[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for RandomForest classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('Recall')
plt.grid(True)

In [0]:
print ("Area under the curve is ", roc_auc_score(y_te_cl,final_cl_model.predict_proba(X_te)[:,1]))

### Evaluation of second best model

In [0]:
## Test
y_pred_forest = forest_cv.predict(X_te)
print('Accuracy score: ',accuracy_score(y_te_cl,y_pred_forest))

In [0]:
## Confusion matrix
cm_heatmap(y_te_cl,y_pred_forest)

In [0]:
print(classification_report(y_te_cl, y_pred_forest))

In [0]:
fpr, tpr, thresholds = roc_curve(y_te_cl, forest_cv.predict_proba(X_te)[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for RandomForest classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('Recall')
plt.grid(True)

In [0]:
print ("Area under the curve is ", roc_auc_score(y_te_cl,forest_cv.predict_proba(X_te)[:,1]))

### Comparision of Accuracy, Precision and Recall between Logistic Regression and Random Forest classifiers

In the case of movie box office prediction, a **high recall** translates to missing to identify a film that will be successful and  **high precision** ensures the majority of positively predicted films will be successful. 

As the average cost involved in producing a major studio film is extremely high and the number of films produced by a production house is relatively low, it is important that the classifier's positive predictions are going to be successful. Failing that will result in huge loss for the production house.  Hence a model that has high precision rather than high recall is preferred. 

In [0]:
def metrics_comp(y_te_cl,y_pred_cl,y_pred_forest):
    metrics=pd.DataFrame(index=['Accuracy','Precision','Recall'],columns=['Logistic Regression','Random Forest'])
    metrics.loc['Accuracy','Logistic Regression']=accuracy_score(y_te_cl,y_pred_cl)
    metrics.loc['Precision','Logistic Regression']=precision_score(y_te_cl,y_pred_cl)
    metrics.loc['Recall','Logistic Regression']=recall_score(y_te_cl,y_pred_cl)
    metrics.loc['Accuracy','Random Forest']=accuracy_score(y_te_cl,y_pred_forest)
    metrics.loc['Precision','Random Forest']=precision_score(y_te_cl,y_pred_forest)
    metrics.loc['Recall','Random Forest']=recall_score(y_te_cl,y_pred_forest)
    return(metrics)

In [0]:
metrics_comp(y_te_cl,y_pred_cl,y_pred_forest)

In [0]:
fig,ax=plt.subplots(figsize=(8,6))
metrics_comp(y_te_cl,y_pred_cl,y_pred_forest).plot(kind='bar',ax=ax)
plt.show()

Accuracy levels are consistant between both models. However, Logistic regression has higher precision than Random forest.

### Tradeoff between precision and recall at different thresholds.

In [0]:
precision_logreg, recall_logreg, thresholds_logreg=precision_recall_curve(y_te_cl,final_cl_model.predict_proba(X_te)[:,1])


In [0]:
def p_r_threshold(thresholds,precision,recall):

    fig,ax=plt.subplots(figsize=(8,5))
    ax.plot(thresholds_logreg,precision_logreg[1:],label='Precision')
    ax.plot(thresholds_logreg,recall_logreg[1:],label='Recall')
    ax.set_xlabel('Classification threshold')
    ax.set_ylabel('precision,Recall')
    ax.set_title('Logistic Regression: Precison Recall')
    ax.legend()
    ax.grid();
p_r_threshold(thresholds_logreg,precision_logreg,recall_logreg)

At threshold 0.5 the recall is 0.4 with a very high precision (0.7). Moving the threshold to 0.6 results in a much higher precision (0.8) without a significant drop in recall.

# 5.0 Analysis

## Plots

## Scoring

# 6.0 Summary

## Observations

## Conclusions