# Problem 3: Predict whether a movie will be profitable or not

While predicting the exact amount of revenue that a movie will make certainly is interesting, it is also very valuable for a movie-making company to evaluate whether it will be profitable or not, as it could help movie makers make the ultimate decision to pursue a certain movie or not.


In [1]:
# Loading necessary packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Loading the training & testing set
# Remember, this has been previously randomly split
# The training data contains approx 70% of observations

movies_train = pd.read_csv('../00_DATA/train_dataset.csv')
movies_test = pd.read_csv('../00_DATA/test_dataset.csv')

movies_train.head(3)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,Sport,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced
0,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,0,1,0,0,0,...,0,0,0,6.4,81936.0,48000000,76019048,1,0,20
1,tt0116391,Gang,Gang,2000,152,0,0,0,0,0,...,0,0,0,6.2,236.0,30000000,41480851,0,0,21
2,tt0118589,Glitter,Glitter,2001,104,0,0,0,0,1,...,0,0,0,2.3,23292.0,22000000,5271666,0,0,20


# Creating the dependent Variable

As of now, the variable of interest is continuous. That is, the column "revenue" contains an amount of dollars of total box office revenue achieved by the movie in question. However, in this section we're only interested in predicted whether 'revenue' is larger than or equal to 'budget'. We create then a new column 'profitable':

In [3]:
movies_train['profitable'] = np.where(movies_train['revenue'] >= movies_train['budget'], 1, 0)
movies_test['profitable'] = np.where(movies_test['revenue'] >= movies_test['budget'], 1, 0)


In [4]:
movies_train.head(5)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced,profitable
0,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,0,1,0,0,0,...,0,0,6.4,81936.0,48000000,76019048,1,0,20,1
1,tt0116391,Gang,Gang,2000,152,0,0,0,0,0,...,0,0,6.2,236.0,30000000,41480851,0,0,21,1
2,tt0118589,Glitter,Glitter,2001,104,0,0,0,0,1,...,0,0,2.3,23292.0,22000000,5271666,0,0,20,0
3,tt0120166,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001,86,0,1,0,0,0,...,0,0,4.5,565.0,150000000,215283742,0,0,20,1
4,tt0120467,Vulgar,Vulgar,2000,87,0,0,0,0,0,...,0,0,5.2,4078.0,120000,14904,0,0,21,0


In [5]:
movies_test.head(5)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced,profitable
0,tt0293429,Mortal Kombat,Mortal Kombat,2021,110,0,1,0,0,0,...,0,0,6.1,150542.0,20000000,83601013,0,0,0,1
1,tt0315642,Wazir,Wazir,2016,103,0,0,0,0,0,...,0,0,7.2,18426.0,5200000,9200000,0,0,5,1
2,tt0385887,Motherless Brooklyn,Motherless Brooklyn,2019,144,0,0,0,0,0,...,0,0,6.8,51825.0,26000000,18377736,1,0,2,0
3,tt0437086,Alita: Battle Angel,Alita: Battle Angel,2019,122,0,0,0,0,0,...,0,0,7.3,249934.0,170000000,404852543,1,1,2,1
4,tt0441881,Danger Close,Danger Close: The Battle of Long Tan,2019,118,0,0,0,0,0,...,1,0,6.8,11395.0,23934823,2078370,0,0,2,0


# Model Generation

In this section,  we'll generate classifying models that predict whether a movie is profitable or not given its features. The models we will generate are the following:

* baseline
* Logistic Regression
* CART
* Vanilla Bagging
* Random Forest
* Gradient Boosting

We will then evaluate our performance metric of choice, **accuracy**, for each model.

In [6]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score

In [7]:
# defining my own metrics functions to use for later
def tpr_score(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred).ravel()
    return (cm[3]) / (cm[3] + cm[2])
    
def fpr_score(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred).ravel()
    return (cm[1]) / (cm[1] + cm[0])

## Baseline Model

Let us first build a baseline model that we'll use as "strict minimum" to assess the performance of the next models. A baseline model simply predicts whatever happens most frequently in the training set. Let us inspect:

In [8]:
# Which value. of "profitable" happens most frequently on the training set?
movies_train['profitable'].value_counts()

1    3947
0    1604
Name: profitable, dtype: int64

There are more profitable movies than non-profitable. Therefore, our baseline model will always predict profitable = 1.

In [48]:
baseline_y_pred = [1 for pred in movies_test['profitable']]
y_actual = movies_test['profitable']

In [49]:
# CM and performance metrics of the baseline model

baseline_cm = confusion_matrix(y_actual, baseline_y_pred).ravel()
baseline_accuracy = accuracy_score(y_actual, baseline_y_pred)

print("Confusion Matrix of the Baseline Model", baseline_cm)
print("Accuracy of the baseline model:", baseline_accuracy)

Confusion Matrix of the Baseline Model [   0  634    0 1721]
Accuracy of the baseline model: 0.7307855626326963


# Logistic Regression Model

### Ignored irrelevant columns:

Before we even begin modelling, some columns must be removed/Ignored as they are not relevant for this study:
- tconst is just a database ID and carries no relevance
- primaryTitle and originalTitle are not relevant, as we chose not to do NLP on titles for this portion of the project
- startYear, as we used it to obtain yearsSinceProduces
- **important** numVotes and averageRating: We assume in this study that ratings are gathered AFTER the revenue is obtained. Therefore it may not be relevant for companies willing to determine profitability BEFORE movies are released.
- revenue: As it was used to create the dependent variable, we cannot use it to predict profitability

In [11]:
movies_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5551 entries, 0 to 5550
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tconst              5551 non-null   object 
 1   primaryTitle        5551 non-null   object 
 2   originalTitle       5551 non-null   object 
 3   startYear           5551 non-null   int64  
 4   runtimeMinutes      5551 non-null   int64  
 5   Animation           5551 non-null   int64  
 6   Fantasy             5551 non-null   int64  
 7   GameShow            5551 non-null   int64  
 8   History             5551 non-null   int64  
 9   Music               5551 non-null   int64  
 10  Musical             5551 non-null   int64  
 11  News                5551 non-null   int64  
 12  SciFi               5551 non-null   int64  
 13  Sport               5551 non-null   int64  
 14  War                 5551 non-null   int64  
 15  Western             5551 non-null   int64  
 16  averag

In [12]:
# function that writes the formula for us, if needed
def formula_for_logreg(df, y, cols_to_remove=[]):
    return y + ' ~ ' + ' + '.join(df.columns.drop([y] + cols_to_remove))

formula_for_logreg(movies_train, 'profitable', [])

'profitable ~ tconst + primaryTitle + originalTitle + startYear + runtimeMinutes + Animation + Fantasy + GameShow + History + Music + Musical + News + SciFi + Sport + War + Western + averageRating + numVotes + budget + revenue + isTopActor + isTopDirector + yearsSinceProduced'

In [67]:
import statsmodels.formula.api as smf

logreg0 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    Animation + 
                    Fantasy + 
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi + 
                    Sport + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg0.summary())

Optimization terminated successfully.
         Current function value: 0.576422
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5551
Model:                          Logit   Df Residuals:                     5535
Method:                           MLE   Df Model:                           15
Date:                Fri, 03 Dec 2021   Pseudo R-squ.:                 0.04124
Time:                        20:29:25   Log-Likelihood:                -3199.7
converged:                       True   LL-Null:                       -3337.3
Covariance Type:            nonrobust   LLR p-value:                 7.596e-50
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4478      0.171      2.611      0.009       0.112       0.784
runti

In [68]:
y_pred_logreg0 = [1 if pred >= 0.5 else 0 for pred in logreg0.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg0, y_actual)

0.7290870488322717

In [69]:
cm_logreg0 = confusion_matrix(y_actual, y_pred_logreg0).ravel()

print ("Confusion Matrix of Logistic Regression : \n", cm_logreg0)

Confusion Matrix of Logistic Regression : 
 [   5  629    9 1712]


In [70]:
movies_test['profitable'].value_counts()

1    1721
0     634
Name: profitable, dtype: int64