# Problem 3: Predict whether a movie will be profitable or not

While predicting the exact amount of revenue that a movie will make certainly is interesting, it is also very valuable for a movie-making company to evaluate whether it will be profitable or not, as it could help movie makers make the ultimate decision to pursue a certain movie or not.


In [1]:
# Loading necessary packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Loading the training & testing set
# Remember, this has been previously randomly split
# The training data contains approx 70% of observations

movies_train = pd.read_csv('../00_DATA/train_dataset.csv')
movies_test = pd.read_csv('../00_DATA/test_dataset.csv')

movies_train.head(3)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,Sport,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced
0,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,0,1,0,0,0,...,0,0,0,6.4,81936.0,48000000,76019048,1,1,20
1,tt0116391,Gang,Gang,2000,152,0,0,0,0,0,...,0,0,0,6.2,236.0,30000000,41480851,0,0,21
2,tt0118589,Glitter,Glitter,2001,104,0,0,0,0,1,...,0,0,0,2.3,23292.0,22000000,5271666,0,0,20


# Creating the dependent Variable

As of now, the variable of interest is continuous. That is, the column "revenue" contains an amount of dollars of total box office revenue achieved by the movie in question. However, in this section we're only interested in predicted whether 'revenue' is larger than or equal to 'budget'. We create then a new column 'profitable':

In [3]:
movies_train['profitable'] = np.where(movies_train['revenue'] >= movies_train['budget'], 1, 0)
movies_test['profitable'] = np.where(movies_test['revenue'] >= movies_test['budget'], 1, 0)


In [4]:
movies_train.head(5)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced,profitable
0,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,0,1,0,0,0,...,0,0,6.4,81936.0,48000000,76019048,1,1,20,1
1,tt0116391,Gang,Gang,2000,152,0,0,0,0,0,...,0,0,6.2,236.0,30000000,41480851,0,0,21,1
2,tt0118589,Glitter,Glitter,2001,104,0,0,0,0,1,...,0,0,2.3,23292.0,22000000,5271666,0,0,20,0
3,tt0120166,The Sorcerer's Apprentice,The Sorcerer's Apprentice,2001,86,0,1,0,0,0,...,0,0,4.5,565.0,150000000,215283742,0,0,20,1
4,tt0120467,Vulgar,Vulgar,2000,87,0,0,0,0,0,...,0,0,5.2,4078.0,120000,14904,0,0,21,0


In [5]:
movies_test.head(5)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,Animation,Fantasy,GameShow,History,Music,...,War,Western,averageRating,numVotes,budget,revenue,isTopActor,isTopDirector,yearsSinceProduced,profitable
0,tt0293429,Mortal Kombat,Mortal Kombat,2021,110,0,1,0,0,0,...,0,0,6.1,150542.0,20000000,83601013,0,0,0,1
1,tt0315642,Wazir,Wazir,2016,103,0,0,0,0,0,...,0,0,7.2,18426.0,5200000,9200000,0,1,5,1
2,tt0385887,Motherless Brooklyn,Motherless Brooklyn,2019,144,0,0,0,0,0,...,0,0,6.8,51825.0,26000000,18377736,1,1,2,0
3,tt0437086,Alita: Battle Angel,Alita: Battle Angel,2019,122,0,0,0,0,0,...,0,0,7.3,249934.0,170000000,404852543,1,1,2,1
4,tt0441881,Danger Close,Danger Close: The Battle of Long Tan,2019,118,0,0,0,0,0,...,1,0,6.8,11395.0,23934823,2078370,1,0,2,0


# Model Generation

In this section,  we'll generate classifying models that predict whether a movie is profitable or not given its features. The models we will generate are the following:

* baseline
* Logistic Regression
* CART
* Vanilla Bagging
* Random Forest
* Gradient Boosting

We will then evaluate our performance metric of choice, **accuracy**, for each model.

In [6]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score

In [7]:
# defining my own metrics functions to use for later
def tpr_score(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred).ravel()
    return (cm[3]) / (cm[3] + cm[2])
    
def fpr_score(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred).ravel()
    return (cm[1]) / (cm[1] + cm[0])

## Baseline Model

Let us first build a baseline model that we'll use as "strict minimum" to assess the performance of the next models. A baseline model simply predicts whatever happens most frequently in the training set. Let us inspect:

In [8]:
# Which value. of "profitable" happens most frequently on the training set?
movies_train['profitable'].value_counts()

1    3946
0    1604
Name: profitable, dtype: int64

There are more profitable movies than non-profitable. Therefore, our baseline model will always predict profitable = 1.

In [9]:
baseline_y_pred = [1 for pred in movies_test['profitable']]
y_actual = movies_test['profitable']

In [10]:
# CM and performance metrics of the baseline model

baseline_cm = confusion_matrix(y_actual, baseline_y_pred).ravel()
baseline_accuracy = accuracy_score(y_actual, baseline_y_pred)

print("Confusion Matrix of the Baseline Model", baseline_cm)
print("Accuracy of the baseline model:", baseline_accuracy)

Confusion Matrix of the Baseline Model [   0  633    0 1720]
Accuracy of the baseline model: 0.7309817254568636


# Logistic Regression

### Ignored irrelevant columns:

Before we even begin modelling, some columns must be removed/Ignored as they are not relevant for this study:
- tconst is just a database ID and carries no relevance
- primaryTitle and originalTitle are not relevant, as we chose not to do NLP on titles for this portion of the project
- startYear, as we used it to obtain yearsSinceProduces
- **important** numVotes and averageRating: We assume in this study that ratings are gathered AFTER the revenue is obtained. Therefore it may not be relevant for companies willing to determine profitability BEFORE movies are released.
- revenue: As it was used to create the dependent variable, we cannot use it to predict profitability

In [11]:
movies_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5550 entries, 0 to 5549
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tconst              5550 non-null   object 
 1   primaryTitle        5550 non-null   object 
 2   originalTitle       5550 non-null   object 
 3   startYear           5550 non-null   int64  
 4   runtimeMinutes      5550 non-null   int64  
 5   Animation           5550 non-null   int64  
 6   Fantasy             5550 non-null   int64  
 7   GameShow            5550 non-null   int64  
 8   History             5550 non-null   int64  
 9   Music               5550 non-null   int64  
 10  Musical             5550 non-null   int64  
 11  News                5550 non-null   int64  
 12  SciFi               5550 non-null   int64  
 13  Sport               5550 non-null   int64  
 14  War                 5550 non-null   int64  
 15  Western             5550 non-null   int64  
 16  averag

In [12]:
# function that writes the formula for us, if needed
def formula_for_logreg(df, y, cols_to_remove=[]):
    return y + ' ~ ' + ' + '.join(df.columns.drop([y] + cols_to_remove))

formula_for_logreg(movies_train, 'profitable', [])

'profitable ~ tconst + primaryTitle + originalTitle + startYear + runtimeMinutes + Animation + Fantasy + GameShow + History + Music + Musical + News + SciFi + Sport + War + Western + averageRating + numVotes + budget + revenue + isTopActor + isTopDirector + yearsSinceProduced'

### Logistic Regression model #0 

In [13]:
# logreg0: All relevant features except numVotes and averageRating
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating
logreg0 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    Animation + 
                    Fantasy + 
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi + 
                    Sport + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg0.summary())

Optimization terminated successfully.
         Current function value: 0.573974
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5534
Method:                           MLE   Df Model:                           15
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04538
Time:                        15:02:55   Log-Likelihood:                -3185.6
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 1.396e-55
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4530      0.177      2.564      0.010       0.107       0.799
runti

In [14]:
y_pred_logreg0 = [1 if pred >= 0.5 else 0 for pred in logreg0.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg0, y_actual)

0.7292817679558011

In [15]:
cm_logreg0 = confusion_matrix(y_actual, y_pred_logreg0).ravel()

print ("Confusion Matrix of Logistic Regression : \n", cm_logreg0)

Confusion Matrix of Logistic Regression : 
 [   8  625   12 1708]


### Logistic Regression model #1: Trying to include voting data

In [16]:
# logreg1: All relevant features **including** numVotes and averageRating
import statsmodels.formula.api as smf

# Removed: none
logreg1 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    Animation + 
                    Fantasy + 
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi + 
                    Sport + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced +
                    numVotes +
                    averageRating
                ''' ,data = movies_train).fit()

print(logreg1.summary())

Optimization terminated successfully.
         Current function value: 0.528663
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5532
Method:                           MLE   Df Model:                           17
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                  0.1207
Time:                        15:02:55   Log-Likelihood:                -2934.1
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                2.590e-160
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.6996      0.233      3.002      0.003       0.243       1.156
runti

In [17]:
y_pred_logreg1 = [1 if pred >= 0.5 else 0 for pred in logreg1.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg1, y_actual)

0.7280067998300043

### Checking VIFs (checking for collinearity of independent variables)

In [18]:
# VIF Check
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def VIF(df, columns):
    
    values = sm.add_constant(df[columns]).values  # the dataframe passed to VIF must include the intercept term. We add it the same way we did before.
    num_columns = len(columns)+1
    vif = [variance_inflation_factor(values, i) for i in range(num_columns)]
    
    return pd.Series(vif[1:], index=columns)

In [19]:
# checking VIF of columns of logreg0
cols = '''
            runtimeMinutes 
            Animation
            Fantasy 
            History 
            Music 
            Musical 
            News
            SciFi 
            Sport 
            War
            Western
            budget
            isTopActor 
            isTopDirector 
            yearsSinceProduced
                '''.split()
VIF(movies_train, cols)

  x = pd.concat(x[::order], 1)


runtimeMinutes        1.150473
Animation             1.047916
Fantasy               1.022809
History               1.036128
Music                 1.012351
Musical               1.016676
News                  1.003986
SciFi                 1.043541
Sport                 1.006872
War                   1.019436
Western               1.001371
budget                1.068120
isTopActor            1.373503
isTopDirector         1.413826
yearsSinceProduced    1.030560
dtype: float64

In [20]:
# checking VIF of columns of logreg1
cols = '''
        runtimeMinutes
        Animation 
        Fantasy 
        History 
        Music  
        Musical 
        News  
        SciFi  
        Sport  
        War 
        Western 
        budget 
        isTopActor  
        isTopDirector  
        yearsSinceProduced 
        numVotes 
        averageRating
                '''.split()
VIF(movies_train, cols)

runtimeMinutes        1.200381
Animation             1.051623
Fantasy               1.029710
History               1.046060
Music                 1.015461
Musical               1.017900
News                  1.005254
SciFi                 1.081139
Sport                 1.009518
War                   1.020329
Western               1.002417
budget                1.152986
isTopActor            1.439676
isTopDirector         1.548722
yearsSinceProduced    1.031681
numVotes              1.602614
averageRating         1.221574
dtype: float64

Looks like there is no apparent linear relationship between our independent variables. We can therefore go ahead and remove variables with high p-value from now on to see how the model evolves.

### Logistic Regression model #2: Removing 'Animation' due to high p-value

In [21]:
# logreg2: logreg0 - 'Animation'
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation
logreg2 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    Fantasy + 
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi + 
                    Sport + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg2.summary())

Optimization terminated successfully.
         Current function value: 0.573983
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5535
Method:                           MLE   Df Model:                           14
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04537
Time:                        15:02:56   Log-Likelihood:                -3185.6
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 3.080e-56
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4599      0.175      2.624      0.009       0.116       0.803
runti

In [22]:
y_pred_logreg2 = [1 if pred >= 0.5 else 0 for pred in logreg2.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg2, y_actual)

0.7292817679558011

### Logistic Regression model #3: Removing 'Fantasy' due to high p-value

In [23]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy
logreg3 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi + 
                    Sport + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg3.summary())

Optimization terminated successfully.
         Current function value: 0.573983
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5536
Method:                           MLE   Df Model:                           13
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04537
Time:                        15:02:56   Log-Likelihood:                -3185.6
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 6.240e-57
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4599      0.175      2.625      0.009       0.117       0.803
runti

In [24]:
y_pred_logreg3 = [1 if pred >= 0.5 else 0 for pred in logreg3.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg3, y_actual)

0.7292817679558011

### Logistic Regression model #4: Removing 'Sport' due to high p-value


In [26]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy, Sport
logreg4 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    Music + 
                    Musical + 
                    News + 
                    SciFi +
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg4.summary())

Optimization terminated successfully.
         Current function value: 0.573986
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5537
Method:                           MLE   Df Model:                           12
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04537
Time:                        15:09:17   Log-Likelihood:                -3185.6
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 1.232e-57
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4609      0.175      2.632      0.008       0.118       0.804
runti

In [27]:
y_pred_logreg4 = [1 if pred >= 0.5 else 0 for pred in logreg4.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg4, y_actual)

0.7292817679558011

### Logistic Regression model #5: Removing 'Music' due to high p-value

In [28]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy, Sport, Music
logreg5 = smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    Musical + 
                    News + 
                    SciFi +
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg5.summary())

Optimization terminated successfully.
         Current function value: 0.574007
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5538
Method:                           MLE   Df Model:                           11
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04533
Time:                        15:11:11   Log-Likelihood:                -3185.7
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 2.564e-58
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4600      0.175      2.628      0.009       0.117       0.803
runti

In [29]:
y_pred_logreg5 = [1 if pred >= 0.5 else 0 for pred in logreg5.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg5, y_actual)

0.7292817679558011

### Logistic Regression model #6: Removing 'News' due to high p-value

In [34]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy, Sport, Music, News
logreg6= smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    Musical +
                    SciFi +
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg6.summary())

Optimization terminated successfully.
         Current function value: 0.574031
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5539
Method:                           MLE   Df Model:                           10
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04529
Time:                        15:13:16   Log-Likelihood:                -3185.9
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 5.154e-59
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4653      0.175      2.664      0.008       0.123       0.808
runti

In [31]:
y_pred_logreg6 = [1 if pred >= 0.5 else 0 for pred in logreg6.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg6, y_actual)

0.7292817679558011

### Logistic Regression model #7: Removing 'SciFi' due to high p-value

In [35]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy, Sport, Music, News, SciFi
logreg7= smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    Musical +
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg7.summary())

Optimization terminated successfully.
         Current function value: 0.574183
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5540
Method:                           MLE   Df Model:                            9
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04504
Time:                        15:14:47   Log-Likelihood:                -3186.7
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 1.973e-59
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4589      0.175      2.624      0.009       0.116       0.802
runti

In [36]:
y_pred_logreg7 = [1 if pred >= 0.5 else 0 for pred in logreg7.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg7, y_actual)

0.7292817679558011

### Logistic Regression model #8: Removing 'Musical' due to high p-value

In [37]:
import statsmodels.formula.api as smf

# Removed: numVotes, averageRating, Animation, Fantasy, Sport, Music, News, SciFi, Musical
logreg8= smf.logit(formula = ''' profitable ~ 
                    runtimeMinutes +  
                    History + 
                    War + 
                    Western +
                    budget +
                    isTopActor + 
                    isTopDirector + 
                    yearsSinceProduced
                ''' ,data = movies_train).fit()

print(logreg8.summary())

Optimization terminated successfully.
         Current function value: 0.574350
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:             profitable   No. Observations:                 5550
Model:                          Logit   Df Residuals:                     5541
Method:                           MLE   Df Model:                            8
Date:                Sat, 04 Dec 2021   Pseudo R-squ.:                 0.04476
Time:                        15:16:33   Log-Likelihood:                -3187.6
converged:                       True   LL-Null:                       -3337.0
Covariance Type:            nonrobust   LLR p-value:                 7.704e-60
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4780      0.174      2.749      0.006       0.137       0.819
runti

In [38]:
y_pred_logreg8 = [1 if pred >= 0.5 else 0 for pred in logreg8.predict(movies_test)]
y_actual = movies_test['profitable']

accuracy_score(y_pred_logreg8, y_actual)

0.7292817679558011

# CART Tree Classifier

A CART model, where:
* min_samples leaf is arbitrarly set at 5
* all the other settings are left default
* ccp_alpha is optimized through k-fold cross validation
* k=5 was chosen.
* min_samples_split is set at 10

In [None]:
### CART 1) NOT Using features that may not make sense in real life: numvotes, yearsSinceProduced, AvgRating

In [None]:
# we need to split the data between dependent (y) and indepdendent (x)
y_train = movies_train['profitable']
X_train = movies_train.drop(['profitable', 'tconst', 'revenue', 'primaryTitle', 'originalTitle', 'numVotes', 'averageRating', 'yearsSinceProduced'], axis=1)

# same thing for the test set
y_test = movies_test['profitable']
X_test = movies_test.drop(['profitable', 'tconst', 'revenue', 'primaryTitle', 'originalTitle', 'numVotes', 'averageRating', 'yearsSinceProduced'], axis=1)

In [39]:
movies_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5550 entries, 0 to 5549
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tconst              5550 non-null   object 
 1   primaryTitle        5550 non-null   object 
 2   originalTitle       5550 non-null   object 
 3   startYear           5550 non-null   int64  
 4   runtimeMinutes      5550 non-null   int64  
 5   Animation           5550 non-null   int64  
 6   Fantasy             5550 non-null   int64  
 7   GameShow            5550 non-null   int64  
 8   History             5550 non-null   int64  
 9   Music               5550 non-null   int64  
 10  Musical             5550 non-null   int64  
 11  News                5550 non-null   int64  
 12  SciFi               5550 non-null   int64  
 13  Sport               5550 non-null   int64  
 14  War                 5550 non-null   int64  
 15  Western             5550 non-null   int64  
 16  averag