# PART 1: Cleaning and Preparing Dataset


### Question: How can we predict the potential sucess of a movie?

This predictive approach empowers us to anticipate the reception and success of upcoming films, aiding decision-making processes within the realm of film investment, allowing investors to make informed decisions on which projects to support financially.

### Dataset: https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
sb.set() # set the default Seaborn style for graphics

In [2]:
data = pd.read_csv('movies.csv')
data.head()

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      7668 non-null   object 
 1   rating    7591 non-null   object 
 2   genre     7668 non-null   object 
 3   year      7668 non-null   int64  
 4   released  7666 non-null   object 
 5   score     7665 non-null   float64
 6   votes     7665 non-null   float64
 7   director  7668 non-null   object 
 8   writer    7665 non-null   object 
 9   star      7667 non-null   object 
 10  country   7665 non-null   object 
 11  budget    5497 non-null   float64
 12  gross     7479 non-null   float64
 13  company   7651 non-null   object 
 14  runtime   7664 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 898.7+ KB


We start by loading our dataset and inspecting it to understand its structure and content. We identify the numeric and categorical predictors as shown below. 

### Variables we are making use of:

**Numeric Predictors**
1. votes
2. gross
3. budget
4. runtime

**Categorical Predictors:**
1. rating
2. genre
3. country
4. company

**Response Variable**

Score


### Cleaning dataset to include our chosen predictors

Our variables range from the budget of the movie, the genre to the votes and runtime of a movie. These were variables that we took as they may be variables that could predict if a movie is sucessful, for example, the larger the budget, the movie is more produced which attributes to a "successful" movie.

In [4]:
#cleaning data
movieData = data[['gross', 'runtime', 'budget','votes','country','genre', 'rating', 'company', 'score']]
movieData.shape

(7668, 9)

#### Dropping company as predictor

In [5]:
company_counts = movieData['company'].value_counts()
company_counts_df = pd.DataFrame(company_counts)
company_counts_df = company_counts.reset_index()
company_counts_df.columns = ['Company', 'Count']
print(company_counts_df)

                       Company  Count
0           Universal Pictures    377
1                 Warner Bros.    334
2            Columbia Pictures    332
3           Paramount Pictures    320
4        Twentieth Century Fox    240
...                        ...    ...
2380  Digital Image Associates      1
2381    Kopelson Entertainment      1
2382              Clavius Base      1
2383    Tim Burton Productions      1
2384               PK 65 Films      1

[2385 rows x 2 columns]


As for the 'company' variable, we chose not to include it in the analysis due to the lack of informative insights caused by the sheer amount of categories. As you can see, there are 2385 different companies. Including a categorical variable with this many levels can significantly increase the complexity of our model. Furthermore, some companies have only a few movies associated with them, leading to sparse data for those categories. Sparse data can make it difficult for the model to learn meaningful patterns and relationships.

Therefore, we have decided to drop company as one of our categorical predictors although we intially wanted to use it because we thought it would be useful.

In [6]:
movieData = movieData.copy()
movieData.drop(columns=['company'], inplace=True)

#### Dropping null values

We perform a quick check to see if there are any NaN values and drop them from our dataset. Dropping these rows ensures that we have a clean dataset to work with, as missing values can disrupt our analysis.

In [7]:
# drop all the NaN values
movieData = movieData.dropna()

# reset the index of the rows of the DataFrame
movieData = movieData.reset_index(drop=True)

print(f"The shape of the new dataset: {movieData.shape}")

The shape of the new dataset: (5423, 8)


In [8]:
# checking if NaNs exist in our dataset after dropping
movieData.isnull().values.any()

False

In [9]:
movieData.head()

Unnamed: 0,gross,runtime,budget,votes,country,genre,rating,score
0,46998772.0,146.0,19000000.0,927000.0,United Kingdom,Drama,R,8.4
1,58853106.0,104.0,4500000.0,65000.0,United States,Adventure,R,5.8
2,538375067.0,124.0,18000000.0,1200000.0,United States,Action,PG,8.7
3,83453539.0,88.0,3500000.0,221000.0,United States,Comedy,PG,7.7
4,39846344.0,98.0,6000000.0,108000.0,United States,Comedy,R,7.3


In [11]:
movieData.to_csv('cleaned-movie-dataset.csv', index=False)