# **Part 1: Data cleaning and preparation**
---
### **Question:** What factors have the greatest influence on the success of a movie?

**Dataset:** [TMBD Movie Dataset](https://www.kaggle.com/datasets/successikuku/tmbd-movie-dataset)

Data cleaning and preparation refer to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in raw data to make it suitable for analysis.<br>

The quality of data is a critical factor in the success of any analysis, and data cleaning and preparation are essential steps to ensure that the data is accurate, complete, and consistent. The process typically involves several steps, including data profiling, data cleansing, data transformation, and data integration.<br>

Data cleaning and preparation are necessary to ensure that the data is suitable for analysis and to avoid making incorrect or biased conclusions based on flawed data. By cleaning and preparing the data, analysts can increase the accuracy and reliability of their results, which can lead to better decision-making and improved business outcomes.






In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt 
sb.set() 

In [2]:
movie_data = pd.read_csv('TMBD Movie Dataset.csv')

### We will be using *profit* and *popularity* as the main factors of success

In [3]:
profit = pd.DataFrame(movie_data['profit'])
popularity = pd.DataFrame(movie_data['popularity'])

### We have filtered out the *factors* that we identified to have relevance to the *success* of a movie 



In [4]:
budget = pd.DataFrame(movie_data['budget'])
cast = pd.DataFrame(movie_data['cast'])
director = pd.DataFrame(movie_data['director'])
genres = pd.DataFrame(movie_data['genres'])
runtime = pd.DataFrame(movie_data['runtime'])
production_companies = pd.DataFrame(movie_data['production_companies'])
release_year = pd.DataFrame(movie_data['release_year'])
release_date = pd.DataFrame(movie_data['release_date'])
allfactors = movie_data[['budget', 'cast', 'director', 'genres', 'production_companies', 'release_year', 'release_date', 'runtime', 'popularity', 'profit']]

### Removing NaN values

Since we are only concerned with data that is fully filled, we will be removing movies that contain NaN values.

In [5]:
allfactors.dropna()
allfactors = allfactors.reset_index(drop=True)
print(f"The shape of the new dataset: {allfactors.shape}")
allfactors.isnull().values.any()

The shape of the new dataset: (1287, 10)


False

### Splitting of data 

We have noticed that some factors like *cast*, *director*, *production companies* and *release_date* contain multiple unique variables. Therefore, we are using the explode function to split the data such that each row only contain one unique variable.

In [6]:
allfactors[['year', 'month', 'day']] = allfactors['release_date'].str.split('-', expand=True)
allfactors.drop(['year', 'day'], axis=1, inplace=True)
factors = allfactors[['budget', 'cast', 'director', 'genres', 'production_companies', 'release_year', 'release_date', 'runtime', 'month']]
success = allfactors[['popularity', 'profit']]


In [7]:
cast["cast"] = cast["cast"].str.split("|")
cast = cast.explode("cast")
director["director"] = director["director"].str.split("|")
director = director.explode("director")
production_companies["production_companies"] = production_companies["production_companies"].str.split("|")
production_companies = production_companies.explode("production_companies")
genres["genres"] = genres["genres"].str.split("|")
genres = genres.explode("genres")

### One-hot encoding of specific data

One-hot encoding converts categorical data into numerical such that machine learning can be applied since most machine learning models can only be applied to numerical data

In [9]:
def encodetable(y, separator): 
    options_list = []
    
    # iterate through data and find all available options
    for val in y:
        options = str(val).split(separator)
        options_list.append(options)
    
    # options_list is a list of list containing the avialable options
    # convert to single non-nested list &
    # convert that to set and back to list to remove redundant options
    options = list(set([val for option in options_list for val in option]))
    
    # sort the list so the DataFrame columns are sorted
    options.sort()
    
    # create an empty DataFrame with shape (len(y), len(options))
    df = pd.DataFrame(index=range(len(y)), columns=options)
    
    # intialize all values to 0
    for col in df.columns:
        df[col].values[:] = 0
    
    # set value to 1 if the option was selected
    for index, vals in enumerate(y):
        options = str(vals).split(separator)
        for val in options:
            df.at[index, val] = 1
    return df

In [10]:
encoded_cast = encodetable(factors['cast'], '|')
encoded_cast

Unnamed: 0,50 Cent,A. Michael Baldwin,A.J. Cook,Aamir Khan,Aaran Thomas,Aaron Burns,Aaron Eckhart,Aaron Paul,Aaron Stanford,Aaron Taylor-Johnson,...,Zenzo Ngqobe,Zhang Ziyi,Zineb Oukach,Zoe Saldana,Zoe Sloane,Zooey Deschanel,ZoÃ« Bell,ZoÃ« Kravitz,Ã“lafur Darri Ã“lafsson,à¸ªà¸£à¸žà¸‡à¸©à¹Œ à¸Šà¸²à¸•à¸£à¸µ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1282,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1283,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1284,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1285,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
encoded_genre = encodetable(factors['genres'], '|')
encoded_genre

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
3,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
4,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1282,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1283,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1284,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1285,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0


In [12]:
factors.drop(['genres', 'cast', 'director', 'production_companies'], axis=1, inplace=True)
factors

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  factors.drop(['genres', 'cast', 'director', 'production_companies'], axis=1, inplace=True)


Unnamed: 0,budget,release_year,release_date,runtime,month
0,150000000.0,2015,2015-06-09,124,06
1,150000000.0,2015,2015-05-13,120,05
2,110000000.0,2015,2015-03-18,119,03
3,200000000.0,2015,2015-12-15,136,12
4,190000000.0,2015,2015-04-01,137,04
...,...,...,...,...,...
1282,7000000.0,1973,1973-07-05,121,07
1283,11000000.0,1965,2065-12-16,130,12
1284,7000000.0,1969,2069-12-12,142,12
1285,300000.0,1978,1978-10-25,91,10


### Pickle data

Lastly we used the "to_pickle" function such that we can easily access the cleaned data in another file

In [13]:
factors.to_pickle('factors.pkl')
success.to_pickle('success.pkl')
cast.to_pickle('cast.pkl')
director.to_pickle('director.pkl')
genres.to_pickle('genres.pkl')
encoded_genre.to_pickle('encoded_genre.pkl')
encoded_cast.to_pickle('encoded_cast.pkl')
production_companies.to_pickle('production_companies.pkl')