# Box Office Movie Franchise Predictor

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data File(s)

In [2]:
data_path = './data/'
file_name = 'movies_with_sequels_final.csv'

df = pd.read_csv(data_path + file_name)

## Data Wrangling and Cleaning

In [3]:
df.head()

Unnamed: 0,Title,url,IMDB Score,Metacritic,Runtime (mins),Budget,Opening Weekend,Gross USA,Gross World,Release Date,Rating,Genres,Country
0,Spider-Man (2002),http://www.imdb.com/title/tt0145487/,7.3,73,121,"$139,000,000",114844116,407022860,825025036,3-May-02,PG-13,Action Adventure Sci-Fi,USA
1,Spider-Man 2 (2004),http://www.imdb.com/title/tt0316654/,7.3,83,127,"$200,000,000",88156227,373585825,788976453,30-Jun-04,PG-13,Action Adventure Sci-Fi,USA
2,The Matrix (1999),http://www.imdb.com/title/tt0133093/,8.7,73,136,"$63,000,000",27788331,171479930,465343787,31-Mar-99,R,Action Sci-Fi,USA
3,The Matrix Reloaded (2003),http://www.imdb.com/title/tt0234215/,7.2,62,138,"$150,000,000",91774413,281576461,741846459,15-May-03,R,Action Sci-Fi,USA
4,The Lord of the Rings: The Fellowship of the R...,http://www.imdb.com/title/tt0120737/,8.8,92,178,"$93,000,000",47211490,315544750,887832826,19-Dec-01,PG-13,Action Adventure Drama,New Zealand


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1080 non-null   object 
 1   url              1080 non-null   object 
 2   IMDB Score       1080 non-null   float64
 3   Metacritic       1080 non-null   object 
 4   Runtime (mins)   1080 non-null   object 
 5   Budget           1080 non-null   object 
 6   Opening Weekend  1080 non-null   object 
 7   Gross USA        1080 non-null   object 
 8   Gross World      1080 non-null   object 
 9   Release Date     1080 non-null   object 
 10  Rating           1080 non-null   object 
 11  Genres           1080 non-null   object 
 12  Country          1080 non-null   object 
dtypes: float64(1), object(12)
memory usage: 109.8+ KB


From the info table above, there are no missing values. This is incorrect. There are missing values but are stated as 'None' in their respective columns and that is why an entire column gets converted to an object. This was done initially in the webscrapping portion of the work. It has been corrected to be replaced with np.nan. Nonetheless, the change will be made here.

In [5]:
def replace_none_w_nan(data):
    
    column_names = list(data.columns)
    
    for col in column_names:
        data[col].replace('None',np.nan, inplace=True)
        
    return data


In [6]:
df = replace_none_w_nan(df)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1080 non-null   object 
 1   url              1080 non-null   object 
 2   IMDB Score       1080 non-null   float64
 3   Metacritic       975 non-null    object 
 4   Runtime (mins)   1077 non-null   object 
 5   Budget           1078 non-null   object 
 6   Opening Weekend  953 non-null    object 
 7   Gross USA        1006 non-null   object 
 8   Gross World      1079 non-null   object 
 9   Release Date     1080 non-null   object 
 10  Rating           1080 non-null   object 
 11  Genres           1080 non-null   object 
 12  Country          1080 non-null   object 
dtypes: float64(1), object(12)
memory usage: 109.8+ KB
