Import the relevant files and get a first look and analysis at the bom_df data set

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
bom_df=pd.read_csv('zippedData/bom.movie_gross.csv.gz')

In [3]:
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
bom_df.shape

(3387, 5)

In [5]:
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


The foreign_gross column is missing data and it is also not in numeric form as it should be, those are the first few things we will address. It is in the object data type in part because it has some 'NaN'. Now we must analyze the data set and then decide how best to deal with this missing data. 

In [6]:
bom_df['foreign_gross'].tail()

3382    NaN
3383    NaN
3384    NaN
3385    NaN
3386    NaN
Name: foreign_gross, dtype: object

In [7]:
bom_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [8]:
bom_df['foreign_gross'].isna().sum()

1350

In [9]:
bom_df.dropna(subset=['foreign_gross'],inplace=True)

In [10]:
bom_df['foreign_gross'].isna().sum()

0

I decided it would be best to drop the empty rows as I did not want to manipulate or possibly skew the data by using the mean or median. Next I would like to convert the column 'foreign_gross' to numeric form.

In [11]:
bom_df['foreign_gross'].head()

0    652000000
1    691300000
2    664300000
3    535700000
4    513900000
Name: foreign_gross, dtype: object

In [12]:
tmdbmovies= pd.read_csv('zippedData/tmdb.movies.csv.gz')
tmdbmovies.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [13]:
tnmovie_budgets= pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
tnmovie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [22]:
Production_budget=tnmovie_budgets['production_budget']
production_budget_list=[]
for num in Production_budget:
    a= num.replace(',','')
    b= a.replace('$','')
    production_budget_list.append(int(b))


In [23]:
worldwide_gross= tnmovie_budgets['worldwide_gross']
worldwide_gross_list=[]
for num in worldwide_gross:
    a= num.replace(',','')
    b= a.replace('$','')
    worldwide_gross_list.append(int(b))


In [29]:
domestic_gross= tnmovie_budgets['domestic_gross']
domestic_gross_list=[]
for num in domestic_gross:
    a= num.replace(',','')
    b= a.replace('$','')
    worldwide_gross_list.append(int(b))

In [30]:
fig, (ax) = plt.subplots()
ax.scatter(production_budget_list, worldwide_gross_list)
ax.set_title('worldwide')

In [31]:
#fig, (ax) = plt.subplots()
#ax.set_title('domestic')