# Data Preparation
## Step 1 : Data  Sources

In our analysis we look at data  on movie releases from IMDB, and The Numbers to answer questions for a hypothetical movie studio start-up. Our data are contained in the ../zippedData directory of this repo and will need to be unzipped and imported to be useful for this analysis. 

### How did we choose our data?

We decided to use data from `tn.movie_budgets.csv.gz`, `imdb.title.basics.csv.gz` and `imdb.name.basics.csv.gz`. We chose `tn.movie_budgets.csv.gz` because it provided more detailed information about revenue and production costs which allowed us to ask and answer more meaningful questions about the overall return on investment for each film. We also included `imdb.title.basics.csv.gz` in order to take a more detailed look at what _types_ of films performed best over time. Finally, we took a look at the personell files in `imdb.name.basics.csv.gz` to answer questions about which indusdry professionals were involved in successful titles. The total size of our combined dataset is 5698 unique records of movies 

First we will import the required packages and build an unzip function to help access our relevant files.

In [1]:
!ls -la ../zippedData/

total 23100
drwxr-xr-x 1 smang 197609        0 Jun 22 19:51 .
drwxr-xr-x 1 smang 197609        0 Jun 22 21:01 ..
-rw-r--r-- 1 smang 197609    53544 Jun 22 19:51 bom.movie_gross.csv.gz
-rw-r--r-- 1 smang 197609 18070960 Jun 22 19:51 imdb.name.basics.csv.gz
-rw-r--r-- 1 smang 197609  3459897 Jun 22 19:51 imdb.title.basics.csv.gz
-rw-r--r-- 1 smang 197609  1898523 Jun 22 19:51 imdb.title.crew.csv.gz
-rw-r--r-- 1 smang 197609   153218 Jun 22 19:51 tn.movie_budgets.csv.gz


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import gzip
from io import StringIO
%matplotlib inline


# This function uses try statements to push through errors and unzip the csv data
def unzip_csv(file_location): 
    file = gzip.open(file_location, 'rb')
    content = file.read()
    file.close()
    try:
        content_str = str(content,'utf-8')
        content_data = StringIO(content_str) 
    except:
        content_str = str(content,'latin-1')
        content_data = StringIO(content_str) 
    try:
        return pd.read_csv(content_data)
    except:
        return pd.read_csv(content_data, sep='\t')

    
#hard-coding the file-locations and nicknames into a dict for future reference
file_locations = ['../zippedData/imdb.name.basics.csv.gz'
                  ,'../zippedData/imdb.title.basics.csv.gz'
                  ,'../zippedData/tn.movie_budgets.csv.gz'
                  ,'../zippedData/imdb.title.crew.csv.gz']

file_nicknames = ['name','basics','budgets','crew']


#this dicitonary comprehension uses a zip function to smush the two lists together and then parse them into a dict
#we also have a reference for each raw df and its location on the drive.
file_dict = {k:v for k,v in zip(file_nicknames,file_locations)}

#we unzip and define frames
name= unzip_csv(file_dict['name'])
basics= unzip_csv(file_dict['basics'])
budgets= unzip_csv(file_dict['budgets'])
crew= unzip_csv(file_dict['crew'])

## Step 2: Cleaning the Data

In the next step we take the raw data frames and format the values to their appropriate data types, drop duplicates, null values, and redundant or irrelevant columns. We'll examine the head of our budgets DataFrame below as a starting off point:

### Budgets

In [3]:
budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [4]:
budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [5]:
budgets.isna().sum()

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

After looking at some summary data for the budgets data frame we can see that the we have a few tasks before this is going to be useful for analysis. There seems to be a redundant index column, and the numerical and time information is in the wrong format.

In [6]:
#id column is a redundant index so we're dropping it
budgets.drop('id', axis=1, inplace=True)

#setting date column to datatime object for use in charts etc.
budgets['release_date'] = pd.to_datetime(budgets['release_date'])

#stripping any unseen or unknown whitespace from the object locales
budgets.columns.str.strip()
budgets['movie'] = budgets['movie'].str.strip()

#this function launders the money ;D
def clean_money(budgets_series):
    #the map function applys the .replace to each cell in the given series, x[1:] skips the $
    return budgets_series.map(lambda x: int(x[1:].replace(',','')))

budgets['production_budget'] = clean_money(budgets['production_budget'])
budgets['domestic_gross'] = clean_money(budgets['domestic_gross'])
budgets['worldwide_gross'] = clean_money(budgets['worldwide_gross'])

#adding in relevant columns
budgets['foreign_gross'] = budgets.worldwide_gross - budgets.domestic_gross
budgets['profit'] = budgets.worldwide_gross - budgets.production_budget

#dropping duplicates
budgets.drop_duplicates('movie', keep='first',inplace=True)

In [7]:
#looks good now
budgets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5698 entries, 0 to 5781
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   release_date       5698 non-null   datetime64[ns]
 1   movie              5698 non-null   object        
 2   production_budget  5698 non-null   int64         
 3   domestic_gross     5698 non-null   int64         
 4   worldwide_gross    5698 non-null   int64         
 5   foreign_gross      5698 non-null   int64         
 6   profit             5698 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 356.1+ KB


### Basics
Now the general shape of the cleaning process has been defined we can rinse and repeat on our other data sets, making them easier to use in later analysis.

In [8]:
basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [9]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [10]:
basics['movie'] = basics['primary_title']

#keeping only 'movie' and 'ttconst' as keys for our other data, and 'genres' for further analysis
basics.drop(['primary_title','original_title','runtime_minutes'],axis=1,inplace=True)

The columns look correct:

In [11]:
basics.columns

Index(['tconst', 'start_year', 'genres', 'movie'], dtype='object')

In [12]:
#the strip functions remove unwanted whitespace if its lurking in there
basics.columns = basics.columns.str.strip()

for column in list(basics.columns):
    try:
        basics[column] = basics[column].str.strip()
    except:
        pass

#Dropping duplicates
basics.drop_duplicates('movie', keep='first', inplace=True)

#Dropping null vales
to_drop = basics[basics['genres'].isna()==True].index
basics.drop(to_drop,inplace=True)

#this .map will apply a .split to all the genres at each "," decoding the genres data into a nested list.
#basics['genres'] = basics['genres'].map(lambda x: x.split(","))

As demonstrated below the previously difficult to use string data has now been munged into a useful format:

In [13]:
basics['genres']

0            Action,Crime,Drama
1               Biography,Drama
2                         Drama
3                  Comedy,Drama
4          Comedy,Drama,Fantasy
                  ...          
146138    Adventure,History,War
146139                    Drama
146140              Documentary
146141                   Comedy
146143              Documentary
Name: genres, Length: 131180, dtype: object

In [14]:
basics['genres'][0][0]

'A'

### Name

In [15]:
name.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [16]:
name.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   nconst              606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
 5   known_for_titles    576444 non-null  object 
dtypes: float64(2), object(4)
memory usage: 27.8+ MB


In [17]:
name.isnull().sum()

nconst                     0
primary_name               0
birth_year            523912
death_year            599865
primary_profession     51340
known_for_titles       30204
dtype: int64

In [18]:
#dropping these since they're outside the scope of our analysis
name.drop(['primary_profession','birth_year','death_year','known_for_titles'],axis=1,inplace=True)

#cleaning the object data
name.columns = name.columns.str.strip()

#for loop will work here since all columns are object data
for column in list(name.columns):
    name[column] = name[column].str.strip()

In [19]:
name.head(5)

Unnamed: 0,nconst,primary_name
0,nm0061671,Mary Ellen Bauder
1,nm0061865,Joseph Bauer
2,nm0062070,Bruce Baum
3,nm0062195,Axel Baumann
4,nm0062798,Pete Baxter


### Crew

In [20]:
crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


In [21]:
crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   tconst     146144 non-null  object
 1   directors  140417 non-null  object
 2   writers    110261 non-null  object
dtypes: object(3)
memory usage: 3.3+ MB


In [22]:
#dropping these since they're outside the scope of our analysis
crew.drop(['writers'],axis=1,inplace=True)

#cleaning the object data
crew.columns = crew.columns.str.strip()

#for loop will work here since all columns are object data
for column in list(crew.columns):
    crew[column] = crew[column].str.strip()

Since the director is the only reason we're using this data set we're going to drop null director values.

In [23]:
to_drop = crew[crew.directors.isna()==True].index
crew.drop(to_drop,inplace=True)

In [24]:
crew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140417 entries, 0 to 146142
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   tconst     140417 non-null  object
 1   directors  140417 non-null  object
dtypes: object(2)
memory usage: 3.2+ MB


In [25]:
crew.directors = crew.directors.map(lambda x: x.split(","))

In [26]:
crew.head()

Unnamed: 0,tconst,directors
0,tt0285252,[nm0899854]
2,tt0462036,[nm1940585]
3,tt0835418,[nm0151540]
4,tt0878654,"[nm0089502, nm2291498, nm2292011]"
5,tt0879859,[nm2416460]


In [27]:
files_list = [name,basics,crew,budgets]

In [28]:
for file in files_list:
    display(file.head())

Unnamed: 0,nconst,primary_name
0,nm0061671,Mary Ellen Bauder
1,nm0061865,Joseph Bauer
2,nm0062070,Bruce Baum
3,nm0062195,Axel Baumann
4,nm0062798,Pete Baxter


Unnamed: 0,tconst,start_year,genres,movie
0,tt0063540,2013,"Action,Crime,Drama",Sunghursh
1,tt0066787,2019,"Biography,Drama",One Day Before the Rainy Season
2,tt0069049,2018,Drama,The Other Side of the Wind
3,tt0069204,2018,"Comedy,Drama",Sabse Bada Sukh
4,tt0100275,2017,"Comedy,Drama,Fantasy",The Wandering Soap Opera


Unnamed: 0,tconst,directors
0,tt0285252,[nm0899854]
2,tt0462036,[nm1940585]
3,tt0835418,[nm0151540]
4,tt0878654,"[nm0089502, nm2291498, nm2292011]"
5,tt0879859,[nm2416460]


Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit
0,2009-12-18,Avatar,425000000,760507625,2776345279,2015837654,2351345279
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,804600000,635063875
2,2019-06-07,Dark Phoenix,350000000,42762350,149762350,107000000,-200237650
3,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,944008095,1072413963
4,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,696540365,999721747


In [29]:
budgets['year'] = pd.DatetimeIndex(budgets['release_date']).year

In [30]:
budgets['movie'] = budgets['movie'] +' '+ budgets['year'].astype(str)

In [31]:
budgets['movie'].head()

0                                         Avatar 2009
1    Pirates of the Caribbean: On Stranger Tides 2011
2                                   Dark Phoenix 2019
3                        Avengers: Age of Ultron 2015
4              Star Wars Ep. VIII: The Last Jedi 2017
Name: movie, dtype: object

In [32]:
basics['movie'] = basics['movie'] +' '+ basics['start_year'].astype(str)

In [33]:
df = budgets.merge(basics)
df.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit,year,tconst,start_year,genres
0,2011-05-20,Pirates of the Caribbean: On Stranger Tides 2011,410600000,241063875,1045663875,804600000,635063875,2011,tt1298650,2011,"Action,Adventure,Fantasy"
1,2019-06-07,Dark Phoenix 2019,350000000,42762350,149762350,107000000,-200237650,2019,tt6565702,2019,"Action,Adventure,Sci-Fi"
2,2015-05-01,Avengers: Age of Ultron 2015,330600000,459005868,1403013963,944008095,1072413963,2015,tt2395427,2015,"Action,Adventure,Sci-Fi"
3,2018-04-27,Avengers: Infinity War 2018,300000000,678815482,2048134200,1369318718,1748134200,2018,tt4154756,2018,"Action,Adventure,Sci-Fi"
4,2017-11-17,Justice League 2017,300000000,229024295,655945209,426920914,355945209,2017,tt0974015,2017,"Action,Adventure,Fantasy"


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   release_date       1349 non-null   datetime64[ns]
 1   movie              1349 non-null   object        
 2   production_budget  1349 non-null   int64         
 3   domestic_gross     1349 non-null   int64         
 4   worldwide_gross    1349 non-null   int64         
 5   foreign_gross      1349 non-null   int64         
 6   profit             1349 non-null   int64         
 7   year               1349 non-null   int64         
 8   tconst             1349 non-null   object        
 9   start_year         1349 non-null   int64         
 10  genres             1349 non-null   object        
dtypes: datetime64[ns](1), int64(7), object(3)
memory usage: 126.5+ KB


In [35]:
name.head()

Unnamed: 0,nconst,primary_name
0,nm0061671,Mary Ellen Bauder
1,nm0061865,Joseph Bauer
2,nm0062070,Bruce Baum
3,nm0062195,Axel Baumann
4,nm0062798,Pete Baxter


In [36]:
crew.head()

Unnamed: 0,tconst,directors
0,tt0285252,[nm0899854]
2,tt0462036,[nm1940585]
3,tt0835418,[nm0151540]
4,tt0878654,"[nm0089502, nm2291498, nm2292011]"
5,tt0879859,[nm2416460]


In [37]:
df = df.merge(crew,how='left')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   release_date       1349 non-null   datetime64[ns]
 1   movie              1349 non-null   object        
 2   production_budget  1349 non-null   int64         
 3   domestic_gross     1349 non-null   int64         
 4   worldwide_gross    1349 non-null   int64         
 5   foreign_gross      1349 non-null   int64         
 6   profit             1349 non-null   int64         
 7   year               1349 non-null   int64         
 8   tconst             1349 non-null   object        
 9   start_year         1349 non-null   int64         
 10  genres             1349 non-null   object        
 11  directors          1349 non-null   object        
dtypes: datetime64[ns](1), int64(7), object(4)
memory usage: 137.0+ KB


In [38]:
df_e = df.explode('directors')

In [39]:
df_e = df_e.merge(name,left_on='directors',right_on='nconst')
df_e.columns

Index(['release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross', 'foreign_gross', 'profit', 'year', 'tconst',
       'start_year', 'genres', 'directors', 'nconst', 'primary_name'],
      dtype='object')

In [40]:
df_e.drop(['directors','nconst','tconst'],axis=1,inplace=True)

In [41]:
df_e.head()
df_e['primary_name'].isna().sum()

0

In [42]:
df_e = df_e.groupby('movie')['primary_name'].apply(", ".join).reset_index()

In [43]:
df = df.merge(df_e)

In [44]:
df['director'] = df['primary_name']
df.drop(['tconst','directors','primary_name'],axis=1,inplace=True)
df.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit,year,start_year,genres,director
0,2011-05-20,Pirates of the Caribbean: On Stranger Tides 2011,410600000,241063875,1045663875,804600000,635063875,2011,2011,"Action,Adventure,Fantasy",Rob Marshall
1,2019-06-07,Dark Phoenix 2019,350000000,42762350,149762350,107000000,-200237650,2019,2019,"Action,Adventure,Sci-Fi",Simon Kinberg
2,2015-05-01,Avengers: Age of Ultron 2015,330600000,459005868,1403013963,944008095,1072413963,2015,2015,"Action,Adventure,Sci-Fi",Joss Whedon
3,2018-04-27,Avengers: Infinity War 2018,300000000,678815482,2048134200,1369318718,1748134200,2018,2018,"Action,Adventure,Sci-Fi","Anthony Russo, Joe Russo"
4,2017-11-17,Justice League 2017,300000000,229024295,655945209,426920914,355945209,2017,2017,"Action,Adventure,Fantasy",Zack Snyder


In [45]:
df['genres'].loc[0][0]

'A'

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   release_date       1349 non-null   datetime64[ns]
 1   movie              1349 non-null   object        
 2   production_budget  1349 non-null   int64         
 3   domestic_gross     1349 non-null   int64         
 4   worldwide_gross    1349 non-null   int64         
 5   foreign_gross      1349 non-null   int64         
 6   profit             1349 non-null   int64         
 7   year               1349 non-null   int64         
 8   start_year         1349 non-null   int64         
 9   genres             1349 non-null   object        
 10  director           1349 non-null   object        
dtypes: datetime64[ns](1), int64(7), object(3)
memory usage: 166.5+ KB


In [47]:
df['year'] = df['start_year']
df.drop(['release_date','start_year'],axis=1,inplace=True)

In [48]:
!pwd

/c/Users/smang/Documents/Flatiron/dsc-phase-1-project/Data


In [49]:
df.to_csv('data-clean.csv')
df.to_excel('data-clean.xlsx')