# Data Preparation

## Step 1: Importing the data

Our data are contained in the zippedData directory of this repo and will need to be unzipped and imported to be useful for this analysis. First we will import the required packages and build an unzip function to help access our relevant files.

### How did we choose our data?

We decided to use data from `tn.movie_budgets.csv.gz`, `imdb.title.basics.csv.gz` and `imdb.name.basics.csv.gz`. We chose `tn.movie_budgets.csv.gz` because it provided more detailed information about revenue and production costs which allowed us to ask and answer more meaningful questions about the overall return on investment for each film. We also included `imdb.title.basics.csv.gz` in order to take a more detailed look at what _types_ of films performed best over time. Finally, we took a look at the personell files in `imdb.name.basics.csv.gz` to answer questions about which indusdry professionals were involved in successful titles.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import gzip
from io import StringIO
%matplotlib inline


# This function uses try statements to push through errors and unzip the csv data
def unzip_csv(file_location): 
    file = gzip.open(file_location, 'rb')
    content = file.read()
    file.close()
    try:
        content_str = str(content,'utf-8')
        content_data = StringIO(content_str) 
    except:
        content_str = str(content,'latin-1')
        content_data = StringIO(content_str) 
    try:
        return pd.read_csv(content_data)
    except:
        return pd.read_csv(content_data, sep='\t')

    
#hard-coding the file-locations and nicknames into a dict for future reference
file_locations = ['zippedData/imdb.name.basics.csv.gz'
                  ,'zippedData/imdb.title.basics.csv.gz'
                  ,'zippedData/tn.movie_budgets.csv.gz']

file_nicknames = ['name','basics','budgets']


#this dicitonary comprehension uses a zip function to smush the two lists together and then parse them into a dict
#we also have a reference for each raw df and its location on the drive.
file_dict = {k:v for k,v in zip(file_nicknames,file_locations)}

#we unzip and define frames
name_raw = unzip_csv(file_dict['name'])
basics_raw = unzip_csv(file_dict['basics'])
budgets_raw = unzip_csv(file_dict['budgets'])

#renaming frames for cleaning
name = name_raw
basics = basics_raw
budgets = budgets_raw

## Step 2: Cleaning the Data

In the next step we take the raw data frames and format the values to their appropriate data types, drop duplicates, null values, and redundant or irrelevant columns. We'll examine the head of our budgets DataFrame below as a starting off point:

### Budgets

In [None]:
budgets.head()

In [None]:
budgets.info()

In [None]:
budgets.isna().sum()

After looking at some summary data for the budgets data frame we can see that the we have a few tasks before this is going to be useful for analysis. There seems to be a redundant index column, and the numerical and time information is in the wrong format.

In [None]:
#id column is a redundant index so we're dropping it
budgets.drop('id', axis=1, inplace=True)

#setting date column to datatime object for use in charts etc.
budgets['release_date'] = pd.to_datetime(budgets['release_date'])

#stripping any unseen or unknown whitespace from the object locales
budgets.columns.str.strip()
budgets['movie'] = budgets['movie'].str.strip()

#this function launders the money ;D
def clean_money(budgets_series):
    #the map function applys the .replace to each cell in the given series, x[1:] skips the $
    return budgets_series.map(lambda x: int(x[1:].replace(',','')))

budgets['production_budget'] = clean_money(budgets['production_budget'])
budgets['domestic_gross'] = clean_money(budgets['domestic_gross'])
budgets['worldwide_gross'] = clean_money(budgets['worldwide_gross'])

#adding in relevant columns
budgets['foreign_gross'] = budgets.worldwide_gross - budgets.domestic_gross
budgets['profit'] = budgets.worldwide_gross - budgets.production_budget

#dropping duplicates
budgets.drop_duplicates('movie', keep='first',inplace=True)

In [None]:
#looks good now
budgets.info()

### Basics
Now the general shape of the cleaning process has been defined we can rinse and repeat on our other data sets, making them easier to use in later analysis.

In [None]:
basics.head()

In [None]:
basics.info()

In [None]:
basics.isna().sum()

#### Null values
So it looks like the column we're most interest in from this data frame has a rather signifigant portion of null values, something we didnt encounter in the other data frame we already processed. Let's address the null values.

In [None]:
#the dataframe below proivdes a list of indices that contain a null value in the genre column the genre column is very 
#important to our analysis so we'll drop null values.
to_drop = basics[basics['genres'].isna()==True].index

#simple drop will finish the job
basics.drop(to_drop,inplace=True)

#### Pressing on...

Now that the null values have been removed we will run a similar set of cleaning techniques on the dataframe as we did previously. Since the 'movie' column in budgets is how we're going to identify which records go where we are going to adjust the primary title column of basics in order to more easily denote the implied relationship.

In [None]:
basics['movie'] = basics['primary_title']

#keeping only 'movie' and 'ttconst' as keys for our other data, and 'genres' for further analysis
basics.drop(['primary_title','original_title','start_year'
                ,'runtime_minutes'],axis=1,inplace=True)

The columns look correct:

In [None]:
basics.columns

In [None]:
#the strip functions remove unwanted whitespace if its lurking in there
basics.columns = basics.columns.str.strip()

for column in list(basics.columns):
    basics[column] = basics[column].str.strip()

#Dropping duplicates
basics.drop_duplicates('movie', keep='first', inplace=True)
    
#this .map will apply a .split to all the genres at each "," decoding the genres data into a nested list.
basics['genres'] = basics['genres'].map(lambda x: x.split(","))

As demonstrated below the previously difficult to use string data has now been munged into a useful format:

In [None]:
basics['genres'][0]

In [None]:
basics['genres'][0][0]

### Name

In [None]:
name.head()

In [None]:
name.info()

In [None]:
name.isnull().sum()

In [None]:
#dropping these since they're outside the scope of our analysis
name.drop(['nconst','birth_year','death_year',],axis=1,inplace=True)

#this phrase finds null values in either column
to_drop = name[(name['primary_profession'].isna()==True)|
               name['known_for_titles'].isna()==True].index

name.drop(to_drop,inplace=True)

#cleaning the object data
name.columns = name.columns.str.strip()

#for loop will work here since all columns are object data
for column in list(name.columns):
    name[column] = name[column].str.strip()

#splitting the cleaned data into nested lists
name['known_for_titles'] = name['known_for_titles'].map(lambda x: x.split(","))
name['primary_profession'] = name['primary_profession'].map(lambda x: x.split(","));

#### Unnesting the tconst values
Now our primary linkage to the rest of our data set seems to be encoded in a nested list. To access these records we need to employe .explode() to get just 1 tconst and job value per record.

In [None]:
name_exploded = name.explode('known_for_titles')

In [None]:
name_exploded = name_exploded.explode('primary_profession', ignore_index=True)
name_exploded.info()

In [None]:
name_exploded.drop_duplicates()

In [None]:
name_exploded = name_exploded[(name_exploded['primary_profession']=='director')| (name_exploded['primary_profession']=='producer')]

In [None]:
name_exploded.info()

In [None]:
name_gb = name_exploded.groupby('known_for_titles')
group_dict = name_gb.groups

In [None]:
name_gb = name_exploded.groupby(['known_for_titles'])['primary_name'].apply(', '.join).reset_index()

In [None]:
name_gb2 = name_exploded.groupby(['known_for_titles'])['primary_profession'].apply(', '.join).reset_index()

## Merging the data

In [None]:
name = name_gb.merge(name_gb2)
name['primary_profession'] = name['primary_profession'].map(lambda x: x.split(","))
name['primary_name'] = name['primary_name'].map(lambda x: x.split(","))

In [None]:
df = budgets.merge(basics.merge(name,how='left',left_on='tconst',right_on='known_for_titles'),how='left')

In [None]:
df.head()

In [None]:
df.drop(['tconst','known_for_titles'],axis=1,inplace=True)

In [None]:
df.head()