In [210]:
import numpy as np
import pandas as pd
import os
import requests
import matplotlib.pyplot as plt
import time
import seaborn as sns
import datetime
from datetime import date
from bs4 import BeautifulSoup as bs

## 1. Project Overview

The aim of the following analysis is to test out the old adage "you have to spend money to make money" within the film industry. As a self-anointed cinephile who spends much of his time either watching films after work or listening to film related podcasts during work, I've always been fascinated with the industry. Coming from a finance role/background, the financial side of film production has always fascinated me but I never had the tools to to obtain the sources of information required and to analyse it accordingly. 

This course has provided me with those missing tools and I will not let an opportunity pass to put this together both for purposes of this course and for my personal needs going forward. 

My first task in this project is to build a population of films with accurate production costs, to do this I pulled a number of .csv files from Kaggle and reviewed them in Excel prior to loading them into Python. Once I had the population and the variable (Production Cost), I then considered what other variables would be best to compare the costs against.

Below are the other variables I aim to compare production costs against:
- Worldwide Gross Amounts:
    - Does spending more make more?
- IMDB User Ratings:
    - Does spending more increase audience enjoyment?
- Rotten Tomato Critic Scores:
    - Does spending more increase critical reception?

Using these three variables, I can expand on my analysis and see if there is a genuine correlation between production cost and the other variables along with other insights that will appear as I analyse.

## 2. Data Loading and Cleansing

The first step is to build up a population of films, I have sourced eight .csv files from Kaggle most of which provide a list of film titles, release dates/years, production costs and worldwide gross amounts. Only some of the files contain all of those variables so, I must build a Dataframes (DF) for production costs and worldwide grosses separately cleanse them removing duplicates and blanks before joining them as one. 

But first I should load the files:

## Loading the Film Production/Gross Files:

In [211]:
movie_filenames = os.listdir("C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/movie_data/") # lists the csv files in my folder

In [212]:
def extract_name_files(text): # this removes the .csv from the name the files in the folder
    name_file = text.strip('.csv').lower()
    return name_file

In [213]:
names_of_movie_files = list(map(extract_name_files,movie_filenames)) # creates the list that to be used to name the dataframes from the filenames

In [214]:
for i in range(0,len(names_of_movie_files)): # saves each csv in a dataframe structure
    exec(names_of_movie_files[i] + " =  pd.read_csv('C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/movie_data/'+movie_filenames[i])")

In [215]:
# I will then use the below to display the dataframes currently loaded:
%whos DataFrame

Variable                    Type         Data/Info
--------------------------------------------------
all_time_worldwide_bo       DataFrame         Rank  Year          <...>n\n[595 rows x 6 columns]
blockbuster                 DataFrame         release_year  rank_i<...>\n[430 rows x 13 columns]
blockbuster_PD              DataFrame         release_year        <...>n\n[430 rows x 4 columns]
file_names                  DataFrame                          fil<...>0         top_movies_data
imdb_movie_metadata         DataFrame          color      director<...>n[5043 rows x 28 columns]
imdb_top_1000               DataFrame                             <...>n[1000 rows x 16 columns]
movie_industry_dataset      DataFrame                             <...>n[7633 rows x 15 columns]
movie_industry_dataset_PD   DataFrame          release_year       <...>\n[7633 rows x 4 columns]
production_file_names       DataFrame                            f<...>       top_movies_data_PD
rotten_tomatoes_movie    

## Loading the Film Rating Files:

Repeating the same steps as before but for the film rating files in a different folder, however this time I can reuse the extract_name_files(text) function defined earlier which means one less step:

In [216]:
rating_filenames = os.listdir("C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/rating_data/")

In [217]:
names_of_rating_files = list(map(extract_name_files,rating_filenames))

In [218]:
for i in range(0,len(names_of_rating_files)):
    exec(names_of_rating_files[i] + " =  pd.read_csv('C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/rating_data/'+rating_filenames[i])")

In [219]:
%whos DataFrame

Variable                    Type         Data/Info
--------------------------------------------------
all_time_worldwide_bo       DataFrame         Rank  Year          <...>n\n[595 rows x 6 columns]
blockbuster                 DataFrame         release_year  rank_i<...>\n[430 rows x 13 columns]
blockbuster_PD              DataFrame         release_year        <...>n\n[430 rows x 4 columns]
file_names                  DataFrame                          fil<...>0         top_movies_data
imdb_movie_metadata         DataFrame          color      director<...>n[5043 rows x 28 columns]
imdb_top_1000               DataFrame                             <...>n[1000 rows x 16 columns]
movie_industry_dataset      DataFrame                             <...>n[7633 rows x 15 columns]
movie_industry_dataset_PD   DataFrame          release_year       <...>\n[7633 rows x 4 columns]
production_file_names       DataFrame                            f<...>       top_movies_data_PD
rotten_tomatoes_movie    

The number of files loaded as DataFrames has increased by the 3 rating files

To explore the contents of these DataFrames, I will create a new DataFrame and then use a loop to present the info of each of the loaded DataFrame

In [220]:
file_names = pd.DataFrame({'file': ['all_time_worldwide_bo','blockbuster','imdb_movie_metadata','imdb_top_1000','movie_industry_dataset',
                                    'rotten_tomatoes_movie','tgm_bo_summary','tmdb_movie','tmds_movies_metadata',
                                    'top_grossing_film','top_movies_data']})

In [221]:
for index, row in file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.info())')

all_time_worldwide_bo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595 entries, 0 to 594
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     595 non-null    int64 
 1   Year                     595 non-null    int64 
 2   Movie                    595 non-null    object
 3   WorldwideBox Office      595 non-null    object
 4   DomesticBox Office       588 non-null    object
 5   InternationalBox Office  595 non-null    object
dtypes: int64(2), object(4)
memory usage: 28.0+ KB
None
blockbuster
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   release_year          430 non-null    int64  
 1   rank_in_year          430 non-null    int64  
 2   imdb_rating           430 non-null    float64
 3   mpaa_rating      

The above provides me with an overview of the DFs, I can see all columns within each DF and the data type contained within each column. In building my population, I will require columns containing:
- Film Title
    - For Obvious Reasons
- Release Date or Release Year
    - While having a full release date would be perfect as it would allow analysis by month and even day, I would settle for release year if it means getting a bigger population. 
- Production Costs Amount
    - The entire report hinges on the presence and accuracy of production costs for each film. 

I will start by building this population, extracting each of these columns from the files where all three columns are present, cleanse the data in each new subsetted version of the DF, amend the column headings to ensure they are uniformed in each new DF before stacking them to form a production cost population. I will then repeat the steps for the files containing gross box office figures.  


## Analysis of DataFrames:

Excluding the three film rating DFs (imdb_movie_metadata, imdb_top_1000 and rotten_tomatoes_movie) and the DFs which have no production costs or budgets (all_time_worldwide_bo, tgm_bo_summary and top_grossing_film), I am left with the following DFs and the relevent columns:
- blockbuster
    - release_year
    - film_title
    - film_budget
    - length_in_min
        - I am including the length to filter out movies 60 minutes or less where possible
- movie_industry_dataset
    - name
    - year
    - budget
    - runtime
- tmdb_movie
    - budget
    - release_date
    - runtime
    - title
- tmds_movies_metadata
    - budget
    - release_date
    - title
    - runtime
- top_movies_data
    - Release Date
    - Movie Title
    - Production Budget
    
I will use the column names from the blockbuster DF and change the other columns names to match those ones. In addition to the column name changes, the following steps will be taken to cleanse each DF:
- Filter out films 60 minutes and under
- Filter out and rows where the production costs are blanks
- Remove duplicates based one these two criteria in this order:
    - film_title & release_year
    - film_title & film_budget
        - This should remove any potential duplicates where multiple versions of the same title appear but should also retain films with identical titles produced in different years, remakes being a prime example of this. 

The files should be ready to be appended on top of each other at this point. 

The next stage will be to pull together the DFs containing the film grossing data which from the loaded DFs will be along with the columns required to join with the production DFs:
- all_time_worldwide_bo
    - Year
    - Movie
    - WorldwideBox Office
- blockbuster
    - release_year 
    - film_title
    - worldwide_gross   
- movie_industry_dataset
    - name
    - year
    - gross
- tgm_bo_summary
    - Title
    - Lifetime Gross
    - Year
- tmdb_movie
    - release_date
    - revenue
    - title                 
- tmds_movies_metadata
    - release_date
    - revenue
    - title
- top_grossing_film
    - Movie_Name
    - Lifetime Gross
    - Year  
- top_movies_data
    - Release Date
    - Movie Title
    - Worldwide Gross


In [222]:
# Before I begin to subset the files, I will cleanse some of the data based on the data type information provided above. I can see that budget columns in the Blockbuster and TMDS DFs are not numeric data types.

# The loop below will print the first two rows of each file to show me potential reasons for the issues with the data types.

for index, row in file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.head(2))')

all_time_worldwide_bo
   Rank  Year              Movie WorldwideBox Office DomesticBox Office  \
0     1  2009             Avatar      $2,845,899,541       $760,507,625   
1     2  2019  Avengers: Endgame      $2,797,800,564       $858,373,000   

  InternationalBox Office  
0          $2,085,391,916  
1          $1,939,427,564  
blockbuster
   release_year  rank_in_year  imdb_rating mpaa_rating         film_title  \
0          2019             1          8.5       PG-13  Avengers: Endgame   
1          2019             2          7.0          PG      The Lion King   

   film_budget  length_in_min domestic_distributor worldwide_gross  \
0  356,000,000            181         Walt Disney    2,797,800,564   
1  260,000,000            118         Walt Disney    1,656,943,394   

  domestic_gross    genre_1    genre_2 genre_3  
0    858,373,000     Action  Adventure   Drama  
1    543,638,043  Animation  Adventure   Drama  
imdb_movie_metadata
   color   director_name  num_critic_for_revie

In [223]:
# based on the above, the following changes are required for the data to be usable:
# remove commas and currency signs from columns where required
all_time_worldwide_bo['WorldwideBox Office']=all_time_worldwide_bo['WorldwideBox Office'].str.replace(',','')
all_time_worldwide_bo['WorldwideBox Office']=all_time_worldwide_bo['WorldwideBox Office'].str.replace('$','')
blockbuster['film_budget']=blockbuster['film_budget'].str.replace(',','')
blockbuster['worldwide_gross']=blockbuster['worldwide_gross'].str.replace(',','')
tgm_bo_summary['Lifetime Gross']=tgm_bo_summary['Lifetime Gross'].str.replace(',','')
tgm_bo_summary['Lifetime Gross']=tgm_bo_summary['Lifetime Gross'].str.replace('$','')

# change the data types of the columns to a numeric type
all_time_worldwide_bo['WorldwideBox Office'] = pd.to_numeric(all_time_worldwide_bo['WorldwideBox Office'])
blockbuster['film_budget'] = pd.to_numeric(blockbuster['film_budget'])
blockbuster['worldwide_gross'] = pd.to_numeric(blockbuster['worldwide_gross'])
tgm_bo_summary['Lifetime Gross'] = pd.to_numeric(tgm_bo_summary['Lifetime Gross'])
tmds_movies_metadata['budget'] = pd.to_numeric(tmds_movies_metadata['budget'])

In [224]:
# the below info will show the change has taken place for the budget data for both DFs

for index, row in file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.info())')

all_time_worldwide_bo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595 entries, 0 to 594
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     595 non-null    int64 
 1   Year                     595 non-null    int64 
 2   Movie                    595 non-null    object
 3   WorldwideBox Office      595 non-null    int64 
 4   DomesticBox Office       588 non-null    object
 5   InternationalBox Office  595 non-null    object
dtypes: int64(3), object(3)
memory usage: 28.0+ KB
None
blockbuster
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   release_year          430 non-null    int64  
 1   rank_in_year          430 non-null    int64  
 2   imdb_rating           430 non-null    float64
 3   mpaa_rating      

To uniform the columns, I will rename, where required, to the following:
- release_year (or date)
- film_title
- film_budget
- film_runtime
- film_gross

In [225]:
all_time_worldwide_bo.rename(columns={'Year': 'release_year','Movie': 'film_title', 'WorldwideBox Office': 'film_gross'}, inplace=True)

blockbuster.rename(columns={'length_in_min': 'film_runtime','worldwide_gross': 'film_gross'}, inplace=True)

movie_industry_dataset.rename(columns={'year': 'release_year', 'name': 'film_title',
                                          'budget': 'film_budget', 'runtime': 'film_runtime','gross': 'film_gross'}, inplace=True)

tgm_bo_summary.rename(columns={'Title': 'film_title','Lifetime Gross': 'film_gross', 'Year': 'release_year'}, inplace=True)

tmdb_movie.rename(columns={'title': 'film_title','budget': 'film_budget', 'runtime': 'film_runtime','revenue': 'film_gross'}, inplace=True)

tmds_movies_metadata.rename(columns={'title': 'film_title','budget': 'film_budget', 'runtime': 'film_runtime','revenue': 'film_gross'}, inplace=True)

top_grossing_film.rename(columns={'Year': 'release_year', 'Movie_Name': 'film_title','Lifetime Gross': 'film_gross'}, inplace=True)

top_movies_data.rename(columns={'Release Date': 'release_date', 'Movie Title': 'film_title',
                                          'Production Budget': 'film_budget', 'Worldwide Gross': 'film_gross'}, inplace=True)

In [226]:
tmdb_movie['release_date'] = pd.to_datetime(tmdb_movie['release_date'])
tmds_movies_metadata['release_date'] = pd.to_datetime(tmds_movies_metadata['release_date'])
top_movies_data['release_date'] = pd.to_datetime(top_movies_data['release_date'])

In [227]:
tmdb_movie['release_year'] = pd.DatetimeIndex(tmdb_movie['release_date']).year
tmds_movies_metadata['release_year'] = pd.DatetimeIndex(tmds_movies_metadata['release_date']).year
top_movies_data['release_year'] = pd.DatetimeIndex(top_movies_data['release_date']).year

In [228]:
top_movies_data_PD.head(1)

Unnamed: 0,release_year,release_date,film_title,film_budget
0,2009.0,2009-09-09,9,30000000


## Production DF Analysis

Taking a deeper look at the files being used in the production DF and making the required amendments

In [229]:
#Create new DFs with only the columns required
columns_to_subset1 = ['release_year', 'film_title', 'film_budget','film_runtime']
columns_to_subset2 = ['release_year', 'release_date', 'film_title', 'film_budget','film_runtime']
columns_to_subset3 = ['release_year', 'release_date', 'film_title', 'film_budget']
blockbuster_PD = blockbuster[columns_to_subset1]
movie_industry_dataset_PD = movie_industry_dataset[columns_to_subset1]
tmdb_movie_PD = tmdb_movie[columns_to_subset2]
tmds_movies_metadata_PD = tmds_movies_metadata[columns_to_subset2]
top_movies_data_PD = top_movies_data[columns_to_subset3]

In [230]:
production_file_names = pd.DataFrame({'file': ['blockbuster_PD','movie_industry_dataset_PD','tmdb_movie_PD','tmds_movies_metadata_PD',
                                    'top_movies_data_PD']})

In [231]:
for index, row in production_file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.info())')

blockbuster_PD
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   release_year  430 non-null    int64 
 1   film_title    430 non-null    object
 2   film_budget   430 non-null    int64 
 3   film_runtime  430 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 13.6+ KB
None
movie_industry_dataset_PD
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7633 entries, 0 to 7632
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   release_year  7633 non-null   int64  
 1   film_title    7633 non-null   object 
 2   film_budget   5462 non-null   float64
 3   film_runtime  7629 non-null   float64
dtypes: float64(2), int64(1), object(1)
memory usage: 238.7+ KB
None
tmdb_movie_PD
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 5 colu

In [232]:
for index, row in production_file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.head(2))')

blockbuster_PD
   release_year         film_title  film_budget  film_runtime
0          2019  Avengers: Endgame    356000000           181
1          2019      The Lion King    260000000           118
movie_industry_dataset_PD
   release_year       film_title  film_budget  film_runtime
0          1980      The Shining   19000000.0         146.0
1          1980  The Blue Lagoon    4500000.0         104.0
tmdb_movie_PD
   release_year release_date                                film_title  \
0        2009.0   2009-10-12                                    Avatar   
1        2007.0   2007-05-19  Pirates of the Caribbean: At World's End   

   film_budget  film_runtime  
0    237000000         162.0  
1    300000000         169.0  
tmds_movies_metadata_PD
   release_year release_date film_title  film_budget  film_runtime
0        1995.0   1995-10-30  Toy Story     30000000          81.0
1        1995.0   1995-12-15    Jumanji     65000000         104.0
top_movies_data_PD
   release_year rel