In [1]:
import numpy as np
import pandas as pd
import os
import requests
import matplotlib.pyplot as plt
import time
import seaborn as sns
import datetime
from datetime import date
from bs4 import BeautifulSoup as bs

## 1. Project Overview

The aim of the following analysis is to test out the old adage "you have to spend money to make money" within the film industry. As a self-anointed cinephile who spends much of his time either watching films after work or listening to film related podcasts during work, I've always been fascinated with the industry. Coming from a finance role/background, the financial side of film production has always fascinated me but I never had the tools to to obtain the sources of information required and to analyse it accordingly. 

This course has provided me with those missing tools and I will not let an opportunity pass to put this together both for purposes of this course and for my personal needs going forward. 

My first task in this project is to build a population of films with accurate production costs, to do this I pulled a number of .csv files from Kaggle and reviewed them in Excel prior to loading them into Python. Once I had the population and the variable (Production Cost), I then considered what other variables would be best to compare the costs against.

Below are the other variables I aim to compare production costs against:
- Worldwide Gross Amounts:
    - Does spending more make more?
- IMDB User Ratings:
    - Does spending more increase audience enjoyment?
- Rotten Tomato Critic Scores:
    - Does spending more increase critical reception?

Using these three variables, I can expand on my analysis and see if there is a genuine correlation between production cost and the other variables along with other insights that will appear as I analyse.

## 2. Data Loading and Cleansing

The first step is to build up a population of films, I have sourced eight .csv files from Kaggle most of which provide a list of film titles, release dates/years, production costs and worldwide gross amounts. Only some of the files contain all of those variables so, I must build a Dataframes (DF) for production costs and worldwide grosses separately cleanse them removing duplicates and blanks before joining them as one. 

But first I should load the files:

## Loading the Film Production/Gross Files:

In [2]:
movie_filenames = os.listdir("C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/movie_data/") # lists the csv files in my folder

In [3]:
def extract_name_files(text): # this removes the .csv from the name the files in the folder
    name_file = text.strip('.csv').lower()
    return name_file

In [4]:
names_of_movie_files = list(map(extract_name_files,movie_filenames)) # creates the list that to be used to name the dataframes from the filenames

In [5]:
for i in range(0,len(names_of_movie_files)): # saves each csv in a dataframe structure
    exec(names_of_movie_files[i] + " =  pd.read_csv('C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/movie_data/'+movie_filenames[i])")

  if (await self.run_code(code, result,  async_=asy)):


In [6]:
# I will then use the below to display the dataframes currently loaded:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
all_time_worldwide_bo    DataFrame         Rank  Year          <...>n\n[595 rows x 6 columns]
blockbuster              DataFrame         release_year  rank_i<...>\n[430 rows x 13 columns]
movie_industry_dataset   DataFrame                             <...>n[7633 rows x 15 columns]
tgm_bo_summary           DataFrame         Rank                <...>\n[1000 rows x 5 columns]
tmdb_movie               DataFrame             budget          <...>n[4803 rows x 20 columns]
tmds_movies_metadata     DataFrame           adult             <...>[45447 rows x 24 columns]
top_grossing_film        DataFrame             Release_Type    <...>\n[1000 rows x 7 columns]
top_movies_data          DataFrame         Release Date Movie T<...>\n[3900 rows x 5 columns]


## Loading the Film Rating Files:

Repeating the same steps as before but for the film rating files in a different folder, however this time I can reuse the extract_name_files(text) function defined earlier which means one less step:

In [7]:
rating_filenames = os.listdir("C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/rating_data/")

In [8]:
names_of_rating_files = list(map(extract_name_files,rating_filenames))

In [9]:
for i in range(0,len(names_of_rating_files)):
    exec(names_of_rating_files[i] + " =  pd.read_csv('C:/Users/New User/Documents/UCD/Assignment/assignment_files/csv_downloads/rating_data/'+rating_filenames[i])")

In [10]:
%whos DataFrame

Variable                 Type         Data/Info
-----------------------------------------------
all_time_worldwide_bo    DataFrame         Rank  Year          <...>n\n[595 rows x 6 columns]
blockbuster              DataFrame         release_year  rank_i<...>\n[430 rows x 13 columns]
imdb_movie_metadata      DataFrame          color      director<...>n[5043 rows x 28 columns]
imdb_top_1000            DataFrame                             <...>n[1000 rows x 16 columns]
movie_industry_dataset   DataFrame                             <...>n[7633 rows x 15 columns]
rotten_tomatoes_movie    DataFrame                            r<...>[17712 rows x 22 columns]
tgm_bo_summary           DataFrame         Rank                <...>\n[1000 rows x 5 columns]
tmdb_movie               DataFrame             budget          <...>n[4803 rows x 20 columns]
tmds_movies_metadata     DataFrame           adult             <...>[45447 rows x 24 columns]
top_grossing_film        DataFrame             Release_Typ

The number of files loaded as DataFrames has increased by the 3 rating files

To explore the contents of these DataFrames, I will create a new DataFrame and then use a loop to present the info of each of the loaded DataFrame

In [55]:
file_names = pd.DataFrame({'file': ['all_time_worldwide_bo','blockbuster','imdb_movie_metadata','imdb_top_1000','movie_industry_dataset',
                                    'rotten_tomatoes_movie','tgm_bo_summary','tmdb_movie','tmds_movies_metadata','top_grossing_film','top_movies_data']})

In [69]:
for index, row in file_names.iterrows():
    print(row['file'])
    exec('print('+row['file']+'.info())')

all_time_worldwide_bo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595 entries, 0 to 594
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     595 non-null    int64 
 1   Year                     595 non-null    int64 
 2   Movie                    595 non-null    object
 3   WorldwideBox Office      595 non-null    object
 4   DomesticBox Office       588 non-null    object
 5   InternationalBox Office  595 non-null    object
dtypes: int64(2), object(4)
memory usage: 28.0+ KB
None
blockbuster
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   release_year          430 non-null    int64  
 1   rank_in_year          430 non-null    int64  
 2   imdb_rating           430 non-null    float64
 3   mpaa_rating      

The above provides me with an overview of the DFs, I can see all columns within each DF and the data type contained within each column. In building my population, I will require columns containing:
- Film Title
    - For Obvious Reasons
- Release Date or Release Year
    - While having a full release date would be perfect as it would allow analysis by month and even day, I would settle for release year if it means getting a bigger population. 
- Production Costs Amount
    - The entire report hinges on the presence and accuracy of production costs for each film. 

I will start by building this population, extracting each of these columns from the files where all three columns are present, cleanse the data in each new subsetted version of the DF, amend the column headings to ensure they are uniformed in each new DF before stacking them to form a production cost population. I will then repeat the steps for the files containing gross box office figures.  


## Analysis of DataFrames:

Excluding the three film rating DFs (imdb_movie_metadata, imdb_top_1000 and rotten_tomatoes_movie) and the DFs which have no production costs or budgets (all_time_worldwide_bo, tgm_bo_summary and top_grossing_film), I am left with the following DFs and the relevent columns:
- blockbuster
    - release_year
    - film_title
    - film_budget
    - length_in_min
        - I am including the length to filter out movies 60 minutes or less where possible
- movie_industry_dataset
    - name
    - year
    - budget
    - runtime
- tmdb_movie
    - budget
    - release_date
    - runtime
    - title
- tmds_movies_metadata
    - budget
    - release_date
    - title
    - runtime
- top_movies_data
    - Release Date
    - Movie Title
    - Production Budget
    
I will use the column names from the blockbuster DF and change the other columns names to match those ones. In addition to the column name changes, the following steps will be taken to cleanse each DF:
- Filter out films 60 minutes and under
- Filter out and rows where the production costs are blanks
- Remove duplicates based one these two criteria in this order:
    - film_title & release_year
    - film_title & film_budget
        - This should remove any potential duplicates where multiple versions of the same title appear but should also retain films with identical titles produced in different years, remakes being a prime example of this. 

The files should be ready to be appended on top of each other at this point. 

The next stage will be to pull together the DFs containing the film grossing data which from the loaded DFs will be along with the columns required to join with the production DFs:
- all_time_worldwide_bo
    - Year
    - Movie
    - WorldwideBox Office
- blockbuster
    - release_year 
    - film_title
    - worldwide_gross   
- movie_industry_dataset
    - name
    - year
    - gross
- tgm_bo_summary
    - Title
    - Lifetime Gross
    - Year
- tmdb_movie
    - release_date
    - revenue
    - title                 
- tmds_movies_metadata
    - release_date
    - revenue
    - title
- top_grossing_film
    - Movie_Name
    - Lifetime Gross
    - Year  
- top_movies_data
    - Release Date
    - Movie Title
    - Worldwide Gross

Once loaded, the same cleansing will take place along with renaming of the columns to line them up to be appended together. 

After creating the two new DFs for production costs and gross amounts, I carry out some additional cleansing on both files. This will include cleaning and uniforming the film titles then removing the duplicates created by the combining of the DFs. 

## Production DF Analysis