# Data Combination over Movie Data Sets (CSV Files)

The purpose of this notebook is to document the steps taken in order to generate movies_df and studios_df, which will be used for further analysis to make recommendations to Microsoft on where they could enter the original movie production business.

The scope of this excercise is the provided CSV Files:
- bom.Movie_gross.csv.gz
- rt.movie_info_tsv.gz
- rt.reviews.tsv.gz
- tmdb.movies.csv.gz
- tn.movie_budgets.csv.gz

<figure>
    <img src="Converging Arrows.png"
         alt="Let's synthesize some data"
         width="400"
         height="200">
    <figcaption><center><bold>Putting it all together</figcaption>
</figure>

From the EDA notebook:

The CSV files will contain the relevant financial data needed to make a recommendation to Microsoft. Our key metric will be profitability of movie titles, then profitability by genre, and then identifying people in the movie making labor force would be best for the job (using a mix of both movie profitability and critic reviews)

## Master Table Import: "tn.movie_budgets.csv.gz"

### *Basic Import*

For purposes of our combination, we will use 'tn.movie_budgets.csv.gz' as our beginning source of truth, as it contains the most relevant information we are looking to investigate (mainly, domestic and international box office sales and movie budget)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import string
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'

In [None]:
df_budgets = pd.read_csv("Raw Data/tn.movie_budgets.csv.gz")

In [None]:
df_budgets.head(2)

Converting the financial data into numbers, release_date info into datetime objects, and creating a 'year' column:

In [None]:
df_budgets['worldwide_gross'] = df_budgets['worldwide_gross'].str.replace('$', '').str.replace(',', '')
df_budgets['domestic_gross'] = df_budgets['domestic_gross'].str.replace('$', '').str.replace(',', '')
df_budgets['production_budget'] = df_budgets['production_budget'].str.replace('$', '').str.replace(',', '')

df_budgets['worldwide_gross'] = df_budgets['worldwide_gross'].astype(float)
df_budgets['domestic_gross'] = df_budgets['domestic_gross'].astype(float)
df_budgets['production_budget'] = df_budgets['production_budget'].astype(float)

df_budgets['release_date'] = pd.to_datetime(df_budgets['release_date'])

df_budgets['year'] = df_budgets['release_date'].dt.strftime('%Y')
df_budgets['year'] = df_budgets['year'].map(int)

Since we will be combining additional genre info based off of the title of the movie, we will normalize punctuation use by removing punctuation from all titles in this data source and later data sources:

In [None]:
transtab = str.maketrans(dict.fromkeys(punct, ''))

df_budgets['movie'] = '|'.join(df_budgets['movie'].tolist()).translate(transtab).split('|')

We will create a profit calculation, which is worldwide_gross minus the production budget. We will also create a column which calculates the percentage of profit margin

In [None]:
df_budgets['profit'] = df_budgets['worldwide_gross'] - df_budgets['production_budget']
df_budgets['profit_margin'] = (df_budgets['profit'] / df_budgets['worldwide_gross']) * 100

In [None]:
df_budgets.head()

A studio like Microsoft will likely put at least 100k into it's projects, so we will filter for budgets above $100k. They will likely not be interested in festivals or other distributions that will not have box office values, so we will filter out movies with zero box office sales

In [None]:
df_budgets = df_budgets[df_budgets['production_budget'] > 100000]
df_budgets = df_budgets[(df_budgets['domestic_gross'] > 1) | (df_budgets['worldwide_gross'] > 1)]
df_budgets = df_budgets[df_budgets['year'] > 2000]

In [None]:
duplicates = df_budgets[df_budgets.duplicated(['movie'], keep=False)]

In [None]:
duplicates

In [None]:
df_budgets.info()

## Appending Genre using additional IMDB Data

In [None]:
df_additional = pd.read_csv("Raw Data/title.basics.tsv.gz", sep='\t', encoding='latin1')
df_additional['primaryTitle'] = df_additional['primaryTitle'].map(str)
df_additional = df_additional[df_additional['titleType'] == 'movie']

In [None]:
transtab = str.maketrans(dict.fromkeys(punct, ''))

df_additional['primaryTitle'] = '|'.join(df_additional['primaryTitle'].tolist()).translate(transtab).split('|')

## Appending Genre using CSV Data

In [None]:
df_budget_genre = df_budgets.merge(df_additional.drop_duplicates('primaryTitle'), how='left', left_on='movie', right_on='primaryTitle')

In [None]:
df_budget_genre.head(60)

In [None]:
df_budget_genre[df_budget_genre['genres'].isnull()]

In [None]:
df_budget_genre.isna().sum()

In [None]:
from pathlib import Path  
filepath = Path('financial_data.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
df_budget_genre.to_csv(filepath)  