# Exploratory Data Analysis - Box Office Datasets


## Business Understanding
Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies.
<br>
You are charged with:
- Exploring what types of films are currently doing the best at the box office
- Translating the findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create


## Data Understanding
In the folder **`Data`** are movie Datasets:
<br>

| File Format | Source Info  |Contains|
|------|------|------|
|bom.movie_gross.csv |Box Office Mojo|Gross Information of Movies|
|rt.movie_info.tsv|Rotten Tomatoes|Movie Information|
|rt.reviews.tsv|Rotten Tomatoes|Movie Reviews  Information|
|tmdb.movies.csv|TheMovieDB|Movie Popularity Information|
|tn.movie_budgets.csv|The Numbers|Movies budget Information|
|im.db|IMDB|Movie Information as shown in the DataBase schema below|

Because it was collected from various locations, the different files have different formats as shown in the table above. Some are compressed CSV (comma-separated values) or TSV (tab-separated values) files that can be opened using spreadsheet software or pd.read_csv, while the data from IMDB is located in a SQLite database.

>_Below is the Database Schema of the IMDB database_
![movie_data_erd.jpeg](movie_data_erd.jpeg)

# IMDB Dataset
## Data Preparation 
### Unzipping the `.zip` files
In the cells below we import the necessary libraries required for unzipping our data to access the SQLite DB

In [1]:
#importing the necessary libraries for unzipping
import glob
import zipfile

In [2]:
#Unzipping the .zip files for the IMDB
files = glob.glob('zippedData/*.zip')
files

In [3]:
for file in files:
    print('Unzipping:',file)

    with zipfile.ZipFile(file, 'r') as zip_ref:
        zip_ref.extractall('data/raw')

### 1.0 Load and Preview our datasets
Importing the required libraries for the EDA process

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Box Office Mojo Dataset


In [5]:
# Load the CSV files as a DataFrame
# Url path for our Datasets
grossBom_path='zippedData/bom.movie_gross.csv.gz'

#Coverting the zipped CSV files into a DataFrame
grossBom=pd.read_csv(grossBom_path, compression='gzip')
grossBom.head(2)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010


# The MovieDB Dataset

In [6]:
# Url path for our Datasets
tmdb_path='zippedData/tmdb.movies.csv.gz'
tmdb=pd.read_csv(tmdb_path,index_col=0,compression='gzip')
tmdb.head(2)

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610


# The Numbers Dataset 

In [7]:
# Url path for our Datasets
tnBudget_path='zippedData/tn.movie_budgets.csv.gz'
tnBudget=pd.read_csv(tnBudget_path,compression='gzip')
tnBudget.head(2)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"


# Rotten Tomatoes Dataset

In [8]:
#Load the Rotten Tomatoes TSV files as a DataFrame
rt_movie_path='zippedData/rt.movie_info.tsv.gz'
rt_reviews_path='zippedData/rt.reviews.tsv.gz'

#converting all the zipped TSV files into a DataFrame
rt_movies=pd.read_csv(rt_movie_path,delimiter='\t',encoding='latin1',compression='gzip')
rt_reviews=pd.read_csv(rt_reviews_path,delimiter='\t',encoding='latin1',compression='gzip')

#Preview of the Rotten tomatoes movies dataset
rt_movies.head(2)

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One


In [9]:
#Preview of the Rotten tomatoes reviews dataset
rt_reviews.head(2)

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
