# Business Problem
For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

* Part 1: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
* Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
* Part 3: Construct and export a MySQL database using your data.
* Part 4: Apply hypothesis testing to explore what makes a movie successful.
* Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

# Specifications
Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

## Data Source
https://datasets.imdbws.com/

<img src='blue_long_2-9665a76b1ae401a510ec1e0ca40ddcb3b0cfe45f1d51b77a308fea0845885648.svg'>


# Imports

In [1]:
import pandas as pd
import numpy as np

## URLS

In [2]:
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

## Loading TSV's with Pandas

In [3]:
basics = pd.read_csv(basics_url, sep = '\t', low_memory = False)
ratings = pd.read_csv(ratings_url, sep = '\t', low_memory = False)
akas = pd.read_csv(akas_url, sep = '\t', low_memory = False)

## Replacing \N to NaN

In [4]:
basics.replace({'\\N' : np.nan}, inplace = True)
ratings.replace({'\\N' : np.nan}, inplace = True)
akas.replace({'\\N' : np.nan}, inplace = True)

# Loading data and Preprocessing
## Basics

In [5]:
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9864481 entries, 0 to 9864480
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 677.3+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [12]:
# Eliminate movies that are null for runtimeMinutes
basics.dropna(subset = ['runtimeMinutes'], inplace = True)
# Eliminate movies that are null for genre
basics.dropna(subset = ['genres'], inplace = True)
# keep only titleType==Movie
basics = basics[basics['titleType'] == 'movie']
# keep startYear 2000-2022
basics.dropna(subset = ['startYear'], inplace = True)
basics['startYear'] = basics['startYear'].astype(int)
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2022)]
# Eliminate movies that include "Documentary" in genre
is_documentary = basics['genres'].str.contains('documentary', case = False)
basics = basics[~is_documentary]
# Keep only US movies*

basics.info()
basics.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147574 entries, 34803 to 9864331
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          147574 non-null  object
 1   titleType       147574 non-null  object
 2   primaryTitle    147574 non-null  object
 3   originalTitle   147574 non-null  object
 4   isAdult         147574 non-null  object
 5   startYear       147574 non-null  int32 
 6   endYear         0 non-null       object
 7   runtimeMinutes  147574 non-null  object
 8   genres          147574 non-null  object
dtypes: int32(1), object(8)
memory usage: 10.7+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
9379990,tt8869128,movie,Summer Camp,Summer Camp,0,2018,,128,"Comedy,Drama"
2416835,tt1262896,movie,Forever Plaid,Forever Plaid,0,2008,,90,"Comedy,Musical"
9793533,tt9763638,movie,Mike Polk Jr. Live at the Kent Stage,Mike Polk Jr. Live at the Kent Stage,0,2019,,56,Comedy


- We have reduced the number of rows in the basic data file. We haven't done the 'Keep only US movies' just yet as region is only found in akas.

## AKAs

In [18]:
akas.info()
akas.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35949703 entries, 0 to 35949702
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
7114316,tt11664878,7,Episodio #1.24,ES,es,,,0
15056541,tt15410190,26,Fishbowl Wives,ID,en,imdbDisplay,,0
20917587,tt21617584,6,2004年11月3日 のエピソード,JP,ja,,,0
7691600,tt11927000,2,Episodio #1.81,ES,es,,,0
6486535,tt11394288,5,Blanco de verano,MX,,imdbDisplay,,0


In [22]:
akas = akas[akas['region'] == 'US']
akas.info()
akas.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1438075 entries, 5 to 35949447
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1438075 non-null  object
 1   ordering         1438075 non-null  int64 
 2   title            1438075 non-null  object
 3   region           1438075 non-null  object
 4   language         3933 non-null     object
 5   types            978977 non-null   object
 6   attributes       46596 non-null    object
 7   isOriginalTitle  1436730 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.7+ MB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
30613103,tt6749018,1,Trouble Maker,US,,imdbDisplay,,0
13647010,tt14754772,1,Requiems and Revivals: Facing the Past,US,,,,0
11851378,tt1388410,1,Stormrise,US,,imdbDisplay,,0


## Ratings

In [24]:
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313762 entries, 0 to 1313761
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1313762 non-null  object 
 1   averageRating  1313762 non-null  float64
 2   numVotes       1313762 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.1+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1974
1,tt0000002,5.8,264
2,tt0000003,6.5,1822
3,tt0000004,5.6,178
4,tt0000005,6.2,2617


In [27]:
US = ratings['tconst'].isin(akas['titleId'])
ratings = ratings[US]

In [28]:
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499625 entries, 0 to 1313737
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         499625 non-null  object 
 1   averageRating  499625 non-null  float64
 2   numVotes       499625 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.2+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1974
1,tt0000002,5.8,264
4,tt0000005,6.2,2617
5,tt0000006,5.1,182
6,tt0000007,5.4,820


# Save

In [29]:
# example making new folder with os
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir('Data/')

[]

In [30]:
basics.to_csv('Data/title_basics.csv.gz', compression = 'gzip', index = False)
akas.to_csv('Data/title_akas.csv.gz', compression = 'gzip', index = False)
ratings.to_csv('Data/title_ratings.csv.gz', compression = 'gzip', index = False)

In [31]:
# Open saved file and preview again
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)
basics.sample(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
41727,tt1929276,movie,Estranged,Estranged,0,2015,,92,"Horror,Mystery,Thriller"
73903,tt6253942,movie,Do It Right,De toutes mes forces,0,2017,,98,Drama
20370,tt11660572,movie,Bridges,Bridges,0,2021,,81,Drama


In [32]:
# Open saved file and preview again
akas = pd.read_csv('Data/title_akas.csv.gz', low_memory = False)
akas.sample(3)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
595714,tt13740096,2,Forty Soldiers,US,,imdbDisplay,,0.0
261391,tt0465430,13,The Cottage,US,,imdbDisplay,,0.0
516421,tt12346504,1,Giant Alligator Feeding!,US,,imdbDisplay,,0.0


In [33]:
# Open saved file and preview again
ratings = pd.read_csv('Data/title_ratings.csv.gz', low_memory = False)
ratings.sample(3)

Unnamed: 0,tconst,averageRating,numVotes
481685,tt8315874,6.6,197
328617,tt1786763,6.4,56
297559,tt1476886,5.7,29
