# Business Problem
For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

* Part 1: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
* Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
* Part 3: Construct and export a MySQL database using your data.
* Part 4: Apply hypothesis testing to explore what makes a movie successful.
* Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

# Part 1

## Specifications
Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

## Data Source
https://datasets.imdbws.com/

<img src='blue_long_2-9665a76b1ae401a510ec1e0ca40ddcb3b0cfe45f1d51b77a308fea0845885648.svg'>


## Imports

In [1]:
import pandas as pd
import numpy as np 
import os

## URLS

In [2]:
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

## Loading TSV's with Pandas

In [3]:
basics = pd.read_csv(basics_url, sep = '\t', low_memory = False)
ratings = pd.read_csv(ratings_url, sep = '\t', low_memory = False)
akas = pd.read_csv(akas_url, sep = '\t', low_memory = False)

## Replacing \N to NaN

In [4]:
basics.replace({'\\N' : np.nan}, inplace = True)
ratings.replace({'\\N' : np.nan}, inplace = True)
akas.replace({'\\N' : np.nan}, inplace = True)

## Loading data and Preprocessing
### Basics

In [5]:
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9906183 entries, 0 to 9906182
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 680.2+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [6]:
# Eliminate movies that are null for runtimeMinutes
basics.dropna(subset = ['runtimeMinutes'], inplace = True)
# Eliminate movies that are null for genre
basics.dropna(subset = ['genres'], inplace = True)
# keep only titleType==Movie
basics = basics[basics['titleType'] == 'movie']
# keep startYear 2000-2022
basics.dropna(subset = ['startYear'], inplace = True)
basics['startYear'] = basics['startYear'].astype(int)
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2022)]
# Eliminate movies that include "Documentary" in genre
is_documentary = basics['genres'].str.contains('documentary', case = False)
basics = basics[~is_documentary]
# Keep only US movies*

basics.info()
basics.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147657 entries, 34803 to 9906033
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          147657 non-null  object
 1   titleType       147657 non-null  object
 2   primaryTitle    147657 non-null  object
 3   originalTitle   147657 non-null  object
 4   isAdult         147657 non-null  object
 5   startYear       147657 non-null  int32 
 6   endYear         0 non-null       object
 7   runtimeMinutes  147657 non-null  object
 8   genres          147657 non-null  object
dtypes: int32(1), object(8)
memory usage: 10.7+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
4033486,tt15614672,movie,Rakshak Fantastic 4,Rakshak Fantastic 4,0,2009,,45,Crime
2478571,tt12747860,movie,Bhima cha Wagh,Bhima cha Wagh,0,2015,,145,Drama
839943,tt0867306,movie,The Human Trace,The Human Trace,0,2008,,109,Thriller


- We have reduced the number of rows in the basic data file. We haven't done the 'Keep only US movies' just yet as region is only found in akas.

### AKAs

In [7]:
akas.info()
akas.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36139188 entries, 0 to 36139187
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.2+ GB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
10318852,tt13156472,10,Prince of Muck,US,,imdbDisplay,,0
7771583,tt11961526,1,The Three Little Bumble Nums,CA,,,,0
35265187,tt9457430,1,エピソード #1.331,JP,ja,,,0
34140190,tt8850038,1,Vintage Crown Point with the Cowbell Song,CA,,,,0
19089778,tt1956432,2,Ashes,US,,imdbDisplay,,0


In [8]:
akas = akas[akas['region'] == 'US']
akas.info()
akas.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1443212 entries, 5 to 36138932
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1443212 non-null  object
 1   ordering         1443212 non-null  int64 
 2   title            1443212 non-null  object
 3   region           1443212 non-null  object
 4   language         3945 non-null     object
 5   types            979812 non-null   object
 6   attributes       46703 non-null    object
 7   isOriginalTitle  1441870 non-null  object
dtypes: int64(1), object(7)
memory usage: 99.1+ MB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
1735275,tt0345672,4,The Bridge,US,,festival,,0
472823,tt0062347,14,The Hired Killer,US,,imdbDisplay,,0
34314384,tt8943888,1,At Least 5 People Are Dead as Tropical Storm F...,US,,,,0


In [9]:
# filter basics to only include US from akas filtered dataset
akas_keepers = basics['tconst'].isin(akas['titleId'])
# filter basics [!]
basics = basics[akas_keepers]
# checking that the number of entries has decreases
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86756 entries, 34803 to 9905949
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          86756 non-null  object
 1   titleType       86756 non-null  object
 2   primaryTitle    86756 non-null  object
 3   originalTitle   86756 non-null  object
 4   isAdult         86756 non-null  object
 5   startYear       86756 non-null  int32 
 6   endYear         0 non-null      object
 7   runtimeMinutes  86756 non-null  object
 8   genres          86756 non-null  object
dtypes: int32(1), object(8)
memory usage: 6.3+ MB


### Ratings

In [10]:
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1320058 entries, 0 to 1320057
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1320058 non-null  object 
 1   averageRating  1320058 non-null  float64
 2   numVotes       1320058 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.2+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1978
1,tt0000002,5.8,265
2,tt0000003,6.5,1831
3,tt0000004,5.6,179
4,tt0000005,6.2,2621


In [11]:
rating_keepers = ratings['tconst'].isin(akas['titleId'])
ratings = ratings[rating_keepers]

In [12]:
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 501007 entries, 0 to 1320033
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         501007 non-null  object 
 1   averageRating  501007 non-null  float64
 2   numVotes       501007 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.3+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1978
1,tt0000002,5.8,265
4,tt0000005,6.2,2621
5,tt0000006,5.1,182
6,tt0000007,5.4,821


## Save

In [13]:
# making new folder with os
os.makedirs('Data/',exist_ok = True) 
# Confirm folder created
os.listdir('Data/')

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [14]:
basics.to_csv('Data/title_basics.csv.gz', compression = 'gzip', index = False)
akas.to_csv('Data/title_akas.csv.gz', compression = 'gzip', index = False)
ratings.to_csv('Data/title_ratings.csv.gz', compression = 'gzip', index = False)

In [15]:
# Open saved file and preview again
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)
basics.sample(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
4730,tt0341546,movie,Replay,Replay,0,2003,,85,"Crime,Mystery"
31511,tt1483507,movie,Shirley Adams,Shirley Adams,0,2009,,92,Drama
21062,tt11853944,movie,The Doll 3,The Doll 3,0,2022,,115,"Horror,Thriller"


In [16]:
basics_num_of_rows = len(basics)
print(f"The number of rows is {basics_num_of_rows}")

The number of rows is 86756


In [17]:
# Open saved file and preview again
akas = pd.read_csv('Data/title_akas.csv.gz', low_memory = False)
akas.sample(3)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
206925,tt0349166,2,Brain Child,US,,working,,0.0
896338,tt2281385,2,Nostalgia,US,,imdbDisplay,,0.0
779540,tt1830899,1,Vibrations: A Documentary,US,,imdbDisplay,,0.0


In [18]:
akas_num_of_rows = len(akas)
print(f"The number of rows is {akas_num_of_rows}")

The number of rows is 1443212


In [19]:
# Open saved file and preview again
ratings = pd.read_csv('Data/title_ratings.csv.gz', low_memory = False)
ratings.sample(3)

Unnamed: 0,tconst,averageRating,numVotes
249369,tt11512490,7.0,31
134622,tt0442703,6.9,68
208706,tt0820887,6.2,43


In [21]:
ratings_num_of_rows = len(ratings)
print(f"The number of rows is {ratings_num_of_rows}")

The number of rows is 501007
