# Business Problem

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

- Part 1: Download several files from IMDB’s movie data set and filter out the subset of movies requested by the stakeholder.
- Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
- Part 3: Construct and export a MySQL database using your data.
- Part 4: Apply hypothesis testing to explore what makes a movie successful.
- Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

## Getting Started Tips:
Please make sure to read the following lesson "Getting Started - Project 3" for additional tips and directions!

## The Data
- IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

 - Overview/Data Dictionary: https://www.imdb.com/interfaces/
 - Downloads page: https://datasets.imdbws.com/


- From their previous research, they realized they want to focus on the following files:

 - title.basics.tsv.gz
 - title.ratings.tsv.gz
 - title.akas.tsv.gz

## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

## Deliverable
After filtering out movies that do not meet the stakeholder's specifications:

- Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
- Save each file to a compressed csv file "Data/" folder inside your repository.
- Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
- Submit the link to your repository

# Part 1: Download files from IMDB’s movie data set

## Imports

In [1]:
import pandas as pd
import numpy as np

#pd.set_option('display.max_rows', None) 

## Load Data

In [2]:
# set urls for database from IMDB website
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

In [3]:
# load data
basics = pd.read_csv(url_basics, sep = '\t', low_memory = False)
akas = pd.read_csv(url_akas, sep = '\t', low_memory = False)
ratings = pd.read_csv(url_ratings, sep = '\t', low_memory = False)

## Data Cleaning

### Title Basics Database

#### Replace "\N" with np.nan

In [4]:
basics.isna().sum()

tconst             0
titleType          0
primaryTitle      11
originalTitle     11
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            15
dtype: int64

In [5]:
# Missing values are nan and \N. I wlll replace them all with nan so I can delete them. 
basics.replace({'\\N':np.nan}, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle           11
originalTitle          11
isAdult                 1
startYear         1334564
endYear           9783545
runtimeMinutes    6969735
genres             444697
dtype: int64

#### Eliminate movies that are null for runtimeMinutes

In [6]:
basics.dropna(subset = ['runtimeMinutes'], axis = 0, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 1
startYear          170803
endYear           2869321
runtimeMinutes          0
genres              77286
dtype: int64

#### Eliminate movies that are null for genre

In [7]:
basics.dropna(subset = ['genres'], axis = 0, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 0
startYear          165858
endYear           2793630
runtimeMinutes          0
genres                  0
dtype: int64

#### Keep only titleType==Movie

In [8]:
basics = basics[basics['titleType'] == 'movie']
basics['titleType'].info()

<class 'pandas.core.series.Series'>
Int64Index: 383178 entries, 8 to 9890706
Series name: titleType
Non-Null Count   Dtype 
--------------   ----- 
383178 non-null  object
dtypes: object(1)
memory usage: 5.8+ MB


In [9]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383178 entries, 8 to 9890706
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          383178 non-null  object
 1   titleType       383178 non-null  object
 2   primaryTitle    383178 non-null  object
 3   originalTitle   383178 non-null  object
 4   isAdult         383178 non-null  object
 5   startYear       376718 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  383178 non-null  object
 8   genres          383178 non-null  object
dtypes: object(9)
memory usage: 29.2+ MB


#### Keep startYear 2000-2022

In [10]:
basics.dropna(subset = ['startYear'], axis = 0, inplace = True)
basics['startYear'] = basics['startYear'].astype(dtype = int) 
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2022)]
basics['startYear'].describe()

count    223583.000000
mean       2013.372980
std           5.853606
min        2000.000000
25%        2009.000000
50%        2014.000000
75%        2018.000000
max        2022.000000
Name: startYear, dtype: float64

#### Eliminate movies that include "Documentary" in genre

In [11]:
basics['genres'].value_counts()

Documentary                  53314
Drama                        36068
Comedy                       13459
Comedy,Drama                  6459
Horror                        5804
                             ...  
Crime,Documentary,Romance        1
Animation,Biography,Sport        1
Adventure,History,Music          1
Adventure,History,War            1
Crime,Fantasy,Sci-Fi             1
Name: genres, Length: 1187, dtype: int64

In [12]:
is_documentary = basics['genres'].str.contains('documentary',case = False)
basics = basics[~is_documentary]

In [13]:
basics['genres'].value_counts()

Drama                     36068
Comedy                    13459
Comedy,Drama               6459
Horror                     5804
Drama,Romance              4317
                          ...  
Action,Fantasy,Western        1
Family,Musical,Sport          1
Horror,Music,Mystery          1
Comedy,History,Mystery        1
Crime,Fantasy,Sci-Fi          1
Name: genres, Length: 966, dtype: int64

### AKAS Database

In [14]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36055639 entries, 0 to 36055638
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


#### Keep only US movies.

In [15]:
akas = akas[akas['region'] == 'US']
akas['region'].value_counts()

US    1440226
Name: region, dtype: int64

#### Replace "\N" with np.nan

In [16]:
akas.replace({'\\N':np.nan}, inplace = True)
akas.isna().sum()

titleId                  0
ordering                 0
title                    0
region                   0
language           1436286
types               460804
attributes         1393562
isOriginalTitle       1345
dtype: int64

#### Keep only US movies using AKAs table

In [17]:
keepers =basics['tconst'].isin(akas['titleId'])
basics = basics[keepers]

### RATINGS Database

In [18]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1316976 entries, 0 to 1316975
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1316976 non-null  object 
 1   averageRating  1316976 non-null  float64
 2   numVotes       1316976 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.1+ MB


#### Replace "\N" with np.nan (if any)

In [19]:
ratings.replace({'\\N':np.nan}, inplace = True)
ratings.isna().sum()

tconst           0
averageRating    0
numVotes         0
dtype: int64

#### Keep only US movies from Title Basics Table using AKAs table

In [30]:
keepers =ratings['tconst'].isin(akas['titleId'])
ratings = ratings[keepers]

## Review

In [21]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86777 entries, 34803 to 9890522
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          86777 non-null  object
 1   titleType       86777 non-null  object
 2   primaryTitle    86777 non-null  object
 3   originalTitle   86777 non-null  object
 4   isAdult         86777 non-null  object
 5   startYear       86777 non-null  int64 
 6   endYear         0 non-null      object
 7   runtimeMinutes  86777 non-null  object
 8   genres          86777 non-null  object
dtypes: int64(1), object(8)
memory usage: 6.6+ MB


In [22]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1440226 entries, 5 to 36055383
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1440226 non-null  object
 1   ordering         1440226 non-null  int64 
 2   title            1440226 non-null  object
 3   region           1440226 non-null  object
 4   language         3940 non-null     object
 5   types            979422 non-null   object
 6   attributes       46664 non-null    object
 7   isOriginalTitle  1438881 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.9+ MB


In [23]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500376 entries, 0 to 1316951
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         500376 non-null  object 
 1   averageRating  500376 non-null  float64
 2   numVotes       500376 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.3+ MB


## Save Files

In [24]:
import os
os.makedirs('Data/',exist_ok=True) 

# Confirm folder created
os.listdir("Data/")

['.DS_Store']

In [25]:
## Save current dataframe to file.
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0043139,movie,Life of a Beijing Policeman,Wo zhe yi bei zi,0,2013,,120,"Drama,History"
2,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [26]:
## Save current dataframe to file.
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [27]:
## Save current dataframe to file.
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1976
1,tt0000002,5.8,264
2,tt0000005,6.2,2617
3,tt0000006,5.1,182
4,tt0000007,5.4,820


# Part 2: Use an API to extract  revenue and profit

## Additional Imports

In [28]:
# Standard
import matplotlib.pyplot as plt
import seaborn as sns

# New Imports
import os, json, math, time
from yelpapi import YelpAPI
from tqdm.notebook import tqdm_notebook

# Install tmdbsimple 
!pip install tqdm



In [29]:
import json
with open('/Users/jasontracey/.secret/tmdb_api.json') as f: #change the path to match YOUR path!!
    login = json.load(f)
login.keys()

dict_keys(['Client ID', 'API Key'])