# Business Problem

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

- Part 1: Download several files from IMDB’s movie data set and filter out the subset of movies requested by the stakeholder.
- Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
- Part 3: Construct and export a MySQL database using your data.
- Part 4: Apply hypothesis testing to explore what makes a movie successful.
- Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

## Getting Started Tips:
Please make sure to read the following lesson "Getting Started - Project 3" for additional tips and directions!

## The Data
- IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

 - Overview/Data Dictionary: https://www.imdb.com/interfaces/
 - Downloads page: https://datasets.imdbws.com/


- From their previous research, they realized they want to focus on the following files:

 - title.basics.tsv.gz
 - title.ratings.tsv.gz
 - title.akas.tsv.gz

## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

## Deliverable
After filtering out movies that do not meet the stakeholder's specifications:

- Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
- Save each file to a compressed csv file "Data/" folder inside your repository.
- Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
- Submit the link to your repository

# Imports

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None) 

# Load Data

In [2]:
# set urls for database from IMDB website
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

In [3]:
# load data
basics = pd.read_csv(url_basics, sep = '\t', low_memory = False)
akas = pd.read_csv(url_akas, sep = '\t', low_memory = False)
ratings = pd.read_csv(url_ratings, sep = '\t', low_memory = False)

# Data Cleaning

## Title Basics Database

### Replace "\N" with np.nan

In [4]:
basics.isna().sum()

tconst             0
titleType          0
primaryTitle      11
originalTitle     11
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            15
dtype: int64

In [5]:
# Missing values are nan and \N. I wlll replace them all with nan so I can delete them. 
basics.replace({'\\N':np.nan}, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle           11
originalTitle          11
isAdult                 1
startYear         1332457
endYear           9768787
runtimeMinutes    6963746
genres             444031
dtype: int64

### Eliminate movies that are null for runtimeMinutes

In [6]:
basics.dropna(subset = ['runtimeMinutes'], axis = 0, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 1
startYear          170690
endYear           2860402
runtimeMinutes          0
genres              76914
dtype: int64

### Eliminate movies that are null for genre

In [7]:
basics.dropna(subset = ['genres'], axis = 0, inplace = True)
basics.isna().sum()

tconst                  0
titleType               0
primaryTitle            1
originalTitle           1
isAdult                 0
startYear          165736
endYear           2785081
runtimeMinutes          0
genres                  0
dtype: int64

### Keep only titleType==Movie

In [8]:
basics = basics[basics['titleType'] == 'movie']
basics['titleType'].info()

<class 'pandas.core.series.Series'>
Int64Index: 382882 entries, 8 to 9875688
Series name: titleType
Non-Null Count   Dtype 
--------------   ----- 
382882 non-null  object
dtypes: object(1)
memory usage: 5.8+ MB


In [9]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 382882 entries, 8 to 9875688
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          382882 non-null  object
 1   titleType       382882 non-null  object
 2   primaryTitle    382882 non-null  object
 3   originalTitle   382882 non-null  object
 4   isAdult         382882 non-null  object
 5   startYear       376423 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  382882 non-null  object
 8   genres          382882 non-null  object
dtypes: object(9)
memory usage: 29.2+ MB


### Keep startYear 2000-2022

In [10]:
basics.dropna(subset = ['startYear'], axis = 0, inplace = True)
basics['startYear'] = basics['startYear'].astype(dtype = int) 
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2022)]
basics['startYear'].describe()

count    223494.000000
mean       2013.372314
std           5.853115
min        2000.000000
25%        2009.000000
50%        2014.000000
75%        2018.000000
max        2022.000000
Name: startYear, dtype: float64

### Eliminate movies that include "Documentary" in genre

In [11]:
basics['genres'].value_counts()

Documentary                         53271
Drama                               36060
Comedy                              13458
Comedy,Drama                         6458
Horror                               5802
Drama,Romance                        4312
Thriller                             3938
Comedy,Drama,Romance                 3036
Comedy,Romance                       2946
Action                               2739
Biography,Documentary                2557
Documentary,Music                    2345
Drama,Thriller                       2257
Romance                              2039
Horror,Thriller                      1946
Documentary,Drama                    1627
Documentary,History                  1544
Action,Crime,Drama                   1471
Crime,Drama                          1418
Biography,Documentary,History        1346
Crime,Drama,Thriller                 1256
Animation                            1254
Family                               1216
Documentary,Sport                 

In [12]:
is_documentary = basics['genres'].str.contains('documentary',case = False)
basics = basics[~is_documentary]

In [13]:
basics['genres'].value_counts()

Drama                            36060
Comedy                           13458
Comedy,Drama                      6458
Horror                            5802
Drama,Romance                     4312
Thriller                          3938
Comedy,Drama,Romance              3036
Comedy,Romance                    2946
Action                            2739
Drama,Thriller                    2257
Romance                           2039
Horror,Thriller                   1946
Action,Crime,Drama                1471
Crime,Drama                       1418
Crime,Drama,Thriller              1256
Animation                         1254
Family                            1216
Music                             1132
Drama,Family                      1126
Comedy,Horror                     1117
Action,Drama                       978
Sport                              957
Crime                              921
Horror,Mystery,Thriller            859
Sci-Fi                             856
Drama,Mystery,Thriller   

### Keep only US movies using AKAs table

In [14]:
keepers =basics['tconst'].isin(akas['titleId'])
basics = basics[keepers]

## AKAS Database

In [15]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35996402 entries, 0 to 35996401
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


### Keep only US movies.

In [20]:
akas = akas[akas['region'] == 'US']
akas['region'].value_counts()

US    1438843
Name: region, dtype: int64

### Replace "\N" with np.nan

In [22]:
akas.replace({'\\N':np.nan}, inplace = True)
akas.isna().sum()

titleId                  0
ordering                 0
title                    0
region                   0
language           1434911
types               459777
attributes         1392216
isOriginalTitle       1345
dtype: int64

## RATINGS Database

In [16]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1314586 entries, 0 to 1314585
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1314586 non-null  object 
 1   averageRating  1314586 non-null  float64
 2   numVotes       1314586 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.1+ MB


### Replace "\N" with np.nan (if any)

In [23]:
ratings.replace({'\\N':np.nan}, inplace = True)
ratings.isna().sum()

tconst           0
averageRating    0
numVotes         0
dtype: int64

### Keep only US movies using AKAs table

In [25]:
keepers =ratings['tconst'].isin(akas['titleId'])
ratings = ratings[keepers]

# Review

In [27]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146957 entries, 34803 to 9875588
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          146957 non-null  object
 1   titleType       146957 non-null  object
 2   primaryTitle    146957 non-null  object
 3   originalTitle   146957 non-null  object
 4   isAdult         146957 non-null  object
 5   startYear       146957 non-null  int64 
 6   endYear         0 non-null       object
 7   runtimeMinutes  146957 non-null  object
 8   genres          146957 non-null  object
dtypes: int64(1), object(8)
memory usage: 11.2+ MB


In [28]:
akas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1438843 entries, 5 to 35996146
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1438843 non-null  object
 1   ordering         1438843 non-null  int64 
 2   title            1438843 non-null  object
 3   region           1438843 non-null  object
 4   language         3932 non-null     object
 5   types            979066 non-null   object
 6   attributes       46627 non-null    object
 7   isOriginalTitle  1437498 non-null  object
dtypes: int64(1), object(7)
memory usage: 98.8+ MB


In [29]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499778 entries, 0 to 1314561
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         499778 non-null  object 
 1   averageRating  499778 non-null  float64
 2   numVotes       499778 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.3+ MB


# Save Files

In [30]:
import os
os.makedirs('Data/',exist_ok=True) 

# Confirm folder created
os.listdir("Data/")

[]

In [31]:
## Save current dataframe to file.
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


In [32]:
## Save current dataframe to file.
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [33]:
## Save current dataframe to file.
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

# Open saved file and preview again
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1977
1,tt0000002,5.8,264
2,tt0000005,6.2,2617
3,tt0000006,5.1,182
4,tt0000007,5.4,820
