# Business Problem

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

- Part 1: Download several files from IMDB’s movie data set and filter out the subset of movies requested by the stakeholder.
- Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
- Part 3: Construct and export a MySQL database using your data.
- Part 4: Apply hypothesis testing to explore what makes a movie successful.
- Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

## Getting Started Tips:
Please make sure to read the following lesson "Getting Started - Project 3" for additional tips and directions!

## The Data
- IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

 - Overview/Data Dictionary: https://www.imdb.com/interfaces/
 - Downloads page: https://datasets.imdbws.com/


- From their previous research, they realized they want to focus on the following files:

 - title.basics.tsv.gz
 - title.ratings.tsv.gz
 - title.akas.tsv.gz

## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

## Deliverable
After filtering out movies that do not meet the stakeholder's specifications:

- Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
- Save each file to a compressed csv file "Data/" folder inside your repository.
- Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
- Submit the link to your repository

# Imports

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None) 

# Load Data

In [2]:
# set urls for database from IMDB website
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

In [None]:
# load data
basics = pd.read_csv(url_basics, sep = '\t', low_memory = False)
akas = pd.read_csv(url_akas, sep = '\t', low_memory = False)
ratings = pd.read_csv(url_ratings, sep = '\t', low_memory = False)

# Data Cleaning

## Title Basics Database

### Replace "\N" with np.nan

In [None]:
basics.isna().sum()

In [None]:
# Missing values are nan and \N. I wlll replace them all with nan so I can delete them. 
basics.replace({'\\N':np.nan}, inplace = True)
basics.isna().sum()

### Eliminate movies that are null for runtimeMinutes

In [None]:
basics.dropna(subset = ['runtimeMinutes'], axis = 0, inplace = True)
basics.isna().sum()

### Eliminate movies that are null for genre

In [None]:
basics.dropna(subset = ['genres'], axis = 0, inplace = True)
basics.isna().sum()

### Keep only titleType==Movie

In [None]:
basics = basics[basics['titleType'] == 'movie']
basics['titleType'].info()

In [None]:
basics.info()

### Keep startYear 2000-2022

In [None]:
basics.dropna(subset = ['startYear'], axis = 0, inplace = True)
basics['startYear'] = basics['startYear'].astype(dtype = int) 
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2022)]
basics['startYear'].describe()

### Eliminate movies that include "Documentary" in genre

In [None]:
basics['genres'].value_counts()

In [None]:
is_documentary = basics['genres'].str.contains('documentary',case = False)
basics = basics[~is_documentary]

In [None]:
basics['genres'].value_counts()

### Keep only US movies using AKAs table

### AKAS Database

In [None]:
akas.info()

### RATINGS Database

In [None]:
ratings.info()