<table style="background-color: rgb(13,32,76); border-radius: 10px">
	<thead>
		<tr>
			<th colspan="2" style="border: hidden; vertical-align: top;" width="15%"><img src="https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/1674755235__Icons_400px_Core Assignment.png">
			</th>
			<th style="border: hidden;">
				<h1 style="color: white;">Project Part 1 (Core)</h1>
			</th>
		</tr>
	</thead>
</table>

<h1>Business Problem</h1>
<p>For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.</p><p><br></p><p><img src="https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/1649183790__theatre-background.jpeg" referrerpolicy="no-referrer" alt="img" style="cursor: pointer; max-width: 100%; height: 377px; width: 696px; display: block; margin: auto;" width="696" height="377" title="img"></p>
<p class="text-center"><a href="https://thesenatortheatre.com/">Image Source</a></p>
<p>Over the course of this project, you will:</p>
<ul>
<li>Part 1: Create your project repository, download IMDB’s movie data, and filter out the subset of movies requested by the stakeholder.</li>
<li>Part 3: Design a  MySQL database for your data and insert the data.</li>
<li>Part 3: Use an API to extract box office financial data and transform and load it into your database.</li>
<li>Part 4: Apply hypothesis testing to explore what makes a movie "successful."</li>
</ul>
<h2><br>Part 1</h2>
<p>For Part 1 of the project, you will be creating your project repository, downloading the IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as csv files (".csv.gz") in your repository.</p>


<h3>The Data</h3><div><font color="#3e4e5a" face="Gotham-Rounded-Bold"><span style="font-size: 16px;"></span></font>
<p>IMDB Provides a large dataset with varied information for Movies, TV Shows, Made for TV Movies, etc. for free (for Non-commercial use).  The Data Dictionary is located here: <a href="https://www.imdb.com/interfaces/" target="_blank" class="url" style="background-color: rgb(255, 255, 255);">https://www.imdb.com/interfaces/</a>. </p>
<ul><li>We have provided partially-processed files for you in this <a href="https://drive.google.com/drive/folders/1I8FKN3S9acXMNzyXq3lo8n9PjplSPB97?usp=drive_link" style="background-color: rgb(255, 255, 255);">Google Drive folder</a>.</li></ul>
<h2>Specifications</h2>
<p>Your stakeholder only wants you to include information for movies based on the following specifications:</p>
<ul>
<li>Include only movies that were released in the United States.</li>
<li>Include only movies that were released 2000 - 2022 (startYear &gt;=2000 and startYear&lt;=2022)</li>
<li>Include only full-length movies (titleType = "movie").</li>
<li>Exclude movies that are missing genre or runtime.</li>
<li>Include only fictional genres (where Genres does not include "Documentary".)</li>
</ul>
<h2>Provided Files </h2>
<ul>
<li>From their previous research, they realized the data they want is in the following files:
<ul>
<li>title.basics.tsv.gz</li>
<li>title.ratings.tsv.gz</li>
</ul>
</li>
<li>However, to filter for movies released in the United States, you will also need "title-akas-us-only.csv"</li></ul>
<p>Note: this is a pre-filtered version of the title.akas.tsv.gz file. The full file is large and can cause problems for computers with less RAM/memory. We have included information on the preprocessing steps that have already been performed in the included Google Doc "IMDB Movie Dataset Info" in the folder linked above.</p>

</ul>
<hr>
<h1></h1>


In [1]:
import pandas as pd
import numpy as np

## Processing Title Basics

In [3]:
# Load title basics
basics = pd.read_csv("Data_v23/title.basics.tsv.gz", sep='\t', low_memory=False)
basics

FileNotFoundError: [Errno 2] No such file or directory: 'Data_v23/title.basics.tsv.gz'

In [None]:
basics.isna().sum()

### Include only movies that were released in the United States

#### 1) Get unique title IDs from US akas

In [None]:
# Load us AKAs
us_akas = pd.read_csv("Data_v23/title-akas-us-only.csv", low_memory=False)
us_akas.head()

In [None]:
# Get list of unique movie IDs from US
us_titleids = us_akas['titleId'].unique()
len(us_titleids)

In [None]:
us_titleids

#### 2) Remove rows from title basics that are not in the us title IDs


In [None]:
# Only keep movies released in US
basics = basics[basics['tconst'].isin(us_titleids)]
basics

### Exclude any movie with missing values for genre or runtime

In [None]:
## Replace "\N" with np.nan
basics = basics.replace({'\\N':np.nan})
basics.isna().sum()

In [None]:
## Eliminate movies that are null for runtimeMinute, genres, and startYear
basics = basics.dropna(subset=['runtimeMinutes','genres'])
basics.isna().sum()

In [None]:
len(basics)

### Include only full-length movies (titleType = "movie").


In [None]:
basics['titleType'].value_counts()

In [None]:
filter_type = basics['titleType'] =='movie'
basics = basics[filter_type]
len(basics)

In [None]:
basics.head()

### Include only fictional movies (not from documentary genre)

In [None]:
# Get filter for movies with genre="Documentary"
filter_docs = basics['genres'].str.contains("Documentary")
filter_docs.sum()

In [None]:
# Remove documentaries
basics = basics[~filter_docs]
len(basics)


### Include only movies that were released 2000 - 2021 (include 2000 and 2021)

#### Convert Start Year to Float

In [None]:
# Convert startYear to a float.
basics['startYear'] = basics['startYear'].astype(float)

In [None]:
# Filter based on years
filter_years = (basics['startYear'] < 2023) & (basics['startYear']>=2000)
filter_years.sum()

In [None]:
# remove unwanted years
basics = basics[filter_years]
len(basics)

### Title Basics - Final Check & save to csv

In [None]:
# Final preview before saving
basics.info()
basics.head()

In [None]:
basics.to_csv('Data/title-basics.csv', index=False)

## Filter Ratings to Keep Only Movies in Final Title Basics

In [None]:
ratings = pd.read_csv('Data/title.ratings.tsv.gz', sep='\t')
ratings.info()
ratings

In [None]:
## Replace "\N" with np.nan
ratings = ratings.replace({'\\N':np.nan})
ratings.isna().sum()

In [None]:
ratings = ratings[ratings['tconst'].isin(basics['tconst'])]
ratings.info()
ratings.head()

In [None]:
ratings.to_csv("Data/title-ratings.csv", index=False)