# Intro

Jon Messier

2/20/2023

---

**Business Problem**

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

# Part 1

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

**Getting Started Tips:**

- Please make sure to read the following lesson ["Getting Started - Project 3"](https://login.codingdojo.com/m/376/12528/88061) for additional tips and directions!
    
  **The Data**

    IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.
-   Overview/Data Dictionary: https://www.imdb.com/interfaces/
- Downloads page: https://datasets.imdbws.com/

- From their previous research, they realized they want to focus on the following files:
  - title.basics.tsv.gz
  - title.ratings.tsv.gz
  - title.akas.tsv.gz


**Specifications**

Your stakeholder only wants you to include information for movies based on the following specifications:

-    Exclude any movie with missing values for genre or runtime
-    Include only full-length movies (titleType = "movie").
-    Include only fictional movies (not from documentary genre)
-   Include only movies that were released 2000 - 2021 (include 2000 and 2021)
-    Include only movies that were released in the United States


**Deliverable**

After filtering out movies that do not meet the stakeholder's specifications:

-    Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
-    Save each file to a compressed csv file "Data/" folder inside your repository.
-    Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
-    Submit the link to your repository



# Class/Data imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

# Data Inspection and cleanup

## Aka's
- [x]  Replace "\N" with np.nan
- [x] Keep only US movies.

In [3]:
df_akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
df_akas.info()
df_akas.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35009891 entries, 0 to 35009890
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.1+ GB


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [4]:
#replace \N with np.nan
df_akas = df_akas.replace({'\\N':np.nan})

In [5]:
#Keep only US movies
df_akas=df_akas.loc[df_akas['region']=="US"]

## Basics
- [x]   Replace "\N" with np.nan
- [x]   Eliminate movies that are null for runtimeMinutes
- [x] Eliminate movies that are null for genre
- [x] keep only titleType==Movie
- [x] keep startYear 2000-2022
- [x]  Eliminate movies that include "Documentary" in genre (see tip below)
- [x]  Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [6]:
df_basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
df_basics.info()
df_basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9631839 entries, 0 to 9631838
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 661.4+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [7]:
#replace \N with np.nan
df_basics = df_basics.replace({'\\N':np.nan})
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9631839 entries, 0 to 9631838
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 661.4+ MB


In [8]:
#drop null runtimes
df_basics = df_basics.dropna(axis=0, subset='runtimeMinutes')
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2821887 entries, 0 to 9631838
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 215.3+ MB


In [9]:
#drop null genre
df_basics = df_basics.dropna(axis=0,subset = 'genres')
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2746306 entries, 0 to 9631838
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 209.5+ MB


In [10]:
#keep titletype=movie
df_basics=df_basics[df_basics['titleType']=='movie']
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 377538 entries, 8 to 9631789
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          377538 non-null  object
 1   titleType       377538 non-null  object
 2   primaryTitle    377538 non-null  object
 3   originalTitle   377538 non-null  object
 4   isAdult         377538 non-null  object
 5   startYear       371162 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  377538 non-null  object
 8   genres          377538 non-null  object
dtypes: object(9)
memory usage: 28.8+ MB


In [11]:
#drop null startYears
df_basics = df_basics.dropna(axis=0,subset = 'startYear')
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371162 entries, 8 to 9631789
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          371162 non-null  object
 1   titleType       371162 non-null  object
 2   primaryTitle    371162 non-null  object
 3   originalTitle   371162 non-null  object
 4   isAdult         371162 non-null  object
 5   startYear       371162 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  371162 non-null  object
 8   genres          371162 non-null  object
dtypes: object(9)
memory usage: 28.3+ MB


In [12]:
#convert year to an int
df_basics['startYear']=df_basics['startYear'].astype('int')

#Keep only the movies between 2000-2022
df_basics=df_basics.loc[(df_basics["startYear"]>= 2000) 
                        & (df_basics["startYear"]<= 2022)]
df_basics['startYear'].describe()

count    221677.000000
mean       2013.341402
std           5.841255
min        2000.000000
25%        2009.000000
50%        2014.000000
75%        2018.000000
max        2022.000000
Name: startYear, dtype: float64

In [13]:
# Exclude movies that are included in the documentary category.
is_documentary = df_basics['genres'].str.contains('documentary',case=False)
df_basics = df_basics[~is_documentary]
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
77964,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [14]:
# Filter the basics table down to only include the US by using the filter ...
#Akas dataframe
keepers =df_basics['tconst'].isin(df_akas['titleId'])
df_basics = df_basics[keepers]
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
93938,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


## Ratings

- [x]   Replace "\N" with np.nan
- [x]   Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

In [15]:
df_ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
df_ratings.info()
df_ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1282624 entries, 0 to 1282623
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1282624 non-null  object 
 1   averageRating  1282624 non-null  float64
 2   numVotes       1282624 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 29.4+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000003,6.5,1787
3,tt0000004,5.6,179
4,tt0000005,6.2,2589


In [16]:
#replace \N with np.nan
df_ratings = df_ratings.replace({'\\N':np.nan})

In [17]:
# Filter the ratings table down to only include the US by using the filter ...
#Akas dataframe
keepers =df_ratings['tconst'].isin(df_akas['titleId'])
df_ratings = df_ratings[keepers]
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
4,tt0000005,6.2,2589
5,tt0000006,5.1,177
6,tt0000007,5.4,812


# Data File storage

In [18]:
# example making new folder with os
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [19]:
## Save current dataframe to file.
df_basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)
df_ratings.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)
df_akas.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

In [20]:
# Open saved file and preview again
df_basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


In [21]:
# Open saved file and preview again
df_akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
df_akas.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000005,6.2,2589
3,tt0000006,5.1,177
4,tt0000007,5.4,812


In [22]:
# Open saved file and preview again
df_ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
df_ratings.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


# File .info() summary

In [23]:
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85569 entries, 0 to 85568
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          85569 non-null  object 
 1   titleType       85569 non-null  object 
 2   primaryTitle    85569 non-null  object 
 3   originalTitle   85569 non-null  object 
 4   isAdult         85569 non-null  int64  
 5   startYear       85569 non-null  int64  
 6   endYear         0 non-null      float64
 7   runtimeMinutes  85569 non-null  int64  
 8   genres          85569 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 5.9+ MB


In [24]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1416293 entries, 0 to 1416292
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1416293 non-null  object 
 1   ordering         1416293 non-null  int64  
 2   title            1416293 non-null  object 
 3   region           1416293 non-null  object 
 4   language         3832 non-null     object 
 5   types            974033 non-null   object 
 6   attributes       46039 non-null    object 
 7   isOriginalTitle  1414948 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 86.4+ MB


In [25]:
df_akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489896 entries, 0 to 489895
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         489896 non-null  object 
 1   averageRating  489896 non-null  float64
 2   numVotes       489896 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 11.2+ MB
