# Project 3
Jacob Tanzi


## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

* Include only movies that were released in the United States.
* Include only movies that were released 2000 - 2021 (include 2000 and 2021)
* Include only full-length movies (titleType = "movie").
* Exclude any movie with missing values for genre or runtime
* Include only fictional movies (not from the Documentary genre)




In [1]:
import os
os.makedirs('Data/',exist_ok=True)
os.listdir("Data/")

[]

In [6]:
import pandas as pd
import gzip

In [13]:
file_path = 'title.basics.tsv.gz'


def read_gzipped_tsv(file_path):
    with gzip.open(file_path, 'rb') as f:
        basics = pd.read_csv(f, sep='\t', low_memory=False)
    return basics

basics = read_gzipped_tsv(file_path)


In [14]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [17]:
file_path = 'title.ratings.tsv.gz'


def read_gzipped_tsv(file_path):
    with gzip.open(file_path, 'rb') as f:
        ratings = pd.read_csv(f, sep='\t', low_memory=False)
    return ratings

ratings = read_gzipped_tsv(file_path)

In [18]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,265
2,tt0000003,6.5,1850
3,tt0000004,5.5,178
4,tt0000005,6.2,2634


In [25]:

akas = pd.read_csv('title-akas-us-only.csv', low_memory=False)


In [26]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
1,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
2,tt0000005,10,Blacksmith Scene,US,\N,imdbDisplay,\N,0
3,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
4,tt0000005,6,Blacksmith Scene #1,US,\N,alternative,\N,0


### Required Preprocessing - Details

Below is an overview of the steps required for this part of the project. Further down the page are detailed tips on how to accomplish many of these steps.

Filtering/Cleaning Steps:

AKAs:
* keep only US movies.
* Replace "\N" with np.nan



In [29]:
akas.replace({'\\N':'np.nan'}, inplace = True)

Ratings:
* Keep only movies that were included in your final title basics dataframe above.
(Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan (if any)

In [30]:
ratings.replace({'\\N':'np.nan'}, inplace = True)

Title Basics:
* Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan
* Eliminate movies that are null for runtimeMinutes
* Eliminate movies that are null for genre
* keep only titleType==Movie
* Convert the startYear column to float data type.
* Filter the dataframe using startYear. Keep years between 2000-2021 (Including 2000 and 2021)
* Eliminate movies that include "Documentary" in the genre (see tip below).



In [31]:
basics.replace({'\\N':'np.nan'}, inplace = True)

In [33]:
keepers =basics['tconst'].isin(akas['titleId'])
keepers

0            True
1            True
2           False
3           False
4            True
            ...  
10036507    False
10036508    False
10036509    False
10036510    False
10036511    False
Name: tconst, Length: 10036512, dtype: bool

In [34]:
basics = basics[keepers]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,np.nan,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,np.nan,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,np.nan,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,np.nan,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,np.nan,1,"Short,Sport"
...,...,...,...,...,...,...,...,...,...
10036373,tt9916560,tvMovie,March of Dimes Presents: Once Upon a Dime,March of Dimes Presents: Once Upon a Dime,0,1963,np.nan,58,Family
10036402,tt9916620,movie,The Copeland Case,The Copeland Case,0,np.nan,np.nan,np.nan,Drama
10036440,tt9916702,short,Loving London: The Playground,Loving London: The Playground,0,np.nan,np.nan,np.nan,"Drama,Short"
10036463,tt9916756,short,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,np.nan,np.nan,Short


In [35]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365742 entries, 0 to 10036467
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1365742 non-null  object
 1   titleType       1365742 non-null  object
 2   primaryTitle    1365742 non-null  object
 3   originalTitle   1365742 non-null  object
 4   isAdult         1365742 non-null  object
 5   startYear       1365742 non-null  object
 6   endYear         1365742 non-null  object
 7   runtimeMinutes  1365742 non-null  object
 8   genres          1365742 non-null  object
dtypes: object(9)
memory usage: 104.2+ MB


In [39]:

basics.dropna(subset=['runtimeMinutes'], inplace=True)

In [40]:
basics.dropna(subset=['genres'], inplace=True)

In [41]:
basics = basics[basics['titleType'] == 'Movie']

In [42]:
basics['startYear'] = basics['startYear'].astype(float)

In [43]:
filtered_basics = basics[(basics['startYear'] >= 2000) 
                         & (basics['startYear'] <= 2021)]

In [44]:
basics = basics[filtered_basics]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres


In [45]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres


## Deliverable

After filtering out movies that do not meet the stakeholder's specifications:

* Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
* Save each file to a compressed csv file "Data/" folder inside your repository.
* Commit your changes to your repository in GitHub desktop and * Publish repository / Push Changes.
* Submit the link to your repository