# Project 3
Jacob Tanzi


## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

* Include only movies that were released in the United States.
* Include only movies that were released 2000 - 2021 (include 2000 and 2021)
* Include only full-length movies (titleType = "movie").
* Exclude any movie with missing values for genre or runtime
* Include only fictional movies (not from the Documentary genre)




In [91]:
import os
os.makedirs('Data/',exist_ok=True)
os.listdir("Data/")

[]

In [92]:
import pandas as pd
import gzip

In [93]:
file_path = 'title.basics.tsv.gz'


def read_gzipped_tsv(file_path):
    with gzip.open(file_path, 'rb') as f:
        basics = pd.read_csv(f, sep='\t', low_memory=False)
    return basics

basics = read_gzipped_tsv(file_path)


In [94]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [95]:
file_path = 'title.ratings.tsv.gz'


def read_gzipped_tsv(file_path):
    with gzip.open(file_path, 'rb') as f:
        ratings = pd.read_csv(f, sep='\t', low_memory=False)
    return ratings

ratings = read_gzipped_tsv(file_path)

In [96]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,265
2,tt0000003,6.5,1850
3,tt0000004,5.5,178
4,tt0000005,6.2,2634


In [97]:

akas = pd.read_csv('title-akas-us-only.csv', low_memory=False)


In [98]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
1,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
2,tt0000005,10,Blacksmith Scene,US,\N,imdbDisplay,\N,0
3,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
4,tt0000005,6,Blacksmith Scene #1,US,\N,alternative,\N,0


### Required Preprocessing - Details

Below is an overview of the steps required for this part of the project. Further down the page are detailed tips on how to accomplish many of these steps.

Filtering/Cleaning Steps:

AKAs:
* keep only US movies.
* Replace "\N" with np.nan



In [99]:
akas.replace({'\\N':'np.nan'}, inplace = True)

In [100]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,np.nan,imdbDisplay,np.nan,0
1,tt0000002,7,The Clown and His Dogs,US,np.nan,np.nan,literal English title,0
2,tt0000005,10,Blacksmith Scene,US,np.nan,imdbDisplay,np.nan,0
3,tt0000005,1,Blacksmithing Scene,US,np.nan,alternative,np.nan,0
4,tt0000005,6,Blacksmith Scene #1,US,np.nan,alternative,np.nan,0


Ratings:
* Keep only movies that were included in your final title basics dataframe above.
(Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan (if any)

In [101]:
ratings.replace({'\\N':'np.nan'}, inplace = True)

In [102]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,265
2,tt0000003,6.5,1850
3,tt0000004,5.5,178
4,tt0000005,6.2,2634


Title Basics:
* Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan
* Eliminate movies that are null for runtimeMinutes
* Eliminate movies that are null for genre
* keep only titleType==Movie
* Convert the startYear column to float data type.
* Filter the dataframe using startYear. Keep years between 2000-2021 (Including 2000 and 2021)
* Eliminate movies that include "Documentary" in the genre (see tip below).



In [103]:
basics.replace({'\\N':'np.nan'}, inplace = True)

In [104]:
keepers =basics['tconst'].isin(akas['titleId'])
keepers

0            True
1            True
2           False
3           False
4            True
            ...  
10036507    False
10036508    False
10036509    False
10036510    False
10036511    False
Name: tconst, Length: 10036512, dtype: bool

In [105]:
basics = basics[keepers]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,np.nan,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,np.nan,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,np.nan,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,np.nan,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,np.nan,1,"Short,Sport"
...,...,...,...,...,...,...,...,...,...
10036373,tt9916560,tvMovie,March of Dimes Presents: Once Upon a Dime,March of Dimes Presents: Once Upon a Dime,0,1963,np.nan,58,Family
10036402,tt9916620,movie,The Copeland Case,The Copeland Case,0,np.nan,np.nan,np.nan,Drama
10036440,tt9916702,short,Loving London: The Playground,Loving London: The Playground,0,np.nan,np.nan,np.nan,"Drama,Short"
10036463,tt9916756,short,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,np.nan,np.nan,Short


In [106]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365742 entries, 0 to 10036467
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1365742 non-null  object
 1   titleType       1365742 non-null  object
 2   primaryTitle    1365742 non-null  object
 3   originalTitle   1365742 non-null  object
 4   isAdult         1365742 non-null  object
 5   startYear       1365742 non-null  object
 6   endYear         1365742 non-null  object
 7   runtimeMinutes  1365742 non-null  object
 8   genres          1365742 non-null  object
dtypes: object(9)
memory usage: 104.2+ MB


In [107]:

basics.dropna(subset=['runtimeMinutes'], inplace=True)

In [108]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,np.nan,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,np.nan,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,np.nan,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,np.nan,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,np.nan,1,"Short,Sport"


In [109]:
basics.dropna(subset=['genres'], inplace=True)

In [110]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,np.nan,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,np.nan,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,np.nan,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,np.nan,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,np.nan,1,"Short,Sport"


In [111]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365742 entries, 0 to 10036467
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1365742 non-null  object
 1   titleType       1365742 non-null  object
 2   primaryTitle    1365742 non-null  object
 3   originalTitle   1365742 non-null  object
 4   isAdult         1365742 non-null  object
 5   startYear       1365742 non-null  object
 6   endYear         1365742 non-null  object
 7   runtimeMinutes  1365742 non-null  object
 8   genres          1365742 non-null  object
dtypes: object(9)
memory usage: 104.2+ MB


In [112]:
basics['titleType'].unique()

array(['short', 'movie', 'tvSeries', 'tvMovie', 'tvEpisode', 'tvShort',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame'], dtype=object)

In [113]:
basics = basics[basics['titleType'] == 'movie']

In [114]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299523 entries, 8 to 10036402
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          299523 non-null  object
 1   titleType       299523 non-null  object
 2   primaryTitle    299523 non-null  object
 3   originalTitle   299523 non-null  object
 4   isAdult         299523 non-null  object
 5   startYear       299523 non-null  object
 6   endYear         299523 non-null  object
 7   runtimeMinutes  299523 non-null  object
 8   genres          299523 non-null  object
dtypes: object(9)
memory usage: 22.9+ MB


In [115]:
basics['startYear'].unique()

array(['1894', '1897', '1906', '1907', '1908', '1910', '1909', '1911',
       '1913', '1912', '1914', '1919', '1917', '1915', '1936', '1916',
       '1925', '1918', '1922', '1920', '1921', '1924', '1923', '1927',
       '1926', '1935', '1929', '1928', '1942', '1930', '1932', '1939',
       '1931', 'np.nan', '1937', '1950', '1933', '1938', '1934', '1940',
       '1944', '1946', '1957', '1943', '1941', '1948', '2001', '1945',
       '1949', '1953', '1954', '1983', '1947', '1980', '1952', '1951',
       '1962', '1955', '1961', '1958', '1956', '1959', '1960', '1963',
       '1965', '1971', '1964', '1969', '1972', '1966', '1967', '1968',
       '1990', '1970', '1973', '1979', '1976', '2020', '1978', '1989',
       '1974', '1975', '1981', '1995', '1986', '1987', '2016', '2018',
       '1992', '1977', '1984', '1985', '1982', '1993', '1988', '1991',
       '2005', '1998', '2002', '1997', '2009', '2017', '2006', '1996',
       '1994', '1999', '2004', '2000', '2003', '2008', '2007', '2022',
    

In [122]:
basics['startYear'].value_counts()

2019.0    9619
2018.0    9429
2017.0    9320
2021.0    9235
2022.0    9217
          ... 
2028.0       2
1894.0       1
2030.0       1
1904.0       1
2031.0       1
Name: startYear, Length: 135, dtype: int64

In [117]:
basics['startYear'] = basics['startYear'].replace('np.nan', np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  basics['startYear'] = basics['startYear'].replace('np.nan', np.nan)


In [123]:
basics['startYear'] = basics['startYear'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  basics['startYear'] = basics['startYear'].astype(float)


In [124]:
filtered_basics = basics[(basics['startYear'] >= 2000) 
                         & (basics['startYear'] <= 2021)]

In [125]:
basics = basics[filtered_basics]
basics

ValueError: Boolean array expected for the condition, not object

In [126]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,np.nan,45,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,np.nan,100,"Documentary,News,Sport"
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,np.nan,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,np.nan,90,Drama
625,tt0000630,movie,Hamlet,Amleto,0,1908.0,np.nan,np.nan,Drama


In [46]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          0 non-null      object 
 1   titleType       0 non-null      object 
 2   primaryTitle    0 non-null      object 
 3   originalTitle   0 non-null      object 
 4   isAdult         0 non-null      object 
 5   startYear       0 non-null      float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  0 non-null      object 
 8   genres          0 non-null      object 
dtypes: float64(1), object(8)
memory usage: 0.0+ bytes


In [None]:
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

## Deliverable

After filtering out movies that do not meet the stakeholder's specifications:

* Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
* Save each file to a compressed csv file "Data/" folder inside your repository.
* Commit your changes to your repository in GitHub desktop and * Publish repository / Push Changes.
* Submit the link to your repository