# Project 3
Jacob Tanzi


## Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

* Include only movies that were released in the United States.
* Include only movies that were released 2000 - 2021 (include 2000 and 2021)
* Include only full-length movies (titleType = "movie").
* Exclude any movie with missing values for genre or runtime
* Include only fictional movies (not from the Documentary genre)




In [1]:
import os
os.makedirs('Data/',exist_ok=True)
os.listdir("Data/")

['title_basics.csv.gz']

In [2]:
import pandas as pd
import gzip

In [3]:
#basics = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz' , sep='\t',low_memory=True)

  basics = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz' , sep='\t',low_memory=True)


In [7]:
#ratings = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz' , sep='\t',low_memory=True)

In [15]:
#akas = pd.read_csv('https://datasets.imdbws.com/title.akas.tsv.gz', sep='\t', low_memory=False)


In [6]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [8]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1990
1,tt0000002,5.8,265
2,tt0000003,6.5,1851
3,tt0000004,5.5,178
4,tt0000005,6.2,2636


In [16]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


### Required Preprocessing - Details

Below is an overview of the steps required for this part of the project. Further down the page are detailed tips on how to accomplish many of these steps.

Filtering/Cleaning Steps:

AKAs:
* keep only US movies.
* Replace "\N" with np.nan



In [20]:
import numpy as np

In [21]:
akas.replace('\\N', np.nan, inplace=True)

In [22]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0
1,tt0000001,2,Carmencita,DE,,,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,,imdbDisplay,,0
3,tt0000001,4,Καρμενσίτα,GR,,imdbDisplay,,0
4,tt0000001,5,Карменсита,RU,,imdbDisplay,,0


In [23]:
akas = akas[akas['region'] == 'US']

In [24]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


Title Basics:
* Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan
* Eliminate movies that are null for runtimeMinutes
* Eliminate movies that are null for genre
* keep only titleType==Movie
* Convert the startYear column to float data type.
* Filter the dataframe using startYear. Keep years between 2000-2021 (Including 2000 and 2021)
* Eliminate movies that include "Documentary" in the genre (see tip below).



In [25]:
basics.replace('\\N', np.nan, inplace=True)

In [26]:
keepers =basics['tconst'].isin(akas['titleId'])
keepers

0            True
1            True
2           False
3           False
4            True
            ...  
10048734    False
10048735    False
10048736    False
10048737    False
10048738    False
Name: tconst, Length: 10048739, dtype: bool

In [27]:
basics = basics[keepers]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"
...,...,...,...,...,...,...,...,...,...
10048600,tt9916560,tvMovie,March of Dimes Presents: Once Upon a Dime,March of Dimes Presents: Once Upon a Dime,0,1963,,58,Family
10048629,tt9916620,movie,The Copeland Case,The Copeland Case,0,,,,Drama
10048667,tt9916702,short,Loving London: The Playground,Loving London: The Playground,0,,,,"Drama,Short"
10048690,tt9916756,short,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,,,Short


In [28]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1368316 entries, 0 to 10048694
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1368316 non-null  object
 1   titleType       1368316 non-null  object
 2   primaryTitle    1368316 non-null  object
 3   originalTitle   1368316 non-null  object
 4   isAdult         1368316 non-null  object
 5   startYear       1269473 non-null  object
 6   endYear         37288 non-null    object
 7   runtimeMinutes  864233 non-null   object
 8   genres          1339809 non-null  object
dtypes: object(9)
memory usage: 104.4+ MB


In [29]:

basics.dropna(subset=['runtimeMinutes'], inplace=True)

In [30]:
basics.dropna(subset=['genres'], inplace=True)

In [31]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"


In [32]:
basics['titleType'].unique()

array(['short', 'movie', 'tvSeries', 'tvEpisode', 'tvMovie', 'tvShort',
       'tvMiniSeries', 'video', 'tvSpecial', 'videoGame'], dtype=object)

In [33]:
basics = basics[basics['titleType'] == 'movie']

In [34]:
basics['titleType'].unique()

array(['movie'], dtype=object)

In [35]:
basics['startYear'].unique()

array(['1894', '1897', '1906', '1907', '1908', '1909', '1911', '1913',
       '1912', '1919', '1914', '1915', '1936', '1916', '1917', '1925',
       '1918', '1920', '1921', '1924', '1923', '1922', '1927', '1926',
       '1935', '1929', '1928', '1942', '1930', '1932', '1931', '1937',
       '1950', '1933', '1938', '1939', '1934', '1940', '1944', '1946',
       '1957', '1943', '1941', '1948', '2001', '1945', '1953', '1954',
       '1983', '1947', '1949', '1980', '1952', '1951', '1962', '1955',
       '1961', '1958', '1956', '1959', '1960', '1963', '1965', '1971',
       '1964', '1969', '1966', '1967', '1968', '1990', '1970', '1973',
       '1979', '1976', '2020', '1978', '1972', '1989', '1974', '1975',
       '1981', '1995', '1986', '1987', '2016', '2018', '1992', '1977',
       '1984', '1985', '1982', '1993', '1988', nan, '1991', '2005',
       '1998', '2002', '2022', '1997', '2009', '2017', '2006', '1996',
       '1994', '1999', '2004', '2000', '2008', '2007', '2003', '1903',
       '2

In [37]:
basics['startYear'] = basics['startYear'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  basics['startYear'] = basics['startYear'].astype(float)


In [38]:
filtered_basics = basics[(basics['startYear'] >= 2000) 
                         & (basics['startYear'] <= 2021)]

In [42]:
filtered_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34802,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61114,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67488,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
67666,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
86793,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [43]:
basics = filtered_basics

In [44]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34802,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61114,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67488,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
67666,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
86793,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [45]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114501 entries, 34802 to 10048505
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          114501 non-null  object 
 1   titleType       114501 non-null  object 
 2   primaryTitle    114501 non-null  object 
 3   originalTitle   114501 non-null  object 
 4   isAdult         114501 non-null  object 
 5   startYear       114501 non-null  float64
 6   endYear         0 non-null       object 
 7   runtimeMinutes  114501 non-null  object 
 8   genres          114501 non-null  object 
dtypes: float64(1), object(8)
memory usage: 12.8+ MB


In [46]:
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]

Ratings:
* Keep only movies that were included in your final title basics dataframe above.
(Use AKAs table, see "Filtering one dataframe based on another" section below)
* Replace "\N" with np.nan (if any)

In [47]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1334565 entries, 0 to 1334564
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1334565 non-null  object 
 1   averageRating  1334565 non-null  float64
 2   numVotes       1334565 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.5+ MB


In [48]:
ratings.replace('\\N', np.nan, inplace=True)

In [49]:
r_keep = ratings['tconst'].isin(akas['titleId'])
r_keep

0           True
1           True
2          False
3          False
4           True
           ...  
1334560    False
1334561    False
1334562    False
1334563    False
1334564    False
Name: tconst, Length: 1334565, dtype: bool

In [50]:
ratings = ratings[r_keep]


In [51]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1990
1,tt0000002,5.8,265
4,tt0000005,6.2,2636
5,tt0000006,5.0,183
6,tt0000007,5.4,826


In [52]:
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

In [53]:
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)

In [54]:
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

In [55]:
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [56]:
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [57]:
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1990
1,tt0000002,5.8,265
2,tt0000005,6.2,2636
3,tt0000006,5.0,183
4,tt0000007,5.4,826


## Deliverable

After filtering out movies that do not meet the stakeholder's specifications:

* Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
* Save each file to a compressed csv file "Data/" folder inside your repository.
* Commit your changes to your repository in GitHub desktop and * Publish repository / Push Changes.
* Submit the link to your repository