# Data Dictionary

### Data Location
-The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

### IMDb Dataset Details
- Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

### title.akas.tsv.gz
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.bas-ics.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short,  tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

### name.basics.tsv.gz
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
basics="https://datasets.imdbws.com/title.basics.tsv.gz"
akas="https://datasets.imdbws.com/title.akas.tsv.gz"
ratings="https://datasets.imdbws.com/title.ratings.tsv.gz"

In [3]:
# Create Dataframes
#basics_df = pd.read_csv(basics_url, sep='\t', low_memory=False)

#akas_df= basics = pd.read_csv(aka_url, sep='\t', low_memory=False)

#ratings_df = pd.read_csv(ratings_url, sep='\t', low_memory=False)

In [4]:
# making new folder with os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

['final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json']

In [5]:
# Open saved file and preview again
basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
1,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
2,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
3,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
4,tt0001285,movie,The Life of Moses,The Life of Moses,0,1909,,50,"Biography,Drama,Family"


In [6]:
# Open saved file and preview again
akas_df = pd.read_csv("Data/title_akas.csv.gz", low_memory=False)
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
1,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
2,tt0000005,10,Blacksmith Scene,US,\N,imdbDisplay,\N,0
3,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
4,tt0000005,6,Blacksmith Scene #1,US,\N,alternative,\N,0


In [7]:
# Open saved file and preview again
ratings_df = pd.read_csv("Data/title_ratings.csv.gz", low_memory=False)
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1986
1,tt0000002,5.8,265
2,tt0000003,6.5,1845
3,tt0000004,5.5,178
4,tt0000005,6.2,2627


# Processing

## Filtering/Cleaning Steps

### Title Basics:

In [8]:
# Convert startYear to float
basics_df['startYear'] = basics_df['startYear'].astype(float)

In [9]:
# Replace "\N" with np.nan
basics_df.replace({'\\N':np.nan},inplace=True)

In [10]:
# Eliminate movies that are missing values for runtimeMinutes, genres, startYear
basics_df = basics_df.dropna(subset = ['runtimeMinutes', 'genres'])

In [11]:
# Keep titleType movie
basics_df = basics_df.loc[basics_df['titleType'] == 'movie']

In [12]:
# Keep startYear 2000-2021
basics_df = basics_df.loc[(basics_df['startYear'] >= 2000) & (basics_df['startYear'] <=2021)]

In [13]:
basics_df['startYear'].value_counts()

2019.0    5876
2018.0    5779
2017.0    5642
2016.0    5254
2021.0    5159
2015.0    5056
2020.0    5005
2014.0    4917
2013.0    4711
2012.0    4522
2011.0    4230
2010.0    3861
2009.0    3557
2008.0    2912
2007.0    2576
2006.0    2438
2005.0    2184
2004.0    1902
2003.0    1685
2001.0    1577
2002.0    1570
2000.0    1456
Name: startYear, dtype: int64

In [14]:
# Eliminate movies that include "Documentary" in genre
documentary_filter = basics_df['genres'].str.contains('documentary', case=False)
basics_df = basics_df[~documentary_filter]

In [15]:
basics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81869 entries, 16194 to 159782
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          81869 non-null  object 
 1   titleType       81869 non-null  object 
 2   primaryTitle    81869 non-null  object 
 3   originalTitle   81869 non-null  object 
 4   isAdult         81869 non-null  int64  
 5   startYear       81869 non-null  float64
 6   endYear         0 non-null      float64
 7   runtimeMinutes  81869 non-null  int64  
 8   genres          81869 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 6.2+ MB


### AKAs:

In [16]:
akas_df = pd.read_csv(akas, sep='\t', low_memory=False)

In [17]:
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [18]:
akas_df.info

<bound method DataFrame.info of             titleId  ordering                      title region language  \
0         tt0000001         1                 Карменсіта     UA       \N   
1         tt0000001         2                 Carmencita     DE       \N   
2         tt0000001         3  Carmencita - spanyol tánc     HU       \N   
3         tt0000001         4                 Καρμενσίτα     GR       \N   
4         tt0000001         5                 Карменсита     RU       \N   
...             ...       ...                        ...    ...      ...   
36516803  tt9916852         5             Episódio #3.20     PT       pt   
36516804  tt9916852         6             Episodio #3.20     IT       it   
36516805  tt9916852         7               एपिसोड #3.20     IN       hi   
36516806  tt9916856         1                   The Wind     DE       \N   
36516807  tt9916856         2                   The Wind     \N       \N   

                types     attributes isOriginalTitle  


In [19]:
# Replace "\N" with np.nan
akas_df.replace({'\\N':np.nan},inplace=True)

In [20]:
#akas_df = akas_df.loc[akas_df['region'] =='US']
us_only = akas_df['region'] == 'US'

In [21]:
akas_df = akas_df[us_only]
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


In [22]:
akas_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1450916 entries, 5 to 36516552
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1450916 non-null  object
 1   ordering         1450916 non-null  int64 
 2   title            1450916 non-null  object
 3   region           1450916 non-null  object
 4   language         4000 non-null     object
 5   types            981156 non-null   object
 6   attributes       46968 non-null    object
 7   isOriginalTitle  1449574 non-null  object
dtypes: int64(1), object(7)
memory usage: 99.6+ MB


In [23]:
# Filter the basics table to only include US by using the filter aka dataframe
# Filter the basics table down to only include the US by using the filter akas dataframe
keepers =basics_df['tconst'].isin(akas_df['titleId'])
keepers

16194     True
28684     True
31584     True
38560     True
41184     True
          ... 
159778    True
159779    True
159780    True
159781    True
159782    True
Name: tconst, Length: 81869, dtype: bool

In [24]:
basics_df = basics_df[keepers]
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
16194,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
28684,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
31584,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
38560,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"
41184,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,,126,Drama


### Ratings:

In [25]:
# Replace "\N" with np.nan
ratings_df.replace({'\\N':np.nan},inplace=True)

In [26]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1986
1,tt0000002,5.8,265
2,tt0000003,6.5,1845
3,tt0000004,5.5,178
4,tt0000005,6.2,2627


In [27]:
ratings_us = ratings_df['tconst'].isin(akas_df['titleId'])

In [28]:
ratings_df= ratings_df[ratings_us]
ratings_df

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1986
1,tt0000002,5.8,265
4,tt0000005,6.2,2627
5,tt0000006,5.1,182
6,tt0000007,5.4,820
...,...,...,...
1328545,tt9916200,8.1,230
1328546,tt9916204,8.2,264
1328553,tt9916348,8.3,18
1328554,tt9916362,6.4,5413


In [29]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 503198 entries, 0 to 1328559
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         503198 non-null  object 
 1   averageRating  503198 non-null  float64
 2   numVotes       503198 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.4+ MB


## df to csv's read in from Data file

In [30]:
## Save dataframe to file
basics_df.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)

In [31]:
## Save dataframe to file
ratings_df.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)

In [32]:
## Save dataframe to file
akas_df.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)