# Data Dictionary

### Data Location
-The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

### IMDb Dataset Details
- Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

### title.akas.tsv.gz
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.bas-ics.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short,  tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

### name.basics.tsv.gz
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [4]:
import pandas as pd
import numpy as np
import os
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import pymysql
pymysql.install_as_MySQLdb()
from urllib.parse import quote_plus as urlquote

# ## Change username and password to match your personal MySQL Server settings
# username = 'root' # default username for MySQL db is root
# password = 'YOUR_PASSWORD' # whatever password you chose during MySQL installation.

# connection = f'mysql+pymysql://{username}:{password}@localhost/Chinook'
# engine = create_engine(connection)

In [5]:
basics="https://datasets.imdbws.com/title.basics.tsv.gz"
akas="https://datasets.imdbws.com/title.akas.tsv.gz"
ratings="https://datasets.imdbws.com/title.ratings.tsv.gz"

In [6]:
import json
with open('/Users/parri_nqdmzn3/.secret/mysql.JSON') as f:
    login = json.load(f)
login.keys()

dict_keys(['username', 'password'])

In [7]:
connection = f"mysql+pymysql://{login['username']}:{urlquote(login['password'])}@localhost:8888/"
engine = create_engine(connection)

In [None]:
basics = pd.read_csv(basics, sep='\t', low_memory=False)

In [91]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [92]:
basics.info

<bound method DataFrame.info of             tconst  titleType               primaryTitle  \
0        tt0000001      short                 Carmencita   
1        tt0000002      short     Le clown et ses chiens   
2        tt0000003      short             Pauvre Pierrot   
3        tt0000004      short                Un bon bock   
4        tt0000005      short           Blacksmith Scene   
...            ...        ...                        ...   
9971214  tt9916848  tvEpisode              Episode #3.17   
9971215  tt9916850  tvEpisode              Episode #3.19   
9971216  tt9916852  tvEpisode              Episode #3.20   
9971217  tt9916856      short                   The Wind   
9971218  tt9916880  tvEpisode  Horrid Henry Knows It All   

                     originalTitle isAdult startYear endYear runtimeMinutes  \
0                       Carmencita       0      1894      \N              1   
1           Le clown et ses chiens       0      1892      \N              5   
2         

# Processing

## Filtering/Cleaning Steps

### Title Basics:

In [93]:
# Replace "\N" with np.nan
basics.replace({'\\N':np.nan},inplace=True)

In [94]:
# Eliminate movies that are missing values for runtimeMinutes, genres, startYear
basics = basics.dropna(subset = ['runtimeMinutes', 'genres', 'startYear'])

In [95]:
# Eliminate movies that are null for genre
basics = basics.dropna(subset = ['genres'])

In [96]:
# keep only titleType==Movie
basics = basics.loc[basics['titleType'] == 'movie']

In [97]:
# keep startYear 2000-2022
documentary_filter = basics['genres'].str.contains('documentary', case=False)
basics = basics[~documentary_filter]

In [98]:
# Eliminate movies that include "Documentary" in genre (see tip below)
# Exclude movies that are included in the documentary category.
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]



### AKAs:

In [99]:
akas = pd.read_csv(akas, sep='\t', low_memory=False)

In [100]:
akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


In [101]:
akas.info

<bound method DataFrame.info of             titleId  ordering                      title region language  \
0         tt0000001         1                 Карменсіта     UA       \N   
1         tt0000001         2                 Carmencita     DE       \N   
2         tt0000001         3  Carmencita - spanyol tánc     HU       \N   
3         tt0000001         4                 Καρμενσίτα     GR       \N   
4         tt0000001         5                 Карменсита     RU       \N   
...             ...       ...                        ...    ...      ...   
36415764  tt9916852         5             Episódio #3.20     PT       pt   
36415765  tt9916852         6             Episodio #3.20     IT       it   
36415766  tt9916852         7               एपिसोड #3.20     IN       hi   
36415767  tt9916856         1                   The Wind     DE       \N   
36415768  tt9916856         2                   The Wind     \N       \N   

                types     attributes isOriginalTitle  


In [105]:
# keep only US movies.
akas = akas.loc[akas['region'] =='US']

In [106]:
# Replace "\N" with np.nan
akas.replace({'\\N':'np.nan'})

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,np.nan,imdbDisplay,np.nan,0
14,tt0000002,7,The Clown and His Dogs,US,np.nan,np.nan,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,np.nan,imdbDisplay,np.nan,0
36,tt0000005,1,Blacksmithing Scene,US,np.nan,alternative,np.nan,0
41,tt0000005,6,Blacksmith Scene #1,US,np.nan,alternative,np.nan,0
...,...,...,...,...,...,...,...,...
36415295,tt9916560,1,March of Dimes Presents: Once Upon a Dime,US,np.nan,imdbDisplay,np.nan,0
36415365,tt9916620,1,The Copeland Case,US,np.nan,imdbDisplay,np.nan,0
36415454,tt9916702,1,Loving London: The Playground,US,np.nan,np.nan,np.nan,0
36415497,tt9916756,1,Pretty Pretty Black Girl,US,np.nan,imdbDisplay,np.nan,0


### Ratings:

In [107]:
ratings = pd.read_csv(ratings, sep='\t', low_memory=False)

In [108]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1982
1,tt0000002,5.8,265
2,tt0000003,6.5,1839
3,tt0000004,5.5,178
4,tt0000005,6.2,2624


In [109]:
ratings.info

<bound method DataFrame.info of             tconst  averageRating  numVotes
0        tt0000001            5.7      1982
1        tt0000002            5.8       265
2        tt0000003            6.5      1839
3        tt0000004            5.5       178
4        tt0000005            6.2      2624
...            ...            ...       ...
1325151  tt9916730            8.3        10
1325152  tt9916766            7.0        21
1325153  tt9916778            7.2        36
1325154  tt9916840            7.5         7
1325155  tt9916880            7.0         7

[1325156 rows x 3 columns]>

In [110]:
# Replace "\N" with np.nraan
ratings.replace({'\\N':np.nan},inplace=True)

In [112]:
# Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
keepers =basics['tconst'].isin(akas['titleId'])
keepers



8           True
570         True
587         True
672         True
930        False
           ...  
9970892     True
9970901     True
9970940    False
9970985     True
9971069    False
Name: tconst, Length: 286498, dtype: bool

In [113]:
basics = basics[keepers]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
1273,tt0001285,movie,The Life of Moses,The Life of Moses,0,1909,,50,"Biography,Drama,Family"
...,...,...,...,...,...,...,...,...,...
9970357,tt9914942,movie,Life Without Sara Amat,La vida sense la Sara Amat,0,2019,,74,Drama
9970752,tt9915872,movie,The Last White Witch,My Girlfriend is a Wizard,0,2019,,97,"Comedy,Drama,Fantasy"
9970892,tt9916170,movie,The Rehearsal,O Ensaio,0,2019,,51,Drama
9970901,tt9916190,movie,Safeguard,Safeguard,0,2020,,95,"Action,Adventure,Thriller"


In [114]:
# example making new folder with os
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")


[]

In [115]:
## Save current dataframe to file.
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)



In [116]:
# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()



Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
1,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
2,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
3,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
4,tt0001285,movie,The Life of Moses,The Life of Moses,0,1909,,50,"Biography,Drama,Family"
