# Data Dictionary

### Data Location
-The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

### IMDb Dataset Details
- Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

### title.akas.tsv.gz
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.bas-ics.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short,  tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

### name.basics.tsv.gz
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

In [109]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [110]:
basics_url1="https://datasets.imdbws.com/title.basics.tsv.gz"
basics_url2="https://datasets.imdbws.com/title.akas.tsv.gz"
basics_url3="https://datasets.imdbws.com/title.ratings.tsv.gz"

In [111]:
df1 = pd.read_csv(basics_url1, sep='\t', low_memory=False)

In [112]:
df1.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [113]:
df1.info

<bound method DataFrame.info of             tconst  titleType               primaryTitle  \
0        tt0000001      short                 Carmencita   
1        tt0000002      short     Le clown et ses chiens   
2        tt0000003      short             Pauvre Pierrot   
3        tt0000004      short                Un bon bock   
4        tt0000005      short           Blacksmith Scene   
...            ...        ...                        ...   
9966181  tt9916848  tvEpisode              Episode #3.17   
9966182  tt9916850  tvEpisode              Episode #3.19   
9966183  tt9916852  tvEpisode              Episode #3.20   
9966184  tt9916856      short                   The Wind   
9966185  tt9916880  tvEpisode  Horrid Henry Knows It All   

                     originalTitle isAdult startYear endYear runtimeMinutes  \
0                       Carmencita       0      1894      \N              1   
1           Le clown et ses chiens       0      1892      \N              5   
2         

# Processing

## Filtering/Cleaning Steps

### Title Basics:

In [114]:
# Replace "\N" with np.nan
df.replace({'\\N':'np.nan'})

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1982
1,tt0000002,5.8,265
2,tt0000003,6.5,1839
3,tt0000004,5.5,178
4,tt0000005,6.2,2625
...,...,...,...
1324468,tt9916730,8.3,10
1324469,tt9916766,7.0,21
1324470,tt9916778,7.2,36
1324471,tt9916840,7.5,7


In [124]:
# Eliminate movies that are null for runtimeMinutes
SELECT * FROM runtimeMinutes
WHERE runtimeMinutes IS NOT NULL;

SyntaxError: invalid syntax (3569774482.py, line 2)

In [125]:
# Eliminate movies that are null for genre


In [126]:
# keep only titleType==Movie


In [118]:
# keep startYear 2000-2022


In [119]:
# Eliminate movies that include "Documentary" in genre (see tip below)
# Exclude movies that are included in the documentary category.
is_documentary = df['genres'].str.contains('documentary',case=False)
df = df[~is_documentary]



KeyError: 'genres'

In [None]:
# Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)


### AKAs:

In [None]:
# keep only US movies.


In [None]:
# Replace "\N" with np.nan
df.replace({'\\N':'np.nan'})

### Ratings:

In [None]:
# Replace "\N" with np.nan
df.replace({'\\N':'np.nan'})

In [None]:
# Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)
