# Demonstration of Dataset Loading

Here I will be showing you the naive/bad way of loading the data followed by the approach that was taken to load and compile all the information from each of the datasets at hand into a single dataset.

To do that, we first have to download the datasets from the source.

In [2]:
import warnings, requests, gzip, io, gc
import multiprocessing
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

filenames = [
    "title.akas.tsv.gz",
    "title.basics.tsv.gz",
    "title.crew.tsv.gz",
    "title.principals.tsv.gz",
    "title.ratings.tsv.gz",
    "name.basics.tsv.gz"
    ]
url = 'https://datasets.imdbws.com/'
chunksize = 10 ** 6

def download_file(filename):
    fetch_url = url + filename
    print("Downloading file " + filename + "in a parallelized manner\n")
    with open(filename, "wb") as f:
        r = requests.get(fetch_url)
        f.write(r.content)

with multiprocessing.Pool() as pool:
    pool.map(download_file, filenames)

Downloading file title.basics.tsv.gzin a parallelized manner
Downloading file title.akas.tsv.gzin a parallelized manner


Downloading file title.crew.tsv.gzin a parallelized manner

Downloading file title.principals.tsv.gzin a parallelized manner

Downloading file title.ratings.tsv.gzin a parallelized manner

Downloading file name.basics.tsv.gzin a parallelized manner



# The wrong approach

The naive approach of loading the datasets at hand is to directly load them up and store them in local variables.

Let's do that here.

In [3]:
title_basics = pd.read_csv(
    "title.basics.tsv.gz",
    on_bad_lines='skip',
    usecols=["tconst","titleType","primaryTitle","originalTitle","isAdult","startYear","runtimeMinutes","genres"],
    delimiter="\t")

In [4]:
title_akas = pd.read_csv(
    "title.akas.tsv.gz",
    usecols=["titleId", "region"],
    on_bad_lines='skip',
    delimiter="\t")

In [5]:
title_crew = pd.read_csv("title.crew.tsv.gz",
    on_bad_lines='skip',
    delimiter="\t")

In [6]:
title_principals = pd.read_csv("title.principals.tsv.gz",
    usecols=["tconst", "nconst", "category"],
    on_bad_lines='skip',
    delimiter="\t")

In [None]:
name_basics = pd.read_csv("name.basics.tsv.gz",
    on_bad_lines='skip',
    delimiter="\t")

In [None]:
title_ratings = pd.read_csv("title.ratings.tsv.gz",
    on_bad_lines='skip',
    delimiter="\t")

In [None]:
title_ratings.shape

(1376091, 3)

Looking at the code above, we see that not even 4-5 cells in, our kernel crashes. This is because the datasets we are loading up take up memory on the RAM, and datasets of this size fill up the available space quickly.