In [1]:
import os
import torch
import sys
import random
if not os.getcwd().endswith("legal_citation_rec"):
    os .chdir("..")
from utils import init_tokenizer
from config import VOCAB_SIZES, ICLOUD_FP, TEXT_FP

  from .autonotebook import tqdm as notebook_tqdm


The NVIDIA Tesla T4 used by Google Colab has 16GB of RAM available and the CPU about 13GB.
In the following cells, I will test RAM consumption for different dataset sizes and configurations.

First, let's find out the necessary datatype for our input tensors:

In [2]:
tokenizer, _, _ = init_tokenizer()
print(f"Max token index: {len(tokenizer)}")

Max token index: 30524


The maximum integer needed to represent the tokenizer indices is lower than `32,767`. This is great because it means that we can represent each index with a 16-bit integer. This will save us a substantial amount of memory compared to using int32.

In order to reduce disk I/O. We could store the preprocessed input tokens in a single binary file instead of multiple ones. Let's find out if this is feasible.

First, find out how many individual files exist per vocab size:

In [6]:
preprocessed_fp: str = os.path.join(ICLOUD_FP, TEXT_FP)

file_counts = {vsize: 0 for vsize in VOCAB_SIZES}
for key in file_counts.keys():
    file_counts[key] = len(os.listdir(os.path.join(preprocessed_fp, f"vocab_size_{key}")))
    
file_counts

{1431: 317568, 859: 237831, 479: 266993, 105: 13413}

Approximate the average number of samples in a file from 500 random samples for each vocab size:

In [8]:
n = 500
nsamples = {vsize: 0 for vsize in VOCAB_SIZES}
for vsize in VOCAB_SIZES:
    dir_fp: str = os.path.join(preprocessed_fp, f"vocab_size_{vsize}")
    fnames = os.listdir(dir_fp)
    random.shuffle(fnames)
    for f in fnames[:n]:
        t = torch.load(os.path.join(dir_fp, f))
        nsamples[vsize] += len(t)
    nsamples[vsize] //= 500
    
print(f"Average number of samples in a file per vocab: {nsamples}.")
    

Average number of samples in a file per vocab: {1431: 2, 859: 2, 479: 2, 105: 2}.


Let's allocate a tensor that includes all data samples for each vocab size:

In [10]:
ram_size_in_gb = dict()
for vsize in VOCAB_SIZES:
    t1 = torch.randint( # inputs
        low=0,
        high=30524,
        size=(
            nsamples[vsize] * file_counts[vsize],
            256,
        ),
        dtype=torch.int16,
    )
    t2 = torch.randint( # labels
        low=0,
        high=30524,
        size=(
            nsamples[vsize] * file_counts[vsize],
            1,
        ),
        dtype=torch.int16,
    )
    size_in_bit: int = sys.getsizeof(t1.storage()) + sys.getsizeof(t2.storage())
    ram_size_in_gb[vsize] = size_in_bit * 8e-9
    
print(f"Needed RAM in GB: {ram_size_in_gb}")

Needed RAM in GB: {1431: 2.6116800000000002, 859: 1.9559229120000001, 479: 2.1957512, 105: 0.11030928000000001}


Loading the whole dataset into memory at once seems feasible for every vocabulary size!

Now, after I united the data into bigger chunks, it gets obvious that my approximations were flawed.
For processing speed optimation, I did not create a single large .pt file per vocab size (yet).
Instead, I created files of 20000 samples each. These will be concatenated to a single large .pt file.

Let's see how many files were created for each vocab size:

In [7]:
all_file_names: list[str] = os.listdir(os.path.join("data","text","preprocessed"))
n: int = 20000
tensor_lengths = dict()

for vsize in VOCAB_SIZES:
    file_names_per_vocab: list[str] = [fname for fname in all_file_names if fname.endswith(f"{vsize}.pt")]
    print(f"Number of context files (= n label files) for vocab size {vsize}: {len(file_names_per_vocab)/2}")
    tensor_lengths[vsize] = (len(file_names_per_vocab)//2)*n
    print(f"Final united tensor length for vocab size {vsize}: {tensor_lengths[vsize]}")
    

Number of context files (= n label files) for vocab size 1431: 74.0
Final united tensor length for vocab size 1431: 1480000
Number of context files (= n label files) for vocab size 859: 40.0
Final united tensor length for vocab size 859: 800000
Number of context files (= n label files) for vocab size 479: 27.0
Final united tensor length for vocab size 479: 540000
Number of context files (= n label files) for vocab size 105: 8.0
Final united tensor length for vocab size 105: 160000


Let's allocate a tensor that includes all data samples for each vocab size:

In [8]:
ram_size_in_gb = dict()
for vsize in VOCAB_SIZES:
    t1 = torch.randint( # inputs
        low=0,
        high=30524,
        size=(
            tensor_lengths[vsize],
            256,
        ),
        dtype=torch.int16,
    )
    t2 = torch.randint( # labels
        low=0,
        high=30524,
        size=(
            tensor_lengths[vsize],
            1,
        ),
        dtype=torch.int16,
    )
    size_in_bit: int = sys.getsizeof(t1.storage()) + sys.getsizeof(t2.storage())
    ram_size_in_gb[vsize] = size_in_bit * 8e-9
    
print(f"Needed RAM in GB: {ram_size_in_gb}")

Needed RAM in GB: {1431: 6.085760768, 859: 3.289600768, 479: 2.2204807680000003, 105: 0.657920768}


This changes things quite a bit. While 105 and 479 vocab size seems still feasible, the others are probably not.