# Partitioning the 130k Squad2 Dataset

Due to memory constraints, we're unable to run clustering on the entire 400gb transformed dataset.

In order to get a representational subset of the squad2 examples, a number of examples (43,200 rows) were extracted from each of the files output from our pipeline.  As each context typically can have a number of questions associated with it, sampling in this  manner is preferable to a fully sequential sub-sampling as we can  get a better cross-section sampling of the squad2 examples.

When using Dask, it is recommended that the dataset be split into 1-2gb sections before loading, so this also took care of that step for us.  There is some memory overhead using Dask/cuML vs. 1 GPU cuML, and kMeans using Dask and cuML was more memory hungry than DBSCAN, this also allows finer grained control over the size of our dataset so as to cluster as large a portion as our VRAM would allow.

This notebook takes Squad2 examples 0-129999 in 26 files of 5000 examples each ( 16gb, 720,000 rows ) and subsample to 26 segment files of 300 examples apiece ( each ~1GB ).  With 144 rows / example this produces 43,200 rows per segment, 1,123,200 attention heads total = 7,800 squad2 examples sampling the entire dataset.  Filenames are output with the count value left padded to 6 digits to load sequentially when reading in as a batch.

In [1]:
import pandas as pd
import os
import time
import multiprocessing as mp
import logging
from functools import partial

logger = mp.log_to_stderr()
logger.setLevel(logging.INFO)

data_dir='/rapids/notebooks/host/representations/final/'

In [2]:
def segment(count, rows):
        start_time = time.time()
        logger.info(f'Loading segment of {count} ...')
        df = pd.read_csv(os.path .join(data_dir,f'final_representation_df_{count}.csv'), nrows=rows)
        logger.info(f'Writing segment {count} ...')
        df.to_csv(os.path .join(data_dir, f'partitions/{count:06d}_partition.csv'), index=False)  
        logger.info(f'--- Finished writing segment {count} in {(time.time() - start_time)} seconds ---"')
        return count

In [3]:
squad2_examples_per_segment = 300
representation_rows = squad2_examples_per_segment * 12 * 12
# 144 heads per squad2 example, 300 squad examples = 43200 - should be about 1gb
with mp.Pool(12) as p:
    p.map(partial(segment, rows=representation_rows), range(5000,135000,5000))

[INFO/ForkPoolWorker-1] child process calling self.run()
[INFO/ForkPoolWorker-2] child process calling self.run()
[INFO/ForkPoolWorker-3] child process calling self.run()
[INFO/ForkPoolWorker-8] child process calling self.run()
[INFO/ForkPoolWorker-9] child process calling self.run()
[INFO/ForkPoolWorker-4] child process calling self.run()
[INFO/ForkPoolWorker-10] child process calling self.run()
[INFO/ForkPoolWorker-11] child process calling self.run()
[INFO/ForkPoolWorker-3] Loading segment of 15000 ...
[INFO/ForkPoolWorker-1] Loading segment of 5000 ...
[INFO/ForkPoolWorker-9] Loading segment of 20000 ...
[INFO/ForkPoolWorker-10] Loading segment of 35000 ...
[INFO/ForkPoolWorker-2] Loading segment of 10000 ...
[INFO/ForkPoolWorker-12] child process calling self.run()
[INFO/ForkPoolWorker-8] Loading segment of 25000 ...
[INFO/ForkPoolWorker-11] Loading segment of 40000 ...
[INFO/ForkPoolWorker-4] Loading segment of 30000 ...
[INFO/ForkPoolWorker-12] Loading segment of 45000 ...
[INFO