# GPU resource optimization

Currently, the longest run time in the pipeline is calculating the embeddings with HuggingFace transformers and msmarco-distilbert-base-tas-b. Should probably read up on embedding models - it may be possible to pick a better one. We are only using this one because it was recommended by OpenSearch for their text embedding ingest pipeline. Since we are now calculating the embedding ourselves outside of OpenSearch we can use whatever we want.

But before we start swapping models, I want to try the embedding run with multiple GPU workers, both on the same GPU and across multiple GPUs.
We know from the OpenSearch indexing notebook that in a single process, an embedding batch size of 1 text chunk is the fastest. This simplifies the batch accumulation and submission somewhat. It also means that the GPU has memory to spare and could fit multiple copies of the model for multiple parallel embedding jobs.

Need to be careful to time only the embedding part of the job submission loop. This is done to match the single process benchmark from the OpenSearch indexing notebook. We also don't want/need to include the spin-up time because during the real run, we will have hours or days of GPU compute time and any initialization overhead will be insignificant in comparison.

## 1. Run set-up

### 1.1. Imports

In [None]:
# Change working directory to parent so we can import as we would from __main__.py
print(f'Working directory: ', end = '')
%cd ..

# Standard imports
import time
import random
import multiprocessing as mp
from multiprocessing import Manager, Process

# PyPI imports
import h5py
import numpy as np
import matplotlib.pyplot as plt

# Internal imports
import configuration as config
import notebooks.notebook_helper as helper_funcs

### 1.2. Notebook parameters

In [2]:
# Run or skip experiments
run_multiworker_benchmark=True
run_multigpu_benchmark=True
run_multigpu_queue_benchmark=True

# Estimated total chunks after semantic splitting, determined in
# semantic splitting notebook
estimated_total_chunks=20648877

# Where to save plots
figure_dir='./notebooks/figures/04-GPU_resource_optimization'

### 1.3. Data loading

In [None]:
# Open a connection to the input data on disk
input_file_path=f'{config.DATA_PATH}/wikipedia-sample/{config.PARSED_TEXT}'
input_data=h5py.File(input_file_path, 'r')

# Ingest all of the parsed sample data so we don't have to mess around
# with accumulating batches for the workers from input batches
records=[]

# Loop on the batches
for batch_num in input_data['batches']:

    # Grab the batch from the hdf5 connection
    batch=input_data[f'batches/{batch_num}']

    # Add record to batch and count
    for record in batch:
        records.append(record)

# Close the connection to the input data
input_data.close()

print(f'Have {len(records)} records')

## 2. Multiple workers, single GPU

### 2.1. Benchmark specific parameters

In [4]:
# GPU to assign work to
worker_gpu=['cuda:0']

# Numbers of workers per GPU to test
gpu_worker_counts=[1,2,4,8]

# Number texts to encode for each replicate of each worker count
target_texts=3200

# Number of replicates to run for each worker count
replicates=3

# Build holder for results
results={}

### 2.2. Benchmark

In [None]:
%%time

if run_multiworker_benchmark == True:

    # Loop on the worker counts
    for n_workers in gpu_worker_counts:
        print(f'Embedding with {n_workers} GPU workers.')
        
        # Add an empty list to collect the results, using the worker count as key
        results[f'{n_workers}']=[]

        # Calculate how many texts we need to send to each worker
        # the get the target number of texts
        texts_per_worker=target_texts // n_workers

        # Build the GPU list for this worker count
        worker_gpus=worker_gpu*n_workers

        for replicate in range(replicates):

            # Generate random batches of texts for each worker
            batches=[]

            for i in range(n_workers):

                batches.append(random.sample(records, texts_per_worker))

            # Send the batches
            mean_embedding_time=helper_funcs.submit_batches(worker_gpus, batches)

            # Calculate the embedding rate
            embedding_rate=target_texts / mean_embedding_time

            # Add it to the results
            results[f'{n_workers}'].append(embedding_rate)

    print()

### 2.3. Results

In [None]:
if run_multiworker_benchmark == True:

    plt.title('Embedding rate benchmark: workers per single GPU')
    plt.xlabel('GPU workers')
    plt.ylabel('Embedding rate (records per second)')

    standard_deviations=[]
    means=[]

    for n_workers in gpu_worker_counts:
        times=results[f'{n_workers}']
        means.append(np.mean(times))
        standard_deviations.append(np.std(times))

    plt.errorbar(
        gpu_worker_counts, 
        means, 
        yerr=standard_deviations, 
        linestyle='dotted',
        marker='o', 
        capsize=5
    )

    plt.savefig(f'{figure_dir}/2.3-embedding_rate_workers_per_GPU.jpg')
    plt.show()

    mean_embedding_rate=sum(results['8']) / len(results['8'])
    print(f'Estimated total embedding time: {(estimated_total_chunks / mean_embedding_rate) / (60*60*24):.2f} days with 8 workers')
    print(f'Mean embedding rate: {mean_embedding_rate:.0f} records per second with 8 workers')

OK, so - it *looks* like we are speeding up the embedding by using multiple workers. But, in reality, even the fastest multi-worker embedding rate is slower than that of a single process. Probably because of the overhead associated with running multiple workers.

Let's try it with one worker on each of the three GPUs in the system.

## 3. Single worker on multiple GPUs
### 3.1. Benchmark

In [None]:
%%time

if run_multigpu_benchmark == True:
        
    # Build the GPU list
    worker_gpus=['cuda:0','cuda:1','cuda:2']
    n_workers=len(worker_gpus)

    # Holder for results
    results=[]

    # Calculate how many texts we need to send to each worker
    # the get the target number of texts
    texts_per_worker=target_texts // n_workers

    for replicate in range(replicates):

        # Generate random batches of texts for each worker
        batches=[]

        for i in range(n_workers):

            batches.append(random.sample(records, texts_per_worker))

        # Send the batches
        mean_embedding_time=helper_funcs.submit_batches(worker_gpus, batches)

        # Calculate the embedding rate
        embedding_rate=target_texts / mean_embedding_time

        # Add it to the results
        results.append(embedding_rate)

    print()

### 3.2. Results

In [None]:
if run_multigpu_benchmark == True:
    
    mean_embedding_rate=sum(results) / len(results)
    print(f'Estimated total embedding time: {(estimated_total_chunks / mean_embedding_rate) / (60*60*24):.2f} days')
    print(f'Mean embedding rate: {mean_embedding_rate:.0f} records per second')

OK, that worked somewhat - using all three GPUs, we can cut about half of a day off of the encoding time. However, this approach is not great - the GTX1070 is much faster than the K80s so it sits around waiting for the slower cards to finish each round. This effect could be more pronounced during the real run where each GPU will be embedding hundreds of thousands of articles. To really do this right, we would need a queue system where persistent workers can consume smaller batches as needed until all of the text has been embedded. Keep the GPUs fed!

## 4. Single queue-fed worker on multiple GPUs

Here is what the set-up looks like.

1. Create reader queue to take jobs from reader process to GPU workers.
2. Reader process generates batches of text and puts them in the queue.
3. GPU workers take jobs from the queue and embed them.

For the real implementation, we would also need a writer queue and process to take embedded texts from the GPU workers and write them to disk - hdf5 doesn't like multiple workers writing to the same file.

Timing this for benchmarking is also tricky. We can't time in the workers like we did for the batch rounds because the workers won't necessarily be in sync. I think the trick is to start the GPU workers up. Then start the timer, then start the reader process and call join on the GPU workers. The join will block until all of the GPU workers return. Then we stop the timer. Since the work does not start until the reader process starts putting batches in the queue, this gives the GPU workers a chance to spin up before we start timing.

### 4.1. Benchmark specific parameters

In [9]:
# Number texts to encode for each replicate of each worker count
target_texts=12000

# Number of texts to send for one workunit
batch_sizes=[1,10,100,1000]

# Number of replicates to run
replicates=3

# Build holder for results
results={}

# GPU list
worker_gpus=['cuda:0','cuda:1','cuda:2']
n_workers=len(worker_gpus)

### 4.2. Benchmark

In [None]:
%%time

if run_multigpu_queue_benchmark == True:

    # Loop on batch sizes
    for batch_size in batch_sizes:

        # Add an empty list to collect the results, using the batch size as key
        results[f'{batch_size}']=[]

        # Loop on the replicates
        for replicate in range(replicates):
            print(f'Running replicate {replicate}, batch size {batch_size}')

            # Start multiprocessing manager
            manager=Manager()

            # Set-up reader queue
            reader_queue=manager.Queue(maxsize=10)

            # Set-up the reader process
            reader_process=Process(
                target=helper_funcs.reader,
                args=(records, target_texts, batch_size, reader_queue, n_workers)
            )

            # Start the pool
            pool=mp.Pool(processes=len(worker_gpus))

            # Start each GPU worker
            for gpu in worker_gpus:
                pool.apply_async(helper_funcs.calculate_embeddings_from_queue, (gpu,reader_queue,))

            # Wait for the workers to spin up
            time.sleep(3)

            # Start the timer
            start_time=time.time()

            # Start the reader process
            reader_process.start()

            # Wait for the GPU worker pool to finish
            pool.close()
            pool.join()

            # Stop the timer
            dT=time.time() - start_time

            # Collect the apparent embedding rate
            results[f'{batch_size}'].append(target_texts / dT)
            
            # Clean up
            reader_process.close()
            manager.shutdown()

    print()

### 4.3. Results

In [None]:
if run_multigpu_queue_benchmark == True:

    plt.title('Embedding rate benchmark: queued workunit size')
    plt.xlabel('Records per workunit')
    plt.ylabel('Rate (records per second)')

    standard_deviations=[]
    means=[]

    for batch_size in batch_sizes:
        times=results[f'{batch_size}']
        means.append(np.mean(times))
        standard_deviations.append(np.std(times))

    plt.errorbar(
        batch_sizes, 
        means, 
        yerr=standard_deviations, 
        linestyle='dotted',
        marker='o', 
        capsize=5
    )

    plt.xscale('log')
    plt.savefig(f'{figure_dir}/4.3-embedding_rate_workunit_size.jpg')
    plt.show()

    mean_embedding_rate=sum(results['1']) / len(results['1'])
    print(f'Estimated total embedding time: {(estimated_total_chunks / mean_embedding_rate) / (60*60*24):.2f} days with workunit size 1.')
    print(f'Mean embedding rate: {mean_embedding_rate:.0f} records per second with workunit size 1.')

Nice! That worked great - two hours of work just saved us a day and a half of embedding time over the first single process version of this. I think we have squeezed out all of the performance we are going to get, time to move on

**One final note**: we should probably be using this queue pattern for everything we do with multiprocessing. Workers sitting around while the pool.join() or results.get() call waits for everyone to finish kills the run time. And stopping and restarting workers is obviously not good.