There is one main quest and several side-quests after that. To get the full grade, you need attain 10 or more points in whichever way you want.

### Part 1: Memory-efficient training (5 pts)

![img](https://steamuserimages-a.akamaihd.net/ugc/280721626864094662/C48355EE16889197B8D000A198F970CD7E64CB7A/?imw=512&imh=342&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=true)

__Your quest__ is to fine-tune a GPT-2-Large by fitting it into 11GiB GPU memory (as in 1080Ti or 2080Ti). We deliberately limit GPU memory below and recommend you to check the peak memory usage via: [`torch.cuda.max_memory_allocated()`](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html).



In [2]:
import torch
max_memory_gib = torch.cuda.get_device_properties('cuda').total_memory / 2 ** 30
torch.cuda.set_per_process_memory_fraction(min(1.0, 11 / max_memory_gib))
print(f"Setting memory limit to {min(1.0, 11 / max_memory_gib) * 100:.2f}%")

Setting memory limit to 98.45%


GPT-2 is a [popular language model by OpenAI](https://openai.com/blog/better-language-models/). We're gonna use [huggingface transformers](https://huggingface.co/docs/transformers/index) as a shortcut to access this model. Here's how it works:

In [3]:
# dependencies: !pip install transformers==4.16 datasets==1.18.3
import transformers
model_name = 'gpt2-large'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.GPT2LMHeadModel.from_pretrained(model_name)

In [4]:
# here's how the model works: tokenizer converts raw data to pytorch tensors
batch = tokenizer(["A cat sat", "import numpy"], return_tensors='pt')
print("Batch:", repr(batch)[:70].replace('\n', ' '), ' ...')


Batch: {'input_ids': tensor([[   32,  3797,  3332],         [11748,   299, 32  ...


In [5]:
# model turns those tensors into logits (pre-softmax activations)
pred = model(**batch)
print("Logits shape:", pred.logits.shape, " -- [batch size, sequence_length, vocab size]")

# a model is also a standard PyTorch module that has .parameters, can be sent .to(device), etc
print(f"Parameters: {sum(param.numel() for param in model.parameters())/1e6:0.2f} million")

Logits shape: torch.Size([2, 3, 50257])  -- [batch size, sequence_length, vocab size]
Parameters: 774.03 million


In [6]:
# fun fact: you can use the model to generate text given prefix
generated_ids = model.generate(**batch, max_length=16)
print("Sample A:", tokenizer.decode(generated_ids[0]))
print("Sample B:", tokenizer.decode(generated_ids[1]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample A: A cat sat on the edge of the bed, staring at the ceiling.

Sample B: import numpy as np

from scipy.stats import mean,


Here's a sample data you can use for prototyping -- and demonstrating that your algorithm works.

In [7]:
from datasets import load_dataset

data = load_dataset("wikitext", "wikitext-2-v1")['train']
tokenizer.pad_token = tokenizer.eos_token
test_batch = tokenizer(sorted(data['text'], key=len)[-64:], max_length=1024, padding=True, pad_to_multiple_of=256, return_tensors='pt')

# if you want actual training, consider this dataset: https://huggingface.co/datasets/transformersbook/codeparrot

Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

  "`max_length` is ignored when `padding`=`True` and there is no truncation strategy. "


### Victory conditions

- __(1/5 points)__ it trains in some shape or form, regardless of batch size and optimizer
- __(2/5 points)__ it trains with sequence length 1024, batch size 64, with any optimizer (e.g. SGD)
- __(3/5 points)__ same, but using the Adam optimizer, and the training loss is demonstrated to decrease on a fixed batch
- __(4/5 points)__ most of the computation (flops) should pe performed on GPU, the training performance is at least 0.5x as fast as simply running forward/backward on GPU (in terms of sequences per second)
- __(5/5 points)__ __master parameters should be kept in full float32 precision__.


__How do I do that?__ There's a bunch of things you could try: you can pick and choose which of them you employ. Below are some (but not all) options:
- __Gradient accumulation / micro-batching:__ you probably can't process 64 sequences at once -- but what if you accumulate them over several forward/batckward passes with a smaller batch size.
- __Gradient checkpointing:__ you can further reduce activation memory by not storing intermediate activations. You can learn how to usa built-in checkpoints [from their docs](https://huggingface.co/docs/transformers/main_classes/model) or build your own using [vanilla checkpointing](https://pytorch.org/docs/stable/checkpoint.html).
- __Mixed precision:__ you can use [pytorch AMP](https://pytorch.org/docs/stable/amp.html) to cast activations to 16-bit. If that is not enough, you can cast (some) parameters to 16-bit as well. Victory conditions require that the optimized parameters must stay 32-bit, but no one said you must keep GPU parameters that way.
- __Offloading:__ any tensor you don't need right now can be offloaded from GPU to RAM. This is especially true for optimizer statistics.
- __Quantization:__ if everything else fails, you could reduce memory usage further via 8-bit quantization, [e.g. for Adam](https://github.com/facebookresearch/bitsandbytes). However, please make sure the loss still goes down after you're done blaspheming.



__Off limits:__
- using more than 11GiB of GPU memory at any point is forbidden (check with [`torch.cuda.max_memory_allocated()`](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html))
- using non-Adam optimizer (e.g. SGD) will only count for 2/5 points. __Adafactor is not Adam__. But FusedAdam or quantized Adam are both legit -- if you can show that the loss goes down;
- finetuning must update all model parameters on every batch; fine-tuning only a subset of parameters will not get a full grade;
- changing the model structure - your code must run the same computation as the original GPT-2-large up to a floating point precision. Removing layers can be useful in practice, but it's off limits for this task.
- deepspeed and fairscale. There's a separate ~~cauldron~~ assignment for that.

In [None]:
# if it helps, <YOUR CODE HERE>

Additional assignments: pick at least one for full grade, or more if you're into it.

### Pipeline parallelism (3+ points)

__TL;DR__ implement pipeline parallelism using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) or [`torch.rpc`](https://pytorch.org/docs/stable/rpc.html), your choice.

![img](https://i.imgur.com/Va7GN6x.png)

Please start from [PyTorch CIFAR](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) starter kit. We will not judge if you choose something heavier, e.g. ImageNet -- but that is not necessary.
For this assignment, you will need to implement pipeline-parallel training similar to the one defined in [GPipe](https://arxiv.org/abs/1811.06965) (or lecture 4 notes). Feel free to use CPU or share one GPU for multiple processes. The main victory condition is that the model must show the same convergence rate as training on a single GPU -- which you will need to demosntrate.

- __(1/3 points)__ doesn't have to be parallel, works with at least 2 stages
- __(2/3 points)__ same parallelism as in GPipe, 4 or more stages, loss is shown to go down
- __(3/3 points)__ achieve the same covnergence curves on CIFAR (or the task of your choice) as training with a single process
- __(+1 bonus)__ demonstrate that your implementation can process a very large number of micro-batches (at least 1000 microbatches with 4 stages) without going OOM
- __(+1 bonus)__ implement [pipedream](https://arxiv.org/pdf/1806.03377.pdf) or [interleaved pipeline](https://openreview.net/pdf?id=cw-EmNq5zfD) (this may be hard!)

__Conditions:__ your implementation should use multiprocessing (i.e. not a single process sending data between devices). Existing pipelines (e.g. `torch.distributed.pipeline`) are off limits. If your code resembles one of existing pipelines too much, we will brutually interrogate you about the implementation details (as in "we'll make sure it's easier to build your own").

### Tensor Parallelism (3+ points)
_aka AlexNet-style parallelism_

__TL;DR__ implement intra-layer model parallelism and make sure it works.

Similarly to the pipeline task, we recommend that you start with the [PyTorch CIFAR](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) tutorial, but you can choose to use a more complex starter kit if you're feeling adventurous. If you don't have stable access to a multi-GPU setup, feel free to use CPU or share the same GPU across all processes.

The main objective is to implement AlexNet-style model parallelism, wherein each device computes a subset of "neurons" in each layer and then exchanges results with other units. We recommend doing this with [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) with either `all_reduce` or `all_gather`, depending on whether you split matrix rows or columns.

- __(1/3 points)__ a simple architecture, `Sequential(linear1, relu, linear2)`,
- __(2/3 points)__ a more complex architecture, add at least one batch- or layer normalization
- __(3/3 points)__ train a model like this and show learning curves,
- __(+1 bonus)__ parallelize either ResNet50 or some Transformer variant (__warning__, this is hard!),
- __(+1 bonus)__ implement mixed column + row parallelism, run 2 dense layers with a single communication round (see below).

For both options, please attach the learning curves of your model compared to regular single-process training. A decent explanation how this type of parallelism works can be found in Section 2.3 of [this paper](https://arxiv.org/pdf/2104.04473.pdf). Optimizations such as Figure 5 from that paper or this [weird trick](https://arxiv.org/abs/1404.5997) are welcome, but not required.


### Bigger and badder (3+ points)

__TL;DR__ repeat the first quest but for [GPT2-XL](https://huggingface.co/gpt2-xl) (~1.5B parameters) instead of GPT2-Large.

![img](https://i.imgflip.com/48jwa8.png)

- __(1/3 points)__ - do that with 16GiB GPU memory. If you can't get one through conventional means, try [free kaggle P100 quotas](https://www.kaggle.com/product-feedback/83643)
- __(2/3 points)__ - the training should be at least 0.5x as fast as running forward/backward on GPU
- __(3/3 points)__ - back to 11GiB memory and no more than 16GiB RAM
- __(+1 bonus)__ - actually fine-tune the model on something (e.g. code, [codeparrot](https://huggingface.co/datasets/transformersbook/codeparrot)) and compare its generated outputs

You will likely need to add more memory-saving tricks on top of what you used for GPT2-Large. It is okay to keep some parameters in fp16.
Alternatively, __Deepspeed and fairscale are no longer forbidden__, but beware that they are not a free win -- you will have to tune the configuration for your specific setup.

__Variant:__ if you somehow got your hands on a powerful GPU, we will also accept (A) [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) for 24-32GiB GPUs or multiple smaller ones (B) [T5-11B](https://huggingface.co/t5-11b) on any GPUs short of MI250X. Note that you still need to train the entire model: we will not accept solutions based on Denis' [8-bit + LoRA trick](https://huggingface.co/hivemind/gpt-j-6B-8bit).



In [None]:
# <you guessed it...>