## Train a large model on a single GPU

In this section, we will practice strategies for training a large model on a single GPU. After completing this section, you should understand the effect of

-   batch size
-   gradient accumulation
-   reduced precision/mixed precision
-   parameter efficient fine tuning

on a large model training job.

This notebook will be executed inside a Jupyter interface **hosted on a GPU server instance on Chameleon**, NOT in the Chameleon Jupyter interface from which we launch experiments (provision servers, etc.)

### Open the notebook on Colab

We should have already started a notebook server in a container on a Chameleon GPU host, and set up an SSH tunnel to this notebook server. Now, we will open this notebook in Google Colab and connect it to the runtime that you have in Chameleon. This is a convenient way to work, because the notebook and its outputs will be saved automatically in your Google Drive.

-   Use this button to open the notebook in Colab: <a target="_blank" href="https://colab.research.google.com/github/teaching-on-testbeds/llm-chi/blob/main/workspace/2_single_gpu_a100.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>
-   Click “File \> Save a Copy in Drive” to save it in your own Google Drive. Work in your copy, so that the outputs will be saved automatically.
-   Next to the “Connect” button in the top right, there is a ▼ symbol. Click on this symbol to expand the menu, and choose “Connect to a local runtime”.
-   Paste the `http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX` you copied earlier into this space, and choose “Connect”.

**Alternatively, if you prefer not to use Colab** (or can’t, for some reason): just put the `http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX` URL you copied earlier into your browser to open the Jupyter interface directly. But, then you’ll have to open a terminal in that Jupyter interface and run

    wget https://raw.githubusercontent.com/teaching-on-testbeds/llm-chi/refs/heads/main/workspace/2_single_gpu_a100.ipynb

to get a copy of this notebook in that workspace.

Make sure that you can see the GPUs:

In [1]:
!nvidia-smi

Tue Feb 18 06:52:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:25:00.0 Off |                    0 |
| N/A   48C    P0             51W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                         

### Prepare LitGPT

For this tutorial, we will fine-tune an [TinyLlama](https://arxiv.org/abs/2401.02385) or [OpenLLaMA](https://github.com/openlm-research/open_llama) large language model using [`litgpt`](https://github.com/Lightning-AI/litgpt). LitGPT is a convenient wrapper around many PyTorch Lightning capabilities that makes it easy to fine-tune a GPU using a “recipe” defined in a YAML file. (We’ll also try the Python API for LitGPT in the “Multiple GPU” section of this tutorial.)

You may browse the “recipes” for this experiment [in our Github repository](https://github.com/teaching-on-testbeds/llm-chi/tree/main/workspace/config).

Our focus will be exclusively on comparing the time and memory requirements of training jobs under different settings - we will completely ignore the loss of the fine-tuned model, and we will make some choices to reduce the overall time of our experiment (to fit in a short Chameleon lease) that wouldn’t make sense if we really needed the fine-tuned model (e.g. using a very small fraction of the training data).

First, install LitGPT:

In [2]:
!pip install 'litgpt[all]'==0.5.7 'lightning<2.5.0.post0'

Collecting litgpt==0.5.7 (from litgpt[all]==0.5.7)
  Downloading litgpt-0.5.7-py3-none-any.whl.metadata (43 kB)
Collecting lightning<2.5.0.post0
  Downloading lightning-2.5.0-py3-none-any.whl.metadata (40 kB)
Collecting numpy<2.0 (from litgpt==0.5.7->litgpt[all]==0.5.7)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting jsonargparse<=4.32.1,>=4.30.1 (from jsonargparse[signatures]<=4.32.1,>=4.30.1->litgpt==0.5.7->litgpt[all]==0.5.7)
  Downloading jsonargparse-4.32.1-py3-none-any.whl.metadata (12 kB)
Collecting huggingface_hub>=0.23.5 (from litgpt==0.5.7->litgpt[all]==0.5.7)
  Downloading huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.3 (from litgpt==0.5.7->litgpt[all]==0.5.7)
  Downloading safetensors-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers>=0.15.2 (from litgpt==0.5.7->litgpt[all]==0.5.7)
  Downloading tokenizers-0.21.0-cp39-

then, download the foundation models:

In [3]:
!litgpt download TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|█████████████████████████████| 560/560 [00:00<00:00, 5.67MB/s]
generation_config.json: 100%|██████████████████| 129/129 [00:00<00:00, 1.07MB/s]
pytorch_model.bin: 100%|███████████████████▉| 4.40G/4.40G [00:04<00:00, 957MB/s]
tokenizer.json: 100%|██████████████████████| 1.84M/1.84M [00:00<00:00, 12.5MB/s]
tokenizer.model: 100%|███████████████████████| 500k/500k [00:00<00:00, 47.7MB/s]
tokenizer_config.json: 100%|███████████████████| 776/776 [00:00<00:00, 5.90MB/s]
Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}
Loading weights: pytorch_model.bin: 100%|███████████████| 00:07<00:00, 13.98it/s
Saving converted checkpoint to checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T


In [4]:
!litgpt download openlm-research/open_llama_3b

Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|█████████████████████████████| 506/506 [00:00<00:00, 6.63MB/s]
generation_config.json: 100%|██████████████████| 137/137 [00:00<00:00, 1.07MB/s]
pytorch_model.bin: 100%|██████████████████▉| 6.85G/6.85G [00:06<00:00, 1.03GB/s]
tokenizer.model: 100%|███████████████████████| 534k/534k [00:00<00:00, 47.5MB/s]
tokenizer_config.json: 100%|███████████████████| 593/593 [00:00<00:00, 4.78MB/s]
Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_3b'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}
Loading weights: pytorch_model.bin: 100%|███████████████| 00:09<00:00, 10.61it/s
Saving converted checkpoint to checkpoints/openlm-research/open_llama_3b


In [5]:
!litgpt download openlm-research/open_llama_7b

Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|█████████████████████████████| 507/507 [00:00<00:00, 2.91MB/s]
generation_config.json: 100%|██████████████████| 137/137 [00:00<00:00, 1.40MB/s]
pytorch_model-00001-of-00002.bin: 100%|███▉| 9.98G/9.98G [00:09<00:00, 1.05GB/s]
pytorch_model-00002-of-00002.bin: 100%|████▉| 3.50G/3.50G [00:04<00:00, 859MB/s]
pytorch_model.bin.index.json: 100%|████████| 26.8k/26.8k [00:00<00:00, 67.2MB/s]
tokenizer.model: 100%|███████████████████████| 534k/534k [00:00<00:00, 45.3MB/s]
tokenizer_config.json: 100%|███████████████████| 593/593 [00:00<00:00, 4.57MB/s]
Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_7b'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}
Loading weights: pytorch_model-00002-of-00002.bin: 100%|█| 00:18<00:00,  5.46it/
Saving converted checkpoint to checkpoints/openlm-research/open_llama_7b


In [6]:
!litgpt download openlm-research/open_llama_13b

Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|█████████████████████████████| 507/507 [00:00<00:00, 2.60MB/s]
generation_config.json: 100%|███████████████████| 137/137 [00:00<00:00, 851kB/s]
pytorch_model-00001-of-00003.bin: 100%|███▉| 9.95G/9.95G [00:09<00:00, 1.07GB/s]
pytorch_model-00002-of-00003.bin: 100%|███▉| 9.90G/9.90G [00:09<00:00, 1.06GB/s]
pytorch_model-00003-of-00003.bin: 100%|████▉| 6.18G/6.18G [00:06<00:00, 993MB/s]
pytorch_model.bin.index.json: 100%|█████████| 33.4k/33.4k [00:00<00:00, 103MB/s]
tokenizer.model: 100%|███████████████████████| 534k/534k [00:00<00:00, 47.4MB/s]
tokenizer_config.json: 100%|███████████████████| 593/593 [00:00<00:00, 6.31MB/s]
Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_13b'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}
Loading weights: pytorch_model-00003-of-00003.bin: 100%|█| 00:35<00:00,  2.79it/
Saving converted checkpoint to checkpoints/openlm-rese

Also, get the “recipes” that we will use for LLM fine-tuning. Using the file browser on the left side, look at the contents of the “config” directory.

In [7]:
!git clone https://github.com/teaching-on-testbeds/llm-chi/
!mv llm-chi/workspace/config .

Cloning into 'llm-chi'...
remote: Enumerating objects: 122, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 122 (delta 76), reused 71 (delta 32), pack-reused 0 (from 0)[K
Receiving objects: 100% (122/122), 42.65 KiB | 2.67 MiB/s, done.
Resolving deltas: 100% (76/76), done.


### Experiment: Baseline

As a baseline, let’s try an epoch of fine-tuning the TinyLlama-1.1B, using full precision and a batch size of 32:

In [8]:
!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 32

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7bf454d93f80>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-tiny-llama-1.1b'

This will fail because the training job won’t fit in our 80GB GPU memory.

### Experiment: Reduced batch size

But with a smaller batch size, it fits easily:

In [9]:
!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 8 --train.micro_batch_size 8

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7e4eb5922960>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-tiny-llama-1.1b'

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Gradient accumulation

By using gradient accumulation to “step” only after a few “micro batches”, we can train with a larger effective “global” batch size, with minimal effect on the memory required:

In [10]:
!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x70478e11fb00>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-tiny-llama-1.1b'

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Reduced precision

With a “brain float16” format for numbers, instead of “float32”, we can further reduce the memory required, although this representation is less precise:

In [11]:
!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-true

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7d07f1790f80>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-tiny-llama-1.1b'

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Mixed precision

With mixed precision, we get back some of the lost precision in the results, at the cost of some additional memory and time:

In [12]:
!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-mixed

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x77e037f26960>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-tiny-llama-1.1b'

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Larger model - 3b

We’ve gained so much GPU memory back with these techniques, we can even train a larger model. Let’s switch from the 1.1B to the 3B model:

In [13]:
!litgpt finetune_full --config config/open-llama-3b-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-true

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_3b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x77ba6150ed80>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-open-llama-3b'),
 'precision': 'bf16-

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Larger model - 7b

If we reduce the batch size again, we can even train a 7b model:

In [14]:
!litgpt finetune_full --config config/open-llama-7b-full.yaml --train.global_batch_size 16 --train.micro_batch_size 4 --precision bf16-true

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_7b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x719e0af3d1f0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-open-llama-7b'),
 'precision': 'bf16-

Make a note of the training time and memory, which is printed at the end of the training job.

### Experiment: Larger model - 13b

Even with the smallest possible batch size, we can’t train a 13B model:

In [15]:
!litgpt finetune_full --config config/open-llama-13b-full.yaml --train.global_batch_size 1 --train.micro_batch_size 1 --precision bf16-true

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_13b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x773916f38e90>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/full-open-llama-13b'),
 'precision': 'bf1

this will fail with an “out of memory” error. But, if we switch from the Adam optimizer (which has two state values per parameter) to SGD, we can train a 13B model. It’s *verrrrry* slow, though, so we won’t even train it for a full epoch - just 25 “steps”, so we can get an idea of the memory required:

In [16]:
!litgpt finetune_full --config config/open-llama-13b-full.yaml --train.global_batch_size 1 --train.micro_batch_size 1 --precision bf16-true --optimizer SGD --train.max_steps 25

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_13b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7fa86d5c4530>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'num_nodes': 1,
 'optimizer': 'SGD',
 'out_dir': PosixPath('out/finetune/full-open-llama-13b'),
 'precision': 'bf16-true',
 'resume': False,
 'seed': 1337,
 'train': TrainArgs(save_interval=100,
                    log_interval=1,
                    global_batch_size=1,
              

### Experiment: Parameter efficient fine tuning

If we are only fine-tuning, not training a model from scratch, we can also consider LoRA and QLoRA. Let’s try it first with our 1.1B model:

In [17]:
!litgpt finetune --config config/tiny-llama-lora.yaml

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x78267c3b3170>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
              

The memory required is *shockingly* small! We can see it with our 3B and 7B models, too:

In [18]:
!litgpt finetune --config config/open-llama-3b-lora.yaml

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_3b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7472f0b26c90>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas'

In [19]:
!litgpt finetune --config config/open-llama-7b-lora.yaml

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_7b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7df3bf9761b0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas'

We can also further reduce the memory required with quantization:

In [20]:
!litgpt finetune --config config/open-llama-7b-lora.yaml --quantize bnb.nf4

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_7b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x747be4784f80>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas'

Even the 13B model can be trained quickly with minimal memory required, using LoRA:

In [21]:
!litgpt finetune --config config/open-llama-13b-lora.yaml

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/openlm-research/open_llama_13b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.5,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7c801011e8d0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=False,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 32,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas