This notebook follows [this blog post](https://www.philschmid.de/deepspeed-lora-flash-attention) from Phillip

## Load Dataset

In [63]:
from datasets import load_dataset
from random import randrange
import os

In [64]:
os.environ["WANDB_ENTITY"] = "hamelsmu"
os.environ["WANDB_PROJECT"] = "deepspeed-bench" # log to your project 
os.environ["WANDB_LOG_MODEL"] = "all" # log your models

In [65]:
dataset = load_dataset("databricks/databricks-dolly-15k", 
                       split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

dataset size: 15011
{'instruction': 'What kind of dog should I get?', 'context': '', 'response': 'There are many dog breeds to choose from. Choosing a dog breed is a personal choice. Consider what kind of lifestyle you live and pick a dog that fits your lifestyle. For example, if you are allergic to dogs you may consider a poodle, or poodle mix as they tend to be hypoallergenic.', 'category': 'general_qa'}


In [68]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

In [69]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

### Instruction
What is buoyant force?

### Answer
The upward force exerted on a body, partially or fully immersed in a fluid, is known as buoyant force. This upward force is also called Upthrust. This is related  to the Archimedes principle. If an object is partially or fully submerged in any fluid, the upward force and the fluid displaced is equal to the upward force exerted by the fluid.


## Load Model

In [70]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

from random import randint
from itertools import chain
from functools import partial

In [71]:
from random import randint
from itertools import chain
from functools import partial

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

In [72]:
# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

### Instruction
What is combinatorial optimisation?

### Answer
Combinatorial optimisation is a field of applied mathematics, combining techniques from combinatorics, linear programming, and the theory of algorithms, to solve discrete optimisation problems. It is usually used as an alias of discrete optimisation. A combinatorial optimisation problem can generally be drawn as a triple (S, f, C), where S is a given search space, f is the objective function, which should be either maximised or minimised, and C is the set of constraints that have to be fulfilled to obtain feasible solutions. The goal is to find a globally optimal solution, meaning a solution s' that belongs to S, with either the highest or lowest objective value in the case of maximisation or minimisation, each under the restriction of constraints.</s>


In [73]:
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

I changed `chunk_length` below to get the desired sequence length I wanted.  I had two versions: 

1. this `2048` version from Phillips blog post, which is named `dolly-processed`
2. and a really small one that was `64`, which I cut off at `3200` examples for to keep it extra small.  I named this data `dolly-processed-tiny-truncated`.  I used this small version for the situation where I was comparing bs=1 with bs=200.

In [74]:
def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

In [75]:
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

In [76]:
len(lm_dataset)

1581

### What is going on here?

`chunk` is packing examples into one contiguous "row" that is of sequence length=2048, the remainder get's put into the next "batch" such that there is effectively no padding.  Basically we are cramming as much data through the model as possible.  

I can imagine this is how pre-training works more generally, but I am not sure this is something that makes sense for instruction-tuning?  The reason is that in practice the model isn't going to see examples that are like this, so should we really be instruction tuning this way?  (I DM'd Phillip on Twitter to ask him his opinion)

In [58]:
print(f"seq len of 1st example: {len(lm_dataset[0]['input_ids'])}")

seq len of 1st example: 2048


In [59]:
print(f"seq len of 2nd example: {len(lm_dataset[1]['input_ids'])}")

seq len of 2nd example: 2048


In [60]:
len(lm_dataset)

1581

### Save Data to Disk

In [61]:
lm_dataset.save_to_disk('dolly-processed')

Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████| 1581/1581 [00:00<00:00, 28107.21 examples/s]


# Train model

Look at the overview tab of runs in [this project](https://wandb.ai/hamelsmu/deepspeed-bench?workspace=user-hamelsmu) to see (most) of the CLI command used to run each go of the model.

In [62]:
!WANDB_ENTITY=hamelsmu WANDB_PROJECT=deepspeed-bench WANDB_LOG_MODEL=all WANDB_RUN_ID=z3-3gpu-v4 \
    torchrun --nproc_per_node 3 run_lora.py \
  --model_id {model_id} \
  --dataset_path dolly-processed \
  --output_dir {model_id}-fa \
  --num_train_epochs 3 \
  --per_device_train_batch_size 8 \
  --learning_rate 4e-3 \
  --gradient_checkpointing True \
  --gradient_accumulation_steps 2 \
  --bf16 True \
  --tf32 True \
  --lr_scheduler_type "constant_with_warmup" \
  --logging_steps 25 \
  --save_steps 100 \
  --save_total_limit 3 \
  --report_to "wandb" \
  --deepspeed z3.json