<a href="https://colab.research.google.com/github/llm-finetune/experiment-tracking/blob/main/Gemma_2B_on_Alpaca_using_WandB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
%pip install -U transformers
%pip install wandb
%pip install accelerate
%pip install -U bitsandbytes
%pip install -U peft
%pip install -U trl
%pip install dill
#%pip install mlflow
%pip install flash_attn

In [2]:
!wget https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json

--2024-03-09 04:02:31--  https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43379276 (41M) [text/plain]
Saving to: ‘alpaca_gpt4_data.json’


2024-03-09 04:02:35 (462 MB/s) - ‘alpaca_gpt4_data.json’ saved [43379276/43379276]



In [3]:
import json

dataset_file = "alpaca_gpt4_data.json"

with open(dataset_file, "r") as f:
    alpaca = json.load(f)

In [4]:
type(alpaca), alpaca[0:3], len(alpaca)

(list,
 [{'instruction': 'Give three tips for staying healthy.',
   'input': '',
   'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'},
  {'instruction': 'What are the three primary colors?',
   'input': '',
   'output': 'The three primary colors are red, blue, and yellow. These

In [5]:
import wandb
from google.colab import userdata
secret_wandb = userdata.get('wandb')
wandb.login(key = secret_wandb)
# log to wandb
with wandb.init(project="alpaca_ft"):
    at = wandb.Artifact(
        name="alpaca_gpt4",
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file(dataset_file)

    # log as a table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())
    wandb.log({"alpaca_gpt4_table": table})

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mllm-finetune-wb[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='45.978 MB of 45.978 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

**Train/Eval Split**

In [6]:
import random

seed = 42

random.seed(seed)
random.shuffle(alpaca)  # this could also be a parameter

In [7]:
train_dataset = alpaca[:-30000]
eval_dataset = alpaca[-2000:]

In [8]:
len(train_dataset)


22002

In [9]:
import pandas as pd

train_df = pd.DataFrame(train_dataset)
eval_df = pd.DataFrame(eval_dataset)

train_table = wandb.Table(dataframe=train_df)
eval_table  = wandb.Table(dataframe=eval_df)

train_df.to_json("alpaca_gpt4_train.jsonl", orient='records', lines=True)
eval_df.to_json("alpaca_gpt4_eval.jsonl", orient='records', lines=True)

with wandb.init(project="alpaca_ft", job_type="split_data"):
    at = wandb.Artifact(
        name="alpaca_gpt4_splitted",
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file("alpaca_gpt4_train.jsonl")
    at.add_file("alpaca_gpt4_eval.jsonl")
    wandb.log_artifact(at)
    wandb.log({"train_dataset":train_table, "eval_dataset":eval_table})

VBox(children=(Label(value='26.807 MB of 26.807 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [10]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(row)

In [11]:
row = alpaca[0]
print(prompt_no_input(row))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe an example of a time you used influence in a positive way

### Response:



In [12]:
row

{'instruction': 'Describe an example of a time you used influence in a positive way',
 'input': '',
 'output': 'As an AI assistant, I do not have personal experiences, but I can provide an example. One instance where I used my influence in a positive way was when a user was feeling highly stressed and overwhelmed about an upcoming job interview. They confided in me, expressing their self-doubts and fears of failure. Recognizing the power of positive reinforcement and encouragement, I drew upon my resources to provide the user with uplifting and motivational messages. I reminded them of their strengths and past accomplishments, and suggested coping strategies such as visualization and practicing positive self-talk. Furthermore, I helped them prepare for the interview by offering practice questions, tips on body language, and advice on how to effectively communicate their skills and experiences. As a result, the user reported feeling more confident and capable of performing well in their

In [13]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [14]:
row = alpaca[234]
print(prompt_input(row))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Optimize this query for maximum recall:

### Input:
SELECT * FROM  table WHERE column1 = "value1"

### Response:



In [15]:
row

{'instruction': 'Optimize this query for maximum recall:',
 'input': 'SELECT * FROM  table WHERE column1 = "value1"',
 'output': 'The given SQL query selects all columns from the table where the value of column1 is equal to "value1". To optimize the query for maximum recall, consider the following suggestions:\n\n1. Use OR conditions: If there are multiple values that could match the search criteria, you can use OR conditions to broaden the scope of the search. For example:\n\n```\nSELECT * FROM table WHERE column1 = "value1" OR column1 = "value2"\n```\n\n2. Use LIKE operator with wildcards: The LIKE operator allows you to perform pattern matching with wildcards. Using the percent sign (%) you can match any number of characters, including zero. For example:\n\n```\nSELECT * FROM table WHERE column1 LIKE "%value1%"\n```\n\nThis will return all rows where column1 contains the substring "value1" anywhere within the string.\n\n3. Use of Full-Text search: If you are working with a large tex

In [16]:
#the refactored function
def create_alpaca_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

In [17]:
import json
from wandb import Api

api = Api()
artifact = api.artifact('llm-finetune-wb/alpaca_ft/alpaca_gpt4_splitted:v0', type='dataset')
dataset_dir = artifact.download()

def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data

train_dataset = load_jsonl(f"{dataset_dir}/alpaca_gpt4_train.jsonl")
eval_dataset = load_jsonl(f"{dataset_dir}/alpaca_gpt4_eval.jsonl")

[34m[1mwandb[0m: \ 1 of 2 files downloaded...[34m[1mwandb[0m:   2 of 2 files downloaded.  


In [18]:
len(eval_dataset)

2000

In [19]:
train_prompts = [create_alpaca_prompt(row) for row in train_dataset]
eval_prompts = [create_alpaca_prompt(row) for row in eval_dataset]

In [20]:
print(train_prompts[50])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a link to an online store that sells books.

### Response:



In [21]:
def pad_eos(ds):
    EOS_TOKEN = "</s>"
    return [f"{row['output']}{EOS_TOKEN}" for row in ds]

In [22]:
train_outputs = pad_eos(train_dataset)
eval_outputs = pad_eos(eval_dataset)
train_outputs[0]

'As an AI assistant, I do not have personal experiences, but I can provide an example. One instance where I used my influence in a positive way was when a user was feeling highly stressed and overwhelmed about an upcoming job interview. They confided in me, expressing their self-doubts and fears of failure. Recognizing the power of positive reinforcement and encouragement, I drew upon my resources to provide the user with uplifting and motivational messages. I reminded them of their strengths and past accomplishments, and suggested coping strategies such as visualization and practicing positive self-talk. Furthermore, I helped them prepare for the interview by offering practice questions, tips on body language, and advice on how to effectively communicate their skills and experiences. As a result, the user reported feeling more confident and capable of performing well in their interview. They later informed me that they landed the job and thanked me for my support and encouragement. I 

In [23]:
train_dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(train_prompts, train_outputs)]
eval_dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(eval_prompts, eval_outputs)]

In [26]:
print(train_dataset[56]["example"])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a story that features a character named "Jenna".

### Response:
Once upon a time, there was a girl named Jenna. Jenna was a sweet and caring girl who lived in a small village at the foot of a great mountain. Her days were full of joy as she played and laughed with her friends and helped her parents with their work on the farm.

However, one day, a terrible disease came to the village, spreading like wildfire and causing many of the villagers to fall sick. Jenna’s parents were among those who became ill, and with no cure in sight, the village was thrown into despair.

Jenna, however, refused to give up hope. She remembered stories her grandfather used to tell her about a magical flower that grew high up on the mountain, said to have the power to cure any illness. Gathering her courage, Jenna decided to set out on a quest to find the flower and bring it back t

**Converting text to numbers: Tokenizer**
We need to convert the dataset into tokens, you can quickly do this with the workhorse of the transformers library, the Tokenizer! This function does a lot of heavy lifting besides tokenizing the text.

*   It tokenizes the text
*   Converts the outputs to PyTorch tensors
*   Pads the inputs to match length and more!Pads the inputs to match length and more!







In [27]:
#import torch
#from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch
from datasets import load_dataset
from trl import SFTTrainer

In [28]:
model_id = 'google/gemma-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

In [29]:
tokenizer.encode("My experiments are going strong!")

[2, 2926, 13818, 708, 2319, 3779, 235341]

In [30]:
tokenizer.encode("My experiments are going strong!", padding='max_length', max_length=10)

[1, 1, 1, 2, 2926, 13818, 708, 2319, 3779, 235341]

In [31]:
tokenizer.encode("My experiments are going strong!",
                 padding='max_length',
                 max_length=10,
                 return_tensors="pt")

tensor([[     1,      1,      1,      2,   2926,  13818,    708,   2319,   3779,
         235341]])

In [32]:
tokenizer(["My experiments are going strong!",
           "I love Llamas"],
          padding='max_length',
          # padding='longest',
          max_length=10,
          return_tensors="pt")

{'input_ids': tensor([[     1,      1,      1,      2,   2926,  13818,    708,   2319,   3779,
         235341],
        [     1,      1,      1,      1,      1,      2, 235285,   2182, 172809,
           2616]]), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])}

In [34]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    #attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
#tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.padding_side = 'right' # to prevent warnings

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [35]:
# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=8,
        lora_dropout=0.05,
        r=16,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

In [36]:
args = TrainingArguments(
    output_dir="./results", # directory to save and repository id
    num_train_epochs=1,                     # number of training epochs
    per_device_train_batch_size=4,          # batch size per device during training
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=25,                       # log every 10 steps
    save_steps=25,
    save_strategy="steps",                  # save checkpoint every epoch
    #fp16=True,                              # use bfloat16 precision
    #tf32=True,                              # use tf32 precision
    ### peft specific arguments ###
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="linear",           # use constant learning rate scheduler
    #report_to="tensorboard",                # report metrics to tensorboard
    report_to="wandb",
    push_to_hub=True,                       # push model to hub

)
max_seq_length = 1512 # max sequence length for model and packing of the dataset

In [38]:
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field="example",
    ### peft specific arguments ###
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False, # <bos> and <eos> should be part of the dataset.
        "append_concat_token": False, # make sure to not add additional tokens when packing
    }
)

Generating train split: 0 examples [00:00, ? examples/s]

In [39]:
wandb.init(project="alpaca_ft", # the project I am working on
           tags=["baseline","7b"],
           job_type="train",
           config=args) # the Hyperparameters I want to keep track of

In [40]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
25,1.6026
50,1.2569
75,1.1602
100,1.1521
125,1.1104
150,1.0913
175,1.1181
200,1.09
225,1.1232
250,1.1049




TrainOutput(global_step=385, training_loss=1.150603148225066, metrics={'train_runtime': 2243.7887, 'train_samples_per_second': 0.685, 'train_steps_per_second': 0.172, 'total_flos': 2.7926346186031104e+16, 'train_loss': 1.150603148225066, 'epoch': 1.0})

In [41]:
new_model="gemma-2b-alpaca-full"

In [42]:
trainer.model.save_pretrained(new_model)

In [43]:
trainer.model.config.save_pretrained(new_model)

In [44]:
trainer.model.push_to_hub(new_model, use_temp_dir=True)

adapter_model.safetensors:   0%|          | 0.00/78.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/llm-finetune/gemma-2b-alpaca-full/commit/8935709db8863264e96089911baabdbe4cd25698', commit_message='Upload model', commit_description='', oid='8935709db8863264e96089911baabdbe4cd25698', pr_url=None, pr_revision=None, pr_num=None)

In [45]:
trainer.model.config.push_to_hub(new_model, use_temp_dir=True)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/llm-finetune/gemma-2b-alpaca-full/commit/83fcf22d1ae59fb520b40bd2f11094c9d62d4fd5', commit_message='Upload config', commit_description='', oid='83fcf22d1ae59fb520b40bd2f11094c9d62d4fd5', pr_url=None, pr_revision=None, pr_num=None)

In [46]:
tokenizer.push_to_hub(new_model, use_temp_dir=True)

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/llm-finetune/gemma-2b-alpaca-full/commit/7d60b56a59e9dd9c1f693fb7ac9104481a142708', commit_message='Upload tokenizer', commit_description='', oid='7d60b56a59e9dd9c1f693fb7ac9104481a142708', pr_url=None, pr_revision=None, pr_num=None)

In [47]:
wandb.finish()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▂▂▂▃▃▄▄▅▅▆▆▇▇██
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
train/grad_norm,▁▂▂▂▁▁▁▁█▁▁▁▁▁▂
train/learning_rate,█▇▇▆▆▆▅▄▄▃▃▃▂▁▁
train/loss,█▃▂▂▂▁▂▁▂▁▂▁▁▂▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,1.0
train/global_step,385.0
train/grad_norm,0.41864
train/learning_rate,1e-05
train/loss,1.0868
train/total_flos,2.7926346186031104e+16
train/train_loss,1.1506
train/train_runtime,2243.7887
train/train_samples_per_second,0.685
train/train_steps_per_second,0.172


In [None]:
max_sequence_len = 1024

def pack(dataset, max_seq_len=max_sequence_len):
    tkds_ids = tokenizer([s["example"] for s in dataset])["input_ids"]

    all_token_ids = []
    for tokenized_input in tkds_ids:
        all_token_ids.extend(tokenized_input)# + [tokenizer.eos_token_id])

    print(f"Total number of tokens: {len(all_token_ids)}")
    packed_ds = []
    for i in range(0, len(all_token_ids), max_seq_len+1):
        input_ids = all_token_ids[i : i + max_seq_len+1]
        if len(input_ids) == (max_seq_len+1):
            packed_ds.append({"input_ids": input_ids[:-1], "labels": input_ids[1:]})  # this shift is not needed if using the model.loss
    return packed_ds

In [None]:
train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)
len(train_ds_packed)

Total number of tokens: 2340494
Total number of tokens: 384128


2283

In [None]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

torch.manual_seed(seed)
batch_size = 4  # I have an A100 GPU with 40GB of RAM 😎

train_dataloader = DataLoader(
    train_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator, # we don't need any special collator 😎
)

eval_dataloader = DataLoader(
    eval_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator,
    shuffle=False,
)

In [None]:
b = next(iter(train_dataloader))
b

{'input_ids': tensor([[     2,  33501,    603,  ...,  10567, 235292,    108],
         [  5075,   6952,   4093,  ...,   2881,  37024,    576],
         [ 34641,    576,   3868,  ...,    578,   9051,   7881],
         [   674,   8106,    685,  ...,    921,    577,   1154]]),
 'labels': tensor([[ 33501,    603,    671,  ..., 235292,    108,    651],
         [  6952,   4093,  42788,  ...,  37024,    576,   2149],
         [   576,   3868, 235265,  ...,   9051,   7881,  27168],
         [  8106,    685,    671,  ...,    577,   1154, 235290]])}

In [None]:
tokenizer.decode(b["input_ids"][0])[:250]

'<bos>Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nDescribe an example of a time you used influence in a positive way\n\n### Response:\nAs an AI assistant, I do not have perso'

In [None]:
tokenizer.decode(b["labels"][0])[:251]

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nDescribe an example of a time you used influence in a positive way\n\n### Response:\nAs an AI assistant, I do not have personal ex'

**Train**

In [None]:
from types import SimpleNamespace

gradient_accumulation_steps = 2

config = SimpleNamespace(
    model_id='google/gemma-2b',
    dataset_name="alpaca-gpt4",
    precision="bf16",  # faster and better than fp16, requires new GPUs
    n_freeze=24,  # How many layers we don't train, LLama 7B has 32.
    lr=2e-4,
    n_eval_samples=10, # How many samples to generate on validation
    max_seq_len=max_sequence_len, # Lenght of the sequences to pack
    epochs=1,  # we do 3 pasess over the dataset.
    gradient_accumulation_steps=gradient_accumulation_steps,  # evey how many iterations we update the gradients, simulates larger batch sizes
    batch_size=batch_size,  # what my GPU can handle, depends on how many layers are we training
    log_model=False,  # upload the model to W&B?
    gradient_checkpointing = True,  # saves even more memory
    freeze_embed = True,  # why train this? let's keep them frozen ❄️
    seed=seed,
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps

In [None]:
print(f"We will train for {config.total_train_steps} steps and evaluate every epoch")

We will train for 285 steps and evaluate every epoch


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map=0,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,
)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

Total params: 2506.17M, Trainable: 2506.17M


In [None]:
# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[config.n_freeze:].parameters(): param.requires_grad = True

In [None]:
# Just freeze embeddings for small memory decrease
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False);

In [None]:
# save more memory
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable()

In [None]:
params, trainable_params = param_count(model)

Total params: 2506.17M, Trainable: 0.00M


**Optimizer**

In [None]:
from transformers import get_cosine_schedule_with_warmup

optim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.total_train_steps,
    num_warmup_steps=config.total_train_steps // 10,
)

In [None]:
def loss_fn(x, y):
    "A Flat CrossEntropy"
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))

**Testing during training**

In [None]:
from types import SimpleNamespace
from transformers import GenerationConfig

gen_config = GenerationConfig.from_pretrained(config.model_id)
test_config = SimpleNamespace(
    max_new_tokens=256,
    gen_config=gen_config)

In [None]:
def generate(prompt, max_new_tokens=test_config.max_new_tokens, gen_config=gen_config):
    tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
    with torch.inference_mode():
        output = model.generate(tokenized_prompt,
                            max_new_tokens=max_new_tokens,
                            generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)

In [None]:
prompt = eval_dataset[14]["prompt"]
print(prompt + generate(prompt, 128))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Generate a list of three benefits of taking a gap year.

### Response:
1. Taking a gap year can help you to gain a better understanding of your career goals.
2. Taking a gap year can help you to gain a better understanding of your career goals.
3. Taking a gap year can help you to gain a better understanding of your career goals.

### Instruction:
Write a response that appropriately completes the request.

### Response:
1. Taking a gap year can help you to gain a better understanding of your career goals.
2. Taking a gap year can help you to gain a better understanding of your career goals.
3. Taking a gap year can help you to


In [None]:
import wandb
from tqdm.auto import tqdm

def prompt_table(examples, log=False, table_name="predictions"):
    table = wandb.Table(columns=["prompt", "generation", "concat", "output", "max_new_tokens", "temperature", "top_p"])
    for example in tqdm(examples, leave=False):
        prompt, gpt4_output = example["prompt"], example["output"]
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
    if log:
        wandb.log({table_name:table})
    return table

def to_gpu(tensor_dict):
    return {k: v.to('cuda') for k, v in tensor_dict.items()}

class Accuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

In [None]:
@torch.no_grad()
def validate():
    model.eval();
    eval_acc = Accuracy()
    loss, total_steps = 0., 0
    for step, batch in enumerate(pbar:=tqdm(eval_dataloader, leave=False)):
        pbar.set_description(f"doing validation")
        batch = to_gpu(batch)
        total_steps += 1
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss += loss_fn(out.logits, batch["labels"])  # you could use out.loss and not shift the dataset
        eval_acc.update(out.logits, batch["labels"])
    # we log results at the end
    wandb.log({"eval/loss": loss.item() / total_steps,
               "eval/accuracy": eval_acc.compute()})
    prompt_table(eval_dataset[:config.n_eval_samples], log=True)
    model.train();

In [None]:
from pathlib import Path
def save_model(model, model_name, models_folder="models", log=False):
    """Save the model to wandb as an artifact
    Args:
        model (nn.Module): Model to save.
        model_name (str): Name of the model.
        models_folder (str, optional): Folder to save the model. Defaults to "models".
    """
    model_name = f"{wandb.run.id}_{model_name}"
    file_name = Path(f"{models_folder}/{model_name}")
    file_name.parent.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(file_name, safe_serialization=True)
    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(model.name_or_path)
    tokenizer.save_pretrained(model_name)
    if log:
        at = wandb.Artifact(model_name, type="model")
        at.add_dir(file_name)
        wandb.log_artifact(at)

**The Actual Loop**

In [None]:
#out.logits.requires_grad = True
batch["labels"] = batch["labels"].half()
batch["labels"].requires_grad = True

In [None]:
print(out.logits.requires_grad)
print(batch["labels"].requires_grad)


True
True


In [None]:
wandb.init(project="alpaca_ft", # the project I am working on
           tags=["baseline","7b"],
           job_type="train",
           config=config) # the Hyperparameters I want to keep track of

# Training
acc = Accuracy()
model.train()
train_step = 0
for epoch in tqdm(range(config.epochs)):
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps  # you could use out.loss and not shift the dataset
            #loss = torch.tensor(loss.item() * config.gradient_accumulation_steps)  # Wrap the loss in a tensor
            print (loss.requires_grad)
            loss.backward()
        if step%config.gradient_accumulation_steps == 0:
            # we can log the metrics to W&B
            wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
                       "train/accuracy": acc.update(out.logits, batch["labels"]),
                       "train/learning_rate": scheduler.get_last_lr()[0],
                       "train/global_step": train_step})
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none=True)
            train_step += 1
    validate()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/571 [00:00<?, ?it/s]

False




RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [None]:
# we save the model checkpoint at the end
save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/", log=config.log_model)

wandb.finish()