<a href="https://colab.research.google.com/github/larsondg2000/Colab-Projects/blob/main/fine_tune_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install dependencies

In [None]:
!pip install transformers datasets evaluate



In [None]:
!pip install trl



In [None]:
!pip install sentencepiece



In [None]:
!pip install accelerate -U



In [None]:
!pip install transformers[torch]



In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("databricks/databricks-dolly-15k")

In [None]:
dataset_split = dataset["train"].train_test_split(test_size=.01)

In [None]:
print(dataset_split)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 14860
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 151
    })
})


In [None]:
# logits: [sequence_length, vocab_size]
# probability distrobution of where each loacation in the sequence length is
# predicting the next token over all possible tokens in the model vocab.

# Softmax give prob. distro.

import torch
import evaluate

def preprocess_logits_for_metrics(logits, labels):
  if isinstance(logits, tuple):
    logits=logits[0]
  return logits.argmax(dim=-1)


metric = evaluate.load('accuracy')

def compute_metrics(eval_preds):
  preds, labels = eval_preds

 # shift the labels and predictions
  labels = labels[:, 1:].reshape(-1)
  preds = preds[:, :-1].reshape(-1)

  mask = labels != -100 # boolean mask

  # apply the mask to filter out ignored labels and corresponding prdictions
  filtered_labels = labels[mask]
  filtered_preds = preds[mask]

  return metric.compute(predictions=filtered_preds,references=filtered_labels)


In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
tokenizer.pad_token = tokenizer.eos_token # end of sequence token for padding


model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", torch_dtype=torch.bfloat16, device_map='auto')

# processing a raw example
def formatting_prompts_func(example):
  output_texts = []
  for i in range(len(example['instruction'])):
    text = f"### Input: {example['context'][i] + ' ' + example['instruction'][i]}\n ### Output: {example['response'][i]}"
    output_texts.append(text)
  return output_texts

# add contect to response template
response_template_with_context = "### Input: Dummy Input\n ### Output:"
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)
print(response_template_ids)

response_template_ids = response_template_ids[-1*len(tokenizer.encode(" ### Output:", add_special_tokens=False)):]
print(response_template_ids)

collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

[835, 10567, 29901, 360, 11770, 10567, 13, 835, 10604, 29901]
[13, 835, 10604, 29901]


In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4,

    gradient_accumulation_steps=32,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=10,
    weight_decay=0.01,
    evaluation_strategy='steps',
    eval_steps=10, #eval every 10 steps
    logging_steps=1,
    gradient_checkpointing=True, # recomputes forward pass activations in backward pass to save nmemory
    save_steps=500 # checkpoint every 500 steps

)

trainer = SFTTrainer(
    model,
    args=training_args,
    train_dataset=dataset_split['train'],
    eval_dataset=dataset_split['test'],
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics,
    max_seq_length=2048  # truncate sequences greater than 2048 tokens.  Increasing equals more memory useage

)
trainer.train()

Map:   0%|          | 0/151 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,Accuracy
10,1.4912,,0.597245
20,1.4796,,0.600456
30,1.6039,,0.59547
40,1.5496,,0.595977
50,1.1085,,0.598935
60,1.1388,,0.594455
70,1.6956,,0.597667
80,1.5783,,0.596737
90,1.7207,,0.59623
100,1.829,,0.59809



Before Taran can propose to Eilonwy, the bard-king Fflewddur Fflam and his mount Llyan arrive with a gravely injured Gwydion, Prince of Don. Servants of Arawn had assaulted them and seized the magical black sword Dyrnwyn. Fflewddur also states that Taran was involved in the ambush, baffling everyone. With Achren's help, the truth is determined: Arawn himself has come from Annuvin to the verge of Caer Dallben in the guise of Taran, in order to lure Gwydion into the ambush.

Because Dyrnwyn may be pivotal as a threat to Arawn, Dallben consults the oracular pig Hen Wen to determine how it may be regained. During the reading, the ash rods used to communicate shatter and the two thirds of Hen Wen's answer are discouraging and vague. When Gwydion heals sufficiently, he sets out with Taran and others to meet with King Smoit. Gwydion insists that he alone should enter Annuvin to seek the sword, but Smoit's Cantrev Cadiffor is on the way. The small party divides, as Rhun and Eilonwy intend to 