# Finetuning Llama 2 and Mistral - Beginners Guide

Taken from: https://medium.com/@geronimo7/finetuning-llama2-mistral-945f9c200611

For a general guide on text generation using transformers, check this excellent Hugging Face doc: https://huggingface.co/docs/transformers/llm_tutorial

#### Prerequisites

bitsandbytes - library for 4-bit quantization

peft - parameter efficient fine tuning, work by reducing the number of trainable parameters.  Used by LORA 

In [1]:
# Prerequisites
!pip install -U accelerate bitsandbytes datasets peft transformers tokenizers

Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting peft
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers
  Downloading transformers-4.39.1-py3-none-any.whl.metadata (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting scipy (from bitsandbytes)
  Using cached scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-n

## Datasets

There are many training compatible formats to convert our datasets to.  Here we will take a look at the Chat ML format, which looks like this:

<code>
<|im_start|>system
You are an AI assistant. User will you give you a task. Your goal is to 
complete the task as faithfully as you can. While performing the task 
think step-by-step and justify your steps.<|im_end|>
<|im_start|>user
Premise: A man is inline skating in front of a wooden bench. Hypothesis: 
A man is having fun skating in front of a bench. .Choose the correct 
answer: Given the premise, can we conclude the hypothesis? 
Select from: a). yes b). it is not possible to tell c). no<|im_end|>
<|im_start|>assistant
b). it is not possible to tell Justification: Although the man is inline 
skating in front of the wooden bench, we cannot conclude whether he is 
having fun or not, as his emotions are not explicitly mentioned.<|im_end|>
</code>

Lets start looking at the Open Assistant Dataset

In [2]:
# Load Open Assistant Dataset
from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")

Downloading readme:   0%|          | 0.00/512 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 31.0M/31.0M [00:02<00:00, 12.8MB/s]
Downloading data: 100%|██████████| 1.61M/1.61M [00:00<00:00, 6.05MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 12947
    })
    test: Dataset({
        features: ['text'],
        num_rows: 690
    })
})

In [4]:
# look at the first few entries
dataset['train'][:5]

{'text': ['<|im_start|>user\nConsigliami 5 nomi per il mio cucciolo di dobberman<|im_end|>\n<|im_start|>assistant\nEcco 5 nomi per il tuo cucciolo di dobermann:\n\n- Zeus\n- Apollo\n- Thor\n- Athena\n- Odin<|im_end|>\n',
  '<|im_start|>user\n¿Crees que los robots sintientes deben tener los mismos derechos que los humanos?<|im_end|>\n<|im_start|>assistant\nEsta es una pregunta muy interesante y compleja. No tengo una opinión personal al respecto, pero puedo decirte lo que algunas personas piensan sobre este tema.😊\n\nEn general, hay dos posiciones opuestas sobre si los robots sintientes deben tener los mismos derechos que los humanos:\n\n- Sí, son seres conscientes y merecen respeto La primera premisa sería un rotundo SÍ, los robots sintientes deben tener los mismos derechos que los humanos porque son seres conscientes, capaces de sentir, pensar y actuar de forma autónoma. Algunos argumentos a favor de esta postura son:\n\n  - Los robots sintientes tienen una inteligencia artificial ava

Conveniently, this dataset is already in ChatML format (note the im_start tag)

But let's see an example of a dataset we need to convert.  Now we will grab Open Orca with 4.2M entries

In [5]:
from datasets import load_dataset

dataset = load_dataset("Open-Orca/OpenOrca")
dataset = dataset["train"].train_test_split(test_size=0.1)

Downloading readme:   0%|          | 0.00/12.0k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 1.01G/1.01G [00:35<00:00, 28.4MB/s]
Downloading data: 100%|██████████| 3.09G/3.09G [02:08<00:00, 24.0MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

In [6]:
# Get the first entry of the new dataset
dataset['train'][0]

{'id': 't0.865507',
 'system_prompt': 'You are an AI assistant that follows instruction extremely well. Help as much as you can.',
 'question': 'Question: "What event did China want to suppress?"  Context: "In 1884, pro-Japanese Koreans in Seoul led the Gapsin Coup. Tensions between China and Japan rose after China intervened to suppress the uprising. Japanese Prime Minister Itō Hirobumi and Li Hongzhang signed the Convention of Tientsin, an agreement to withdraw troops simultaneously, but the First Sino-Japanese War of 1895 was a military humiliation. The Treaty of Shimonoseki recognized Korean independence and ceded Taiwan and the Pescadores to Japan. The terms might have been harsher, but when Japanese citizen attacked and wounded Li Hongzhang, an international outcry shamed the Japanese into revising them. The original agreement stipulated the cession of Liaodong Peninsula to Japan, but Russia, with its own designs on the territory, along with Germany and France, in what was known 

We would have to convert this to ChatML format.  Here's a sample helper function....

In [8]:
def format_conversation(row):
    template="<|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n{q}<|im_end|>\n<|im_start|>assistant\n{a}<|im_end|>"

    conversation=template.format(
        sys=row["system_prompt"],
        q=row["question"],
        a=row["response"],
    )

    return {"text": conversation}

Make a small dataset to try this out

In [11]:
dataset_small = dataset['train'][:100]

In [12]:
import os 
dataset_2 = dataset_small.map(
    format_conversation, 
    remove_columns=dataset["train"].column_names, # remove all columns; only "text" will be left
    num_proc=os.cpu_count()  # multithreaded
)

AttributeError: 'dict' object has no attribute 'map'

In [1]:
!pwd

/Users/jjj/repos/llm-learn/finetune


In [2]:
!ls -al

total 72
drwxr-xr-x  3 jjj  staff     96 Mar 24 22:46 [1m[36m.[m[m
drwxr-xr-x  8 jjj  staff    256 Mar 24 22:46 [1m[36m..[m[m
-rw-r--r--  1 jjj  staff  33388 Mar 24 22:40 qlora-min.ipynb
