# Mistral 7B Instruct Fine-tuning Script
### Contents:
- Loading Model from huggingface
- Loading Dataset (Manually labelled dataset) from huggingface
- Pre-process the training dataset with the Mistral formatting template
- Define LoRA model arguments and SFT Trainer
- Train the model , recording using Wand B
- Referenced Code from
  - https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb





#### Installing Dependencies

In [None]:
!pip install -q transformers[torch] datasets
!pip install -q bitsandbytes trl peft
!pip install flash-attn --no-build-isolation
!pip install -q torch
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install -q datasets #huggingface datasets to download and manipulate datasets
!pip install -q peft #Parameter efficient finetuning - for qLora Finetuning
!pip install -q bitsandbytes #For Model weights quantisation
!pip install -q trl #Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning
!pip install -q wandb -U #Used to monitor the model score during training
!pip install --upgrade huggingface_hub
!pip install wandb
!huggingface-cli login --token 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

import json
import pandas as pd
import torch
import wandb
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


### Logging into Huggingface and WandB

In [None]:
hf_token = #
wb_token = #
wandb.login(key=wb_token)
notebook_login()

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Loading Manually labelled dataset for Instruct Fine-Tuning

In [None]:
from datasets import load_dataset

dataset = load_dataset("mirko5301/crypto_whitepaper_public", split="test")

Downloading readme:   0%|          | 0.00/4.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/108k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/49 [00:00<?, ? examples/s]

In [None]:
# Function to format each row in the dataset
def create_text_row(instruction, output, input):
    text_row = f"""<s>[INST] {instruction} here are the inputs {input} [/INST] \\n {output} </s>"""
    return text_row

# Iterate over all the rows, format the dataset, and store it in a JSONL file
def process_jsonl_file(dataset, output_file_path):
    with open(output_file_path, "w") as output_jsonl_file:
        for item in dataset:
            json_object = {
                "text": create_text_row(item["instruction"], item["output"], item["input"]),
                "instruction": item["instruction"],
                "input": item["input"],
                "output": item["output"]
            }
            output_jsonl_file.write(json.dumps(json_object) + "\n")


# Provide the path where you want to save the formatted dataset
process_jsonl_file(dataset, "./training_datasett.jsonl")

In [None]:
train_dataset = load_dataset('json', data_files='./training_datasett.jsonl' , split='train')


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
from datasets import DatasetDict

indices = range(0,49)

dataset_dict = {"train": train_dataset.select(indices),
                "test": train_dataset.select(indices)}

raw_datasets = DatasetDict(dataset_dict)
raw_datasets



DatasetDict({
    train: Dataset({
        features: ['text', 'instruction', 'input', 'output'],
        num_rows: 49
    })
    test: Dataset({
        features: ['text', 'instruction', 'input', 'output'],
        num_rows: 49
    })
})

In [None]:
example = raw_datasets["train"][0]
print(example.keys())

dict_keys(['text', 'instruction', 'input', 'output'])


In [None]:
import random

# create the splits
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

for index in random.sample(range(len(raw_datasets["train"])), 3):
  print(f"Sample {index} of the processed training set:\n\n{raw_datasets['train'][index]['text']}")

Sample 0 of the processed training set:

<s>[INST] What is the crypto-asset project description of Algorand here are the inputs Algorand is a blockchain-based cryptocurrency protocol that aims to create a decentralized, secure, and scalable platform for diverse applications. Founded by Silvio Micali, it employs a unique Pure Proof of Stake (PPoS) consensus mechanism, ensuring security and decentralization by randomly selecting validators from the pool of token holders. Algorand supports high transaction throughput with low latency, making it suitable for fast-processing applications like finance and gaming. The platform offers robust smart contract capabilities, developer-friendly tools, and a commitment to sustainability through energy-efficient operations. Its decentralized governance model and interoperability features further enhance its appeal for building and integrating various blockchain solutions. Additionally, Algorand supports NFTs and other digital assets, broadening its ut

## Loading Mistral model

In [None]:
from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
  tokenizer.pad_token_id = tokenizer.eos_token_id

# Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
  tokenizer.model_max_length = 2048


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

### Define model arguments for LoRA

In [None]:
from transformers import BitsAndBytesConfig
import torch

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
            load_in_4bit=True, # Load the model in 4-bit precision to save memory and improve performance
            bnb_4bit_quant_type="nf4", # Use NormalFloat 4 (nf4) quantization, which is more efficient than standard 4-bit quantization
            bnb_4bit_compute_dtype="float16", # Set computation precision to 16-bit floating point, balancing performance and precision
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model_kwargs = dict(
    attn_implementation="flash_attention_2", # Flash Attention drastically speeds up model computations
    torch_dtype="auto",
    use_cache=False, # False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)

Loading model

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map={"":0})

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/250544c9a802b0396550d0fd24bc80ff98bb1f5f/config.json
Model config MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/250544c9a802b0396550d0fd24bc80ff98bb1f5f/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



## Define SFT Trainer

In [None]:
from trl import SFTTrainer
from peft import LoraConfig
from transformers import TrainingArguments

# path where the Trainer will save its checkpoints and logs
output_dir = './mistral-7b-instruct-lora-v3.0'

# based on config
training_args = TrainingArguments(
    fp16=True,
    do_eval=True,
    evaluation_strategy="epoch",
    gradient_accumulation_steps=3,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=1e-5, # Set the learning rate to a small value (0.00001) to ensure gradual and stable updates during fine-tuning
    log_level="info", # Set the logging level to 'info' to capture general information during the training process
    logging_steps=10, # Log training information every 10 steps to monitor progress
    logging_strategy="steps", # Log based on the number of steps rather than time intervals
    lr_scheduler_type="cosine", # Use a cosine annealing schedule for the learning rate, which gradually reduces the learning rate following a cosine curve
    max_steps=-1, # Set the maximum number of training steps; -1 means no limit, and the number of steps is determined by the number of epochs
    num_train_epochs=10, # Set the total number of training epochs to 10, meaning the model will iterate over the entire dataset 10 times
    output_dir=output_dir,
    optim="paged_adamw_8bit",
    overwrite_output_dir=True,
    per_device_eval_batch_size=1, # originally set to 8
    per_device_train_batch_size=1, # originally set to 8
    push_to_hub=True, #
    hub_model_id="mistral-7b-instruct-lora-v3.0", #Name of the model
    hub_strategy="every_save", #
    save_strategy="no",
    save_total_limit=None,
    seed=42,
)

# based on config
peft_config = LoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer = SFTTrainer(
        model=model,
        #model_init_kwargs=model_kwargs,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True,
        peft_config=peft_config,
        max_seq_length=tokenizer.model_max_length,
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
PyTorch: setting up devices
PyTorch: setting up devices
Using auto half precision backend


In [None]:
trainer.train()

***** Running training *****
  Num examples = 12
  Num Epochs = 10
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 3
  Gradient Accumulation steps = 3
  Total optimization steps = 40
  Number of trainable parameters = 6,815,744
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss
1,No log,2.367471
2,No log,2.337193
3,2.361200,2.308207
4,2.361200,2.281873
5,2.296000,2.259811
6,2.296000,2.24301
7,2.296000,2.231699
8,2.214100,2.225432
9,2.214100,2.222899
10,2.252000,2.222434



***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1

***** Running Evaluation *****
  Num examples = 12
  Batch size = 1


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=40, training_loss=2.280847501754761, metrics={'train_runtime': 242.5221, 'train_samples_per_second': 0.495, 'train_steps_per_second': 0.165, 'total_flos': 1.04951451746304e+16, 'train_loss': 2.280847501754761, 'epoch': 10.0})

In [None]:

trainer.save_state()

## Evaluate model with prompt

In [None]:
text = """<s>[INST] What are the resource allocations for the Chainlink network? here are the inputs During the initial coin offering (ICO) for LINK in September 2017, Chainlink announced a total and maximum supply of 1,000,000,000 LINK tokens. The current supply is about 453,509,553 LINK tokens, or about 45% of the total supply, as of end-September 2021. The Chainlink price at ICO was $0.11 and a total of 350 million LINK tokens were sold. This represents an over 200X from the ICO price to Chainlink price today.

Chainlink price experienced a massive bull run in the period around mid-2019 to mid-2020. Chainlink bulls were colloquially referred to as “LINK Marines,'' becoming a well-known meme in the crypto community. Chainlink price reached an all-time high of $52.88 on May 9, 2021, on the back of an overall crypto market rally, as well as ongoing developments in the Chainlink ecosystem.

According to the ICO documentation, 35% of the total token supply will go towards node operators and the incentivization of the ecosystem. Another 35% of LINK tokens were distributed during public sale events. Lastly, the remaining 30% of the total token supply was directed towards the company for the continued development of the Chainlink ecosystem and network. [/INST]"""

# Define the device for model inference
device = "cuda:0"

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate output based on the input
outputs = model.generate(**inputs, max_new_tokens=100)

# Decode and print the generated output
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s><s>[INST] What are the resource allocations for the Chainlink network? here are the inputs During the initial coin offering (ICO) for LINK in September 2017, Chainlink announced a total and maximum supply of 1,000,000,000 LINK tokens. The current supply is about 453,509,553 LINK tokens, or about 45% of the total supply, as of end-September 2021. The Chainlink price at ICO was $0.11 and a total of 350 million LINK tokens were sold. This represents an over 200X from the ICO price to Chainlink price today.

Chainlink price experienced a massive bull run in the period around mid-2019 to mid-2020. Chainlink bulls were colloquially referred to as “LINK Marines,'' becoming a well-known meme in the crypto community. Chainlink price reached an all-time high of $52.88 on May 9, 2021, on the back of an overall crypto market rally, as well as ongoing developments in the Chainlink ecosystem.

According to the ICO documentation, 35% of the total token supply will go towards node operators and the

In [None]:
text = """What are the organisations and people involved in the development of Ripple?"""

# Define the device for model inference
device = "cuda:0"

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate output based on the input
outputs = model.generate(**inputs, max_new_tokens=100)

# Decode and print the generated output
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> What are the organisations and people involved in the development of Ripple?

Ripple is an open-source, decentralized payment protocol that enables the transfer of various types of assets, including cryptocurrencies, fiat currencies, and other commodities. The Ripple protocol was initially developed by a company called OpenCoin, which was later renamed Ripple Labs.

Ripple Labs was founded in 2012 by Chris Larsen and Jed McCaleb. Larsen served as the CEO


## Saving model

In [None]:
trainer.save_model("mistral-crypto-2.0")

Saving model checkpoint to mistral-crypto-2.0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/250544c9a802b0396550d0fd24bc80ff98bb1f5f/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

tokenizer config file saved in mistral-crypto-2.0/tokenizer_config.json
Special tokens file saved in mistral-crypto-2.0/special_tokens_m

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

## Testing huggingface model

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mirko5301/mistral-7b-instruct-lora")
model = AutoModelForCausalLM.from_pretrained("mirko5301/mistral-7b-instruct-lora")

text = """What is Algorand?"""

device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt")


outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

tokenizer_config.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

loading file tokenizer.model from cache at None
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--mirko5301--mistral-7b-instruct-lora/snapshots/22df6a57498aec525465401ebe2429e4829bbf81/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--mirko5301--mistral-7b-instruct-lora/snapshots/22df6a57498aec525465401ebe2429e4829bbf81/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--mirko5301--mistral-7b-instruct-lora/snapshots/22df6a57498aec525465401ebe2429e4829bbf81/tokenizer_config.json


adapter_config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/250544c9a802b0396550d0fd24bc80ff98bb1f5f/config.json
Model config MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/250544c9a802b0396550d0fd24bc80ff98bb1f5f/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



adapter_model.safetensors:   0%|          | 0.00/218M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What is Algorand? Algorand is an open-source, decentralized, and blockchain-based platform designed for building decentralized applications (dApps) and financial services. It was created by MIT professor Silvio Micali and his team
