# Mistral-Finetune: An Introduction!

In this notebook, we'll be exploring [`mistral-finetune`](https://github.com/mistralai/mistral-finetune) a tool from Mistral AI that, according to their README.md, enables "memory-efficient and performant" fine-tuning of Mistral's models!

It leverages LoRA, an industry staple, in order to achieve this goal.

Let's dive in and see what Mistral's new tool can do for us!

## Gathering Dependencies

First things first, we'll start by gathering the repository, and installing some dependencies!

In [None]:
!git clone https://github.com/mistralai/mistral-finetune.git

Cloning into 'mistral-finetune'...
remote: Enumerating objects: 171, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 171 (delta 24), reused 14 (delta 14), pack-reused 135[K
Receiving objects: 100% (171/171), 135.62 KiB | 746.00 KiB/s, done.
Resolving deltas: 100% (75/75), done.


In [None]:
%cd mistral-finetune/

/content/mistral-finetune


In [None]:
!pip install -qUr requirements.txt

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.0/166.0 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.8/394.8 kB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m98.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for fire (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following 

> NOTE: You can safely ignore the dependency conflicts above.

## Downloading the Model

Next up, we're going to download Mistral 7B v0.3 from Mistral's CDN.

> NOTE: You may experience difficulty downloading the model in the Colab environment. Please retry the download if you see your download speeds crash, or you experience a disconnect.

In [None]:
!wget https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar

--2024-06-05 15:10:54--  https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar
Resolving models.mistralcdn.com (models.mistralcdn.com)... 104.26.7.117, 172.67.70.68, 104.26.6.117, ...
Connecting to models.mistralcdn.com (models.mistralcdn.com)|104.26.7.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14496675840 (14G) [application/x-tar]
Saving to: ‘mistral-7B-v0.3.tar.1’


2024-06-05 15:16:46 (39.4 MB/s) - ‘mistral-7B-v0.3.tar.1’ saved [14496675840/14496675840]



Now we want to save our model in a directory call `mistral_models` - you can use whatever directory name that you desire - but be sure to change references to `mistral_models` as well!

In [None]:
!MODEL=/content/mistral_models && mkdir -p $MODEL && tar -xf mistral-7B-v0.3.tar -C $MODEL

## Data Collection and Verification

Next, we'll want to gather our data and modify it into the appropriate instruct format - as noted in [the repository](https://github.com/mistralai/mistral-finetune?tab=readme-ov-file#instruct).

In essence, `mistral-finetune` expects the instruction fine-tuning data to be in the following format:

```python
{
  "messages" [
    {
      "role" : "system",
      "content" : "SYSTEM_PROMPT_1"
    },
    {
      "role" : "user",
      "content" : "USER_PROMPT_1"
    },
    {
      "role" : "assistant",
      "content" : "RESPONSE_1"
    },
  ]
}
{
  "messages" [
    {
      "role" : "system",
      "content" : "SYSTEM_PROMPT_2"
    },
    {
      "role" : "user",
      "content" : "USER_PROMPT_2"
    },
    {
      "role" : "assistant",
      "content" : "RESPONSE_2"
    },
  ]
}
...
```

Notice that the format is `JSONL`!

We're going to be leveraging a subset of the [LIMIT: Less Is More for Instruction Tuning](https://www.databricks.com/blog/limit-less-more-instruction-tuning), specifically the `Instruct-v1`, aka `dolly_hhrlhf`!

> NOTE: This dataset will require you to accept terms of use - please navigate to [this link](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) if you havbe not already done so.

We'll start with creating a data directory, and popping into it.

In [None]:
!mkdir -p data

In [None]:
%cd data

/content/data


We're going to grab a few dependencies here for our dataset!

In [None]:
!pip install -qU datasets huggingface-hub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.2/289.2 kB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Let's login to Hugging Face with a READ token.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Now we can download our data!

In [None]:
from datasets import load_dataset

dataset = load_dataset("mosaicml/dolly_hhrlhf")

Let's take a peak at our dataset to see what kind of shape it's in!

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 59310
    })
    test: Dataset({
        features: ['prompt', 'response'],
        num_rows: 5129
    })
})

In [None]:
dataset["train"][0]

{'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nwhat is kangen water?\n\n### Response:\n',
 'response': 'Kangen water is alkaline ionized water produced through a process called electrolysis.  Kangen water is also referred to electrolyzed reduced water.  This water is characterized by an large negative oxidation reduction potential and a potential hydrogen level > 7.0 making the water alkaline.  It is also infused with molecular hydrogen in the amount of 1 - 1.5 parts per million per gallon of water produced.  This infused hydrogen has been shown to be a very good anti-inflammatory for the body.'}

As we can see, this is not the expected format - so we'll need to do some formatting to make sure our data is in the expected format.

We can do this with `dataset.map()`, which simple need to create a formatting function.

In [None]:
def mistral_finetune_format(sample):
  system_prompt = sample["prompt"].split("### Instruction:")[0].strip().lstrip()
  user_prompt = sample["prompt"].split("### Instruction:")[-1].split("### Response:")[0].strip().lstrip()

  return {"data" : {"messages" : [{"role" : "system", "content" : system_prompt}, {"role" : "user", "content" : user_prompt}, {"role" : "assistant", "content" : sample["response"]}]}}

Let's verify our formatting function worked!

In [None]:
mistral_finetune_format(dataset["train"][0])

{'data': {'messages': [{'role': 'system',
    'content': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'},
   {'role': 'user', 'content': 'what is kangen water?'},
   {'role': 'assistant',
    'content': 'Kangen water is alkaline ionized water produced through a process called electrolysis.  Kangen water is also referred to electrolyzed reduced water.  This water is characterized by an large negative oxidation reduction potential and a potential hydrogen level > 7.0 making the water alkaline.  It is also infused with molecular hydrogen in the amount of 1 - 1.5 parts per million per gallon of water produced.  This infused hydrogen has been shown to be a very good anti-inflammatory for the body.'}]}}

Now that our data formatter is tested - lets map it across the entire dataset!

In [None]:
formatted_dataset = dataset.map(mistral_finetune_format)

Map:   0%|          | 0/59310 [00:00<?, ? examples/s]

Map:   0%|          | 0/5129 [00:00<?, ? examples/s]

In [None]:
formatted_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'data'],
        num_rows: 59310
    })
    test: Dataset({
        features: ['prompt', 'response', 'data'],
        num_rows: 5129
    })
})

Let's save our data as a `JSONL` file for compatibility!

We'll create a training set, and a evaluation set.

In [None]:
import json

file_path = "/content/data/train_instruct.jsonl"

with open(file_path, "w") as file:
  for item in formatted_dataset["train"]["data"]:
    json_str = json.dumps(item)
    file.write(json_str + "\n")

In [None]:
file_path = "/content/data/test_instruct.jsonl"

with open(file_path, "w") as file:
  for item in formatted_dataset["test"]["data"]:
    json_str = json.dumps(item)
    file.write(json_str + "\n")

### Verifying the Dataset

We can use the provided tools to verify that our dataset is in the correct shape - let's first pass our dataset through the reformat to clean up, or skip, any potential issues!

In [None]:
!python -m utils.reformat_data /content/data/train_instruct.jsonl

In [None]:
!python -m utils.reformat_data /content/data/test_instruct.jsonl

Now that our reformat completed with no issues - we can move to validating our data - but before we do, we need to talk about the `.yaml` file that acts as a guide for our training process.

Let's make it together in the following cell - we'll start by adding referene to our data.

Notice that our data is under the `data` header.

In [None]:
training_dataset_path = "/content/data/train_instruct.jsonl"
eval_dataset_path = "/content/data/test_instruct.jsonl"

training_yaml = f"""\
data:
  instruct_data: '{training_dataset_path}'
  eval_instruct_data: '{eval_dataset_path}'
"""

Next, we'll add a reference to our downloaded and extracted model!

In [None]:
model_path = "/content/mistral_models"

training_yaml += f"\nmodel_id_or_path: '{model_path}'"

Now we can add some additional training parameters.

These are typical, and similar to what you'd see in something like `transformers` from Hugging Face!

In [None]:
LORA_RANK = 64
SEQ_LEN = 4092
BATCH_SIZE = 1
NUM_MICROBATCHES = 8
MAX_STEPS = 300

LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.1

OUTPUT_DIR = "content/limit_test"

In [None]:
training_yaml += f"""
# optim
seq_len: {SEQ_LEN}
batch_size: {BATCH_SIZE}
num_microbatches: {NUM_MICROBATCHES}
max_steps: {MAX_STEPS}

optim:
  lr: {LEARNING_RATE}
  weight_decay: {WEIGHT_DECAY}
  pct_start: 0.05

# other
seed: 0
log_freq: 1
eval_freq: 100
no_eval: False
ckpt_freq: 100

save_adapters: True

run_dir: '{OUTPUT_DIR}'
"""

### Weights and Biases Integration

Now we can add references to our Weights and Biases project, API key, and run name!

This integration is straightforward and lets us monitor our fine-tuning very easily!

In [None]:
!pip install -qU wandb

Now we can add these Weights and Biases configurations to our `.yaml` file!

In [None]:
import getpass

WANDB_PROJECT = "MistralFinetune"
WANBD_RUN_NAME = "DollyInstruct"
API_KEY = getpass.getpass("WandB API Key:")

WandB API Key:··········


In [None]:
training_yaml += f"""
wandb:
  project: '{WANDB_PROJECT}'
  run_name: '{WANBD_RUN_NAME}'
  key: '{API_KEY}'
  offline: False
"""

Now let's save our our `.yaml` file and use it to validate our data!

In [None]:
import yaml
with open('/content/instruct_tune_mistral_7B.yaml', 'w') as file:
    yaml.dump(yaml.safe_load(training_yaml), file)

In [None]:
!python -m utils.validate_data --train_yaml /content/instruct_tune_mistral_7B.yaml

0it [00:00, ?it/s]Validating /content/data/train_instruct.jsonl ...

  0% 0/59310 [00:00<?, ?it/s][A
  0% 128/59310 [00:00<00:46, 1277.78it/s][A
  0% 286/59310 [00:00<00:40, 1452.69it/s][A
  1% 432/59310 [00:00<00:40, 1451.90it/s][A
  1% 578/59310 [00:00<00:42, 1396.78it/s][A
  1% 743/59310 [00:00<00:39, 1484.30it/s][A
  2% 892/59310 [00:00<00:39, 1484.37it/s][A
  2% 1052/59310 [00:00<00:38, 1519.21it/s][A
  2% 1223/59310 [00:00<00:37, 1552.61it/s][A
  2% 1379/59310 [00:00<00:37, 1541.97it/s][A
  3% 1534/59310 [00:01<00:38, 1499.33it/s][A
  3% 1705/59310 [00:01<00:36, 1561.28it/s][A
  3% 1862/59310 [00:01<00:37, 1547.57it/s][A
  3% 2018/59310 [00:01<00:39, 1466.52it/s][A
  4% 2166/59310 [00:01<00:41, 1387.80it/s][A
  4% 2320/59310 [00:01<00:39, 1429.32it/s][A
  4% 2465/59310 [00:01<00:40, 1410.41it/s][A
  4% 2607/59310 [00:01<00:40, 1405.32it/s][A
  5% 2770/59310 [00:01<00:38, 1468.58it/s][A
  5% 2921/59310 [00:01<00:38, 1478.15it/s][A
  5% 3070/59310 [00:02<00:38,

## Model Training

Now that we have our `.yaml` file - we can go ahead an train our model!

We need to do a bit of bookkeeping for the Colab environment before moving on.

In [None]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

We'll also make sure that our `OUTPUT_DIR` does not exist to avoid errors.

In [None]:
!rm -r /content/limit_test

Now - we can train!

We'll use `torchrun` to run our `train` script leveraging the created `.yaml` file - and away we go!

In [None]:
!torchrun --nproc-per-node 1 -m train /content/instruct_tune_mistral_7B.yaml

2024-06-05 17:19:19.422143: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-05 17:19:19.476085: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-05 17:19:19.476136: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-05 17:19:19.478064: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-05 17:19:19.486942: I tensorflow/core/platform/cpu_feature_guar

## Inference with Mistral

Now that we have a trained model - let's see how it responds!

First up  - let's install the `mistral_inference` library.

In [None]:
!pip install -qU mistral_inference

Similar to the `transformers` library - we have a set of useful imports that, for the most part, just do what they say!

In [None]:
from mistral_inference.model import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

Now we can load our downloaded model, our downloaded tokenizer, and our fine-tuned adapter!

In [None]:
tokenizer = MistralTokenizer.from_file("/content/mistral_models/tokenizer.model.v3")
model = Transformer.from_folder("/content/mistral_models")
model.load_lora("/content/limit_test/checkpoints/checkpoint_000100/consolidated/lora.safetensors")

In a very familiar format - we can create a request to our model!

We'll be sure to use the Instruction template we created before, and give a sample request!

In [None]:
completion_request = ChatCompletionRequest(
    messages=
      [
        SystemMessage(content="Below is an instruction that describes a task. Write a response that appropriately completes the request."),
        UserMessage(content="Explain Machine Learning to me in a nutshell.")
      ]
)

We'll go ahead an tokenize our chat completion!

In [None]:
tokens = tokenizer.encode_chat_completion(completion_request).tokens

Now we can generate a response and see how it did!

In [None]:
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

In [None]:
print(result)

Machine Learning is a subset of Artificial Intelligence that allows computers to learn from data without being explicitly programmed. Machine Learning algorithms use statistical techniques to give computers the ability to learn without being explicitly programmed. Machine Learning focuses on the development of computer programs that can access data and use it to learn for themselves.


This is a suitable response! Great job model!