# OPEN SOURCE MODELS IN MISTRAL

- `Mistral-7B` - A 7B transformer model, fast-deployed and easily customisable. Small, yet very powerful for a variety of use cases.
    - Performant in English and code
    - 32k context window

- `Mistral-8x7B` - A 7B sparse Mixture-of-Experts (SMoE). Uses 12.9B active parameters out of 45B total.
    - Fluent in English, French, Italian, German, Spanish, and strong in code.
    - 32k context window

- `Mixtral-8x22B` - Currently the most performant open model. A 22B sparse Mixture-of-Experts (SMoE). Uses only 39B active parameters out of 141B.
    - Fluent in English, French, Italian, German, Spanish, and strong in code.
    - 64k context window.
    - Native function calling capacities.
    - Function calling and json mode available on our API endpoint.

- There are also `Optimized models` in Mistral like `Mistral-small`, `Mistral-large` and `Mistral-Embed`. You can refer them in the [Mistral's Website](https://mistral.ai/technology/#models) and also in [huggingface](https://huggingface.co/mistralai).

# FIINETUNING `Mistral-8x7B Model`
- I chose `Mistral-8x7B Model` model rather than `Mistral-7B` because the FlanV2_19k_smaples dataset which was preprocessed contains English along with other languages. So, it will be easier for the model to understand and train. Whereas the `Mistral-7B` was only for English and code. If we choose that It wouldn't perform that good. (It can perform if our sample size was huge)

In [None]:
!pip install -r /content/drive/MyDrive/requirements.txt

In [28]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

In [1]:
# Login to your huggingface account
from huggingface_hub import notebook_login
notebook_login()
# To login you have to create your huggingface-access-token with WRITE permission.
# you can also login through your terminal using the cli command --->  huggingface-cli login  and verify account using --> huggingface-cli whoami

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, GPTQConfig, TrainingArguments
from trl import SFTTrainer
import os
from transformers import LongformerTokenizer
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
# load the dataset from huggingface which we pushed after preprocessing.
dataset = load_dataset("karthiksagarn/FlanV2-2024")
dataset

DatasetDict({
    train: Dataset({
        features: ['inputs', 'targets', 'task_source', 'task_name', 'template_type'],
        num_rows: 19391
    })
})

In [4]:
data = dataset['train'].to_pandas()

In [5]:
data

Unnamed: 0,inputs,targets,task_source,task_name,template_type
0,Write an article based on this summary:\n\nPur...,They should include long screws and wall ancho...,Flan2021,gem/wiki_lingua_english_en:1.1.0,zs_opt
1,Problem: What would be an example of an negati...,I go here about once every two weeks. They con...,Flan2021,yelp_polarity_reviews:0.2.0,fs_noopt
2,"Input: Qingdao is located in northeast China, ...",Queens Park Rangers manager Harry Redknapp is ...,Flan2021,cnn_dailymail:3.4.0,fs_opt
3,"Input: Steven Lippard, 7, was playing in the d...",A 21-year-old man in Chicago is charged with b...,Flan2021,cnn_dailymail:3.4.0,fs_opt
4,Here is a news article: In his last press conf...,– President Obama held the final press confere...,Flan2021,multi_news:1.0.0,zs_noopt
...,...,...,...,...,...
19386,Consider this response: Wa also occurs as a co...,DIALOG:\nWhat is a The Burning City?\n- The to...,Dialog,wiki_dialog_ii,fs_opt
19387,What came before. The bridge has been toll-fre...,-When was the New Hope Lambertville Bridge bui...,Dialog,wiki_dialog_ii,zs_opt
19388,Consider this response: To provide a high rate...,DIALOG:\nWhat was George Lawrence Stone's tech...,Dialog,wiki_dialog_ii,fs_opt
19389,Read this response and predict the preceding d...,2-way dialog:\n+ What is the difference betwee...,Dialog,wiki_dialog_ii,zs_opt


- We can use any tokenizer (for ex: bert-base-uncased, longformer-base), since we are using mistral model to finetune we use mistral's tokenizer only.

In [5]:
max_seq_length=2048
dtype=None
load_in_4bit=True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name ="unsloth/mistral-7b-v0.3",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth: Fast Mistral patching release 2024.7
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth: Will load unsloth/mistral-7b-v0.3-bnb-4bit as a legacy tokenizer.


In [6]:
## Adding LoRa weights into the model.
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # we can choose any number > 0, Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", ## These target modules specifies the modules or layers in the model where LORA
                      "gate_proj", "up_proj", "down_proj",], ## weights will be added. These are typically the Query, key, value & output
                                                             ## projection layers in a transformer Architecture
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "True", # True or "unsloth" for very long context
    random_state = 3407, # this is just a random seed
    use_rslora = False,  # We support rank stabilized LoRA. rs_lora is randomized sparse lora which a variant of lora that uses randomized sparse projections.
    loftq_config = None, # And LoftQ
)

Unsloth 2024.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [7]:
## Reformat the dataset template to fit into the model.
## ADDING EOS- END_OF_SEQUENCE TOKEN TO DATASET. Orelse the generation will go forever.

EOS_TOKEN = tokenizer.eos_token
data["text"] = data.apply(lambda row: "###HUMAN: " + row["inputs"] + " " + "###ASSISTANT: " + row["targets"] + " " + EOS_TOKEN, axis=1)

In [8]:
data["text"]

0        ###HUMAN: Write an article based on this summa...
1        ###HUMAN: Problem: What would be an example of...
2        ###HUMAN: Input: Qingdao is located in northea...
3        ###HUMAN: Input: Steven Lippard, 7, was playin...
4        ###HUMAN: Here is a news article: In his last ...
                               ...                        
19386    ###HUMAN: Consider this response: Wa also occu...
19387    ###HUMAN: What came before. The bridge has bee...
19388    ###HUMAN: Consider this response: To provide a...
19389    ###HUMAN: Read this response and predict the p...
19390    ###HUMAN: Consider this response: One suggesti...
Name: text, Length: 19391, dtype: object

In [9]:
dataset = Dataset.from_pandas(data)

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 512,
    dataset_num_proc = 1,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        num_train_epochs = 1,
        # max_steps = 60, # Set num_train_epochs = 1 for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        save_steps=500,  # Save checkpoints less frequently
        save_total_limit=2,
    ),
)

Map:   0%|          | 0/19391 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 19,391 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 2,423
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.91
20,1.7381
30,1.5367
40,1.5842
50,1.5311
60,1.6036
70,1.6146
80,1.6014
90,1.4468
100,1.5352
