<a href="https://colab.research.google.com/github/muskaanpatel14/FinetuningMistral7BPractise/blob/main/Mistral7BFinetuningLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

An attempt to fine-tune a Mistral 7B, a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama 2 13B on all benchmarks.

## Loading the Mistral Model
First step will be loading the mistral model, done using 4-bit quantization. We will start by loading the model and quantize it using BitsAndBytes package from HuggingFace. To achieve our goal, namely to fine-tune a model on a single GPU, we will need to quantize it. This means taking its weights, which are in a float32 format, and reducing them to a smaller format, here 4 bits.


In [1]:
! pip install peft
! pip install trl ninja packaging
! pip install transformers
! pip install -U datasets
! pip install -i https://pypi.org/simple/ bitsandbytes
! pip install accelerate
! pip install trl
# ! pip install bitsandbytes-cuda116

Collecting peft
  Downloading peft-0.10.0-py3-none-any.whl (199 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m194.6/199.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torc

Installing all necessary packages.

In [2]:
import torch
import os
import sys
import json
import IPython
from datetime import datetime
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer


I have chosen as the base model the 7B model from MistralAI, which shows very good performance compared to other models of its size. To facilitate easy use in Google Colab and avoid Out-Of-Memory (OOM) errors, I found someone who created a version with more shards, which allows the model to be loaded into the free version of Colab without saturating the RAM.

In [3]:
model_name = "Hugofernandez/Mistral-7B-v0.1-colab-sharded"
# set device
device = 'cuda'
#v Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# We redefine the pad_token and pad_token_id with out of vocabulary token (unk_token)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/918 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Next, we create the Quantization parameters using the most optimal values: by loading the model in 4 bits, using the NF4 format (4-bit NormalFloat (NF4), a new data type that is optimal for normally distributed weight), and by using double quantization which allows for further memory savings. However, for computations, these can only be performed in float16 or bfloat16 depending on the GPU, so they will be converted during calculation and then reconverted into the compressed format.


In [4]:
compute_dtype = getattr(torch, "float16")
print(compute_dtype)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

torch.float16


Next, we load the model and quantize it on the fly using the previous configuration.


In [5]:
from trl import SFTTrainer
import bitsandbytes
# import accelerate
model = AutoModelForCausalLM.from_pretrained(
          model_name,
          quantization_config=bnb_config,
          use_flash_attention_2 = False,
          device_map={"": 0}, #device_map="auto" will cause a problem in the training

)

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

pytorch_model-00001-of-00006.bin:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

pytorch_model-00002-of-00006.bin:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

pytorch_model-00003-of-00006.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00004-of-00006.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00005-of-00006.bin:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

pytorch_model-00006-of-00006.bin:   0%|          | 0.00/4.25G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

We can then verify that our model has been successfully loaded and that the tensor format is indeed Linear4bit, and that the model is ready to be trained.

In [6]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

Next, we define the learning parameters of LoRA. We set the rank r, which is the rank each matrix should have. The higher this rank, the greater the number of weights in the lower-rank matrices. We set it to 16 for this example, but you can increase it if the performance is not satisfactory, or decrease it to reduce the number of trainable parameters.

The dropout rate corresponds to the proportion of weights that should be set to 0 during training to make the network more robust and to prevent overfitting.


The target_modules corresponds to the names of modules that appear when we printed the model (q_proj, k_proj, v_proj, etc.).

In [7]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj", "lm_head",]
)

In [8]:
#Cast some modules of the model to fp32
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

Training Arguments -
* For the optimizer, we use the Paged Optimizer provided by QLoRA. Paged
optimizer is a feature provided by Nvidia to move paged memory of optimizer states between the CPU and GPU. It is mainly used here to manage memory spikes and avoid out-of-memory errors.
* Set a low learning rate because we want to stay close to the original model.
* Here we define the number of epoch to 1.

In [9]:
training_arguments = TrainingArguments(
        output_dir="./results", # directory in which the checkpoint will be saved.
        evaluation_strategy="epoch", # you can set it to 'steps' to eval it every eval_steps
        optim="paged_adamw_8bit", #used with QLoRA
        per_device_train_batch_size=4, #batch size
        per_device_eval_batch_size=4, #same but for evaluation
        gradient_accumulation_steps=1, #number of lines to accumulate gradient, carefull because it changes the size of a "step".Therefore, logging, evaluation, save will be conducted every gradient_accumulation_steps * xxx_step training example
        log_level="debug", #you can set it to  ‘info’, ‘warning’, ‘error’ and ‘critical’
        save_steps=500, #number of steps between checkpoints
        logging_steps=20, #number of steps between logging of the loss for monitoring adapt it to your dataset size
        learning_rate=4e-4, #you can try different value for this hyperparameter
        num_train_epochs=1,
        warmup_steps=100,
        lr_scheduler_type="constant",
)

Imported TLDR news dataset from HuggingFace.

In [10]:
from datasets import load_dataset


dataset = load_dataset("JulesBelveze/tldr_news", download_mode="force_redownload")
from sklearn.model_selection import train_test_split

# train_dataset, test_dataset = train_test_split(dataset['train'], test_size=0.2, random_state=42)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7138 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/794 [00:00<?, ? examples/s]

Train dataset size: 7138
Test dataset size: 794


In [11]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj", "lm_head",]
)

Defined a Trainer with the tokenizer, training and evaluation set, the peft config, and the training arguments defined previously.

In [12]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="content",
        tokenizer=tokenizer,
        args=training_arguments,
)



Map:   0%|          | 0/7138 [00:00<?, ? examples/s]

Map:   0%|          | 0/794 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Checking the number of trainable parameters and the proportion they represent compared to the total number of parameters.

In [13]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = model.num_parameters()
    for _, param in model.named_parameters():
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
print_trainable_parameters(model)

trainable params: 42520576 || all params: 7284252672 || trainable%: 0.583732853796316


Initiated a preliminary “cold” evaluation before starting the training.

In [14]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 794
  Batch size = 4


{'eval_loss': 2.2846853733062744,
 'eval_runtime': 161.3465,
 'eval_samples_per_second': 4.921,
 'eval_steps_per_second': 1.233}

Once the training is complete, we can conduct a few tests to see if the response meets your expectations and consider retraining if the result is not satisfactory.

In [15]:
trainer.train()

Currently training with a batch size of: 4
***** Running training *****
  Num examples = 7,138
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 1,785
  Number of trainable parameters = 42,520,576


Epoch,Training Loss,Validation Loss
1,2.4771,2.423998


Saving model checkpoint to ./results/tmp-checkpoint-500
tokenizer config file saved in ./results/tmp-checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/tmp-checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/tmp-checkpoint-1000
tokenizer config file saved in ./results/tmp-checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/tmp-checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/tmp-checkpoint-1500
tokenizer config file saved in ./results/tmp-checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/tmp-checkpoint-1500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 794
  Batch size = 4




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1785, training_loss=2.281299117165787, metrics={'train_runtime': 4683.6505, 'train_samples_per_second': 1.524, 'train_steps_per_second': 0.381, 'total_flos': 4.603174768233677e+16, 'train_loss': 2.281299117165787, 'epoch': 1.0})

In [16]:
#trainer.evaluate()
eval_prompt = """<s>[INST]What is Pomerium?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]What is Pomerium?[/INST]What is a tool that allows developers to create and share public and private repositories. It is a fork of GitHub that is designed to be a tool for developers who want to share code with others. Developers can create repositories and invite others to collaborate on them. Users can create public and private repositories and invite others to collaborate on them. The tool is currently in open beta and is free to use. It is currently in limited beta and is only available to developers who have signed up for the waitlist. A screenshot of the tool is available in the article. The tool is currently in development and it is not recommended for production use. It is not currently available for public use. A link to the tool is available in the article. [INST]What is Pomerium? is a video that shows the tool in action. A screenshot of the tool is available in the article. A link to the tool is available in the article. [INST]What is Pomerium? is a video that shows th

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl

Save the model.

In [17]:
new_model = 'MistralAI_QLORA'
trainer.model.save_pretrained(new_model)



In [21]:
eval_prompt = """<s>[INST]What is a Large Language Model?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]What is a Large Language Model?[/INST] is a tool that can generate a large language model that can generate text from a small amount of data. It is currently in open beta and it is free to use. The model is currently available in 10 languages and it can generate up to 1000 words per minute. It is not currently available for commercial use. The model is not available for commercial use. [INST]What is currently available in 10 languages and it is not available for commercial use. It is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commercial use. [INST]What is not available for commerc

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl

In [22]:
eval_prompt = """<s>[INST]What is Pomerium?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]What is Pomerium?[/INST]What is a tool that allows developers to create and share public and private repositories. It is a fork of GitHub that is designed to be a tool for developers who want to share code with others. Developers can create repositories and invite others to collaborate on them. Users can create public and private repositories and invite others to collaborate on them. The tool is currently in open beta and is free to use. It is currently in limited beta and is only available to developers who have signed up for the waitlist. A screenshot of the tool is available in the article. The tool is currently in development and it is not recommended for production use. It is not currently available for public use. A link to the tool is available in the article. [INST]What is Pomerium? is a video that shows the tool in action. A screenshot of the tool is available in the article. A link to the tool is available in the article. [INST]What is Pomerium? is a video that shows th

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl

In [23]:
eval_prompt = """<s>[INST]Will Uber deliver pumpkins?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]Will Uber deliver pumpkins?[/INST]Will Uber deliver pumpkins? is a tool that allows users to search for a location and then select a pumpkin to see if it is available for delivery. It is currently in beta and is free to use. The tool is currently only available for Android devices. Uber has been testing the tool since the end of 2019 and it is unknown when it will be available to the public. The tool is currently only available to Uber employees and their families. Uber has been testing the tool since the end of 2019 and it is unknown when it will be available to the public. The tool is currently only available to Uber employees and their families. Uber has been testing the tool since the end of 2019 and it is unknown when it will be available to the public. Uber has been testing the tool since the end of 2019 and it is unknown when it will be available to the public. The tool is currently only available to Uber employees and their families. Uber has been testing the tool since t

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl

In [24]:
eval_prompt = """<s>[INST]What is CatchUp?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]What is CatchUp?[/INST]What is a tool that allows developers to easily create and share code snippets. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool is currently in open beta and is free to use. It is currently in open beta and is free to use. The tool i

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl

In [25]:
eval_prompt = """<s>[INST]What's it like as a Senior Engineer'?[/INST]"""

# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

[INST]What's it like as a Senior Engineer'?[/INST]What's it like as a Senior Engineer? is a collection of stories from engineers who have been in the industry for over 10 years. It is a collection of stories from engineers who have been in the industry for over 10 years. The stories are written by engineers who have been in the industry for over 10 years. They are written in a way that is easy to read and understand. The stories are written in a way that is easy to read and understand. They are written in a way that is easy to read and understand. The stories are written in a way that is easy to read and understand. They are written in a way that is easy to read and understand. The stories are written in a way that is easy to read and understand. They are written in a way that is easy to read and understand. The stories are written in a way that is easy to read and understand. They are written in a way that is easy to read and understand. The stories are written in a way that is easy t

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inpl