# Mistral Inference, Quantization & Fine-Tuning # 

In this notebook, we use a large language model called "Mistral-7B-Instruct-v0.3" (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3), to define a cooking chatbot and trying to fine-tuning it based on the RecipeNLG dataset.

## Install necessary libraries

First, install the HuggingFace Transformers library and some related libraries, such as accelerate. Then, download a few libraries from HuggingFace Transformers.

In [1]:
!pip install -q -U transformers bitsandbytes accelerate #xformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m82.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2025-05-11 15:33:07.131179: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746977587.403980      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746977587.480675      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
torch.random.manual_seed(0)

<torch._C.Generator at 0x7844d8dd42f0>

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("NLP project")

## Import model and tokenizer

Let's import the model and the tokenizer, load the model into the GPU, and show a structure of both.

In [4]:
# Define the model name
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

In [5]:
model = AutoModelForCausalLM.from_pretrained (
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    token=secret_value_0
)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [6]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((4096,), eps=1e-0

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, token=secret_value_0)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Use the model

In [8]:
# Define a pipeline to give the input to the model
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.pad_token_id
)

Device set to use cuda


In [9]:
# Arguments used during the tokens generation.
generation_args = {
    "max_new_tokens": 800,
    "return_full_text": False,
    "temperature": 0.3,
    "do_sample": True,
}

Let's test our Mistral model by giving cooking advices. Start with a very common Italian dish, *Spaghetti alla Carbonara*, and let's see the results.

In [10]:
def chatbot(user_request, system_prompt = "You give cooking advices."):
#    print(system_prompt)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

In [11]:
answer = chatbot("How to cook spaghetti alla carbonara?")
print(answer)

 To cook Spaghetti alla Carbonara, you'll need the following ingredients:

* 1 lb (450g) spaghetti
* 4 large eggs
* 1 cup (100g) grated Pecorino Romano cheese
* 1/2 cup (100g) grated Parmesan cheese
* 8 oz (225g) guanciale or pancetta, diced
* Freshly ground black pepper
* Salt

Here's a step-by-step guide on how to prepare Spaghetti alla Carbonara:

1. Cook the spaghetti in a large pot of salted boiling water until al dente. Reserve 1 cup of pasta water before draining the pasta.

2. While the pasta is cooking, whisk the eggs, Pecorino Romano, Parmesan, and a generous amount of black pepper in a large bowl.

3. In a large skillet, cook the guanciale or pancetta over medium heat until crispy. Remove the skillet from the heat.

4. Drain the pasta, but do not rinse it. Immediately add the drained pasta to the skillet with the rendered fat from the guanciale or pancetta. Toss the pasta to coat it evenly.

5. Remove the skillet from the heat again. Gradually add the reserved pasta water, 1

We can even change language (i.e. switching to Italian) to compare the answers.

In [12]:
answer = chatbot("Come cucinare gli spaghetti alla carbonara?")
print(answer)

 Sì, qui di seguito trovi la ricetta per preparare gli spaghetti alla carbonara:

Ingredienti:
- 400g di spaghetti
- 100g di guanciale o pancetta tagliata a fettine
- 4 uova
- 100g di pecorino romano
- 50g di parmigiano reggiano
- Sale e pepe
- Olio d'oliva

Istruzioni:
1. Fate bollire una grande casseruola di acqua salata. Aggiungete poi gli spaghetti e fate cuocere secondo le istruzioni sulle confezioni.
2. Nel frattempo, tagliate il guanciale a fettine e calpestate il pecorino romano.
3. Prendete una padella e calpestate l'olio d'oliva, il guanciale e il pecorino romano finché non diventano un po' croccanti.
4. Sia la casseruola con gli spaghetti che la padella con il guanciale e il pecorino romano devono essere caldi.
5. Prendete un cucchiaio di cottura e prelevate un po' dell'acqua calda in cui gli spaghetti sono bolliti. Aggiungete questa acqua calda alla padella con il guanciale e il pecorino romano.
6. Spezzate le uova in una ciotola e aggiungete il parmigiano reggiano.
7. Una 

We can employ the model to better analyse some conditions of the recipe, e.g. if the recipe is vegan or not.

In [13]:
answer = chatbot("Are spaghetti alla carbonara good for vegans?", "You are a friendly but concise chatbot.")
print(answer)

 No, traditional spaghetti alla carbonara is not suitable for vegans as it contains ingredients like guanciale (cured pork cheek), Pecorino Romano cheese, and sometimes eggs. A vegan version can be made by substituting the guanciale with smoked tofu or mushrooms, and using a vegan cheese or a plant-based cream.


And we can see some more peculiar recipes, such as *pizzoccheri valtellinesi*.

In [14]:
answer = chatbot("Are you able to prepare pizzoccheri valtellinesi?")
print(answer)

 While I can't physically prepare food, I can certainly guide you through the process of making Pizzoccheri Valtellinesi, a traditional Italian dish from the Valtellina region. Here's a simplified recipe:

Ingredients:
1. 500g fresh buckwheat pasta (pizzoccheri)
2. 200g Swiss chard, stems removed and leaves chopped
3. 200g potatoes, peeled and cubed
4. 100g cubed Swiss cheese (such as Valtellina Casera)
5. 100g cubed Italian salami (such as Bergamasca or Cremasco)
6. 2 tablespoons butter
7. 2 tablespoons extra-virgin olive oil
8. 1 onion, finely chopped
9. Salt
10. Freshly ground black pepper
11. Grated Parmesan cheese, for serving

Instructions:
1. Cook the potatoes in salted boiling water until tender. Drain and set aside.
2. Bring a large pot of salted water to a boil. Add the pizzoccheri and cook until al dente. Add the Swiss chard leaves to the pot during the last minute of cooking. Drain, reserving some of the pasta water.
3. In a large skillet, heat the butter and olive oil over

Ok, maybe on not-so-known dishes, such as *pizzoccheri*, we still need to improve...

What about giving as reference a webpage containing the description of the receipt?

In [15]:
import urllib.request
import bs4 as bs
import re

html_doc = urllib.request.urlopen('https://it.wikipedia.org/wiki/Pizzoccheri_della_Valtellina').read()
parsed_doc = bs.BeautifulSoup(html_doc,'lxml')
page = '\n'.join(p.text for p in parsed_doc.find_all('p'))
print(page)

I pizzòccheri della Valtellina (pizzocher in lombardo) sono una varietà di pasta alimentare preparata con farina di grano saraceno miscelata con altri sfarinati. Simili alle tagliatelle, ma di colore grigiastro, i pizzoccheri sono un piatto tradizionale della Valtellina, in particolare di Teglio, e si mangiano sia in Lombardia, in Italia, sia nel cantone svizzero dei Grigioni.

Non devono essere confusi con i pizzoccheri di Chiavenna, che sono invece una particolare varietà di gnocchi, preparati con farina di frumento e pane secco ammollato nel latte. Nel 2016 il prodotto ha ottenuto dall'Unione europea il riconoscimento di indicazione geografica protetta (IGP).[1]

La prima testimonianza certa sulla coltivazione del grano saraceno (chiamato formentone) in Valtellina risale al 1616, quando il governatore della Valle dell'Adda (appartenente al cantone dei Grigioni in Svizzera) scrisse che «Il saraceno veniva coltivato soprattutto sul versante retico delle Alpi, in particolare nel compre

In [16]:
document = page[:2500]
print(document)

I pizzòccheri della Valtellina (pizzocher in lombardo) sono una varietà di pasta alimentare preparata con farina di grano saraceno miscelata con altri sfarinati. Simili alle tagliatelle, ma di colore grigiastro, i pizzoccheri sono un piatto tradizionale della Valtellina, in particolare di Teglio, e si mangiano sia in Lombardia, in Italia, sia nel cantone svizzero dei Grigioni.

Non devono essere confusi con i pizzoccheri di Chiavenna, che sono invece una particolare varietà di gnocchi, preparati con farina di frumento e pane secco ammollato nel latte. Nel 2016 il prodotto ha ottenuto dall'Unione europea il riconoscimento di indicazione geografica protetta (IGP).[1]

La prima testimonianza certa sulla coltivazione del grano saraceno (chiamato formentone) in Valtellina risale al 1616, quando il governatore della Valle dell'Adda (appartenente al cantone dei Grigioni in Svizzera) scrisse che «Il saraceno veniva coltivato soprattutto sul versante retico delle Alpi, in particolare nel compre

In [17]:
question = "How to cook pizzoccheri della Valtellina? Suggest also a possible wine which matches with this dish."

user_request = "Answer the user question: '" + question + "' \n\n based on the information in the document: \n```\n" + document + "\n```\n\n"
answer = chatbot(user_request)
print(answer)

 To cook Pizzoccheri della Valtellina, you'll need the specialty flour made from Grano Saraceno, also known as formentone. Here's a simple recipe:

1. Prepare the dough by mixing 300g of Grano Saraceno flour, 100g of whole wheat flour, 2 eggs, and a pinch of salt. Add water gradually until the dough comes together. Knead until smooth. Let it rest for 30 minutes.

2. Roll out the dough thinly and cut into strips, similar to tagliatelle but with a grayish color.

3. Boil the pizzoccheri in salted water until al dente.

4. In a pan, sauté diced potatoes, cabbage, and chopped onions in butter until tender.

5. Drain the cooked pizzoccheri and add them to the pan, tossing to coat with the vegetables and butter.

6. Top with a generous amount of Valtellina Casera cheese (a local cheese) and a ladle of melted butter.

7. Serve immediately.

As for the wine, a red wine from Valtellina, such as Sforzato or Valtellina Superiore, would pair well with the earthy flavors of the Pizzoccheri della Va

Ok, a bit better!

In [18]:
# Since the original model (and the tokenizer) is not needed anymore, we can delete it.
del model
del tokenizer

# Clear GPU cache
torch.cuda.empty_cache()

# Load a Quantized Model
https://medium.com/@vishnuchirukandathramesh/how-to-run-mistral-7b-on-free-version-of-google-colab-e0effd9c6a12

First, let's install the necessary packages and import the following libraries.

In [19]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [20]:
!pip install -q -U langchain  # transformers bitsandbytes accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.7/437.7 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
!pip install -q langchain-community langchain-core langchain-huggingface

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [22]:
import torch
from transformers import BitsAndBytesConfig
from langchain import PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from langchain_huggingface import HuggingFacePipeline

The quantization configuration specifies settings for loading a model in 4-bit precision, utilizing torch.float16 for computation, with “nf4” quantization type and double quantization enabled, aiming to enhance efficiency and reduce memory footprint during model deployment.

In [23]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

Then, load the quantized model.

In [24]:
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [25]:
model_4bit

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mist

The code sets up a text generation pipeline using Mistral-7B-Instruct-v0.3 model loaded in 4-bit precision, along with its corresponding tokenizer, enabling caching and automatic device mapping. It configures the generation parameters such as maximum length, sampling, top-k sampling, and number of returned sequences. Additionally, it initializes a LangChain HuggingFacePipeline for streamlined text generation tasks.

In [26]:
pipeline_inst = pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2500,
        truncation=True,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline_inst)

Device set to use cuda:0


In [27]:
template = """You are an respectful and helpful cooking assistant, respond always and be precise and polite.
Answer the question below: {question}
Answer:
"""

def generate_response(question):
    prompt = PromptTemplate(template=template, input_variables=["question"])
    llm_chain = prompt | llm
    response = llm_chain.invoke({"question": question})
#    llm_chain = LLMChain(prompt=prompt, llm=llm)
#    response = llm_chain.run({"question": question})
    return response

In [28]:
output = generate_response("How to prepare spaghetti alla carbonara for 2 people?")
print(output)

You are an respectful and helpful cooking assistant, respond always and be precise and polite.
Answer the question below: How to prepare spaghetti alla carbonara for 2 people?
Answer:
Preparing Spaghetti alla Carbonara for 2 people involves several steps. Here's a simple recipe you can follow:

Ingredients:
1. 200 grams of spaghetti
2. 100 grams of guanciale (Italian cured pork jowl) or pancetta as an alternative
3. 2 large eggs
4. 100 grams of Pecorino Romano cheese, finely grated
5. 50 grams of Parmesan cheese, finely grated
6. Fresh black pepper, to taste
7. Salt, to taste

Instructions:

1. Cook the spaghetti in salted boiling water until al dente. Reserve about 1 cup of pasta water.

2. While the spaghetti is cooking, cut the guanciale into small cubes. Cook it in a large skillet over medium heat until it becomes crispy. Remove from the pan and set aside.

3. In a separate bowl, whisk together the eggs, half of the Pecorino Romano, half of the Parmesan, and a generous amount of fr

In [29]:
# Clear GPU cache
torch.cuda.empty_cache()