# Mistral Inference, Quantization & Fine-Tuning # 

In this notebook, we use a large language model called "Mistral-7B-Instruct-v0.3" (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3), to define a cooking chatbot and trying to fine-tuning it based on the RecipeNLG dataset.

## Install necessary libraries

First, install the HuggingFace Transformers library and some related libraries, such as accelerate. Then, download a few libraries from HuggingFace Transformers.

In [1]:
!pip install -q -U transformers bitsandbytes accelerate #xformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2025-05-10 20:24:57.827754: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746908698.064587      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746908698.130931      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
torch.random.manual_seed(0)

<torch._C.Generator at 0x7c463490c2f0>

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("NLP project")

## Import model and tokenizer

Let's import the model and the tokenizer, load the model into the GPU, and show a structure of both.

In [4]:
# Define the model name
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

In [5]:
model = AutoModelForCausalLM.from_pretrained (
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    token=secret_value_0
)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [6]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((4096,), eps=1e-0

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, token=secret_value_0)

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Use the model

In [8]:
# Define a pipeline to give the input to the model
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

Device set to use cuda


In [9]:
# Arguments used during the tokens generation.
generation_args = {
    "max_new_tokens": 800,
    "return_full_text": False,
    "temperature": 0.3,
    "do_sample": True,
}

Let's test our Mistral model by giving cooking advices. Start with a very common Italian dish, *Spaghetti alla Carbonara*, and let's see the results.

In [10]:
def chatbot(user_request, system_prompt = "You give cooking advices."):
#    print(system_prompt)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

In [11]:
answer = chatbot("How to cook spaghetti alla carbonara?")
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 To cook Spaghetti alla Carbonara, you'll need the following ingredients:

* 1 lb (450g) spaghetti
* 4 large eggs
* 1 cup (100g) grated Pecorino Romano cheese
* 1/2 cup (100g) grated Parmesan cheese
* 8 oz (225g) guanciale or pancetta, diced
* Freshly ground black pepper
* Salt

Here's a step-by-step guide on how to prepare Spaghetti alla Carbonara:

1. Cook the spaghetti in a large pot of salted boiling water until al dente. Reserve 1 cup of pasta water before draining the pasta.

2. While the pasta is cooking, whisk the eggs, Pecorino Romano, Parmesan, and a generous amount of black pepper in a large bowl.

3. In a large skillet, cook the guanciale or pancetta over medium heat until crispy. Remove the skillet from the heat.

4. Drain the pasta, but do not rinse it. Immediately add the drained pasta to the skillet with the rendered fat from the guanciale or pancetta. Toss the pasta to coat it evenly.

5. Remove the skillet from the heat again. Gradually add the reserved pasta water, 1

We can even change language (i.e. switching to Italian) to compare the answers.

In [12]:
answer = chatbot("Come cucinare gli spaghetti alla carbonara?")
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 Sì, qui di seguito trovi la ricetta per preparare gli spaghetti alla carbonara:

Ingredienti:
- 400g di spaghetti
- 100g di guanciale o pancetta tagliata a fettine
- 4 uova
- 100g di pecorino romano
- 50g di parmigiano reggiano
- Sale e pepe
- Olio d'oliva

Istruzioni:
1. Fate bollire una grande casseruola di acqua salata. Aggiungete poi gli spaghetti e fate cuocere secondo le istruzioni sulle confezioni.
2. Nel frattempo, tagliate il guanciale a fettine e calpestate il pecorino romano.
3. Prendete una padella e calpestate l'olio d'oliva, il guanciale e il pecorino romano finché non diventano un po' croccanti.
4. Sia la casseruola con gli spaghetti che la padella con il guanciale e il pecorino romano devono essere caldi.
5. Prendete un cucchiaio di cottura e prelevate un po' dell'acqua calda in cui gli spaghetti sono bolliti. Aggiungete questa acqua calda alla padella con il guanciale e il pecorino romano.
6. Spezzate le uova in una ciotola e aggiungete il parmigiano reggiano.
7. Una 

We can employ the model to better analyse some conditions of the recipe, e.g. if the recipe is vegan or not.

In [13]:
answer = chatbot("Are spaghetti alla carbonara good for vegans?", "You are a friendly but concise chatbot.")
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 No, traditional spaghetti alla carbonara is not suitable for vegans as it contains ingredients like guanciale (cured pork cheek), Pecorino Romano cheese, and sometimes eggs. A vegan version can be made by substituting the guanciale with smoked tofu or mushrooms, and using a vegan cheese or a plant-based cream.


And we can see some more peculiar recipes, such as *pizzoccheri valtellinesi*.

In [14]:
answer = chatbot("Are you able to prepare pizzoccheri valtellinesi?")
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 While I can't physically prepare food, I can certainly guide you through the process of making Pizzoccheri Valtellinesi, a traditional Italian dish from the Valtellina region. Here's a simplified recipe:

Ingredients:
1. 500g fresh buckwheat pasta (pizzoccheri)
2. 200g Swiss chard, stems removed and leaves chopped
3. 200g potatoes, peeled and cubed
4. 100g cubed Swiss cheese (such as Valtellina Casera)
5. 100g cubed Italian salami (such as Bergamasca or Cremasco)
6. 2 tablespoons butter
7. 2 tablespoons extra-virgin olive oil
8. 1 onion, finely chopped
9. Salt
10. Freshly ground black pepper
11. Grated Parmesan cheese, for serving

Instructions:
1. Cook the potatoes in salted boiling water until tender. Drain and set aside.
2. Bring a large pot of salted water to a boil. Add the pizzoccheri and cook until al dente. Add the Swiss chard leaves to the pot during the last minute of cooking. Drain, reserving some of the pasta water.
3. In a large skillet, heat the butter and olive oil over

Ok, maybe on not-so-known dishes, such as *pizzoccheri*, we still need to improve...

What about giving as reference a webpage containing the description of the receipt?

In [15]:
import urllib.request
import bs4 as bs
import re

html_doc = urllib.request.urlopen('https://it.wikipedia.org/wiki/Pizzoccheri_della_Valtellina').read()
parsed_doc = bs.BeautifulSoup(html_doc,'lxml')
page = '\n'.join(p.text for p in parsed_doc.find_all('p'))
print(page)

I pizzòccheri della Valtellina (pizzocher in lombardo) sono una varietà di pasta alimentare preparata con farina di grano saraceno miscelata con altri sfarinati. Simili alle tagliatelle, ma di colore grigiastro, i pizzoccheri sono un piatto tradizionale della Valtellina, in particolare di Teglio, e si mangiano sia in Lombardia, in Italia, sia nel cantone svizzero dei Grigioni.

Non devono essere confusi con i pizzoccheri di Chiavenna, che sono invece una particolare varietà di gnocchi, preparati con farina di frumento e pane secco ammollato nel latte. Nel 2016 il prodotto ha ottenuto dall'Unione europea il riconoscimento di indicazione geografica protetta (IGP).[1]

La prima testimonianza certa sulla coltivazione del grano saraceno (chiamato formentone) in Valtellina risale al 1616, quando il governatore della Valle dell'Adda (appartenente al cantone dei Grigioni in Svizzera) scrisse che «Il saraceno veniva coltivato soprattutto sul versante retico delle Alpi, in particolare nel compre

In [16]:
document = page[:2000]
print(document)

I pizzòccheri della Valtellina (pizzocher in lombardo) sono una varietà di pasta alimentare preparata con farina di grano saraceno miscelata con altri sfarinati. Simili alle tagliatelle, ma di colore grigiastro, i pizzoccheri sono un piatto tradizionale della Valtellina, in particolare di Teglio, e si mangiano sia in Lombardia, in Italia, sia nel cantone svizzero dei Grigioni.

Non devono essere confusi con i pizzoccheri di Chiavenna, che sono invece una particolare varietà di gnocchi, preparati con farina di frumento e pane secco ammollato nel latte. Nel 2016 il prodotto ha ottenuto dall'Unione europea il riconoscimento di indicazione geografica protetta (IGP).[1]

La prima testimonianza certa sulla coltivazione del grano saraceno (chiamato formentone) in Valtellina risale al 1616, quando il governatore della Valle dell'Adda (appartenente al cantone dei Grigioni in Svizzera) scrisse che «Il saraceno veniva coltivato soprattutto sul versante retico delle Alpi, in particolare nel compre

In [17]:
question = "How to cook pizzoccheri della Valtellina?"

user_request = "Answer the user question: '" + question + "' \n\n based on the information in the document: \n```\n" + document + "\n```\n\n"
answer = chatbot(user_request)
print(answer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 To cook Pizzoccheri della Valtellina, you'll need the specialty pasta made from Saraceno wheat flour, which is mixed with other types of flour. The pasta resembles tagliatelle but has a grayish color. Here's a traditional recipe:

Ingredients:
- 500g of Pizzoccheri pasta
- 200g of Swiss chard (or other leafy greens)
- 200g of potatoes
- 100g of cubed, smoked, Valtellina ham (Casera or Bitto)
- 100g of cubed, smoked, Valtellina cheese (Casera or Bitto)
- 4 tablespoons of butter
- 2 tablespoons of extra virgin olive oil
- Salt, to taste
- Pepper, to taste
- 2 cloves of garlic
- 1 onion
- 1 glass of white wine
- 1 liter of broth (vegetable or meat)

Instructions:
1. Boil the potatoes in salted water until tender. Once cooked, peel them and cut into small cubes.
2. Clean the Swiss chard, removing the stems and chopping the leaves.
3. In a large pan, heat the butter and olive oil over medium heat. Add the garlic and onion, sauté until softened.
4. Add the Swiss chard leaves to the pan, coo

Ok, a bit better!

In [18]:
# Since the original model (and the tokenizer) is not needed anymore, we can delete it.
del model
del tokenizer

# Clear GPU cache
torch.cuda.empty_cache()

# Load a Quantized Model
https://medium.com/@vishnuchirukandathramesh/how-to-run-mistral-7b-on-free-version-of-google-colab-e0effd9c6a12

First, let's install the necessary packages and import the following libraries.

In [19]:
!pip install -q -U langchain  # transformers bitsandbytes accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.7/437.7 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [20]:
!pip install -q langchain-community langchain-core

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
import torch
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

The quantization configuration specifies settings for loading a model in 4-bit precision, utilizing torch.float16 for computation, with “nf4” quantization type and double quantization enabled, aiming to enhance efficiency and reduce memory footprint during model deployment.

In [22]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

Then, load the quantized model.

In [23]:
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",quantization_config=quantization_config, )
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [24]:
model_4bit

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mist

The code sets up a text generation pipeline using Mistral-7B-Instruct-v0.3 model loaded in 4-bit precision, along with its corresponding tokenizer, enabling caching and automatic device mapping. It configures the generation parameters such as maximum length, sampling, top-k sampling, and number of returned sequences. Additionally, it initializes a LangChain HuggingFacePipeline for streamlined text generation tasks.

In [25]:
pipeline_inst = pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2500,
        truncation=True,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline_inst)

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipeline_inst)


In [29]:
template = """You are an respectful and helpful cooking assistant, respond always and be precise and polite.
Answer the question below: {question}
Answer:
"""

def generate_response(question):
    prompt = PromptTemplate(template=template, input_variables=["question"])
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    response = llm_chain.run({"question": question})
    return response

In [30]:
output = generate_response("How to prepare spaghetti alla carbonara for 2 people?")
print(output)

You are an respectful and helpful cooking assistant, respond always and be precise and polite.
Answer the question below: How to prepare spaghetti alla carbonara for 2 people?
Answer:
To prepare Spaghetti alla Carbonara for 2 people, gather the following ingredients:
- 200 grams of spaghetti
- 2 large eggs
- 100 grams of guanciale (Italian cured pork cheeks) or pancetta (Italian bacon), cut into small pieces
- 80 grams of freshly grated Pecorino Romano cheese
- 30 grams of freshly grated Parmesan cheese
- Freshly ground black pepper to taste
- Salt for the pasta water

Instructions:
1. Boil a large pot of salted water for the pasta. Cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water and drain the pasta.

2. While the pasta cooks, combine the eggs, Pecorino Romano cheese, and a generous amount of black pepper in a large bowl. Mix until well combined.

3. In a large skillet, cook the guanciale or pancetta over medium heat until crispy, a

In [31]:
# Clear GPU cache
torch.cuda.empty_cache()