<a href="https://colab.research.google.com/github/marxiaoxu/Automatic-code-generation/blob/main/tutorials/getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with `mistral-inference`

This notebook will guide you through the process of running Mistral models locally. We will cover the following:
- How to chat with Mistral 7B Instruct
- How to run Mistral 7B Instruct with function calling capabilities

We recommend using a GPU such as the A100 to run this notebook.

In [1]:
!pip install mistral-inference

Collecting mistral-inference
  Downloading mistral_inference-1.5.0-py3-none-any.whl.metadata (14 kB)
Collecting fire>=0.6.0 (from mistral-inference)
  Downloading fire-0.7.0.tar.gz (87 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/87.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mistral_common>=1.4.0 (from mistral-inference)
  Downloading mistral_common-1.4.4-py3-none-any.whl.metadata (4.6 kB)
Collecting xformers>=0.0.24 (from mistral-inference)
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting pillow>=10.3.0 (from mistral-inference)
  Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting tiktoken<0.8.0,>=0.7.0 (from mistral_common>=1.4.0->mistral-inference)
  Downloading tiktoken-0.7.0

## Download Mistral 7B Instruct

In [2]:
!wget https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar

--2024-11-14 08:02:31--  https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar
Resolving models.mistralcdn.com (models.mistralcdn.com)... 104.26.7.117, 104.26.6.117, 172.67.70.68, ...
Connecting to models.mistralcdn.com (models.mistralcdn.com)|104.26.7.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14496675840 (14G) [application/x-tar]
Saving to: ‘mistral-7B-Instruct-v0.3.tar’


2024-11-14 08:20:22 (12.9 MB/s) - ‘mistral-7B-Instruct-v0.3.tar’ saved [14496675840/14496675840]



In [None]:
!DIR=$HOME/mistral_7b_instruct_v3 && mkdir -p $DIR && tar -xf mistral-7B-Instruct-v0.3.tar -C $DIR

In [None]:
!ls mistral_7b_instruct_v3

## Chat with the model

In [6]:
import os

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# load tokenizer
mistral_tokenizer = MistralTokenizer.from_file(os.path.expanduser("~")+"/mistral_7b_instruct_v3/tokenizer.model.v3")
# chat completion request
completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explain Machine Learning to me in a nutshell.")])
# encode message
tokens = mistral_tokenizer.encode_chat_completion(completion_request).tokens
# load model
model = Transformer.from_folder(os.path.expanduser("~")+"/mistral_7b_instruct_v3")
# generate results
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=mistral_tokenizer.instruct_tokenizer.tokenizer.eos_id)
# decode generated tokens
result = mistral_tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)

OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 29.06 MiB is free. Process 24940 has 14.72 GiB memory in use. Of the allocated memory 14.48 GiB is allocated by PyTorch, and 137.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Function calling

Mistral 7B Instruct v3 also supports function calling!

Let's start by creating a function calling example

In [None]:
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.tool_calls import Function, Tool

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

Since we have already loaded the tokenizer and the model in the example above. We will skip these steps here.

Now we can encode the message with our tokenizer using `MistralTokenizer`.

In [None]:
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokens = mistral_tokenizer.encode_chat_completion(completion_request).tokens

and run `generate` to get a response. Don't forget to pass the EOS id!

In [None]:
from mistral_inference.generate import generate

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=mistral_tokenizer.instruct_tokenizer.tokenizer.eos_id)

Finally, we can decode the generated tokens.

In [None]:
result = mistral_tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens)[0]
result