# A Guide for Mistral NeMo Instruct-2407 on Hopsworks

For details about this Large Language Model (LLM) visit the model page in the HuggingFace repository ➡️ [link](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)

### 1️⃣ Download Mistral NeMo Instruct-2407 using the huggingface_hub library

First, we download the Mistral model files (e.g., weights, configuration files) directly from the HuggingFace repository.


In [1]:
!pip install huggingface_hub --quiet

In [2]:
# Place your HuggingFace token in the HF_TOKEN environment variable

import os
os.environ["HF_TOKEN"] = "<INSERT_YOUR_HF_TOKEN>"

In [3]:
from huggingface_hub import snapshot_download

mistral_nemo_local_dir = snapshot_download("mistralai/Mistral-NeMo-Instruct-2407", ignore_patterns=["consolidated.safetensors"])

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/3.13M [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/204 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.26M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/181k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

tekken.json:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

## 2️⃣ Register Mistral NeMo Instruct-2407 into Hopsworks Model Registry

Once the model files are downloaded from the HuggingFace repository, we can register the models files into the Hopsworks Model Registry.

In [4]:
import hopsworks

project = hopsworks.login()
mr = project.get_model_registry()

2025-01-27 10:18:17,669 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks.ai.local/p/119


In [5]:
# The following instantiates a Hopsworks LLM model, not yet saved in the Model Registry

mistral_nemo = mr.llm.create_model(
    name="mistral_nemo_instruct",
    description="Mistral NeMo Instruct-2407 model (via HF)"
)

In [6]:
# Register the Mistral model pointing to the local model files

mistral_nemo.save(mistral_nemo_local_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks.ai.local/p/119/models/mistral_nemo_instruct/1


Model(name: 'mistral_nemo_instruct', version: 1)

## 3️⃣ Deploy Mistral NeMo Instruct-2407

After registering the LLM model into the Model Registry, we can create a deployment that serves it using the vLLM engine.

Hopsworks provides two types of deployments to serve LLMs with the vLLM engine:

- **Using the official vLLM OpenAI server**: an OpenAI API-compatible server implemented by the creators of vLLM where the vLLM engine is configured with a user-provided configuration (yaml) file.

- **Using the KServe built-in vLLM server**: a KServe-based implementation of an OpenAI API-compatible server for more advanced users who need to provide a predictor script for the initialization of the vLLM engine and (optionally) the implementation of the *completions* and *chat/completions* endpoints.


In [7]:
# Get a reference to the Mistral model if not obtained yet

mistral_nemo = mr.get_model("mistral_nemo_instruct")




In [8]:
# Upload vllm engine config file for the deployments

ds_api = project.get_dataset_api()

path_to_config_file = f"/Projects/{project.name}/" + ds_api.upload("mistral_vllmconfig.yaml", "Resources", overwrite=True)

Uploading: 0.000%|          | 0/184 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/4960 elapsed<00:00 remaining<?

### 🟨 Using KServe vLLM server

Create a model deployment by providing a predictor script and (optionally) a configuration file with the arguments for the vLLM engine.

In [None]:
# upload predictor script
path_to_predictor_script = f"/Projects/{project.name}/" + ds_api.upload("mistral_predictor.py", "Resources", overwrite=True)

mistral_depl = mistral_nemo.deploy(
    name="mistralnemo1",
    description="Mistral NeMo Instruct-2407 from HuggingFace", 
    script_file=path_to_predictor_script,
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*16, "gpus": 1}}
)

Uploading: 0.000%|          | 0/4960 elapsed<00:00 remaining<?

Deployment created, explore it at https://hopsworks.ai.local/p/119/deployments/32
Before making predictions, start the deployment by using `.start()`


### 🟨 Using vLLM OpenAI server

Create a model deployment by providing a configuration file with the arguments for the vLLM engine.

In [10]:
mistral_depl = mistral_nemo.deploy(
    name="mistralnemo2",
    description="Mistral NeMo Instruct-2407 from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*12, "gpus": 1}},
)

Deployment created, explore it at https://hopsworks.ai.local/p/119/deployments/33
Before making predictions, start the deployment by using `.start()`


---

In [11]:
# Retrieve one of the deployments created above

ms = project.get_model_serving()
mistral_depl = ms.get_deployment("mistralnemo2")

In [16]:
mistral_depl.start(await_running=60*15) # wait for 15 minutes maximum

  0%|          | 0/5 [00:00<?, ?it/s]

Start making predictions by using `.predict()`


In [None]:
# mistral_depl.stop()

In [17]:
mistral_depl.get_state()

PredictorState(status: 'Running')

## 4️⃣ Prompting Mistral NeMo Instruct-2407

Once the Mistral deployment is up and running, we can start sending user prompts to the LLM. You can either use an OpenAI API-compatible client (e.g., openai library) or any other http client.

In [18]:
import os

# Get the istio endpoint from the Mistral deployment page in the Hopsworks UI.
istio_endpoint = "<ISTIO_ENDPOINT>" # with format "http://<ip-address>"

# Resolve base uri. NOTE: KServe's vLLM server prepends the URIs with /openai
base_uri = "/openai" if mistral_depl.predictor.script_file is not None else ""

openai_v1_uri = istio_endpoint + base_uri + "/v1"
completions_url = openai_v1_uri + "/completions" 
chat_completions_url = openai_v1_uri + "/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
    'Host': f"{mistral_depl.name}.{project.name.lower().replace('_', '-')}.hopsworks.ai", # also provided in the Hopsworks UI
}

### 🟨 Using httpx

In [19]:
import httpx

In [20]:
#
# Chat Completion for a user message
# 

user_message = "Who is the best French painter. Answer with detailed explanations."

completion_request = {
    "model": mistral_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'mistralnemo2', 'messages': [{'role': 'user', 'content': 'Who is the best French painter. Answer with detailed explanations.'}]}
2025-01-27 13:18:42,979 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
Choosing the "best" French painter can be subjective and depends on personal taste, as well as the specific criteria used for judgment. However, several French painters have made significant contributions to the art world and have left lasting impacts on Western art history. Here, I'll provide detailed explanations of three major French painters often considered among the best:

1. **Claude Monet (1840-1926)** - A founding member of the Impressionist movement, Monet is renowned for his mastery of visible light, his ability to depict the changing effects of light, and his innovative techniques. Here are some reasons why he's considered one of the best:

   - **Influence on Impressionism**: Monet is often considered the fathers

In [21]:
#
# Chat Completion for list of messages
#

messages = [
{
    "role": "user",
    "content": "Hi! How are you doing today?"
}, {
    "role": "assistant",
    "content": "I'm doing well! How can I help you",
}, {
    "role": "user",
    "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]

completion_request = {
    "model": mistral_depl.name,
    "messages": messages
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'mistralnemo2', 'messages': [{'role': 'user', 'content': 'Hi! How are you doing today?'}, {'role': 'assistant', 'content': "I'm doing well! How can I help you"}, {'role': 'user', 'content': 'Can you tell me what the temperate will be in Dallas, in fahrenheit?'}]}
2025-01-27 13:18:44,790 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
Sure! According to the latest forecast, the temperature in Dallas, TX today will be around 75°F (24°C) during the day, with a low of 57°F (14°C) overnight.


### 🟨 Using OpenAI client

In [22]:
!pip install openai --quiet

In [23]:
from openai import OpenAI

In [24]:
client = OpenAI(
    base_url=openai_v1_uri,
    api_key="X",
    default_headers=headers
)

In [25]:
#
# Chat Completion for a user message
#

chat_response = client.chat.completions.create(
    model=mistral_depl.name,
    messages=[
        {"role": "user", "content": "Who is the best French painter. Answer with detailed explanations."},
    ]
)

print(chat_response.choices[0].message.content)

2025-01-27 13:19:12,434 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
Choosing the "best" French painter can be subjective and depends on the criteria you value most: influence, technical skill, innovation, historical significance, or simply personal preference. However, several names frequently rise to the top of these discussions due to their profound impact on art history. Here are a few notable French painters, each with detailed explanations:

1. **Leonardo da Vinci (1452-1519)**: Although not exclusively French, Leonardo spent the final 18 years of his life in France, working for King Francis I. His influence on French art was immense, and his artistic legacy is widely acknowledged. Leonardo's mastery of sfumato, his incredible anatomical understanding, and his groundbreaking compositions have inspired countless artists. His most famous works, such as the "Mona Lisa" and "The Last Supper," are iconic symbols of Western art.

2. **Jacques-Louis D

In [26]:
#
# Chat Completion for list of messages
#

chat_response = client.chat.completions.create(
    model=mistral_depl.name,
    messages=[{
        "role": "user",
        "content": "Hi! How are you doing today?"
    }, {
        "role": "assistant",
        "content": "I'm doing well! How can I help you?",
    }, {
        "role": "user",
         "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
    }]
)

print(chat_response.choices[0].message.content)

2025-01-27 13:19:13,569 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
Sure! According to the latest forecast, the temperature in Dallas, Texas will be around 75°F (24°C) today.
