# A Guide for Llama3.1 8B-Instruct on Hopsworks

For details about this Large Language Model (LLM) visit the model page in the HuggingFace repository ‚û°Ô∏è [link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

### 1Ô∏è‚É£ Download Llama3.1 8B-Instruct using the huggingface_hub library

First, we download the Llama3.1 model files (e.g., weights, configuration files) directly from the HuggingFace repository.


In [1]:
!pip install huggingface_hub --quiet

In [2]:
# Place your HuggingFace token in the HF_TOKEN environment variable

import os
os.environ["HF_TOKEN"] = "<INSERT_YOUR_HF_TOKEN>"

In [3]:
from huggingface_hub import snapshot_download

llama31_local_dir = snapshot_download("meta-llama/Llama-3.1-8B-Instruct", ignore_patterns="original/*")

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

## 2Ô∏è‚É£ Register Llama3.1 8B-Instruct into Hopsworks Model Registry

Once the model files are downloaded from the HuggingFace repository, we can register the models files into the Hopsworks Model Registry.

In [4]:
import hopsworks

project = hopsworks.login()
mr = project.get_model_registry()

2025-01-27 14:53:39,802 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks.ai.local/p/119


In [5]:
# The following instantiates a Hopsworks LLM model, not yet saved in the Model Registry

llama31 = mr.llm.create_model(
    name="llama31_instruct",
    description="Llama3.1 8B-Instruct model (via HF)"
)

In [6]:
# Register the Llama model pointing to the local model files

llama31.save(llama31_local_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks.ai.local/p/119/models/llama31_instruct/1


Model(name: 'llama31_instruct', version: 1)

## 3Ô∏è‚É£ Deploy Llama3.1 8B-Instruct

After registering the LLM model into the Model Registry, we can create a deployment that serves it using the vLLM engine.

Hopsworks provides two types of deployments to serve LLMs with the vLLM engine:

- **Using the official vLLM OpenAI server**: an OpenAI API-compatible server implemented by the creators of vLLM where the vLLM engine is configured with a user-provided configuration (yaml) file.

- **Using the KServe built-in vLLM server**: a KServe-based implementation of an OpenAI API-compatible server for more advanced users who need to provide a predictor script for the initialization of the vLLM engine and (optionally) the implementation of the *completions* and *chat/completions* endpoints.


In [7]:
# Get a reference to the Llama model if not obtained yet

llama31 = mr.get_model("llama31_instruct")




In [8]:
# Upload vllm engine config file for the deployments

ds_api = project.get_dataset_api()

path_to_config_file = f"/Projects/{project.name}/" + ds_api.upload("llama_vllmconfig.yaml", "Resources", overwrite=True)

Uploading: 0.000%|          | 0/62 elapsed<00:00 remaining<?

### üü® Using KServe vLLM server

Create a model deployment by providing a predictor script and (optionally) a configuration file with the arguments for the vLLM engine.

In [9]:
# upload predictor script
path_to_predictor_script = f"/Projects/{project.name}/" + ds_api.upload("llama_predictor.py", "Resources", overwrite=True)

llama31_depl = llama31.deploy(
    name="llama31v1",
    description="Llama3.1 8B-Instruct from HuggingFace", 
    script_file=path_to_predictor_script,
    config_file=path_to_config_file,  # optional
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*16, "gpus": 1}},
)

Uploading: 0.000%|          | 0/1714 elapsed<00:00 remaining<?

Deployment created, explore it at https://hopsworks.ai.local/p/119/deployments/38
Before making predictions, start the deployment by using `.start()`


### üü® Using vLLM OpenAI server

Create a model deployment by providing a configuration file with the arguments for the vLLM engine.

In [10]:
llama31_depl = llama31.deploy(
    name="llama31v2",
    description="Llama3.1 8B-Instruct from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*12, "gpus": 1}},
)

Deployment created, explore it at https://hopsworks.ai.local/p/119/deployments/39
Before making predictions, start the deployment by using `.start()`


---

In [11]:
# Retrieve one of the deployments created above

ms = project.get_model_serving()
llama31_depl = ms.get_deployment("llama31v2")

In [12]:
llama31_depl.start(await_running=60*15) # wait for 15 minutes maximum

  0%|          | 0/5 [00:00<?, ?it/s]

Start making predictions by using `.predict()`


In [13]:
# llama31_depl.stop()

In [14]:
llama31_depl.get_state()

PredictorState(status: 'Running')

## 4Ô∏è‚É£ Prompting Llama3.1 8B-Instruct

Once the Llama31 deployment is up and running, we can start sending user prompts to the LLM. You can either use an OpenAI API-compatible client (e.g., openai library) or any other http client.

In [15]:
import os

# Get the istio endpoint from the Llama deployment page in the Hopsworks UI.
istio_endpoint = "<ISTIO_ENDPOINT>" # with format "http://<ip-address>"
    
# Resolve base uri. NOTE: KServe's vLLM server prepends the URIs with /openai
base_uri = "/openai" if llama31_depl.predictor.script_file is not None else ""

openai_v1_uri = istio_endpoint + base_uri + "/v1"
completions_url = openai_v1_uri + "/completions" 
chat_completions_url = openai_v1_uri + "/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
    'Host': f"{llama31_depl.name}.{project.name.lower().replace('_', '-')}.hopsworks.ai", # also provided in the Hopsworks UI
}

### üü® Using httpx

In [16]:
import httpx

In [17]:
#
# Chat Completion for a user message
#

user_message = "Who is the best French painter. Answer with detailed explanations."

completion_request = {
    "model": llama31_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
print(response)
print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'llama31v2', 'messages': [{'role': 'user', 'content': 'Who is the best French painter. Answer with detailed explanations.'}]}
2025-01-27 15:01:45,144 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
<Response [200 OK]>
Choosing the "best" French painter is subjective, as it depends on personal taste and historical context. However, I can provide you with some of the most renowned French painters and highlight their unique contributions to the world of art.

1. **Claude Monet** (1840-1926)
Monet is often considered one of the greatest French painters. He was a founding member of the Impressionist movement, which emphasized capturing the fleeting effects of light and color in outdoor settings. Monet's brushstrokes were spontaneous and expressive, and he is famous for his series of water lily paintings (Nymph√©as) and his iconic depictions of London's fog-shrouded streets.

Monet's innovative techniques and his focus on light,

In [18]:
#
# Chat Completion for list of messages
#

messages = [{
    "role": "user",
    "content": "Hi! How are you doing today?"
}, {
    "role": "assistant",
    "content": "I'm doing well! How can I help you?",
}, {
    "role": "user",
     "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]


completion_request = {
    "model": llama31_depl.name,
    "messages": messages
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'llama31v2', 'messages': [{'role': 'user', 'content': 'Hi! How are you doing today?'}, {'role': 'assistant', 'content': "I'm doing well! How can I help you?"}, {'role': 'user', 'content': 'Can you tell me what the temperate will be in Dallas, in fahrenheit?'}]}
2025-01-27 15:01:50,060 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
However, I'm a large language model, I don't have real-time access to current weather information. But I can suggest some options to help you find the current temperature in Dallas, Texas:

1. **Check online weather websites**: You can visit websites like weather.com, accuweather.com, or wunderground.com and enter "Dallas, TX" in the search bar to get the current temperature.
2. **Use a voice assistant**: If you have a smart speaker or virtual assistant like Siri, Google Assistant, or Alexa, you can ask them to give you the current temperature in Dallas.
3. **Check a mobile app**: Download a wea

### üü® Using OpenAI client

In [19]:
!pip install openai --quiet

In [20]:
from openai import OpenAI

In [21]:
client = OpenAI(
    base_url=openai_v1_uri,
    api_key="X",
    default_headers=headers
)

In [22]:
#
# Chat Completion for a user message
#

chat_response = client.chat.completions.create(
    model=llama31_depl.name,
    messages=[
        {"role": "user", "content": "Who is the best French painter. Answer with a short explanations."},
    ]
)

print(chat_response.choices[0].message.content)

2025-01-27 15:01:59,744 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
Determining the "best" French painter can be subjective as opinions vary based on personal taste and artistic preferences. However, here are some of the most renowned French painters:

1. **Claude Monet** (1840-1926): A founder of Impressionism, Monet is famous for his captivating landscapes, water lilies, and sunsets. His soft, dreamy brushstrokes revolutionized the art world.
2. **Pierre-Auguste Renoir** (1841-1919): A leading figure in Impressionism, Renoir is celebrated for his vibrant depictions of everyday life, often focusing on the beauty of the human body.
3. **Henri Matisse** (1869-1954): A pioneer of Fauvism, Matisse is renowned for his bold, colorful works that blended elements of modern art and craftsmanship. His intricate cut-outs and paper sculptures are highly acclaimed.
4. **Paul C√©zanne** (1839-1906): A Post-Impressionist master, C√©zanne played a crucial role in 

In [23]:
#
# Chat Completion for list of messages
#

chat_response = client.chat.completions.create(
    model=llama31_depl.name,
    messages=[{
        "role": "user",
        "content": "Hi! How are you doing today?"
    }, {
        "role": "assistant",
        "content": "I'm doing well! How can I help you?",
    }, {
        "role": "user",
         "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
    }]
)

print(chat_response.choices[0].message.content)

2025-01-27 15:02:02,872 INFO: HTTP Request: POST http://51.89.4.22/v1/chat/completions "HTTP/1.1 200 OK"
However, I'm a large language model, I don't have real-time access to current weather conditions. Nevertheless, I can suggest some options to find the current temperature in Dallas, Texas:

1. Check online weather websites: You can visit websites like weather.com, accuweather.com, or wunderground.com to get the current temperature in Dallas.
2. Use a virtual assistant: You can ask virtual assistants like Siri, Google Assistant, or Alexa to provide you with the current temperature in Dallas.
3. Check a weather app: You can download a weather app on your smartphone to get the current temperature in Dallas.

If you'd like, I can provide you with the average temperature ranges for Dallas during different times of the year.
