# A Guide for DeepSeek-R1 distilled Llama3.1-8B on Hopsworks

For details about this Large Language Model (LLM) visit the model page in the HuggingFace repository ➡️ [link](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)

### 1️⃣ Download DeepSeek-R1 distilled Llama3.1-8B using the huggingface_hub library

First, we download the Llama3.1 model files (e.g., weights, configuration files) directly from the HuggingFace repository.


In [None]:
!pip install huggingface_hub --quiet

In [None]:
# Place your HuggingFace token in the HF_TOKEN environment variable

import os
os.environ["HF_TOKEN"] = "<INSERT_YOUR_HF_TOKEN>"

In [None]:
from huggingface_hub import snapshot_download

deepseekr1_local_dir = snapshot_download("deepseek-ai/DeepSeek-R1-Distill-Llama-8B", ignore_patterns="original/*")

## 2️⃣ Register DeepSeek-R1 distilled Llama3.1 8B-Instruct into Hopsworks Model Registry

Once the model files are downloaded from the HuggingFace repository, we can register the models files into the Hopsworks Model Registry.

In [None]:
import hopsworks

project = hopsworks.login()
mr = project.get_model_registry()

In [None]:
# The following instantiates a Hopsworks LLM model, not yet saved in the Model Registry

deepseekr1 = mr.llm.create_model(
    name="deepseekr1_instruct",
    description="DeepSeek-R1 distilled Llama3.1-8B model (via HF)"
)

In [None]:
# Register the distilled model pointing to the local model files

deepseekr1.save(deepseekr1_local_dir)

## 3️⃣ Deploy DeepSeek-R1 distilled Llama3.1-8B

After registering the LLM model into the Model Registry, we can create a deployment that serves it using the vLLM engine.

In [None]:
# Get a reference to the distilled model if not obtained yet

deepseekr1 = mr.get_model("deepseekr1_instruct")

In [None]:
# Upload vllm engine config file for the deployments

ds_api = project.get_dataset_api()

path_to_config_file = f"/Projects/{project.name}/" + ds_api.upload("deepseek_vllmconfig.yaml", "Resources", overwrite=True)

### 🟨 Using vLLM OpenAI server

Create a model deployment by providing a configuration file with the arguments for the vLLM engine.

In [None]:
deepseekr1_depl = deepseekr1.deploy(
    name="deepseekr1",
    description="Deepseek-R1 distilled Llama3.1-8B from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*16, "gpus": 1}},
)

---

In [None]:
# Retrieve one of the deployments created above

ms = project.get_model_serving()
deepseekr1_depl = ms.get_deployment("deepseekr1")

In [None]:
deepseekr1_depl.start(await_running=60*15) # wait for 15 minutes maximum

In [None]:
# deepseekr1.stop()

In [None]:
deepseekr1_depl.get_state()

## 4️⃣ Prompting DeepSeek-R1 distilled Llama3.1 8B-Instruct

Once the deployment is up and running, we can start sending user prompts to the LLM. You can either use an OpenAI API-compatible client (e.g., openai library) or any other http client.

In [None]:
import os

# Get the istio endpoint from the deployment page in the Hopsworks UI.
istio_endpoint = "<ISTIO_ENDPOINT>" # with format "http://<ip-address>:<port>"
    
# Resolve base uri. NOTE: KServe's vLLM server prepends the URIs with /openai
base_uri = "/openai" if deepseekr1_depl.predictor.script_file is not None else ""

openai_v1_uri = istio_endpoint + base_uri + "/v1"
completions_url = openai_v1_uri + "/completions" 
chat_completions_url = openai_v1_uri + "/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
    'Host': f"{deepseekr1_depl.name}.{project.name.lower().replace('_', '-')}.hopsworks.ai", # also provided in the Hopsworks UI
}

### 🟨 Using httpx

In [None]:
import httpx

In [None]:
#
# Chat Completion for a user message
#

# Round 1
user_message = "9.11 and 9.8, which is greater?"
completion_request = {
    "model": deepseekr1_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message
        }
    ]
}

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
print(response)
content = response.json()["choices"][0]["message"]["content"]

print("Resoning content: ", response.json()["choices"][0]["message"]["reasoning_content"])
print("Content: ", content)

# Round 2
completion_request["messages"].append({"role": "assistant", "content": content})
completion_request["messages"].append({
    "role": "user",
    "content": "How many Rs are there in the word 'strawberry'?",
})

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
content = response.json()["choices"][0]["message"]["content"]

print("Resoning content: ", response.json()["choices"][0]["message"]["reasoning_content"])
print("Content: ", content)

### 🟨 Using OpenAI client

In [None]:
!pip install openai --quiet

In [None]:
from openai import OpenAI

In [None]:
client = OpenAI(
    base_url=openai_v1_uri,
    api_key="X",
    default_headers=headers
)

In [None]:
#
# Chat Completion for a user message
#

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = client.chat.completions.create(model=deepseekr1_depl.name, messages=messages)
content = response.choices[0].message.content

print("reasoning_content for Round 1:", response.choices[0].message.reasoning_content)
print("content for Round 1:", content)

# Round 2
messages.append({"role": "assistant", "content": content})
messages.append({
    "role": "user",
    "content": "How many Rs are there in the word 'strawberry'?",
})
response = client.chat.completions.create(model=deepseekr1_depl.name, messages=messages)
content = response.choices[0].message.content

print("reasoning_content for Round 2:", response.choices[0].message.reasoning_content)
print("content for Round 2:", content)