# A Guide for Gemma 2 9B IT on Hopsworks

For details about this Large Language Model (LLM) visit the model page in the HuggingFace repository ‚û°Ô∏è [link](https://huggingface.co/google/gemma-2-9b-it)

### 1Ô∏è‚É£ Download Gemma 2 9B IT using the huggingface_hub library

First, we download the Gemma 2 model files (e.g., weights, configuration files) directly from the HuggingFace repository.


In [None]:
!pip install huggingface_hub --quiet

In [None]:
# Place your HuggingFace token in the HF_TOKEN environment variable

import os
os.environ["HF_TOKEN"] = "<INSERT_YOUR_HF_TOKEN>"

In [None]:
import os
from huggingface_hub import snapshot_download

os.environ["HF_HUB_DISABLE_XET"] = "1"

model_id = "google/gemma-2-9b-it"
gemma_local_dir = snapshot_download(model_id, ignore_patterns="original/*", local_dir=model_id)

## 2Ô∏è‚É£ Register Gemma 2 9B IT into Hopsworks Model Registry

Once the model files are downloaded from the HuggingFace repository, we can register the models files into the Hopsworks Model Registry.

In [None]:
import hopsworks

project = hopsworks.login()
mr = project.get_model_registry()

In [None]:
# The following instantiates a Hopsworks LLM model, not yet saved in the Model Registry

gemma = mr.llm.create_model(
    name="gemma2_9b_it",
    description="Gemma 2 9B IT model (via HF)"
)

In [None]:
# Register the Gemma model pointing to the local model files

gemma.save(gemma_local_dir)

## 3Ô∏è‚É£ Deploy Gemma 2 9B IT

After registering the LLM model into the Model Registry, we can create a deployment that serves it using the vLLM engine with a user-provided configuration (yaml) file.

In [None]:
# Get a reference to the Gemma model if not obtained yet

gemma = mr.get_model("gemma2_9b_it")

In [None]:
# Upload vllm engine config file for the deployments

ds_api = project.get_dataset_api()

path_to_config_file = f"/Projects/{project.name}/" + ds_api.upload("gemma_vllmconfig.yaml", "Resources", overwrite=True)

In [None]:
gemma_depl = gemma.deploy(
    name="gemma2v1",
    description="Gemma 2 9B IT from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 1, "memory": 1024*12, "gpus": 1}, "limits": {"cores": 2, "memory": 1024*16, "gpus": 1}},
)

---

In [None]:
# Retrieve one of the deployments created above

ms = project.get_model_serving()
gemma_depl = ms.get_deployment("gemma2v1")

In [None]:
gemma_depl.start(await_running=60*15) # wait for 15 minutes maximum

In [None]:
# gemma_depl.stop()

In [None]:
gemma_depl.get_state()

## 4Ô∏è‚É£ Prompting Gemma 2 9B IT

Once the Gemma deployment is up and running, we can start sending user prompts to the LLM. You can either use an OpenAI API-compatible client (e.g., openai library) or any other http client.

In [None]:
import os

openai_v1_uri = gemma_depl.get_openai_url()
completions_url = openai_v1_uri + "/completions" 
chat_completions_url = openai_v1_uri + "/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
}

### üü® Using httpx

In [None]:
import httpx

In [None]:
#
# Chat Completion for a user message
#

user_message = "Who is the best French painter. Answer with detailed explanations."

completion_request = {
    "model": gemma_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
print(response)
print(response.json()["choices"][0]["message"]["content"])

In [None]:
#
# Chat Completion for list of messages
#

messages = [{
    "role": "user",
    "content": "Hi! How are you doing today?"
}, {
    "role": "assistant",
    "content": "I'm doing well! How can I help you?",
}, {
    "role": "user",
     "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]


completion_request = {
    "model": gemma_depl.name,
    "messages": messages
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)

print(response.json()["choices"][0]["message"]["content"])

### üü® Using OpenAI client

In [None]:
!pip install openai --quiet

In [None]:
from openai import OpenAI

In [None]:
client = OpenAI(
    base_url=openai_v1_uri,
    api_key="X",
    default_headers=headers
)

In [None]:
#
# Chat Completion for a user message
#

chat_response = client.chat.completions.create(
    model=gemma_depl.name,
    messages=[
        {"role": "user", "content": "Who is the best French painter. Answer with a short explanations."},
    ]
)

print(chat_response.choices[0].message.content)

In [None]:
#
# Chat Completion for list of messages
#

chat_response = client.chat.completions.create(
    model=gemma_depl.name,
    messages=[{
        "role": "user",
        "content": "Hi! How are you doing today?"
    }, {
        "role": "assistant",
        "content": "I'm doing well! How can I help you?",
    }, {
        "role": "user",
         "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
    }]
)

print(chat_response.choices[0].message.content)