# Model Download and Deployment with Kamiwaza SDK

This notebook demonstrates how to download and deploy models using the Kamiwaza SDK. We'll walk through the complete process step-by-step:

1. Searching for models
2. Downloading model files
3. Deploying the model
4. Using the model with the OpenAI compatible interface
5. Stopping the model deployment

In this example, we're using a small language model ([Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF)), but the same process works for any supported model.

## Initialize the Kamiwaza Client

First, we initialize the client by connecting to our Kamiwaza server.

In [2]:
from kamiwaza_client import KamiwazaClient

# Initialize the client
client = KamiwazaClient("http://localhost:7777/api/")

## Search for a Model

Let's search for a specific model from Hugging Face. We use the `search_models` method with the repository ID and set `exact=True` to find an exact match.

The search results show:
- Model name and repository ID
- Available files and their types
- Available quantization levels (fp16, q2_k, q3_k, etc.)
- Download status information

In [4]:
hf_repo = 'Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF'
client.models.search_models(hf_repo, exact = True)

[Model: Qwen2.5-Coder-0.5B-Instruct-GGUF
 Repo ID: Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
 Files: 12 available
   GITATTRIBUTES files: 1
   UNKNOWN files: 1
   MD files: 1
   GGUF files: 9
 Available quantizations:
   - fp16
   - q2_k
   - q3_k
   - q4_0
   - q4_k
   - q5_0
   - q5_k
   - q6_k
   - q8_0
 Files: 0 downloading]

## Download the Model Files

Now we'll initiate the model download using `initiate_model_download`. 

By default, this downloads the best quantization for your hardware. You can specify a particular quantization level by adding a parameter like:

```python
client.models.initiate_model_download(hf_repo, quantization="q4_k")

In [5]:
client.models.initiate_model_download(hf_repo)

{'model': Model: Qwen2.5-Coder-0.5B-Instruct-GGUF
 Repo ID: Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
 Files: 1 available
   GGUF files: 1
 Available quantizations:
   - fp16
   - q2_k
   - q3_k
   - q4_0
   - q4_k
   - q5_0
   - q5_k
   - q6_k
   - q8_0
 Files: 0 downloading,
 'files': [ModelFile: qwen2.5-coder-0.5b-instruct-q6_k.gguf
  Size: 620.25 MB],
 'download_request': ModelDownloadRequest: Model: Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF, Version: None, Hub: HubsHf,
 'result': {'result': True,
  'message': 'Downloads queued',
  'files': ['9b75e90e-1a86-49d7-9953-32242156fbe8']}}

## Check Download Status

After initiating the download, we can check its status to see the progress.

In [6]:
client.models.check_download_status(hf_repo)

[ModelDownloadStatus: qwen2.5-coder-0.5b-instruct-q6_k.gguf
 ID: 9b75e90e-1a86-49d7-9953-32242156fbe8
 Model ID: d67d5808-f95b-466f-9f85-09e1354553d7
 Is Downloading: True
 Download Progress: 0%]

## Wait for Download Completion

Instead of repeatedly checking the status, we can use the `wait_for_download` method to wait until all downloads are complete. This method provides progress updates and a summary once the download finishes.

In [7]:
client.models.wait_for_download(hf_repo)

Overall: 95.0% [02:36] | Active: 1, Completed: 0, Total: 1 | qwen2.5-coder-0.5b-instruct-q6_k.gguf: 95% (6.16MB/s)
Download complete for: Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
Total download time: 02:41
Files downloaded:
- qwen2.5-coder-0.5b-instruct-q6_k.gguf (620.25 MB)
Model ID: d67d5808-f95b-466f-9f85-09e1354553d7


[ModelDownloadStatus: qwen2.5-coder-0.5b-instruct-q6_k.gguf
 ID: 9b75e90e-1a86-49d7-9953-32242156fbe8
 Model ID: d67d5808-f95b-466f-9f85-09e1354553d7
 Is Downloading: True
 Download Progress: 95% (6.16MB/s), 00:05 remaining
 Download time: 01:36]

## Deploy the Model

Once the model is downloaded, we can deploy it using `deploy_model`. This method prepares the model for inference and returns a deployment ID.

In [16]:
client.serving.deploy_model(repo_id=hf_repo)

UUID('81ed8fce-4237-40d2-af76-feef5698bee2')

## List Active Deployments

We can view all active model deployments to confirm our model is running. The output shows:
- Deployment ID and model ID
- Model name
- Status (DEPLOYED, STARTING, etc.)
- Instance information
- The endpoint URL for making inference requests

In [17]:
client.serving.list_active_deployments()

[ActiveModelDeployment(id=UUID('81ed8fce-4237-40d2-af76-feef5698bee2'), m_id=UUID('d67d5808-f95b-466f-9f85-09e1354553d7'), m_name='Qwen2.5-Coder-0.5B-Instruct-GGUF', status='DEPLOYED', instances=[ModelInstance:
 ID: 49ba59dc-6021-41ab-943b-1d0871cab23c
 Deployment ID: 81ed8fce-4237-40d2-af76-feef5698bee2
 Status: DEPLOYED
 Listen Port: 50555], lb_port=51122, endpoint='http://localhost:51122/v1'),
 ActiveModelDeployment(id=UUID('2fd2e948-8441-4e88-9179-bc7321600b62'), m_id=UUID('39164ffe-4ba8-4e6e-9b90-42a4e38e4900'), m_name='Qwen2.5-7B-Instruct-GGUF', status='DEPLOYED', instances=[ModelInstance:
 ID: 8102cc7b-bcb8-4bd6-a546-b40610335bf9
 Deployment ID: 2fd2e948-8441-4e88-9179-bc7321600b62
 Status: DEPLOYED
 Listen Port: 49515], lb_port=51121, endpoint='http://localhost:51121/v1')]

## Use the OpenAI Compatible Interface

Kamiwaza provides an OpenAI-compatible interface, making it easy to use familiar tools with your deployed models. We create an OpenAI client using the `get_client` method.

Now we can use the standard OpenAI API patterns to interact with our model.

In [18]:
openai_client = client.openai.get_client(repo_id=hf_repo)

In [23]:
# Create a streaming chat completion
response = openai_client.chat.completions.create(
    messages=[
        {"role": "user", "content": "How many r's are in the word 'strawberry'? ONLY RESPOND WITH A SINGLE NUMBER"}
    ],
    model="model",
    stream=True 
)

# display the stream
for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)


2025-03-07 13:46:47,056 - httpx - INFO - HTTP Request: POST http://localhost:51122/v1/chat/completions "HTTP/1.1 200 OK"


1

## Stop the Model Deployment

When we're done using the model, we can stop the deployment to free up resources.

In [24]:
client.serving.stop_deployment(repo_id=hf_repo)

True

In [14]:
client.serving.list_active_deployments()

[]