# Model Download and Deployment with Kamiwaza SDK

This notebook demonstrates how to download and deploy models using the Kamiwaza SDK. We'll walk through the complete process step-by-step:

1. Searching for models
2. Downloading model files
3. Deploying the model
4. Using the model with the OpenAI compatible interface
5. Stopping the model deployment

In this example, we're using a small language model ([Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF)), but the same process works for any supported model.

## Initialize the Kamiwaza Client

First, we initialize the client by connecting to our Kamiwaza server.

In [None]:
from kamiwaza_sdk import kamiwaza_sdk as kz

# Initialize the client
client = kz("http://localhost:7777/api/")

## Search for a Model

Let's search for a specific model from Hugging Face. We use the `search_models` method with the repository ID and set `exact=True` to find an exact match.

The search results show:
- Model name and repository ID
- Available files and their types
- Available quantization levels (fp16, q2_k, q3_k, etc.)
- Download status information

In [None]:
hf_repo = 'Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF'
client.models.search_models(hf_repo, exact = True)

## Download the Model Files

Now we'll initiate the model download using `initiate_model_download`. 

By default, this downloads the best quantization for your hardware. You can specify a particular quantization level by adding a parameter like:

```python
client.models.initiate_model_download(hf_repo, quantization="q4_k")

In [None]:
client.models.initiate_model_download(hf_repo)

## Check Download Status

After initiating the download, we can check its status to see the progress.

In [None]:
client.models.check_download_status(hf_repo)

## Wait for Download Completion

Instead of repeatedly checking the status, we can use the `wait_for_download` method to wait until all downloads are complete. This method provides progress updates and a summary once the download finishes.

In [None]:
client.models.wait_for_download(hf_repo)

## Deploy the Model

Once the model is downloaded, we can deploy it using `deploy_model`. This method prepares the model for inference and returns a deployment ID.

In [None]:
client.serving.deploy_model(repo_id=hf_repo)

## List Active Deployments

We can view all active model deployments to confirm our model is running. The output shows:
- Deployment ID and model ID
- Model name
- Status (DEPLOYED, STARTING, etc.)
- Instance information
- The endpoint URL for making inference requests

In [None]:
client.serving.list_active_deployments()

## Use the OpenAI Compatible Interface

Kamiwaza provides an OpenAI-compatible interface, making it easy to use familiar tools with your deployed models. We create an OpenAI client using the `get_client` method.

Now we can use the standard OpenAI API patterns to interact with our model.

In [None]:
openai_client = client.openai.get_client(repo_id=hf_repo)

In [None]:
# Create a streaming chat completion
response = openai_client.chat.completions.create(
    messages=[
        {"role": "user", "content": "How many r's are in the word 'strawberry'? ONLY RESPOND WITH A SINGLE NUMBER"}
    ],
    model="model",
    stream=True 
)

# display the stream
for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)


## Stop the Model Deployment

When we're done using the model, we can stop the deployment to free up resources.

In [None]:
client.serving.stop_deployment(repo_id=hf_repo)

In [None]:
client.serving.list_active_deployments()