This SDK tool provides some helper functions to allow you to create and deploy custom models with ease

Let's say we want to serve a [Tiny-Llama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.6) with [Instill Model](https://github.com/instill-ai/model)

1. First we need to create a file structure like the following

```bash
.
├── README.md
└── tiny_llama               <=== your model name
    └── 1                    <=== your model version
        ├── model.py         <=== your model file
        ├── ray_pb2.py
        ├── ray_pb2.pyi
        ├── ray_pb2_grpc.py
        └── tinyllama        <=== model weights and dependecy folder clone from huggingface (remember to follow the LICENSE of each model)
```

Within the `README.md` you will have to put in the info about the model inbetween the `---` section, and a brief intro down below. For example
```
---
Task: TextGenerationChat
Tags:
  - TextGenerationChat
  - TinyLlama-1.1B-Chat
---

# Model-llama2-7b-chat-dvc

🔥🔥🔥 Deploy [TinyLlama-1.1B-Chat](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.6) model.
```

2. Then we put the 3 proto definition files inside the `./{model_name}/{version}` folder, you can find them [here](https://github.com/instill-ai/model-backend/tree/main/assets/ray/proto), we are working to avoid this step in the future.
3. Now we can `git clone` the dependencies from huggingface, with git lfs.
```
git lfs install
git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.6 $PROJECT_ROOT/{modelname}/{version}/tinyllama
```
4. Next, we start writting our model file, which with the help of the SDK, is relatively similar to what you would expect when developing in your local environment.

In [None]:
# import neccessary packages
import torch
from transformers import pipeline

# import SDK helper functions
# const package hosts the standard Datatypes and Input class for each standard Instill AI Tasks
from instill.helpers.const import DataType, TextGenerationChatInput
# ray_io package hosts the parsers to easily convert request payload into input paramaters, and model outputs to response
from instill.helpers.ray_io import StandardTaskIO
# ray_config package hosts the decorators and deployment object for model class
from instill.helpers.ray_config import instill_deployment, InstillDeployable
# ray_pb2 is the proto definition of the grpc request/response
from ray_pb2 import (
    ModelReadyRequest,
    ModelReadyResponse,
    ModelMetadataRequest,
    ModelMetadataResponse,
    ModelInferRequest,
    ModelInferResponse,
    InferTensor,
)

# use instill_deployment decorator to convert the model class to servable model
@instill_deployment
class TinyLlama:

    # within the __init__ function, setup the model instance with the desired framework, in this
    # case is the pipeline from transformers
    def __init__(self, model_path: str):
        self.pipeline = pipeline(
            "text-generation",
            model=model_path,
            torch_dtype=torch.float32,
            device_map="cpu",
        )

    # ModelMetadata tells the server what inputs and outputs the model is expecting
    def ModelMetadata(self, req: ModelMetadataRequest) -> ModelMetadataResponse:
        resp = ModelMetadataResponse(
            name=req.name,
            versions=req.version,
            framework="python",
            inputs=[
                ModelMetadataResponse.TensorMetadata(
                    name="conversation",
                    datatype=str(DataType.TYPE_STRING.name),
                    shape=[1],
                ),
                ModelMetadataResponse.TensorMetadata(
                    name="max_new_tokens",
                    datatype=str(DataType.TYPE_UINT32.name),
                    shape=[1],
                ),
                ModelMetadataResponse.TensorMetadata(
                    name="temperature",
                    datatype=str(DataType.TYPE_FP32.name),
                    shape=[1],
                ),
                ModelMetadataResponse.TensorMetadata(
                    name="top_k",
                    datatype=str(DataType.TYPE_UINT32.name),
                    shape=[1],
                ),
                ModelMetadataResponse.TensorMetadata(
                    name="random_seed",
                    datatype=str(DataType.TYPE_UINT64.name),
                    shape=[1],
                ),
                ModelMetadataResponse.TensorMetadata(
                    name="extra_params",
                    datatype=str(DataType.TYPE_STRING.name),
                    shape=[1],
                ),
            ],
            outputs=[
                ModelMetadataResponse.TensorMetadata(
                    name="text",
                    datatype=str(DataType.TYPE_STRING.name),
                    shape=[-1, -1],
                ),
            ],
        )
        return resp

    # ModelReady is the healthcheck method for the server
    # implement your own logic and it will reflect on the console
    def ModelReady(self, req: ModelReadyRequest) -> ModelReadyResponse:
        resp = ModelReadyResponse(ready=True)
        return resp

    # ModelInfer is the method handling the trigger request from Instill Model
    async def ModelInfer(self, request: ModelInferRequest) -> ModelInferResponse:
        # prepare the response
        resp = ModelInferResponse(
            model_name=request.model_name,
            model_version=request.model_version,
            outputs=[],
            raw_output_contents=[],
        )

        # use StandardTaskIO package to parse the request and get the corresponding input
        # for text-generation-chat task
        task_text_generation_chat_input: TextGenerationChatInput = (
            StandardTaskIO.parse_task_text_generation_chat_input(request=request)
        )

        # prepare prompt with chat template
        prompt = self.pipeline.tokenizer.apply_chat_template(
            task_text_generation_chat_input.conversation,
            tokenize=False,
            add_generation_prompt=True,
        )

        # inference
        sequences = self.pipeline(
            prompt,
            max_new_tokens=task_text_generation_chat_input.max_new_tokens,
            do_sample=True,
            temperature=task_text_generation_chat_input.temperature,
            top_k=task_text_generation_chat_input.top_k,
            top_p=0.95,
        )

        # convert the output into response output with again the StandardTaskIO
        task_text_generation_chat_output = (
            StandardTaskIO.parse_task_text_generation_chat_output(sequences=sequences)
        )

        # specify the output dimension
        resp.outputs.append(
            InferTensor(
                name="text",
                shape=[1, len(sequences)],
                datatype=str(DataType.TYPE_STRING),
            )
        )

        # finally insert the output into the response
        resp.raw_output_contents.append(task_text_generation_chat_output)

        return resp

# now simply declare a global deployable instance with model weight name or model file name
# and specify if this model is going to use GPU or not
deployable = InstillDeployable(TinyLlama, model_weight_or_folder_name="tinyllama", use_gpu=True)

# you can also have a fine-grained control of the min/max replica numbers
deployable.update_max_replicas(2)
deployable.update_min_replicas(0)

# we plan to open up more detailed resource control in the future

5. Finally, we can pack it up and serve it on `Instill Model`! Simply
```bash
zip -r "tiny-llama.zip" .
```
Or alternatively, if you have a LFS server or DVC bucket setup somewhere, you can also push the files along with the `.dvc` or lfs files onto github, and use our github import.

Now go to `Model Hub` page on Instill console and create a model from local with this zip, and profit!

Here is a sample request and response with this model

_*req:*_
```bash
curl --location 'http://localhost:8080/model/v1alpha/users/admin/models/tinyllama/trigger' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer instill_sk_***' \
--data '{
    "task_inputs": [
        {
            "text_generation_chat": {
                "conversation": [
                    {
                        "role": "user",
                        "content": "is it unhealthy to stay up late?"
                    }
                ],
                "top_k": 5,
                "temperature": 0.7
            }
        }
    ]
}'
```
_*resp:*_
```json
{
    "task": "TASK_TEXT_GENERATION_CHAT",
    "task_outputs": [
        {
            "text_generation": {
                "text": "<|user|>\nis it unhealthy to stay up late?</s>\n<|assistant|>\nYes, staying up late can be unhealthy. Longer hours of sleep are important for good health and well-being. The body needs time to rest and recover after a long day, and excessive sleep can lead to a range of health problems, including insomnia, obesity, and heart disease. It's essential to set a regular sleep schedule, limit screen time before bedtime, and get enough sleep to avoid sleep-related health issues."
            }
        }
    ]
}
```