# Bring Your Own LLM Model for Inference

In this notebook, we will test the Qwen 2.5 - 0.5BN parameter model and later, log it to the Model Registry, and finally create a container service to run inference. You should be able to change the choice of model easily, however, some models require slightly different pipeline flows.

We have not tested larger versions of Qwen, in order to do this, you may need to modify the `INSTANCE_FAMILY` parameter to reflect the needed GPU resources for larger models. Instance Families across cloud providers are described [here](https://docs.snowflake.com/en/sql-reference/sql/create-compute-pool).

## Step 1: Test LLM model in this notebook, without creating a service

In [None]:
# Standard huggingface flow using 'text-generation' pipelines
from transformers import pipeline

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

pipe = pipeline("text-generation", model_name, torch_dtype="auto")

In [None]:
# Create and test a batched input
pipe.tokenizer.padding_side="left"

system_message = 'You are a marketing assistant. For each idea, please provide a witty marketing tagline. Please only generate a single tagline and do not provide any other commentary.'

message_batch = [
    [{"role": "user", "content": "Mobile app for calling a taxi"}, {"role": "system", "content": system_message}],
    [{"role": "user", "content": "Paper towels that have Christmas prints"}, {"role": "system", "content": system_message}],
]

result_batch = pipe(message_batch, max_new_tokens=512, batch_size=2)
response_message_batch = [result[0]["generated_text"] for result in result_batch]
response_message_batch

## Step 2: Log LLM model (or fine-tuned version) to Model Registry

Model Registry Documentation: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/overview

* Standard HuggingFace Pipelines: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/built-in-models/hugging-face - the example below constructs a custommodel
* Custom Model Pipeline: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/bring-your-own-model-types

In [None]:
# Build the custom model class
import os
import torch
import pandas as pd
from transformers import pipeline
from snowflake.ml.registry import Registry
from snowflake.ml.model import custom_model, model_signature
from snowflake.snowpark.context import get_active_session

session = get_active_session()

# Create a custom model class for the instantiation and inference of this model
class Qwen2Model(custom_model.CustomModel):
    def __init__(self, context: custom_model.ModelContext) -> None:
        super().__init__(context)

        # For `transformers` set the environment variables to use local files only
        # We will download them to a local dir using huggingface_hub
        os.environ['HF_HUB_OFFLINE'] = '1'
        os.environ['TRANSFORMERS_OFFLINE'] = '1'
        
        self.pipe = pipeline(
            "text-generation", 
            context.path("model_path"), 
            torch_dtype="auto",
            device=0,
        )
        self.pipe.tokenizer.padding_side="left"

    # Inference function with a dataframe as input
    @custom_model.inference_api
    def predict(self, prompt_df: pd.DataFrame) -> pd.DataFrame:
        prompts = prompt_df['prompts'].tolist()

        messages = [[{"role": "user", "content": prompt}] for prompt in prompts]

        results = self.pipe(messages, max_new_tokens=512, batch_size=len(messages))
        responses = [result[0]["generated_text"] for result in results]
        
        return pd.DataFrame({
            "prompt": [response[0]["content"] for response in responses],
            "response": [response[1]["content"] for response in responses]
        })

In [None]:
# Download a model from huggingface to a local directory
# TO USE YOUR OWN MODEL, skip this step and pass in the model directory path in the place of 
# `local_model_location`. Finally instantiate the CustomModel class.
import tempfile
from huggingface_hub import snapshot_download

tmpdir = tempfile.mkdtemp()
local_model_location = snapshot_download(
    repo_id=model_name,
    local_dir=tmpdir
)

path_list = {"model_path": local_model_location}
qwen = Qwen2Model(context=custom_model.ModelContext(artifacts=path_list))

In [None]:
# Generate a reponse from the model using the predict() method
test_prompt = pd.DataFrame(['What is the internet?', 'The capital of France is'], columns=['prompts'])

response = qwen.predict(test_prompt)
response

In [None]:
# Infer the model signature from the input prompts and the response above.
# Documentation: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/model-signature
signature = model_signature.infer_signature(test_prompt, response)

In [None]:
# Log the model to the Snowflake Model Registry
reg = Registry(session)
mv = reg.log_model(
    qwen,
    model_name='QWEN25',
    version_name='V4',  # Can remove this parameter to auto-create version names
    conda_dependencies=['transformers', 'tokenizers', 'pytorch', 'huggingface_hub', 'snowflake-ml-python'],
    signatures={"predict":signature},
    options={"cuda_version": "11.8"}
)

In [None]:
# This step SHOULD fail!!
# The default for models is to predict using a warehouse, however, these models will need GPU inferencing
mv.run(test_prompt)

## 3. Create a Container Service for Model Serving

Read more here: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/container

*>>> Important Note: This is a long-running service, so once you are done, you will want to suspend the service to stop incurring costs. To do this, run `ALTER SERVICE QWEN_SERVICE SUSPEND;` in a Notebook or SQL worksheet*

In [None]:
# Create a compute pool for GPU access to run this service

# Compute Pool definition
DATABASE_NAME = 'NOTEBOOK_DEMO_DB'
SCHEMA_NAME = 'LLM_TEST_QWEN'
IMAGE_REPO_NAME = "QWEN_SERVICE_REPO"
COMPUTE_POOL_NAME = "QWEN_SERVICE_POOL_S"
COMPUTE_POOL_NODES = 1
COMPUTE_POOL_INSTANCE_TYPE = 'GPU_NV_S'

session.sql(f"use database {DATABASE_NAME};").collect()
session.sql(f"use schema {SCHEMA_NAME};").collect()
session.sql(f"create image repository if not exists {IMAGE_REPO_NAME}").collect()
session.sql(f"alter compute pool if exists {COMPUTE_POOL_NAME} stop all").collect()
session.sql(f"drop compute pool if exists {COMPUTE_POOL_NAME}").collect()
session.sql(f"create compute pool if not exists {COMPUTE_POOL_NAME} min_nodes={COMPUTE_POOL_NODES} " +
            f"max_nodes={COMPUTE_POOL_NODES} instance_family={COMPUTE_POOL_INSTANCE_TYPE} " +
            f"initially_suspended=True auto_resume=True auto_suspend_secs=300").collect()

In [None]:
# Create a Service object that can be called easily
# Name of the Service for powering inference
SERVICE_NAME = 'QWEN_SERVICE'

# **This step may take >15 mins** - it is building a full container runtime.
mv.create_service(
    service_name=SERVICE_NAME,
    service_compute_pool=COMPUTE_POOL_NAME,
    image_repo=IMAGE_REPO_NAME,
    gpu_requests='1',
    ingress_enabled=True,
    max_instances=int(COMPUTE_POOL_NODES),
    build_external_access_integration='ALLOW_ALL_INTEGRATION'
)

## 4. Serve model from Registry and use for Inference
This code can be used in other places like a streamlit app or from a SQL worksheet to call the LLM model

Documentation link: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/container#using-a-model-deployed-to-spcs

In [None]:
# PYTHON CALL - useful for Streamlit app
# Pull Model from Registry for Inference
from snowflake.ml.registry import Registry
from snowflake.snowpark.context import get_active_session

# Modify these based on your details.
DATABASE_NAME = 'NOTEBOOK_DEMO_DB'
SCHEMA_NAME = 'LLM_TEST_QWEN'
SELECTED_MODEL = 'QWEN25'
MODEL_VERSION = 'V4'

session = get_active_session()
reg = Registry(session=session, database_name=DATABASE_NAME, schema_name=SCHEMA_NAME)
qwen_from_registry = reg.get_model(SELECTED_MODEL).version(MODEL_VERSION)

qwen_from_registry.run(test_prompt, service_name=SERVICE_NAME)

In [None]:
-- SQL CALL - useful for applying to a table of data
-- Note: in the customModel class, you may want to modify the predict function to accept more than
-- one row of question/answer in order to be more performant when applied to a table of data
USE DATABASE NOTEBOOK_DEMO_DB;
USE SCHEMA LLM_TEST_QWEN;
SELECT QWEN_SERVICE!PREDICT('What is are large language models?');

In [None]:
CREATE OR REPLACE LOCAL TEMPORARY TABLE IDEA_GENS ON COMMIT PRESERVE ROWS AS
    SELECT
        SNOWFLAKE.CORTEX.COMPLETE(
            'llama3.1-70b',
            [
                {
                    'role': 'user',
                    'content': 'Please give me the name of a country at random. Don''t include any extra commentary, only the name of a country.'
                }
            ],
            {'temperature': 0.7}
        ) AS IDEA_TEXT
    FROM TABLE(GENERATOR(ROWCOUNT => 1000)) t;

In [None]:
-- Test using our model against all 1,000 rows
ALTER SESSION SET QUERY_TAG = 'llm_vectorization_test';
SELECT
    IDEA_TEXT,
    QWEN_SERVICE!PREDICT(
        CONCAT(
            'What is the capital of this country? Only provide the name of the capital: ',
            IDEA_TEXT:choices[0].messages::VARCHAR
        )
    ) as marketing_idea
FROM IDEA_GENS;