# Tutorial 2: Custom Routing

Here we use one model to score an input text, then based on the score we route it to one of the two models.

## Initial Setup

Please follow [Instructions](.../tutorial_1_simple_cascade.ipynb) in Tutorial 1 to set up on-device model and other
runtime configuration.

## Define a two model routing

We use the on-device model to estimate if a query is politically sensitive.

* If yes, we use the cloud model gemini-pro to answer the query. This increases
the accuracy in answering these questions, while balancing
sensitivity against model inference cost.
* If not, we use the cheaper on-device gemma model to answer the query.

Let's define the on-device and cloud models we will be using in this demo.

In [None]:
import generative_computing as genc

MODEL_PATH = "/data/local/tmp/llm/model_gpu.tflite"  #@param

my_on_device_model = genc.interop.langchain.CustomModel(
    uri="/device/llm_inference",
    config={"model_path": MODEL_PATH})

my_on_device_model_ir = genc.interop.langchain.create_computation(my_on_device_model)


In [None]:
import generative_computing as genc

API_KEY = ""  #@param

my_cloud_model = genc.interop.langchain.CustomModel(
    uri="/cloud/gemini",
    config=genc.interop.gemini.create_config(API_KEY))

my_cloud_model_ir = genc.interop.langchain.create_computation(my_cloud_model)


Now, let's define the scorer model. Here, for demo purposes, a simple few-shot
prompt template is used to assess political sensitivity.

One can replace this with more advanced scorer models.

The scorer uses on-device model.

In [None]:
import generative_computing as genc
import langchain
from langchain.prompts import PromptTemplate

prompt_template = """
 Instructions: The following are questions that can touch on sensitive or
 political topics. Please return True or False boolean with no explanation
 if the question touches on sensitive or political topic. Q: what are your
 views on democracy and communism? A: True Q: what are your views on
 current ruling party in US? A: True Q: What is the weather today? A: False
 Q: {query} A:
"""

scorer_model = langchain.chains.LLMChain(
      llm=my_on_device_model,
      prompt=PromptTemplate(
          input_variables=["query"],
          template=prompt_template)
      )

regex_match = (
    genc.authoring.create_regex_partial_match("A: True|A: true|true|True")
)

scorer_chain = (
    genc.interop.langchain.CustomChain()
    | scorer_model
    | regex_match
  )

scorer_chain_ir = genc.interop.langchain.create_computation(scorer_chain)

Next, we define a conditional where if the scorer assesses the query to be
sensitive and returns True, we route the query to cloud model, else we route it
to on-device model.

Additionally, we generate a portable intermediate representation (IR), that we
can run on workstation or Android.

In [None]:
import generative_computing as genc


portable_ir = genc.authoring.create_lambda_from_fn(
    "x",
    lambda arg: genc.authoring.create_conditional(
        genc.authoring.create_call(scorer_chain_ir, arg),
        genc.authoring.create_call(my_cloud_model_ir, arg),
        genc.authoring.create_call(my_on_device_model_ir, arg),
    ),
)
print(portable_ir)

## Run it

In [None]:
import generative_computing as genc

runner = genc.runtime.Runner(portable_ir,
                             genc.examples.executor.create_default_executor())

runner("What are your views on ice cream?")

## Save the IR to a file

Run the following code to generate the a file containing the IR from above
computation.

In [None]:
from google3.pyglib import gfile

with gfile.Open("/tmp/genc_demo.pb", "wb") as f:
  f.write(portable_ir.SerializeToString())


## Install Generative Computing Demo app and deploy IR file to phone

Please see ["Install Generative Computing Demo app and deploy the IR file to phone"](.../tutorial_1_simple_cascade.ipynb) in Tutorial 1 for instructions.


## Run the demo

1. Open Generative Computing Demo app, and enter a sensitive query. See response.

2. Next, enter a non-sensitive query. See response.

3. Interesting observation: Notice the length of the text response for non-sensitive query compared to the sensitive query. Any guesses?
  *  The non-sensitive query is routed to on-device model which in our demo setup returns shorter text responses to balance response time against lesser compute resources on device while the cloud model returns more verbose response.

4. The intent of this custom routing example is to illustrate that developers can create custom routing policies to dynamically choose which model to use per
different use-cases and model capabilities.