# Tutorial 2: Custom Routing

In the preceding turorial, you have seen en example of a single model cascade
that spans across on-device and Cloud, and uses model availability to determine
which model to use. Here, we expand on this and explore a more elaborate setup,
where the decision which model to use is based on the input query. We first use
the on-device model to score the sensitivity of the input query, and depending
on that, we route it to the cloud model (if the query is sensitive, and thus
needs a more powerful model to process), or the on-device model (if the query
appears to be simple enough for the on-device model).

## Initial Setup

Before proceeding, please follow the instructions in
[Tutorial 1](https://github.com/google/generative_computing/tree/master/generative_computing/docs/tutorials/tutorial_1_simple_cascade.ipynb)
to set up your environment, connect Jupyter, and run the command below to run
the GenC imports we're going to use.

In [None]:
import generative_computing.python as genc
from generative_computing.python import authoring
from generative_computing.python import interop
from generative_computing.python import runtime
from generative_computing.python.examples import executor

## Define a two model routing

First, let's define the on-device and cloud models we will be using, in a manner
similar to how we've done this in the first tutorial. We define them in
LangChain as before, but this time, we convert them to the IR form right away,
becuase in the remainder of this tutorial we're going to use some of GenC's
native API for composition.

In [None]:
MODEL_PATH = "/data/local/tmp/llm/gemma-2b-it-gpu-int4.bin"  #@param

my_on_device_model = genc.interop.langchain.CustomModel(
    uri="/device/llm_inference",
    config={"model_path": MODEL_PATH})

my_on_device_model_ir = genc.interop.langchain.create_computation(my_on_device_model)

In [None]:
API_KEY = ""  #@param

my_cloud_model = genc.interop.langchain.CustomModel(
    uri="/cloud/gemini",
    config=genc.interop.gemini.create_config(API_KEY))

my_cloud_model_ir = genc.interop.langchain.create_computation(my_cloud_model)

Now, let's define the scorer chain that will be used to assess the sensitivity
of the inupt query. Here, for demo purposes, a simple few-shot prompt template
is used to assess political sensitivity. One can replace this with more
advanced scorer models. The scorer uses the on-device model to decide on
sensitivity (so we can avoid hitting the cloud backends and thus potentially
reduce the response time if the query is simple enough).

In [None]:
import langchain
from langchain.prompts import PromptTemplate

prompt_template = """
 Instructions: The following are questions that can touch on sensitive or
 political topics. Please return True or False boolean with no explanation
 if the question touches on sensitive or political topic. Q: what are your
 views on democracy and communism? A: True Q: what are your views on
 current ruling party in US? A: True Q: What is the weather today? A: False
 Q: {query} A:
"""

scorer_model = langchain.chains.LLMChain(
      llm=my_on_device_model,
      prompt=PromptTemplate(
          input_variables=["query"],
          template=prompt_template)
      )

regex_match = (
    genc.authoring.create_regex_partial_match("A: True|A: true|true|True")
)

scorer_chain = (
    genc.interop.langchain.CustomChain()
    | scorer_model
    | regex_match
  )

scorer_chain_ir = genc.interop.langchain.create_computation(scorer_chain)

Now, it's time to put all these pieces together into a whole. As noted before,
we're going to illustrate here the use of GenC's conditional expression (one of
the supplied operators) for composition. The expression first uses the scorer
logic we defined just above, and depending on the outcome, passes control to
either of the two conditional branches, each of which contains a call to one of
the two models, for further evaluation.

Once again, we generate th IR, so that we can deploy the result or Android.

In [None]:
portable_ir = genc.authoring.create_lambda_from_fn(
    "x",
    lambda arg: genc.authoring.create_conditional(
        genc.authoring.create_call(scorer_chain_ir, arg),
        genc.authoring.create_call(my_cloud_model_ir, arg),
        genc.authoring.create_call(my_on_device_model_ir, arg),
    ),
)
print(portable_ir)

## Save the IR to a file

The rest of the process looks just like in the proceeding tutorial. Run the
following code to generate the a file containing the IR from above computation,
and load it on the phone.

In [None]:
with open("/tmp/genc_demo.pb", "wb") as f:
  f.write(portable_ir.SerializeToString())

## Install Generative Computing Demo app and deploy IR file to phone

Please see the section entitled "Install Generative Computing Demo app, deploy the IR file on the phone" in
[Tutorial 1](https://github.com/google/generative_computing/tree/master/generative_computing/docs/tutorials/tutorial_1_simple_cascade.ipynb)
for detailed instructions.

## Run the demo

1. Open Generative Computing Demo app, and enter a sensitive query. See response.

2. Next, enter a non-sensitive query. See response.

3. Interesting observation: Notice the length of the text response for non-sensitive query compared to the sensitive query. Any guesses?
  *  The non-sensitive query is routed to on-device model which in our demo setup returns shorter text responses to balance response time against lesser compute resources on device while the cloud model returns more verbose response.

4. The intent of this custom routing example is to illustrate that developers can create custom routing policies to dynamically choose which model to use per
different use-cases and model capabilities.