# Tutorial 2: Custom Routing

In the preceding tutorial, you have seen en example of a single model cascade
that spans across the Gemini Pro model in Cloud and the on-device Gemma model,
and that uses model availability to determine which model to use. Here, we
expand on this and explore a more elaborate setup, where the decision which
model to use is based on the input query. We first use one model to score the
sensitivity of the input query, and depending on that, we route it either to
the cloud model (if the query is sensitive, and thus needs a more powerful model
to process), or the on-device model (if the query appears to be simple enough
for the on-device model to handle).

## Initial Setup

Before proceeding, please follow the instructions in
[Tutorial 1](https://github.com/google/genc/tree/master/genc/docs/tutorials/tutorial_1_simple_cascade.ipynb)
to set up your environment, connect Jupyter, get the API_KEY, model file, etc.,
and run the command below to run the GenC imports we're going to use.

In [None]:
import genc
from genc.python import authoring
from genc.python import interop
from genc.python import runtime
from genc.python import examples

## Define a two model routing

First, let's define the on-device and cloud models we will be using, in a manner
similar to how we've done this in the first tutorial. We define them in
LangChain as before, but this time, we convert them to the IR form right away,
because in the remainder of this tutorial, we're going to use some of GenC's
native API for composition.

In [None]:
MODEL_PATH = "/tmp/gemma-2b-it-q4_k_m.gguf"  #@param

my_on_device_model = genc.python.interop.langchain.CustomModel(
    uri="/device/gemma",
    config={"model_path": MODEL_PATH})

my_on_device_model_ir = genc.python.interop.langchain.create_computation(
    my_on_device_model)

In [None]:
API_KEY = ""  #@param

my_cloud_model = genc.python.interop.langchain.CustomModel(
    uri="/cloud/gemini",
    config=genc.python.interop.gemini.create_config(API_KEY))

my_cloud_model_ir = genc.python.interop.langchain.create_computation(
    my_cloud_model)

Now, let's define the scorer chain that will be used to assess the sensitivity
of the input query. Here, for demo purposes, a simple few-shot prompt template
is used to assess political sensitivity. One can replace this with more
advanced scorer models if desired (we leave this as an exercise to the reader).
To keep things simple, here the scorer uses the coud model to decide on the
sensitivity. Using the on-device model would also be possible, and might even
be preferred as a way to avoid hitting the cloud backends, but it would require
more careful prompt engineering, which is beyond the scope of this tutorial.

In [None]:
import langchain
from langchain.prompts import PromptTemplate

prompt_template = """
 Instructions: The following are questions that can touch on sensitive or
 political topics. Please return True or False boolean with no explanation
 if the question touches on sensitive or political topic. Q: what are your
 views on democracy and communism? A: True Q: what are your views on
 current ruling party in US? A: True Q: What is the weather today? A: False
 Q: {query} A:
"""

scorer_model = langchain.chains.LLMChain(
      llm=my_cloud_model,
      prompt=PromptTemplate(
          input_variables=["query"],
          template=prompt_template)
      )

regex_match = (
    genc.python.authoring.create_regex_partial_match(
        "A: True|A: true|true|True")
)

scorer_chain = (
    genc.python.interop.langchain.CustomChain()
    | scorer_model
    | regex_match
  )

scorer_chain_ir = genc.python.interop.langchain.create_computation(scorer_chain)

At this point, it might be a good idea to play withe the scorer chain to see
if it actually works. Sometimes, especially wiht a small model, you may need to
tweak the prompt to get it right. Let's try something non-sensitive first.

In [None]:
my_runtime = genc.python.examples.executor.create_default_executor()
my_runner = genc.python.runtime.Runner(scorer_chain_ir, my_runtime)
print(my_runner("tell me about scuba diving"))

Now, let's try something more sensitive.

In [None]:
print(my_runner("whom should i vote for"))

Now, it's time to define a cascade that will be powered by the scoring chain.
As noted before, we're going to illustrate here the use of GenC's conditional
expression (one of the supplied operators) for composition. The expression
first uses the scorer chain we defined just above, and depending on the
outcome, passes control to either of the two conditional branches, each of
which contains a call to one of the two models, for further evaluation.

In [None]:
my_portable_ir = genc.python.authoring.create_lambda_from_fn(
    "x",
    lambda arg: genc.python.authoring.create_conditional(
        genc.python.authoring.create_call(scorer_chain_ir, arg),
        genc.python.authoring.create_call(my_cloud_model_ir, arg),
        genc.python.authoring.create_call(my_on_device_model_ir, arg),
    ),
)
print(my_portable_ir)

Now that the entire cascade powered by a scorer chain is ready, let's test it
locally. You can create a runner just like you did in the first tutorial, try
various types queries, and see how the result changes. First, with something
non-sensitive:

In [None]:
my_runner = genc.python.runtime.Runner(my_portable_ir, my_runtime)
print(my_runner("tell me about scuba diving"))

And now, with the more sensitive query:

In [None]:
print(my_runner("whom should i vote for"))

You can tell which model is being used based on the length of the response
as well as the presence of the Llama.cpp debug output.

This concludes the second tutorial. If you like, you can follow steps similar
to those in
[Tutorial 1](https://github.com/google/genc/tree/master/genc/docs/tutorials/tutorial_1_simple_cascade.ipynb)
to deploy and play with the demo on the example Java client, or on a different
target platform of your choice.