# Tutorial 1. Simple chain in LangChain powered by a device-to-Cloud model cascade

This tutorial shows how to define simple application logic in LangChain, use our
interop APIs to configure it to be powered by a cascade of models that spans
across a model in Cloud and an on-device model on Android, and deploy it in a
Java app on an Android phone. This illustrates many of the key interoperability
and portability benefits of GenC in one concise package. See the follow-up
tutorials listed in the parent directory for how you can further extend and
customize such logic to power more complex use cases.

## Initial setup

Before we begin, we need to setup your environment, such that you can continue
with the rest of this tutorial undisrupted.

*   First, you need to start a Jupyter notebook with the GenC dependency
    wired-in, and connect to that notebook - see
    [SETUP.md](https://github.com/google/generative_computing/tree/master/SETUP.md)
    at the root of the repo, and the supporting files in the
    [Jupyter setup directory](https://github.com/google/generative_computing/tree/master/generative_computing/docs/tutorials/jupyter_setup/)
    for instructions how to setup the build and run environment and get Jupyter
    up and running.

*   Next, you need to setup access to the Gemini Pro model that will be used
    in the tutorials. Please see the
    [instructions](https://ai.google.dev/tutorials/rest_quickstart)
    on how to get an API key to access this model through Google AI Studio.

*   Finally, for the last portion of the tutorial that includes deployment on
    Android, you will need to obtain a local model for your device. Please see
    [models.md](https://github.com/google/generative_computing/tree/master/generative_computing/docs/models.md)
    for information on how to obtain models and
    what backends to use. This tutorial supports running your model using
    MediaPipe (optimized GPU performance, but a limited set of models) or
    LlamaCpp (CPU-only, but many models supported). Once you have your model,
    you'll need to push the model file to your phone. Keep note of the path
    you push it to as you'll need to include this when you construct your IR
    below.

    For example, for a MediaPipe Gemma model:
    ```
    adb push {{src-model-dir}}/gemma-2b-it-gpu-int4.bin /data/local/tmp/llm/gemma-2b-it-gpu-int4.bin
    ```

    For a LlamaCpp model [e.g. Gemma 2B Quantized model](https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/tree/main):
    ```
    adb push {{src-model-dir}}/gemma-2b-it-q4_k_m.gguf /data/local/tmp/llm/gemma-2b-it-q4_k_m.gguf
    ```

Now, to verify that GenC dependencies are loaded correctly, let's run a bunch
of imports we're going to use later.

In [None]:
import generative_computing as genc
from generative_computing import authoring
from generative_computing import interop
from generative_computing import runtime
from generative_computing import examples

## Defining application logic in LangChain

We're going to create here an example application logic using LangChain APIs.
For the sake of simplicy, let's go with a simple chain that consists of a prompt
template feeding into an LLM call. Let's define it as a function, so that we can
later play with different models.

In [None]:
import langchain
from langchain.prompts import PromptTemplate

def create_my_chain(llm):
  return langchain.chains.LLMChain(
      llm=llm,
      prompt=PromptTemplate(
          input_variables=["topic"],
          template="Tell me about {topic}?",
      ),
  )

## Declaring a Cloud model you will use to power the chain

Now, let's define a model we can use. In GenC, we refer to models symbolically
since the same model may be provisioned differently depending on where you run
the code (recall that we want to demonstrate in this tutorial is running your
application logic in colab first, but then porting it to run on a phone).
To facilitate this, GenC provides interop APIs that enable you to declare the
use of a model, e.g., as shown below. For this tutorial, we're going to use
the Gemini Pro model from Google Studio AI.

NOTE: Please make sure you have an API_KEY to use as covered in the "Initial setup" section (above).

In [None]:
API_KEY = ""  #@param

my_cloud_model = genc.interop.langchain.CustomModel(
    uri="/cloud/gemini",
    config=genc.interop.gemini.create_config(API_KEY))

Now, you can construct the chain with it:

In [None]:
my_chain = create_my_chain(my_cloud_model)

## Generating portable intermediate representation (IR)

Now that you have the application logic (the chain you defined above), we need
to translate it into what we call a *portable intermediate representation* (IR
for short) that can be deployed on an Android phone. You do this by calling the
converstion function provided by GenC, as follows:

In [None]:
my_portable_ir = genc.interop.langchain.create_computation(my_chain)
print(my_portable_ir)

At the time of this writing, this converter only supports a subset of LangChain
functionality; we'll work to augment the coverage over time (and we welcome your
help if there's a feature of LangChain you'd like to see covered and are willing
to contribute it to the platform).

## Testing the IR locally in the Colab environment

Before we move over to deployment on Android, let's first test that the IR is
indeed working. While our goal is to run it on-device, we can just as well run
it here, in the colab environment (remember, all the code is portable). To do
this, we first need to construct a runtime instance:

In [None]:
my_runtime = genc.examples.executor.create_default_executor()

Now, the constructor above is provided for convenience in running the examples
and tutorials, and is configured with a number of runtime capabilities that we
use in this context. Runtimes in GenC are fully modular and configurable, and
in most advanced uses, you'll want to configure a runtime that suits the
specific environment you want to run in, or your particular application (e.g.,
with additional custom dependencies, or without certain dependencies you don't
want in your environment). One of the tutorials later in the sequence explains
how to do that. For now, the default example runtime will suffice.

Given the runtime and the portable IR we want to run, we can construct a
*runner* object that will act like an ordinary Python function, and can
be directly invoked, like this:

In [None]:
my_runner = genc.runtime.Runner(my_portable_ir, my_runtime)

print(my_runner("scuba diving"))

Because of the portability of the IR, at this point you could deploy this IR
as-is to your Android phone, and test it there without the need for any changes,
running the query with just the cloud model. One of GenC's goals is to enable
easy experimentation and an iterative style of development, where you can check
the results of your work at each step. To do this, jump to the section below
entitled "Saving the IR to a file for deployment to phone" for instructions on
deployment.

Otherwise, if you'd rather prefer to keep expanding on the existing logic, and
deploy the final version of the IR just once at the end, continue on to the next
section, where you will add a device model and run the complete model cascade.

## Adding an on-device model to form a model cascade

Now, recall that what we promised to demonstrate in this tutorial running your
application logic on a phone, where it might be powered by an on-device LLM
while the the phone may be offline. To achieve this, we're going to need to
modify the IR to include the on-device model. First, let's declare the use of
an on-device model in LangChain, similarly to how we did above for the cloud
model.

NOTE: Make sure you have downloaded the on-device model on Android as covered
in the "Initial setup" section above.
You'll need to provide the backend and model_path below.

In [None]:
from enum import Enum
class LocalBackend(Enum):
  MEDIAPIPE = 1
  LLAMACPP = 2

# Change these values based on your desired backend and model
BACKEND = LocalBackend.MEDIAPIPE
MODEL_PATH = "/data/local/tmp/llm/gemma-2b-it-gpu-int4.bin"

# Create IR for on device model
if (BACKEND == LocalBackend.MEDIAPIPE):
  my_on_device_model = genc.interop.langchain.CustomModel(
      uri = "/device/llm_inference",
      config = {"model_path": MODEL_PATH,
                "max_tokens": 64,
                "top_k": 40,
                "temperature": 0.8,
                "random_seed": 100})
elif (BACKEND == LocalBackend.LLAMACPP):
    my_on_device_model = genc.interop.langchain.CustomModel(
      uri = "/device/llamacpp",
      config = {"model_path" : MODEL_PATH,
                "num_threads" : 4,
                "max_tokens" : 64})

Now, we're going to combine the cloud and on-device models into a simple type
of model cascade that spans across cloud and on-device LLMs. For simplicity's
sake, let's define a two-model cascade that first tries to hit a cloud backend
in case we're online, and that defaults to the use of an on-device model when offline.

In [None]:
my_model_cascade = genc.interop.langchain.ModelCascade(models=[
    my_cloud_model, my_on_device_model])

my_chain = create_my_chain(my_model_cascade)
my_portable_ir = genc.interop.langchain.create_computation(my_chain)
print(my_portable_ir)

You could've chosen to order models in the cascade differently to achieve a
different behavior. Everything is customizable! In the next tutorial in the
sequence, we'll show you how you can construct an even more powerful routing
mechanism, where routing is based on query sensitivity. For now, this simple
cascade will suffice.

## Saving the IR to a file for deployment to phone

Now that you tested the IR locally, it's time to deploy it on your phone and
test it there. First, let's save the IR into a file on the local filesystem.

In [None]:
with open("/tmp/genc_demo.pb", "wb") as f:
  f.write(my_portable_ir.SerializeToString())

## Install Generative Computing Demo app, deploy the IR file on the phone

### Building/Deploying the Android app in Docker

If you're already running this notebook inside of a docker container and
followed the steps in
[SETUP.md](https://github.com/google/generative_computing/tree/master/SETUP.md)
then you have an Android
build enviroment already setup. You can build the APK directly from the root
directory (`/generative_computing` if the command from SETUP.md was used).

```
bazel build --config=android_arm64 generative_computing/java/src/java/org/generativecomputing/examples/apps/gencdemo:app
```

Once built, the APK will be in the `bazel-bin` folder. This will need to be
copied to the folder that is shared between the Docker container and the host
machine (because Docker doesn't have access to your USB devices).
If you followed the command in
[SETUP.md](https://github.com/google/generative_computing/tree/master/SETUP.md)
then this folder is `/generative_computing`:

```
cp bazel-bin/generative_computing/java/src/java/org/generativecomputing/examples/apps/gencdemo/app.apk /generative_computing
cp /tmp/genc_demo.pb /generative_computing
```

In a different terminal outside of the Docker container, you can now ADB
install the APK and push the IR.

```
adb install app.apk
adb push genc_demo.pb /data/local/tmp/genc_demo.pb
```

## Run the demo on phone

Make sure your device is connected to the Internet.

1. Open the “Generative Computing Demo” app and type a topic in the UI. As a reminder, here is the prompt template we are using: "Tell me about {topic}?"
  * Example text to enter in UI: "scuba diving"

2. See the result. This is the result coming from the Cloud model (make sure
that the internet connection is available).

Now, you may want to play with setting the Airplane Mode on your Android phone
to On and Off. As you do that, retry the query, and notice how the result
changes depending on the model in use (responses from the on-device and cloud
models tend to look differently). Under the hood, the cascade you defined
prompts GenC to first try the cloud model, but if it fails (while in airplane
mode), it simply falls back to the on-device model (if present).