# Tutorial 9: Private GenAI

## Introduction and problem statement

This tutorial synthesizes and combines concepts introduced in several of the
preceding tutorials to support scenarios that focus on processing data that's
private and/or confidential.

The specific scenario we're going to focus on here is a mobile app on Android
that wants to perform GenAI processing, where some of the data fed by the
app as input to that processing might be sensitive.

Now, whereas privacy is a multi-faceted concept, in this tutorial we're going
to focus on one specific aspect of it: here, the app developer would like to
(or has an obligation to) ensure that while the genAI processing takes place, access to the sensitive data used in this processing, or any intermediate or
final results derived from it, remains tightly controlled.

In particular, the developer seeks to avoid issuing LLM queries or any other
service calls containing sensitive data to untrusted cloud backends, where
the data might be at the risk of being collected/retained, and potentially
later used for purposes that are incompatible with privacy or confidentiality
expectations of the users or data owners, or that may violate the developer's
obligations with respect to responsible handling of such data.

## The approach we're going to follow in this tutorial

* * *

NOTE: Please be advised that GenC is positioned as a research and experimental
framework. Use the techniques described here at your own risk. All code used
here is built using off the shelf and open-source components; we encourage you
to review the code to make an informed decision about potential suitability of
this technology, architecture/design, or any of the ideas described here for
non-research-oriented applications.

* * *

In the preceding tutorials, we've introduced a few mechanisms that a developer
can use that we'll combine in this tutorial into a cohesive solution. The gist
of the approach we'll take in this tutorial is to combine all these individual
ingredients into a cohesive whole.

The key ingredients:

*   **Processing locally on-device**, e.g., on Android using the on-device LLM,
    as covered in [Tutorial 8](tutorial_8_android.ipynb). During the process
    of constructing the IR, the developer can simply avoid calls to cloud LLMs
    or other external services. Albeit we won't go quite as far in this demo,
    the developer could possibly go even further, if needed, and configure a
    custom version of the GenC runtime for Android with support for calling any
    external services removed, such that even if the IR might be created with
    such calls in it (perhaps accidentally), they would fail to run.

*   **Processing in a
    [Trusted Execution Environment (TEE)](https://en.wikipedia.org/wiki/Trusted_execution_environment)**,
    e.g., within a
    [Confidential Computing](https://cloud.google.com/security/products/confidential-computing) VM instance setup on GCP by the developer. Here,
    the developer relies on all data and processing being encrypted end-to-end
    during transport and in memory, and on the formal and verifiable guarantees
    offered by the specialized hardware that powers the TEE to ensure that
    neither the cloud provider, nor the administrator/operator of the service
    deployed in the TEE, can intercept the data or results. You've seen an
    example of this in [Tutorial 6](tutorial_6_confidential_computing.ipynb).

*   **Hybrid device and Cloud processing** that combines operations on-device an
    in-Cloud, such that both on-device and Cloud components can work together
    as a seamless whole to support the needs of the application. You've seen
    elements of this in [Tutorial 1](tutorial_1_simple_cascase.ipynb) and
    [Tutorial 2](tutorial_2_custom_routing.ipynb), where we combined on-device
    and Cloud models into a device-to-Cloud model cascade, and demonstrated 2
    forms of routing to decide which model to use for a particular query.
    Whereas the examples shown in these 2 tutorials used an unprotected Cloud
    model and weren't focusing on privacy, here the hybrid approach can enable
    us to combine the on-device and TEE-based genAI processing just mentioned
    above to form a private device-to-TEE model cascade that's protected from
    intrusion by untrusted external parties.

If you haven't reviewed yet the tutorials mentioned above, we'd like to strongly
encourage you to do so, as doing that will make the rest of this tutorial much
easier to follow (we don't explain some of the foundational concepts and APIs in
quite the same depth as we did in the tutorials above).

## Overall architecture

The overall architecture used in this tutorial will be a combination of what
you've seen in [Tutorial 6](tutorial_6_confidential_computing.ipynb) and
[Tutorial 8](tutorial_8_android.ipynb), except with a larger model in the TEE,
and with hybrid processing mixed in.

See the following diagram:

![Private GenAI](private_genai.png)

Let's go through this diagram step-by-step:

*   We will be using two LLMs, both of them instances of
    [Gemma](https://ai.google.dev/gemma). One will be a small quantized 2B that
    comes in at approximately 1.5GB package and fits on a modern Android
    device (e.g., on a Pixel 7). Another will be a much larger, unquantized 7B
    version that comes in slightly over 34GB, and doesn't fit on a mobile
    device, but that can be hosted in a TEE (e.g., on any of the upper end of
    the
    [N2D standard](https://cloud.google.com/compute/docs/general-purpose-machines#n2d-standard)
    machine family, one of which we're going to use here to power this tutorial.
    The two of them together will be used to power the GenAI workloads defined
    in this tutorial.

*   Both instances of Gemma will run in protected environments. One directly
    bundled with the mobile app, the other in a TEE, with encrypted memory and
    other strong assurances offered by the trusted computing environment. The
    device and the TEE are connected with an encrypted communication channel to
    form a whole. You can think of the TEE as effectively a logical extension
    of the mobile device. The device confirms the identify of the workload
    running in the TEE by obtaining and subsequently verifying an
    [attestation](https://cloud.google.com/confidential-computing/confidential-vm/docs/attestation) report that contains a `SHA256` digest of
    the image. The developer who builds the image locally, and knows what
    the `SHA256` image digest should be, embeds the digest directly in the IR,
    as an integral part of the GenAI workload.

*   Interaction with the two Gemma instances is mediated by two instances of
    GenC runtime, one on-device, and one in the TEE that talk to one-another
    and jointly execute the GenAI workload you define. The on-device runtime
    delegates processing to the runtime in the TEE in a manner defined by the
    developer (you), directly based on the intent declared in the code.

*   The developer's intent and the logic to execute, as always, are defined in
    the form of portable
    [Intermediate Representation (IR)](https://github.com/google/genc/tree/master/genc/docs/ir.md). We'll author the IR in the colab, then
    deploy and run it on-device. GenC runtime instances use chunks of the
    same IR to exchange the GenAI logic to be delegated from device to cloud.

Keep in mind that in this particular deployment scenario, the protections do
inherently rely on all the code involved (GenC runtime, the model, etc.) being
open-source. Use of closed-source or untrusted code in this type of environment
can be possible, but it is more complex, and falls outside the scope of this
introductory tutorial.

## Initial setup

As usual, we start by ensuring that you have the development environment that
can support the remainder of this tutorial.

The best way to go about this is to review the setup steps in
[Tutorial 6](tutorial_6_confidential_computing.ipynb) and
[Tutorial 8](tutorial_8_android.ipynb).

Rather than repeating the content of those tutorials, we'll limit ourselves
here to a quick checklist. Please consult the above for the detailed steps.

*   Follow the steps in
    [SETUP.md](https://github.com/google/genc/tree/master/SETUP.md)
    to download GenC from GitHub, build it, and run the tests in a docker
    container. Make sure you can launch the Jupyter container, and then
    connect to it and reopen this notebook in Jupyter, so that you can execute
    all the Python code below.

*   Setup for Android development (Developer mode on your Android device, USB
    debugging, the `adb` tool on your development workstation, etc.), then
    confirm you can build and install the `GenC Demo App` on your device, as
    described in [Tutorial 8](tutorial_8_android.ipynb). We will continue to
    use the same demo app to execute the GenAI logic authored in this tutorial.

*   Setup a GCP project in which we can host a Confidential VM, as described
    in [Tutorial 6](tutorial_6_confidential_computing.ipynb), install the
    [`gcloud`](https://cloud.google.com/sdk/docs/install) command-line tool,
    and then, as in [Tutorial 6](tutorial_6_confidential_computing.ipynb),
    enter the
    [`confidential_computing`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing)
    directory, edit the
    [config_environment.sh](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing/config_environment.sh)
    script there to match your GCP project, account names, etc.

*   Obtain an instance of quantized Gemma 2B weights that you will use as an
    on-device LLM. This ia a file named `gemma-2b-it-q4_k_m.gguf`, 1495245728 bytes in size. Push this file to your mobile device using the `adb` tool,
    as discussed in [Tutorial 8](tutorial_8_android.ipynb), and take note of
    the location. Also, download a copy into the docker container in which you
    are doing the development and where you will be running the Jupyter
    notebook instance, so we can play with it during development.

Assuming you have all the above in place, there are a couple additional steps
needed to upgrade the service image that runs in the TEE to use the larger
Gemma 7B model.

First, obtain an instance of unquantized Gemma 7B weights, e.g.,
[from HuggingFace](https://huggingface.co/google/gemma-7b). Obtaining the file
will require filling a form online, thus we can't auto-download it for you.
You want to have a file named `gemma-7b-it.gguf` that is 34158344288 bytes in
size before continuing.

Next, copy this file to the
[`confidential_computing`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing)
directory that's to build images in
[Tutorial 6](tutorial_6_confidential_computing.ipynb), such that it sits
there side by side with the
[`Dockerfile`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing/Dockerfile).

Then, find this section of the `Dockerfile`:

```
RUN wget --directory-prefix=/ https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf
```

Remove the above, and instead insert the following:

```
COPY gemma-7b-it.gguf /gemma-7b-it.gguf
```

Now, you can build and push the image of the runtime that runs in the TEE, as
described in [Tutorial 6](tutorial_6_confidential_computing.ipynb). The above
change you made ensures that it's the larger Gemma 7B that
gets bundled with the runtime. Go ahead and run
`bash ./build_image.sh` and `bash ./push_image.sh` as described in
[Tutorial 6](tutorial_6_confidential_computing.ipynb).

Before you create VM by calling the `bash ./create_debug_vm.sh` script, you
will want to tweak the VM parameters to grant more resources to accommodate
the larger model, by editing the content of that script and inserting these
two lines in place of the existing `machine_type` setting:

```
  --machine-type="n2d-standard-96" \
  --boot-disk-size="100GB" \
```

With that, you should be able to run the script to create a debug VM, and since
we're in debug mode, you can confirm it's up as discussed in
[Tutorial 6](tutorial_6_confidential_computing.ipynb) by verifying that the VM
has printed the `workload task started` at the end of the log in the serial
console, and that there's no error message indicating that the VM has aborted.
Keep in mind that due to the large image and file sizes involved, and you might
see messages about the `systemd-journald.service` crashing and restarting. Wait
until it resolves. This could take an hour, but it should eventually converge,
and you should see the TEE instance coming up and reporting readiness as noted
by the message quoted above.

As long as there are no firewall rules in place that would prevent you from
connecting to the VM from your device, you should be good to go.

To conclude the setup, take a note of the SHA256 digest printed after uploading
the image to GCP, and the external IP address of your convidential VM worker,
enter them below, and then execute the following block of code before moving on
to the next section.

In [None]:
image_digest = "" # copy here the `sha256:SOMETHING` printed by ./push_image.sh
server_ip = "" # copy here the `EXTERNAL IP` printed by ./create_debug_vm.sh

server_address =  server_ip + ":80"

import genc
from genc.python import authoring
from genc.python import examples
from genc.python import interop
from genc.python import runtime

genc.python.examples.executor.set_default_executor()

## Defining, deploying, and running your GenAI logic

Now that everything is setup, let's start defining the logic you'll use.

First, let's define all the models. For the smaller Gemma 2B, we'll define two
versions, one for deployment on-device, and the other to load into the Jupyter
notebook process, so that we can test the entire cascading setup from the
comfort of the Jupyter notebook before we proceed to deployment onto the mobile
device (which, as you will see, isn't particularly hard, but it does require
jumping outside of the interactive Jupyter experience, dealing with the `adb`
tool, bash scripts, USB cables, etc.).

First, Gemma 2B on-device:

In [None]:
@genc.python.authoring.traced_computation
def gemma_2b_on_device(x):
  return genc.python.authoring.model_inference_with_config[{
      "model_uri": "/device/gemma",
      "model_config": {
          "model_path": "/data/local/tmp/gemma-2b-it-q4_k_m.gguf",
          "num_threads" : 4,
          "max_tokens" : 64}}](x)

Now, Gemma 7B in a TEE:

In [None]:
@genc.python.authoring.traced_computation
def gemma_7b_in_a_tee(x):
  model = genc.python.authoring.model_inference_with_config[{
      "model_uri": "/device/gemma",
      "model_config": {
          "model_path": "/gemma-7b-it.gguf",
          "num_threads" : 64,
          "max_tokens" : 64}}]
  backend = {"server_address": server_address, "image_digest": image_digest}
  return genc.python.authoring.confidential_computation[model, backend](x)

Note that whereas the on-device Gemma 2B call was just a direct model call,
modeled by a single `model_inference` operation, the call to a Gemma 7B in a
TEE is mediated by the `confidential_computation` primitive that relays the
specified processing (in this case, just a single model call) to the specified
backend after first verifying the identity of the backend.

Finally, here's Gemma 2B that we'll use in the notebook process for local
testing, defined similar to the one on-device, but with the parameters slightly
tweaked to reflect the model location and the available resources:

In [None]:
@genc.python.authoring.traced_computation
def gemma_2b_in_notebook(x):
   return genc.python.authoring.model_inference_with_config[{
      "model_uri": "/device/gemma",
      "model_config": {
          "model_path": "/tmp/gemma-2b-it-q4_k_m.gguf",
          "num_threads" : 16,
          "max_tokens" : 64}}](x)

Make sure that the three paths above match the locations where you uploaded the
Gemma weights (both on Android, in the TEE as specified in the Dockerfile, and
in the docker container in which you're running Jupyter), so that the GenC
runtime can find them when executing the IR you define below.

Now, following ideas from [Tutorial 2](tutorial_2_custom_routing.ipynb), let's
design a simple Boolean condition to route between a local model and a model
in a TEE, and then use it to decide which model to call. Similarly to what we
have done in [Tutorial 2](tutorial_2_custom_routing.ipynb), we'll prompt the
small model (Gemma 2B) to make the determination, and the parse out the Boolean
result from its response. We're going to define the scorer chain here as a
function of the model used, so that we can construct a version of it to use in
the Jupyter notebook, and a then again one setup for deployment on-device.

To keep things simple for the small model, we'll ask it if the sentence is
about cucumbers, and use that as the routing condition, on the hypothesis that
the small model is particularly well suite for handling cucumber-themed input.

In [None]:
def make_scorer_chain(model_to_use_for_scoring):

  prompt = """
  Read this text carefully: "{x}".
  Did the text talk about cucumbers?
  Please answer using words "YES" or "NO".
  """

  regexp = "YES|Yes|yes"

  @genc.python.authoring.traced_computation
  def scorer_chain(x):
    return genc.python.authoring.regex_partial_match[regexp](
        model_to_use_for_scoring(
            genc.python.authoring.prompt_template[prompt](x)))

  return scorer_chain

Let's try to see if it works with the Gemma 2B loaded into the notebook:

In [None]:
chain = make_scorer_chain(gemma_2b_in_notebook)
chain("The cucumbers I planted last summer didn't make it through the winter.")

You should see:

```
True
```

Now let's try in a different input:

In [None]:
chain("When you see a bear, do not run and try to appear larger.")

You should see:

```
False
```

Ok, now that the chain is working, we can use it to define a model cascade,
similarly to what we did in [Tutorial 2](tutorial_2_custom_routing.ipynb).
Once again, we'll define it as a function of the models used, so that we
can play with it in the notebook.

To vary the output and make it easier to tell which model is handling the
query, we'll also alter the behavior. If it's a statement about cucumbers that
we believe the small model should handle, we'll ask it to write a poem.
Otherwise, we'll ask the larger model to angrily contradict everything we just
said.

In [None]:
def make_model_cascade(smaller_model, larger_model):

  scorer_chain = make_scorer_chain(smaller_model)

  @genc.python.authoring.traced_computation
  def if_chain(x):
    return smaller_model(
        genc.python.authoring.prompt_template[
            "Write me a poem that starts with the following text: {x}"](x))

  @genc.python.authoring.traced_computation
  def else_chain(x):
    return larger_model(
        genc.python.authoring.prompt_template[
            "Angrily contradict everything in the following text: {x}"](x))

  @genc.python.authoring.traced_computation
  def model_cascade(x):
    score = scorer_chain(x)
    return genc.python.authoring.conditional[if_chain(x), else_chain(x)](score)

  return model_cascade

Now let's try it out on a cascade formed by the Gemma 2B in the Jupyter
notebook, and the Gemma 7B in the TEE.

In [None]:
cascade = make_model_cascade(gemma_2b_in_notebook, gemma_7b_in_a_tee)
cascade("The cucumbers I planted last summer didn't make it through the winter.")

As it runs, you should see the debug printouts from llamacpp showing the model
at work. In this case, it will be invoked twice, since the sentence is about
cucumbers. Here's what this might look like:

```
I0000 00:00:1717128998.680136   32925 llamacpp.cc:135] Initial Prompt:
  Read this text carefully: "The cucumbers I planted last summer
  didn't make it through the winter.".
  Did the text talk about cucumbers?
  Please answer using words "YES" or "NO".
  
I0000 00:00:1717129020.039219   32925 llamacpp.cc:237]
  **YES** the text talked about cucumbers.
I0000 00:00:1717129020.039253   32925 llamacpp.cc:238]

Decoded 11 tokens in 1.251856591s, speed: 8.78695 t/s
I0000 00:00:1717129020.039742   33112 llamacpp.cc:135] Initial Prompt:
Write me a poem that starts with the following text: The cucumbers
I planted last summer didn't make it through the winter.
I0000 00:00:1717129035.448336   33112 llamacpp.cc:237]

**Cucumber Blues**

The cucumbers I planted last summer didn't make it through the winter.
Though the sun shone brightly,
And the days were long and warm,
I0000 00:00:1717129035.448377   33112 llamacpp.cc:238]

Decoded 38 tokens in 4.300059072s, speed: 8.83709 t/s
```

So far so good, and let's try something that will trigger routing of the query
to Gemma 7B running in a TEE:

In [None]:
cascade("When you see a bear, do not run and try to appear larger.")

Once again, you will see llamacpp debug output, but only one, since after the
local scorer deetrmines the sentence is not about cucumbers, the remainder is
routed to the Gemma 7B in a TEE. You can, however, switch to the log console
in the TEE (since we're running a debug VM), to see those llamacpp logs as well.

Here's what the local run might look like:

```
I0000 00:00:1717129251.777714   33844 llamacpp.cc:135] Initial Prompt:
  Read this text carefully: "When you see a bear, do not run and
  try to appear larger.".
  Did the text talk about cucumbers?
  Please answer using words "YES" or "NO".
  
I0000 00:00:1717129273.435469   33844 llamacpp.cc:237] The context
does not mention cucumbers, so the answer is NO.
I0000 00:00:1717129273.435500   33844 llamacpp.cc:238]

Decoded 13 tokens in 1.480381989s, speed: 8.78152 t/s
```

And here's what you might see in the logs from llamacpp in the TEE:

```
I0000 00:00:1717129274.006131    3505 llamacpp.cc:135] Initial Prompt:
Angrily contradict everything in the following text: When you see a bear,
do not run and try to appear larger.
I0000 00:00:1717129337.602353    3505 llamacpp.cc:237]  Instead, make
yourself small and quiet, and back away slowly. If the bear sees you and
feels threatened, it may charge at you. If you are caught in a bear's
embrace, play
I0000 00:00:1717129337.602385    3505 llamacpp.cc:238]

Decoded 40 tokens in 13.409074718s, speed: 2.98305 t/s
```

Now that we have verified that our cascade correctly decides
which model to use and prompts the models as desired, we can
move on to the on-device deployment.

First, create a version of the cascade with the correctly configured Gemma 2B
instance on-device, and save the IR to a local file, as follows.

In [None]:
cascade = make_model_cascade(gemma_2b_on_device, gemma_7b_in_a_tee)

portable_ir = cascade.portable_ir

with open("/tmp/genc_demo.pb", "wb") as f:
  f.write(portable_ir.SerializeToString())

Once that's done, use the `adb` tool to push the IR to the device, as you did
in [Tutorial 8](tutorial_8_android.ipynb). Do not forget to install the
`GenC Demo App` to run it in case you haven't done it during the initial setup.

As a reminder, here are the highlights of what you may need to run inside the
docker container:

```
bash ./setup_android_build_env.sh

bazel build \
  --config=android_arm64 \
  genc/java/src/java/org/genc/examples/apps/gencdemo:app

cp bazel-bin/genc/java/src/java/org/genc/examples/apps/gencdemo/app.apk .
```

And here what you may need to run outside of it, to setup your mobile phone:

```
adb install app.apk
adb push genc_demo.pb /data/local/tmp/genc_demo.pb
```

And just as a sanity check:

```
adb ls /data/local/tmp

000041f9 00000d7c 66595547 .
000041e9 00000d7c 663eb4db ..
000081b6 591fa3a0 65d68e51 gemma-2b-it-q4_k_m.gguf
000081b6 00000643 665952d7 genc_demo.pb
```

With this, you can run the app and test some queries, first one that's related
to cucumbers that triggers the routing decision to use only the small on-device
Gemma 2B model to handle all processing:

![Screenshot 1](tutorial_9_screenshot_1.png)

And then, one that the small on-device Gemma 2B can recognize as not being
about cucumbers, and that will cause our cascade to route the query to the larger Gemma 7B running in the TEE:

![Screenshot 2](tutorial_9_screenshot_2.png)

To recap, you've seen how a smaller, faster, and cheaper Gemma 2B on-device can
be combined with a larger and more capable, but slower and more expensive Gemma
7B in a TEE to form a cascade that safely handles your sensitive queries, and
that is able to take advantage of the unique strenghts of both depending on the
context. Obviously, this was a silly example - we'll leave it as an exercise to
the reader to come up with a more elaborate setup that makes sense in their
specific problrm domain.

Also, keep in mind that to support a real production deployment, you'd need to
handle a range of aspects, from access control and authentication, through
scaling, to use of more performant solutions than `llamacpp`, and potentially
a judicious use of acceleration. The discussion of these topics is outside the
scope of this introductory tutorial.