# Tutorial 9: Private GenAI

## Introduction and problem statement

This tutorial synthesizes and combines concepts introduced in several of the
preceding tutorials to support scenarios that focus on processing data that's
private and/or confidential.

The specific scenario we're going to focus on here is a mobile app on Android
that wants to perform GenAI processing, where some of the data fed by the
app as input to that processing might be sensitive.

Now, whereas privacy is a multi-faceted concept, in this tutorial we're going
to focus on one specific aspect of it: here, the app developer would like to
(or has an obligation to) ensure that while the genAI processing takes place, access to the sensitive data used in this processing, or any intermediate or
final results derived from it, remains tightly controlled.

In particular, the developer seeks to avoid issuing LLM queries or any other
service calls containing sensitive data to untrusted cloud backends, where
the data might be at the risk of being collected/retained, and potentially
later used for purposes that are incompatible with privacy or confidentiality
expectations of the users or data owners, or that may violate the developer's
obligations with respect to responsible handling of such data.

## The approach we're going to follow in this tutorial

* * *

NOTE: Please be advised that GenC is positioned as a research and experimental
framework. Use the techniques described here at your own risk. All code used
here is built using off the shelf and open-source components; we encourage you
to review the code to make an informed decision about potential suitability of
this technology, architecture/design, or any of the ideas described here for
non-research-oriented applications.

* * *

In the preceding tutorials, we've introduced a few mechanisms that a developer
can use that we'll combine in this tutorial into a cohesive solution. The gist
of the approach we'll take in this tutorial is to combine all these individual
ingredients into a cohesive whole.

The key ingredients:

*   **Processing locally on-device**, e.g., on Android using the on-device LLM,
    as covered in [Tutorial 8](tutorial_8_android.ipynb). During the process
    of constructing the IR, the developer can simply avoid calls to cloud LLMs
    or other external services. Albeit we won't go quite as far in this demo,
    the developer could possibly go even further, if needed, and configure a
    custom version of the GenC runtime for Android with support for calling any
    external services removed, such that even if the IR might be created with
    such calls in it (perhaps accidentally), they would fail to run.

*   **Processing in a
    [Trusted Execution Environment (TEE)](https://en.wikipedia.org/wiki/Trusted_execution_environment)**,
    e.g., within a
    [Confidential Computing](https://cloud.google.com/security/products/confidential-computing) VM instance setup on GCP by the developer. Here,
    the developer relies on all data and processing being encrypted end-to-end
    during transport and in memory, and on the formal and verifiable guarantees
    offered by the specialized hardware that powers the TEE to ensure that
    neither the cloud provider, nor the administrator/operator of the service
    deployed in the TEE, can intercept the data or results. You've seen an
    example of this in [Tutorial 6](tutorial_6_confidential_computing.ipynb).

*   **Hybrid device and Cloud processing** that combines operations on-device an
    in-Cloud, such that both on-device and Cloud components can work together
    as a seamless whole to support the needs of the application. You've seen
    elements of this in [Tutorial 1](tutorial_1_simple_cascase.ipynb) and
    [Tutorial 2](tutorial_2_custom_routing.ipynb), where we combined on-device
    and Cloud models into a device-to-Cloud model cascade, and demonstrated 2
    forms of routing to decide which model to use for a particular query.
    Whereas the examples shown in these 2 tutorials used an unprotected Cloud
    model and weren't focusing on privacy, here the hybrid approach can enable
    us to combine the on-device and TEE-based genAI processing just mentioned
    above to form a private device-to-TEE model cascade that's protected from
    intrusion by untrusted external parties.

If you haven't reviewed yet the tutorials mentioned above, we'd like to strongly
encourage you to do so, as doing that will make the rest of this tutorial much
easier to follow (we don't explain some of the foundational concepts and APIs in
quite the same depth as we did in the tutorials above).

## Overall architecture

The overall architecture used in this tutorial will be a combination of what
you've seen in [Tutorial 6](tutorial_6_confidential_computing.ipynb) and
[Tutorial 8](tutorial_8_android.ipynb), except with a larger model in the TEE,
and with hybrid processing mixed in.

See the following diagram:

![Private GenAI](private_genai.png)

Let's go through this diagram step-by-step:

*   We will be using two LLMs, both of them instances of
    [Gemma](https://ai.google.dev/gemma). One will be a small quantized 2B that
    comes in at approximately 1.5GB package and fits on a modern Android
    device (e.g., on a Pixel 7). Another will be a much larger, unquantized 7B
    version that comes in slightly over 34GB, and doesn't fit on a mobile
    device, but that can be hosted in a TEE (e.g., on any of the upper end of
    the
    [N2D standard](https://cloud.google.com/compute/docs/general-purpose-machines#n2d-standard)
    machine family, one of which we're going to use here to power this tutorial.
    The two of them together will be used to power the GenAI workloads defined
    in this tutorial.

*   Both instances of Gemma will run in protected environments. One directly
    bundled with the mobile app, the other in a TEE, with encrypted memory and
    other strong assurances offered by the trusted computing environment. The
    device and the TEE are connected with an encrypted communication channel to
    form a whole. You can think of the TEE as effectively a logical extension
    of the mobile device. The device confirms the identify of the workload
    running in the TEE by obtaining and subsequently verifying an
    [attestation](https://cloud.google.com/confidential-computing/confidential-vm/docs/attestation) report that contains a `SHA256` digest of
    the image. The developer who builds the image locally, and knows what
    the `SHA256` image digest should be, embeds the digest directly in the IR,
    as an integral part of the GenAI workload.

*   Interaction with the two Gemma instances is mediated by two instances of
    GenC runtime, one on-device, and one in the TEE that talk to one-another
    and jointly execute the GenAI workload you define. The on-device runtime
    delegates processing to the runtime in the TEE in a manner defined by the
    developer (you), directly based on the intent declared in the code.

*   The developer's intent and the logic to execute, as always, are defined in
    the form of portable
    [Intermediate Representation (IR)](https://github.com/google/genc/tree/master/genc/docs/ir.md). We'll author the IR in the colab, then
    deploy and run it on-device. GenC runtime instances use chunks of the
    same IR to exchange the GenAI logic to be delegated from device to cloud.

Keep in mind that in this particular deployment scenario, the protections do
inherently rely on all the code involved (GenC runtime, the model, etc.) being
open-source. Use of closed-source or untrusted code in this type of environment
can be possible, but it is more complex, and falls outside the scope of this
introductory tutorial.

## Initial setup

As usual, we start by ensuring that you have the development environment that
can support the remainder of this tutorial.

The best way to go about this is to review the setup steps in
[Tutorial 6](tutorial_6_confidential_computing.ipynb) and
[Tutorial 8](tutorial_8_android.ipynb).

Rather than repeating the content of those tutorials, we'll limit ourselves
here to a quick checklist. Please consult the above for the detailed steps.

*   Follow the steps in
    [SETUP.md](https://github.com/google/genc/tree/master/SETUP.md)
    to download GenC from GitHub, build it, and run the tests in a docker
    container. Make sure you can launch the Jupyter container, and then
    connect to it and reopen this notebook in Jupyter, so that you can execute
    all the Python code below.

*   Setup for Android development (Developer mode on your Android device, USB
    debugging, the `adb` tool on your development workstation, etc.), then
    confirm you can build and install the `GenC Demo App` on your device, as
    described in [Tutorial 8](tutorial_8_android.ipynb). We will continue to
    use the same demo app to execute the GenAI logic authored in this tutorial.

*   Setup a GCP project in which we can host a Confidential VM, as described
    in [Tutorial 6](tutorial_6_confidential_computing.ipynb), install the
    [`gcloud`](https://cloud.google.com/sdk/docs/install) command-line tool,
    and then, as in [Tutorial 6](tutorial_6_confidential_computing.ipynb),
    enter the
    [`confidential_computing`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing)
    directory, edit the
    [config_environment.sh](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing/config_environment.sh)
    script there to match your GCP project, account names, etc.

*   Obtain an instance of quantized Gemma 2B weights that you will use as an
    on-device LLM. This ia a file named `gemma-2b-it-q4_k_m.gguf`, 1495245728 bytes in size. Push this file to your mobile device using the `adb` tool,
    as discussed in [Tutorial 8](tutorial_8_android.ipynb).

Assuming you have all the above in place, there are a couple additional steps
needed to upgrade the service image that runs in the TEE to use the larger
Gemma 7B model.

First, obtain an instance of unquantized Gemma 7B weights, e.g.,
[from HuggingFace](https://huggingface.co/google/gemma-7b). Obtaining the file
will require filling a form online, thus we can't auto-download it for you.
You want to have a file named `gemma-7b-it.gguf` that is 34158344288 bytes in
size before continuing.

Next, copy this file to the
[`confidential_computing`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing)
directory that's to build images in
[Tutorial 6](tutorial_6_confidential_computing.ipynb), such that it sits
there side by side with the
[`Dockerfile`](https://github.com/google/genc/tree/master/genc/cc/examples/confidential_computing/Dockerfile).

Then, replace this section of the `Dockerfile`:

```
RUN wget --directory-prefix=/ https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf
```

Remove the above, and instead insert the following:

```
COPY gemma-7b-it.gguf /gemma-7b-it.gguf
```

Now, you can build and push the image of the runtime that runs in the TEE, as
described in [Tutorial 6](tutorial_6_confidential_computing.ipynb). The above
change you made ensures that it's the larger Gemma 7B that gets bundled with
the runtime. Go ahead and run `bash ./build_image.sh` and `bash ./push_image.sh`
as described in [Tutorial 6](tutorial_6_confidential_computing.ipynb).

Before you create VM by calling the `bash ./create_debug_vm.sh` script, you
will want to tweak the VM parameters to grant more resources to accommodate
the larger model, by editing the content of that script and inserting these
two lines in place of the existing `machine_type` setting:

```
  --machine-type="n2d-standard-96" \
  --boot-disk-size="100GB" \
```

With that, you should be able to run the script to create a debug VM, and since
we're in debug mode, you can confirm it's up as discussed in
[Tutorial 6](tutorial_6_confidential_computing.ipynb) by verifying that the VM
has printed the `workload task started` at the end of the log in the serial
console, and that there's no error message indicating that the VM has aborted.
Keep in mind that due to the large image and file sizes involved, and you might
see messages about the `systemd-journald.service` crashing and restarting. Wait
until it resolves. This could take an hour, but it should eventually converge,
and you should see the TEE instance coming up and reporting readiness as noted
by the message quoted above.

As long as there are no firewall rules in place that would prevent you from
connecting to the VM from your device, you should be good to go.




## TO BE CONTINUED...