# Demo: Gemma with MLC LLM

Google recently release Gemma: https://blog.google/technology/developers/gemma-open-models/.

This notebook demonstrates how to use the model with MLC LLM: https://llm.mlc.ai/.

For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/models/demo_gemma.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, let's set up the Conda environment which we will be running this notebook in (not required if running in Google Colab).

```bash
conda create --name mlc-llm python=3.11
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [1]:
!nvidia-smi

Fri Feb 23 18:19:58 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Next, let's download the MLC-AI and mlc-llm nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [None]:
!pip install --pre mlc-ai-nightly-cu122 mlc-llm-nightly-cu122 -f https://mlc.ai/wheels

**Google Colab**: If you are using Colab, you may see the red warnings such as "You must restart the runtime in order to use newly installed versions." For our purpose, simply restart session, and run the next cell after restart.

Let's confirm we have installed the packages successfully!

In [None]:
!python -c "import tvm; print('tvm installed properly!')"
!python -c "import mlc_llm; print('mlc_llm installed properly!')"

## Running Gemma with MLC-LLM

Then we can clone gemma weights converted to MLC format from huggingface.

This is the only thing you need. Afterwards, our JIT (just-in-time) compilation will take care of everything for you!

First time running may require more time as we need to compile the model. But afterwards we cache it to `/pathto/.cache/mlc_llm/`, so future runs are faster.

Alternatively, you could also use the following

```python
!python -m mlc_llm compile gemma-7b-it-q4f16_2-MLC -o gemma-7b-it-q4f16_2-q4f16_2-cuda.so

cm = ChatModule("./gemma-7b-it-q4f16_2-MLC", model_lib_path="gemma-7b-it-q4f16_2-q4f16_2-cuda.so")
```

In [4]:
!git lfs install

Git LFS initialized.


In [5]:
# This is gemma 7b with 4-bit quantization
# Any other quantizations/models have the same steps: https://huggingface.co/mlc-ai
!git clone https://huggingface.co/mlc-ai/gemma-7b-it-q4f16_2-MLC

Cloning into 'gemma-7b-it-q4f16_2-MLC'...
remote: Enumerating objects: 113, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (110/110), done.[K
remote: Total 113 (delta 0), reused 0 (delta 0), pack-reused 3[K
Receiving objects: 100% (113/113), 33.40 KiB | 6.68 MiB/s, done.
Filtering content: 100% (103/103), 5.54 GiB | 62.53 MiB/s, done.


In [6]:
from mlc_llm import ChatModule
from mlc_llm.callback import StreamToStdout

In [7]:
cm = ChatModule("./gemma-7b-it-q4f16_2-MLC")

In [8]:
output = cm.generate(
    prompt="Tell me about 5 states in the US",
    progress_callback=StreamToStdout(callback_interval=2),
)

Sure, here's a quick overview of five states in the US:

**1. California:**
- Capital: Sacramento
- Largest city: Los Angeles
- Known for: Golden Gate Bridge, Hollywood, Silicon Valley, and its diverse population.

**2. New York:**
- Capital: Albany
- Largest city: New York City
- Known for: Empire State Building, Times Square, Niagara Falls, and its rich history.

**3. Texas:**
- Capital: Austin
- Largest city: Dallas
- Known for: Its large size, diverse culture, and its strong economy.

**4. Florida:**
- Capital: Tallahassee
- Largest city: Jacksonville
- Known for: Its beautiful beaches, warm climate, and its history as a major naval power.

**5. Alaska:**
- Capital: Juneau
- Largest city: Anchorage
- Known for: Its breathtaking natural beauty, including towering mountains, glaciers, and fjords.


In [9]:
output = cm.generate(
    prompt="Two more please",
    progress_callback=StreamToStdout(callback_interval=2),
)

**Sure, here are two more states:**

**6. Nevada:**
- Capital: Carson City
- Largest city: Las Vegas
- Known for: Its casinos, its desert landscapes, and its history as a frontier town.

**7. Idaho:**
- Capital: Boise
- Largest city: Boise
- Known for: Its scenic mountains, its salmon fishing, and its rich Native American heritage.


In [None]:
cm.reset_chat()