## LLM Demo, UT Austin, Jessy Li

Based on https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Gemma_Basics_with_HF.ipynb

You need a GPU for this, select T4 GPU :)

To set up:

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).

2. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

3. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. Create a new secret with the name HF_TOKEN. Copy/paste your token key into the Value input box of HF_TOKEN. Toggle the button on the left to allow notebook access to the secret.

4. To set up Gemma, head over to the [Gemma model page](https://huggingface.co/google/gemma-2b) and accept the usage conditions. This is the same for other models; e.g., if you want to play with Llama, go to https://huggingface.co/meta-llama/Llama-3.2-1B and accept terms.

In [1]:
!pip install --upgrade -q transformers huggingface_hub peft \
  accelerate bitsandbytes datasets trl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [6]:
%pip install python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
from dotenv import load_dotenv


In [11]:
## This part from https://huggingface.co/docs/transformers/main/en/llm_tutorial

import os

load_dotenv()

True

In [12]:
token = os.environ.get('HF_TOKEN')

In [23]:
%pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.7-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl (216 kB)
Downloading widgetsnbextension-4.0.14-py3-none-any.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [ipywidgets]3[0m [ipywidgets]
[1A[2KSuccessfully installed ipywidgets-8.1.7 jupyterlab_widgets-3.0.15 widgetsnbextension-4.0.14

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new rele

In [13]:
from huggingface_hub import login
login(token)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Instantiate the Gemma 2B model

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.


Let's get started by loading the model from Hugging Face Hub.

### Loading the model from HF Hub

In [2]:
model_id = "google/gemma-2b"
device = "cuda"

In [15]:
# Let's load the tokenizer first
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

## Let's quantize the model to reduce its weight
bnb_config = BitsAndBytesConfig(load_in_4bit = True,
                                bnb_4bit_quant_type = "nf4",
                                bnb_4bit_compute_dtype = torch.bfloat16)

## Let's load the final model
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map = {"": 0})


The 8-bit optimizer is not available on your device, only available on CUDA for now.


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.

thread 'hf-xet-0' panicked at /home/runner/work/xet-core/xet-core/cas_client/src/download_utils.rs:333:54:
index out of bounds: the len is 556 but the index is 1098
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


KeyboardInterrupt: 

### Trying it out

In [6]:
prompt = "My favorite books are"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
output = model.generate(inputs, max_new_tokens = 100)
text = tokenizer.decode(output[0], skip_special_tokens = True)
print(text)

My favorite books are the ones that I read in the car. I love to read on the road, and I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love to read on the road. I love


In [7]:
prompt = "Write a python function that reverses a string."
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
output = model.generate(inputs, max_new_tokens = 100)
text = tokenizer.decode(output[0], skip_special_tokens = True)
print(text)

Write a python function that reverses a string.

Answer:

def reverse(string):
for i in range(len(string)):
string[i] = string[len(string)-1-i]
