<a href="https://colab.research.google.com/github/linhkid/Google-IO-Extended-speechs/blob/main/notebooks/Gemma_2b_and_7b_Google_I_O_2024_Extended.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Google recently released a series of open source LLMs based on their Gemini flagship model. These smaller models are built using the same research and training methodologies that Google used to create Gemini and come with very big promises in how they will reshape the open source LLM space. The models come in 2 sizes, 2B and 7B making them small enough that they can sit on consumer level hardware, and even Google's own Collab service if users don't want or don't have access to their own personal GPUs. The Gemma models are exciting entries into the LLM race and I'm excited to explore them. In this notebook I'll go over how to access these models and run them in your own environment using Huggingface's libraries and tools.

# Preparation

In [1]:
!pip install -q --upgrade transformers accelerate bitsandbytes flash_attn accelerate datasets peft

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m90.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.

In [1]:
import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')


# Gemma 2B

Introduction about Gemma 2b here


## Original FP (torch.float32)


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_NAME = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto"
)

input_text = "Write me a poem about Google I/O event."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos>Write me a poem about Google I/O event.

A digital stage, a virtual space,
Where ideas take their place.
From startups to giants, a diverse array,
Sharing stories, insights, and ways.

The lights shine bright, the speakers speak,
A symphony of knowledge, a vibrant streak.
From AI to VR, the topics unfold,
Connecting minds, stories to be told.

With every click, a new adventure unfolds,
A world of possibilities, a story to be told.
Google I/O, a beacon in the night,
Guiding the future, shining ever bright.<eos>


## Using BFloat16


In [3]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# input_text = "Write me a poem about Google I/O event."
# input_ids = tokenizer(input_text, return_tensors="pt").to("cuda" )

outputs = model.generate(**input_ids,  max_new_tokens=256)
print(tokenizer.decode(outputs[0]))



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos>Write me a poem about Google I/O event.

A digital stage, a virtual space,
Where ideas take their place.
From startups to giants, a diverse array,
Sharing their stories, come what may.

With presentations, panels, and talks,
A chance to learn, to grow, to recall.
The energy's electric, the atmosphere's bright,
As the world's innovators take flight.

With every click, a new horizon unfolds,
A tapestry of dreams, stories yet untold.
From artificial intelligence to the human touch,
Google I/O shines, a beacon of much.

So let us join the virtual fray,
And explore the boundless, ever-changing day.
With Google I/O, we're inspired and free,
To shape the future, to make it for thee.<eos>


## Quantized 8-bit Integer

In [4]:
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quant_config
)

outputs = model.generate(**input_ids,  max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos>Write me a poem about Google I/O event.

A digital stage, a virtual space,
Where ideas flow, a boundless chase.
From startups to giants, a vibrant scene,
Where innovation thrives, a collaborative dream.

The Google I/O stage, a beacon bright,
A portal to the future, a dazzling sight.
With every keynote, a spark is ignited,
Igniting a passion, a burning light.

The speakers, diverse and bold,
Share their stories, stories to be told.
From AI and robotics to the human touch,
They paint a picture of a world that's to come.

The audience, engaged and keen,
A symphony of minds, a collective team.
They connect, they learn, they share their might,
In this digital realm, where dreams take flight.

So let us join the chorus, a global refrain,
In Google's I/O, where the future's here.
A stage for inspiration, a place to ignite,
A beacon of hope, a beacon of light.<eos>


## Quantized 4-bit precision


In [6]:
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_4bit=True,
              bnb_4bit_use_double_quant=True,
              bnb_4bit_quant_type="nf4",
              bnb_4bit_compute_dtype=torch.float16)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quant_config
)

outputs = model.generate(**input_ids,  max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos>Write me a poem about Google I/O event.

A digital stage, a virtual space,
Where ideas flow, a boundless chase.
From startups to giants, a diverse array,
Sharing stories, insights, and a brighter day.

With every click, a journey takes flight,
From product launches to data's might.
The energy is electric, the atmosphere alive,
As communities gather, a collective thrive.

With every session, a spark is ignited,
A thirst for knowledge, a hunger to ignite.
From AI to marketing, the topics unfold,
A symphony of innovation, a story to be told.

So let us gather, in this digital space,
To learn, to grow, and to leave our trace.
Google I/O, a beacon of hope and light,
A journey that inspires, day and night.<eos>
