# Demo of Google's Gemma Model - Answering Common Python Questions

By Nicole Michaud
04/08/2024

[Check out my Github here](http://github.com/nicolemichaud03)

[Connect with me on LinkedIn](http://linkedin.com/in/nicole-michaud2)

<div style="text-align: center;">
<img align="center" src = "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdWSrhXoP4nHcjFcmbTNjhKcd3LRLcIiL2uEp4_8ilX7h_zsMu_muQlLP52eEgxvq4ejAQKy0TQKNaFC07O4o9imxqDDKF8hgaLU-iYfwmcPYGpm64psp1WHyaJOZQPImAhCDpYtc4nWEvbM3hSERTA50n08rIhftkP0rK1ai9uB-o3nWx0TQMRWt1leQ/w1200-h630-p-k-no-nu/Keras-Gemma-GfD.png" width="700">
</div> 


#### Purpose:
The goal of this notebook is to demonstrate how to use the Gemma LLM to answer common questions about the Python programming language. Gemma was trained on web documents, code, and mathematics, so these question and answer pairs about the Python programming language should work great with it! The data used for this task was sourced from FAQs on Python's website.

#### What is Gemma?
Gemma is a lightweight, open-source family of models from Google, built from the same research and technology used to create their Gemini models
Gemma models come in two different parameter size options: 2B and 7B. The 2B model is more lightweight and will perform just fine for this demo.

Let's get started!

## Data Preparation

First, we need to import the necessary libraries for data processing and so that we are able to run this notebook on Kaggle with the Gemma model:

In [1]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/gemma/keras/gemma_2b_en/2/config.json
/kaggle/input/gemma/keras/gemma_2b_en/2/tokenizer.json
/kaggle/input/gemma/keras/gemma_2b_en/2/metadata.json
/kaggle/input/gemma/keras/gemma_2b_en/2/model.weights.h5
/kaggle/input/gemma/keras/gemma_2b_en/2/assets/tokenizer/vocabulary.spm
/kaggle/input/python-faq-qa/python_faqs.csv


In [2]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.1.1 which is incompatible.[0m[31m
[0m

In [3]:
import keras
import keras_nlp

2024-04-08 17:25:24.721473: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-08 17:25:24.721678: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-08 17:25:24.907211: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Keras can be used to easily establish the basic architecture for the model to run on using either TensorFlow, JAX, or PyTorch. I am going to be using TensorFlow in this demo.

In [4]:
#establish a backend
os.environ["KERAS_BACKEND"] = "tensorflow"  


The dataset that is going to be used was created from <a href="https://docs.python.org/3/faq/general.html">Python's FAQs page</a> and includes the common questions, their answers, and what category they fall under.

In [5]:
#import the dataset
import pandas as pd
documents = pd.read_csv("/kaggle/input/python-faq-qa/python_faqs.csv")


### Data Exploration

In [6]:
documents.head()

Unnamed: 0,Question,Answer,Category
0,Why does Python use indentation for grouping o...,Guido van Rossum believes that using indentati...,design
1,Why am I getting strange results with simple a...,See the next question.,design
2,Why are floating-point calculations so inaccur...,Users are often surprised by results like this...,design
3,Why are Python strings immutable?,\nThere are several advantages.\n\nOne is perf...,design
4,Why must 'self' be used explicitly in method d...,The idea was borrowed from Modula-3. It turns ...,design


In [8]:
documents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  178 non-null    object
 1   Answer    178 non-null    object
 2   Category  178 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


The dataset contains 178 question-answer pairs.

In [7]:
documents['Category'].unique()

array(['design', 'extending', 'general', 'gui', 'installed', 'library',
       'programming', 'windows'], dtype=object)

The FAQ questions each fall into one of eight different categories.

## Modeling

For this task I will be using the Gemma_2b model architecture, because I am running it on a personal computer that may not be able to support the 7b architecture. I will be using the GemmaCausalLM model, an end-to-end model for causal language modeling, configured with a preprocessor layer (GemmaCausalLMPreprocessor). By using a preprocessor, the string inputs are automatically preprocessed/tokenized, as opposed to having to do these steps prior to loading the model. The <a href="https://keras.io/api/keras_nlp/models/gemma/https://keras.io/api/keras_nlp/models/gemma/" >Keras 3 API documentation</a> has more details on the different steps and options for Gemma's pretrained models.

#### Loading the Model:

In [9]:
# Load a preprocessor layer from a preset.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_2b_en",
)

Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


In [10]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en", preprocessor=preprocessor)
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...


#### Generating prompts:

Even before fine-tuning the model, we can use it to generate answers. First, we have to generate some sample prompts to be able to see how the model is doing at answering common Python questions, before we fine-tune it.First, we have to generate some sample prompts to be able to see how the model is doing at answering common Python questions, before we fine-tune it.

In [11]:
#Need to format each question-answer pair with a template that joins together and labels the Question, Answer and Category of each
#to be able to be interpreted by the model
from tqdm.notebook import tqdm
tqdm.pandas() 
template = "\n\nCategory:\nkaggle-{Category}\n\nQuestion:\n{Question}\n\nAnswer:\n{Answer}"
documents["prompt"] = documents.progress_apply(lambda row: template.format(Category=row.Category,
                                                             Question=row.Question,
                                                             Answer=row.Answer), axis=1)
documents = documents.prompt.tolist()

  0%|          | 0/178 [00:00<?, ?it/s]

For these prompts I selected 3 random questions from 3 different categories of FAQs, to get an idea of how the model performs on different types of questions.


In [12]:
# Question from "general" category

prompt = template.format(
    Question="What is Python?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

I0000 00:00:1712597559.600229     490 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2024-04-08 17:32:39.602030: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
2024-04-08 17:32:39.602385: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.




Category:
kaggle-

Question:
What is Python?

Answer:
Python is an interpreted, high-level, general-purpose programming language. It is
designed to maximize programmer productivity with a combination of high-level
syntax, dynamic typing, dynamic binding, and high-level data structures, and a
low-level interpreted language, dynamic binding, and a low-level programming
model. It is often used alongside other languages. Python is used to build web
applications, data analysis tools, game engines, and more.

Category:
kaggle-

Question:
What is R?

Answer:
R is a programming language and software environment that supports statistical
computation and graphical display. R was developed by Ross Ihaka and Robert
Supportman in 1990, and it was originally called S-Plus. R is a programming language
and software environment for statistical computing and graphics.

Category:
kaggle-

Question:
What is RStudio?

Answer:
RStudio is an integrated development environment (IDE) written in the programmi

The specific question used in the prompt was given a pretty decent answer, but the other questions generated are not related to Python (they are about another programming language, R).

In [13]:
# Question from "extending" category

prompt = template.format(
    Question="How do I catch the output from PyErr_Print() (or anything that prints to stdout/stderr)?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))



Category:
kaggle-

Question:
How do I catch the output from PyErr_Print() (or anything that prints to stdout/stderr)?

Answer:
This can be done using the following code:

import sys
def my_print(*args, **kwargs):
    if sys.stderr is not None:
        sys.stderr.write("%s\n" % (", ".join(map(str,args)),)
    if sys.stdout is not None:
        sys.stdout.write("%s" % (" ".join(map(str,args)),)
    return

my_print("hello world")

Output:
hello world


I hope this helps,

-John

I think you'll need to use ctypes to do it.

-John


You need to catch the exception and then do a sys.stderr.flush() and sys.stdout.flush().



The answer provided to this question is a bit confusing, it has text from someone named John, which seems a little out of place.

In [14]:
# Question from "library" category

prompt = template.format(
    Question="How do I program using threads?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))



Category:
kaggle-

Question:
How do I program using threads?

Answer:
Threads are not a programming language feature. They’re a feature in the operating system that allows one program to use another program’s code. You can use threads in Java and in C++ (using the C++ thread class).

The following code shows how to start the second thread.

<code>Thread t1 = new Thread(new Runnable()
{
    @Override
    public void run()
    {
        System.out.println("Hello from thread 1!");
    }
});

t1.start();
t1.join(); // wait for thread to complete</code>

In the above example, the second thread starts and immediately exits (because the thread is not doing anything). The thread will not run until the first thread completes (because the second thread waits for the first thread to complete).

The following code shows how to use a thread to perform some work in the background and then return to the main thread.

<code>Thread t = new Thread(new Runnable()
{
    public void run() 
    {
        

The above answers look okay, but some parts of them do not completely make sense for what is being asked. Let's see if we can improve them by fine-tuning the model.

### Fine-tuning with LoRA (Low-Rank Adaptation):

Gemma models and pre-trained, however, a process called fine-tuning can be used to further train them on a particular dataset!

According to the documentation, **LoRA** is a form of PEFT, or parameter-efficient fine-tuning. This is because it works by reducing down the number of trainable parameters and freezing model weights, causing the model to run faster and be more memory-efficient, while maintaining high-quality outputs! 
You can learn more about the specifics of how LoRA works to fine-tune Gemma models <a href="https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb#scrollTo=lSGRSsRPgkzKhttps://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb#scrollTo=lSGRSsRPgkzK">here!</a>


There are other options for tuning Gemma models, but for this demo, LoRA fine-tuning via KerasNLP is a good choice because it is known to be efficient and effective.

When enabling LoRA, the rank must be specified. The rank is related to the number of trainable parameters to be used, and a higher rank indicates that more detailed changes can be made to the model. However, the higher the rank, the more computationally expensive it will be. Therefore, it is best to start with a relatively low rank. Here, we use rank=4.

In [15]:
# Enable LoRA freeze weights on the backbone, while allowing Lora to tune the query and value (question and answer) layers

gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

The total params are a little more than before enabling Lora, now 2,507,536,382 from 2,506,172,416. However, the trainable params are only 1,363,968 out of those whereas all of the params were trainable previously. This is because LoRa reduces trainable parameters in order to reduce the time and memory required for the model to run.

In [16]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 512
#The Adam optimizer is used in the Gemma documentation, but other optimizers can be used as well
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(documents, epochs=1, batch_size=1)

[1m178/178[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12357s[0m 69s/step - loss: 0.9697 - sparse_categorical_accuracy: 0.5263


<keras.src.callbacks.history.History at 0x7ca0ac5e57e0>

I used the same hyperparameter values as the documentation <a href = "https://ai.google.dev/gemma/docs/lora_tuning#lora_fine-tuninghttps://ai.google.dev/gemma/docs/lora_tuning#lora_fine-tuning">here</a>. Some other options for fine-tuning parameters include using a different optimizer, testing out different learning rates and weight decays, and changing the number of epochs and batch size. With respect to the loss, the model is adjusting its weights with each step. How fast or slow it does this adjustment is dictated by the **learning rate**. **Weight decay** is a regularization technique that will cause the weights to exponentially decay (via the specified learning rate) until they equal zero.

Generally, models perform better with multiple epochs, as the model is given more time to understand the trends within the data. Also, a larger batch size (the group of data points that the model is being fit on) can improve performance, but it can also cause overfitting (model is over-trained on the training data and therefore does not know how to respond to test/unseen data). 

As you can see, the above model took a long time to run. This runtime will only increase as you increase the number of epochs and the batch size. Therefore you must keep in mind the tradeoff between optimal model performance and runtime/memory usage.

### Inference after fine-tuning

Let's try out the same sample questions used before to see how the answers have changed after fine-tuning.

In [17]:
prompt = template.format(
    Question="What is Python?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))



Category:
kaggle-

Question:
What is Python?

Answer:
Python is a general-purpose, high-level programming language that can be used for many different kinds of applications, including web development, scientific computing, and system scripting. It was created by Guido van Rossum in 1991, and is now maintained by The Python Software Foundation.

Python is designed to be easy to read and write, yet powerful enough to be used by professionals. It is widely used for web development due to its built-in support for HTTP and other networking protocols, as well as its support for XML processing and database access. Python is also popular among scientific researchers, due to its support for numerical computing and data analysis.

Python is widely available on most major operating systems, and is free and easy to install, making it a popular choice for beginners and professionals alike.

If you're looking for a general-purpose programming language that's easy to learn and use, Python is defini

In [18]:
prompt = template.format(
    Question= "How do I catch the output from PyErr_Print() (or anything that prints to stdout/stderr)?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))



Category:
kaggle-

Question:
How do I catch the output from PyErr_Print() (or anything that prints to stdout/stderr)?

Answer:
The answer is to redirect stdout and stderr to file, and then check the file for output.


Category:
Python

Tags:

Article By:
Michael A. Jackson

Article Last Updated:
2014-05-10 08:40:00 -0500

Article Notes:
http://www.pythonthegeek.com/2014/05/10/how-do-i-catch-the-output-from-pyerr_print-or-anything-that-prints-to-stdout-stderr/#comment-4447



In [19]:
prompt = template.format(
    Question= "How do I program using threads?",
    Answer="",
    Category=""
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))



Category:
kaggle-

Question:
How do I program using threads?

Answer:
The simplest way to do this is to use Python's threading library.



The generated answers for our sample questions are looking much better and making more sense after-fine tuning!

The next steps would be to continue tying out different hyperparameter values for fine-tuning to see how much we can improve the accuracy and the answers we get!

### Conclusion

Google's new Gemma LLM models are lightweight and user friendly as they can be used with familiar APIs such as Keras. In this demo, I show you how to use Gemma with KerasNLP to answer common questions about the programming language Python! 