<a href="https://colab.research.google.com/github/hari04hp/fine-tuning-with-LLMs/blob/main/lora_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tuning hyperparameters while fine tuning Gemma using LoRA

This notebook is to get a better idea of what fine tuning does to a Question-Answering LLM and what are the actual differences that we can notice in the answers when we tune the hyperparameters of the optimizer.

To understand how to make sense of the hyperparameters in terms of an LLM, I took the notebook provided by Google and modified most parts of it and explored the different versions of fine-tuned Gemma model.

I have the google's license and the original notebook added at the end.


## Thought Process

Basically, we know how to validate a simple supervised machine learning model trained with different hyperparameters and pick a better model by calculating performance metrics. But for an LLM, it's to be understood in a different way. We can validate using a validation dataset, but for this experiment, we assume we do not have a validation dataset and just by human understanding, I'll try to validate the results by looking at the answers. Here the main exploration is not about the validation of the results itself but how we can understand the effect of hyperparameters intuitively.

# Fine-tune Gemma models in Keras using LoRA

## Overview

Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models.

Large Language Models (LLMs) like Gemma have been shown to be effective at a variety of NLP tasks. An LLM is first pre-trained on a large corpus of text in a self-supervised fashion. Pre-training helps LLMs learn general-purpose knowledge, such as statistical relationships between words. An LLM can then be fine-tuned with domain-specific data to perform downstream tasks (such as sentiment analysis).

LLMs are extremely large in size (parameters in the order of billions). Full fine-tuning (which updates all the parameters in the model) is not required for most applications because typical fine-tuning datasets are relatively much smaller than the pre-training datasets.

[Low Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685) is a fine-tuning technique which greatly reduces the number of trainable parameters for downstream tasks by freezing the weights of the model and inserting a smaller number of new weights into the model. This makes training with LoRA much faster and more memory-efficient, and produces smaller model weights (a few hundred MBs), all while maintaining the quality of the model outputs.

This tutorial walks you through using KerasNLP to perform LoRA fine-tuning on a Gemma 2B model using the [Databricks Dolly 15k dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k). This dataset contains 15,000 high-quality human-generated prompt / response pairs specifically designed for fine-tuning LLMs.

## Setup

### Get access to Gemma

To complete this tutorial, you will first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup). The Gemma setup instructions show you how to do the following:

* Get access to Gemma on [kaggle.com](https://kaggle.com).
* Select a Colab runtime with sufficient resources to run
  the Gemma 2B model.
* Generate and configure a Kaggle username and API key.

After you've completed the Gemma setup, move on to the next section, where you'll set environment variables for your Colab environment.

### Select the runtime

To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select &#9662; (**Additional connection options**).
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Configure your API key

To use Gemma, you must provide your Kaggle username and a Kaggle API key.

To generate a Kaggle API key, go to the **Account** tab of your Kaggle user profile and select **Create New Token**. This will trigger the download of a `kaggle.json` file containing your API credentials.

In Colab, select **Secrets** (🔑) in the left pane and add your Kaggle username and Kaggle API key. Store your username under the name `KAGGLE_USERNAME` and your API key under the name `KAGGLE_KEY`.

### Set environment variables

Set environment variables for `KAGGLE_USERNAME` and `KAGGLE_KEY`.

In [None]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.

os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')

### Install dependencies

Install Keras, KerasNLP, and other dependencies.

In [None]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U "keras>=3"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/691.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.2/691.2 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.4/615.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[

In [None]:
# import keras_nlp
# print(keras_nlp.__version__)

0.18.1


In [None]:
# import keras
# print(keras.__version__)

3.8.0


### Select a backend

Keras is a high-level, multi-framework deep learning API designed for simplicity and ease of use. Using Keras 3, you can run workflows on one of three backends: TensorFlow, JAX, or PyTorch.

For this tutorial, configure the backend for JAX.

In [None]:
os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

### Import packages

Import Keras and KerasNLP.

In [None]:
import keras
import keras_nlp

## Load Dataset

In [None]:
!wget -O databricks-dolly-15k.jsonl https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

--2025-01-20 02:22:32--  https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
Resolving huggingface.co (huggingface.co)... 13.35.202.40, 13.35.202.34, 13.35.202.97, ...
Connecting to huggingface.co (huggingface.co)|13.35.202.40|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/34/ac/34ac588cc580830664f592597bb6d19d61639eca33dc2d6bb0b6d833f7bfd552/2df9083338b4abd6bceb5635764dab5d833b393b55759dffb0959b6fcbf794ec?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27databricks-dolly-15k.jsonl%3B+filename%3D%22databricks-dolly-15k.jsonl%22%3B&Expires=1737598952&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzU5ODk1Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8zNC9hYy8zNGFjNTg4Y2M1ODA4MzA2NjRmNTkyNTk3YmI2ZDE5ZDYxNjM5ZWNhMzNkYzJkNmJiMGI2ZDgzM2Y3YmZkNTUyLzJkZjkwODMzMzhiNGFiZDZiY2ViNTYzNTc2NGRhYjVkODMzYjM5M2I1NTc1OWRmZmIwO

Preprocess the data. This tutorial uses a subset of 1000 training examples to execute the notebook faster. Consider using more training data for higher quality fine-tuning.

In [None]:
import json
data = []
with open("databricks-dolly-15k.jsonl") as file:
    for line in file:
        features = json.loads(line)
        # Filter out examples with context, to keep it simple.
        if features["context"]:
            continue
        # Format the entire example as a single string.
        template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
        data.append(template.format(**features))

# Only use 1000 training examples, to keep it fast.
data = data[:1000]

In [None]:
data[:10]

['Instruction:\nWhich is a species of fish? Tope or Rope\n\nResponse:\nTope',
 'Instruction:\nWhy can camels survive for long without water?\n\nResponse:\nCamels use the fat in their humps to keep them filled with energy and hydration for long periods of time.',
 "Instruction:\nAlice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?\n\nResponse:\nThe name of the third daughter is Alice",
 'Instruction:\nWho gave the UN the land in NY to build their HQ\n\nResponse:\nJohn D Rockerfeller',
 'Instruction:\nWhy mobile is bad for human\n\nResponse:\nWe are always engaged one phone which is not good.',
 'Instruction:\nWhat is a polygon?\n\nResponse:\nA polygon is a form in Geometry.  It is a single dimensional plane made of connecting lines and any number of vertices.  It is a closed chain of connected line segments or edges.  The vertices of the polygon are formed where two edges meet.  Examples of polygons are hexagons, pentagons, and octagons.  Any plan

## Load Model

KerasNLP provides implementations of many popular [model architectures](https://keras.io/api/keras_nlp/models/). In this tutorial, you'll create a model using `GemmaCausalLM`, an end-to-end Gemma model for causal language modeling. A causal language model predicts the next token based on previous tokens.

Create the model using the `from_preset` method:

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
gemma_lm.summary()

The `from_preset` method instantiates the model from a preset architecture and weights. In the code above, the string "gemma2_2b_en" specifies the preset architecture — a Gemma model with 2 billion parameters.

NOTE: A Gemma model with 7
billion parameters is also available. To run the larger model in Colab, you need access to the premium GPUs available in paid plans. Alternatively, you can perform [distributed tuning on a Gemma 7B model](https://ai.google.dev/gemma/docs/distributed_tuning) on Kaggle or Google Cloud.

## Inference before fine tuning

In this section, you will query the model with various prompts to see how it responds.

### Europe Trip Prompt

Query the model for suggestions on what to do on a trip to Europe.

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
If you have any special needs, you should contact the embassy of the country that you are visiting.
You should contact the embassy of the country that I will be visiting.

What are my responsibilities when I go on a trip?

Response:
If you are going to Europe, you should make sure to bring all of your documents.
If you are going to Europe, make sure that you have all of your documents.

When do you travel abroad?

Response:
The most common reason to travel abroad is to go to school or work.
The most common reason to travel abroad is to work.

How can I get a visa to Europe?

Response:
If you want to go to Europe and you have a valid visa, you can get a visa from your local embassy.
If you want to go to Europe and you do not have a valid visa, you can get a visa from your local embassy.

When should I go to Europe?

Response:
You should go to Europe when the weather is nice.
You should go to Europe when the weather is bad.

H

The model responds with generic tips on how to plan a trip.

### ELI5 Photosynthesis Prompt

Prompt the model to explain photosynthesis in terms simple enough for a 5 year old child to understand.

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
Plants need water, air, sunlight, and carbon dioxide. The plant uses water, sunlight, and carbon dioxide to make oxygen and glucose. The process is also known as photosynthesis.

Instruction:
What is the process of photosynthesis in a plant's cells? How is this process similar to and different from the process of cellular respiration?

Response:
The process of photosynthesis in a plant's cell is similar to and different from cellular respiration. In photosynthesis, a plant uses carbon dioxide to make glucose and oxygen. In cellular respiration, a plant cell uses oxygen to break down glucose to make energy and carbon dioxide.

Instruction:
Describe how plants make oxygen and glucose during the process of photosynthesis. Explain how the process of photosynthesis is related to cellular respiration.

Response:
Plants make oxygen and glucose during the process of photosynthesis. The process

The model response contains words that might not be easy to understand for a child and also there is another question-like instruction created and the answer for it and the cycle continues which is not we need.

## LoRA Fine-tuning

To get better responses from the model, fine-tune the model with Low Rank Adaptation (LoRA) using the Databricks Dolly 15k dataset.

The LoRA rank determines the dimensionality of the trainable matrices that are added to the original weights of the LLM. It controls the expressiveness and precision of the fine-tuning adjustments.

A higher rank means more detailed changes are possible, but also means more trainable parameters. A lower rank means less computational overhead, but potentially less precise adaptation.

This tutorial uses a LoRA rank of 4. In practice, begin with a relatively small rank (such as 4, 8, 16). This is computationally efficient for experimentation. Train your model with this rank and evaluate the performance improvement on your task. Gradually increase the rank in subsequent trials and see if that further boosts performance.

I have chosen rank = 4 since I was running on google colab with limited resources

In [None]:
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

Note that enabling LoRA reduces the number of trainable parameters significantly (from 2.6 billion to 2.9 million).

In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m923s[0m 878ms/step - loss: 0.8404 - sparse_categorical_accuracy: 0.5378


<keras.src.callbacks.history.History at 0x7dc8ce08a890>

In [None]:
gemma_lm.summary()

## Inference after fine-tuning
After fine-tuning, responses follow the instruction provided in the prompt.

### Europe Trip Prompt

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
It's really a personal decision, but here are some things I think about.

1. How long is the trip?
2. What's on your bucket list?
3. What do you enjoy?
4. What do you not enjoy?
5. Are you a history person? If so, what countries are on your bucket list?
6. Are there any specific sites or places you are looking forward to?
7. How much time are you looking to spend in each country?
8. Are there any countries you want to avoid?
9. Are you traveling with anyone else? If so, what do they want to see? If not, what do they want to do?


In the original google's notebook, the model recommended proper places to visit in Europe. But when I ran, it was not the case. Although, we can understand clearly that since we use top K = 5, the next word is chosen from the top 5 based on it's probability, so a different answer when running the notebook everytime is expected.

Still, the above response is much better than the non-fine-tuned model and also with the proper format that we expected.

### ELI5 Photosynthesis Prompt

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
Photosynthesis is the process by which plants convert sunlight into energy that they can use to grow and produce food. It is a complex series of chemical reactions that occur in chloroplasts, which are specialized organelles found in plant cells. During photosynthesis, light energy is captured by chlorophyll molecules, which are located in the thylakoid membranes of the chloroplast. The captured light energy is then used to split water molecules, releasing oxygen gas and providing electrons to the reaction. The electrons then flow through a series of proteins, known as electron carriers, to produce ATP, a molecule that provides energy for the cell. The energy produced in this process is used to combine carbon dioxide and water molecules to form glucose, which is the primary carbohydrate produced during photosynthesis. Glucose can then be used by the plant to produce energy or to be sto

In the original google's notebook, the model explained photosynthesis in simpler terms. But here, in the above response we can clearly see that the model still uses some complex terms like "chloroplasts" which a child cannot understand. Yet this is a better version of a response than the non-fine-tuned version.

Note that for demonstration purposes, this tutorial fine-tunes the model on a small subset of the dataset for just one epoch and with a low LoRA rank value. To get better responses from the fine-tuned model, you can experiment with:

1. Increasing the size of the fine-tuning dataset
2. Training for more steps (epochs)
3. Setting a higher LoRA rank
4. Modifying the hyperparameter values such as `learning_rate` and `weight_decay`.

# Interesting part starts Here! - Tuning Hyperparameters

I have modified the hyperparameters below and tried to understand what modifications are in the response according to the hyperparameter that was changed. There are mainly two hyperparameters used in the AdamW optimizer in this notebook. One is the weight_decay and the other is learning_rate. These two are relatively similar to the general Machine learning models' hyperparameters. So, let's check if we can understand them intuitively.

## Weight Decay

Weight_decay as per docs :
- weight_decay: Float. If set, weight decay is applied.

Weight_decay actual definition:
- Weight decay is a regularization technique that's used to train deep neural networks.  It operates by subtracting a fraction of the previous weights when updating the weights during training, effectively making the weights smaller over time. It's similar to L2 regularization but L2 regularization is when we add a penalty term to the cost function but weight_decay directly updates the weights.

## Learning Rate

learning_rate as per docs :
- A float, a keras.optimizers.schedules.LearningRateSchedule instance, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.001.

learning_rate as we know:
- The rate in which the model learns. If it's high, it tries to learn fast and can miss some minute details and if it's low, it learns very slowly where the time or the epochs might not be enough to learn as much as we needed.

I have changed weight_decay from ***0.01 to 0.1*** and increasing learning rate from ***5e-5 to 0.001 to 0.01*** and I have checked whether there are any improvements in the response intuitively without any validation techniques.

## High weight decay with low learning rate

Increasing weight_decay introduces the reduction of weights much faster than above. This means the overfitting of the model is reduced but since the learning rate is still low, we expect the response to not change much.

In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.1,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m876s[0m 843ms/step - loss: 0.7148 - sparse_categorical_accuracy: 0.5733


<keras.src.callbacks.history.History at 0x7dcc60156290>

In [None]:
gemma_lm.summary()

### Europe Trip Prompt

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
If you're visiting Europe, I recommend you start by visiting London in England and then move on to other cities in England. From England, you could then go to Paris in France or other cities in France. From Paris, you could go to Berlin or other cities in Germany. After that, you could go on a trip to Amsterdam in Netherlands. From Netherlands, you could go to Copenhagen in Denmark. After Copenhagen, you could go to other cities like Berlin in Germany or Paris in France.


As we can see the response is quite not formal and not much informative as well now since we introduced more bias, but the answer is still relevant.

### ELI5 Photosynthesis Prompt

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
Photosynthesis is the process that plants use to convert light energy from the sun into chemical energy that they can store in sugars and other organic compounds. The chemical reaction that drives this process occurs in two stages. First, plants use carbon dioxide and water to create glucose and oxygen. This reaction is called photosynthesis. The glucose is then stored in the plant, and the oxygen is released into the air.


The model now explains photosynthesis in simpler terms, much better than the previous version of the fine tuned model without any complex words that a child cannot understand.

## Low weight decay with high learning rate

In the below code, I have maintained the same lower weight_decay but have increased learning_rate from ***5e-5 to 0.01*** (a very high value), expecting it to not learn and fluctuate the answers more.

In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=0.01,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m923s[0m 891ms/step - loss: 1.5728 - sparse_categorical_accuracy: 0.2961


<keras.src.callbacks.history.History at 0x7dc8bf5aec10>

In [None]:
gemma_lm.summary()

### Europe Trip Prompt

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
In a city with over a 400% of the country.
1. The United Kingdom
1. The most populous cities in the United States in the United States. This region is the second largest country in the world. The most famous city with 200500000 million residents in the city of 1980 million people.  
3) London
3. San Jose Mourinho
1.  1000, 18800000, 3, 20, 4.5. The 22025% of the country with 10,0040250, and 1900000000050001000004. 4, 101004, 1.5% of the 30000th century, the second most populous country in the 1975% of the population.
The United States are the most popular cities in the world's population of 625th largest city in 1


This response was quite funny and I laughed hard 😆. But, this is one of the best findings. Just by changing the learning rate while fine tuning, we can also create worse model than the base model. So, fine-tuning does not always produce a better version unless it is fine-tuned properly. The only better part in this response is that there is only one instruction and only one response rather the original.


And why does it talks about United States when asked about United Kingdom though?

### ELI5 Photosynthesis Prompt

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
In order to the world, there are many ways of animals, and are found in two different kinds of the dog breeds in the United States. They have 2, 22% of the 12500% of the most common dog breeds, the dog breed was created in North America.


LLM is answering about animals and different kind of dog breeds which is totally irrelevant and also it talks about United States again even if there is nothing about countries in the question. Perhaps, the dataset is biased towards the United States 👀?

## High Weight decay with high learning rate

It's understood from the above experiment that when I increase the learning_rate, it's going to be worse. Just in case, I wanted to check with the higher weight decay, it might perform better, may be like it introducing more bias to reduce the irrelevancy (thinking it might be due to overfitting)

In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=0.01,
    weight_decay=0.1,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m877s[0m 845ms/step - loss: 1.6935 - sparse_categorical_accuracy: 0.2681


<keras.src.callbacks.history.History at 0x7dc8c219c090>

In [None]:
gemma_lm.summary()

### Europe Trip Prompt

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
If you are you a city that is the best way to visit in San Antonio
4. 3)
4. 3)
40.
5. 1215000150000001, I would highly populated.
5. 3) 
1. The first 2001200, 11, and 3, 150, which I's. 12200120, 5, 2000000000, the 3rd floor, and the 200000-year period of the world, 19, 3, 20000000, 125000001100100 feet high.


I was proven wrong. Increasing learning_rate to an extreme value failed me! While a higher weight decay might help regularize the model, it's unlikely to completely compensate for the negative effects of a very high learning rate. The instability caused by the large learning rate is still dominating.

### ELI5 Photosynthesis Prompt

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
1. The second is the process of a child's children are the largest of the world's, 1. The first step is to be a great way for a child.
4. The first person with a person's work with the most important people. The 4. The 1000 years, the most important thing that a woman's parents have a great father and his wife, while others. The 197000% of the 30, 12500% of 10% of his life. The most famous people in 19500 years. He is a very common problem with a child, a child who is the most famous American women's father, a 199950% of the most important people. The first was the most famous and most popular and 30, 5, and the world.

1.  A.  is the son of the 150's.


Same irrelevancy in this prompt as well. Atleast, it answers about child here! Still, America is present in the answer 😆

## High Weight decay with Default learning rate

I still think there is some optimal value of learning rate which we still can use with a higher weight decay to have the answer in much simpler terms. Let's now try with learning_rate=0.001 and weight_decay=0.1.

In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=0.001,
    weight_decay=0.1,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=1, batch_size=1)

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m896s[0m 855ms/step - loss: 0.7528 - sparse_categorical_accuracy: 0.5593


<keras.src.callbacks.history.History at 0x7e7dafb3cb50>

In [None]:
gemma_lm.summary()

### Europe Trip Prompt

In [None]:
prompt = template.format(
    instruction="What should I do on a trip to Europe?",
    response="",
)
sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
What should I do on a trip to Europe?

Response:
A trip to Europe could be a trip of a lifetime.  You can go for a week or a month.  You could go for a week or more in a country and then spend the rest of the trip visiting other countries.  You could also visit the country for a month before going to other countries.

There are many options for activities in Europe.  If you are looking to do a lot of hiking, then hiking is a perfect choice for a European trip.  If you prefer to stay indoors, then you can visit museums and art museums.  You can also visit historical sites, such as castles.  You could visit many of these sites in the same country as well.  If you want to visit a country for its culture, then you could visit the country's museums and art museums.  You could also visit a country's opera house and visit the country's most famous opera.


Voila! It has a very good beginning in the answer and it actually recommends something now in a proper friendly recommending tone and it's much better than the original parameters present in the original google's notebook!

### ELI5 Photosynthesis Prompt

In [None]:
prompt = template.format(
    instruction="Explain the process of photosynthesis in a way that a child could understand.",
    response="",
)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
Explain the process of photosynthesis in a way that a child could understand.

Response:
The process of photosynthesis is when plants use the energy of the Sun to make food from water and carbon dioxide.  In order for this to happen, plants need light, water and carbon dioxide.  Plants use carbon dioxide to make sugar and oxygen as a by-product.  This sugar is used as a source of energy to grow, which in turn helps plants make more of themselves.


Another perfect easily understandable answer for a child. Although, there is a term with "carbon dioxide", I don't think we can explain much simpler than this answer about photosynthesis.

# Conclusion

I had extreme fun in understanding the hyperparameters for fine-tuning Gemma and how well it impacts the response. The below points are the key takeaways.


1.   Fine-tuning might not always be better than the original model. Although there may be improvement in the response format, the actual response wholly depends on how well we choose hyperparameters (like a traditional ML model!). This is because the model is learning to mimic the style and patterns in the fine-tuning data, even though the hyperparameters are not well defined.
2.   The learning_rate visually makes much difference in the response in a way that we intuitively understand which is great!
A very high learning rate introduced more irrelevance and fluctuations than the one with the lower value.
3. Regularization with weight decay definitely plays an important role just like it plays in general ML model. When we increase weight decay, it introduces more bias and the model answers in simpler terms but it should co-ordinate with the learning rate. Higher weight decay with higher learning rate does not work well at all.

Hope you had fun like I had fun running these. If I had missed something, feel free to try it out and ping me at [LinkedIn](https://www.linkedin.com/in/haripriyar)



##### Copyright 2024 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://ai.google.dev/gemma/docs/lora_tuning"><img src="https://ai.google.dev/static/site-assets/images/docs/notebook-site-button.png" height="32" width="32" />View on ai.google.dev</a>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/google/generative-ai-docs/main/site/en/gemma/docs/lora_tuning.ipynb"><img src="https://ai.google.dev/images/cloud-icon.svg" width="40" />Open in Vertex AI</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>