<a href="https://colab.research.google.com/github/neomatrix369/learning-path-index/blob/lpi-gemma-model/app/llm-poc-variant-03/lpi_finetune_gemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sorce: https://medium.com/@gabi.preda/fine-tuning-gemma-2-model-with-role-playing-dataset-b8ec399a2e17

In [11]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

In [12]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

neomatrix369_learning_path_index_dataset_path = kagglehub.dataset_download('neomatrix369/learning-path-index-dataset')
keras_gemma2_keras_gemma2_2b_en_1_path = kagglehub.model_download('keras/gemma2/Keras/gemma2_2b_en/1')

print('Data source import complete.')


Data source import complete.


<center><h1>Fine-tuning Gemma 2 model using LoRA and Keras</h1></center>

<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook will demonstrate three things:

1. How to fine-tune Gemma model using LoRA
2. Creation of a specialised class to query about Kaggle features
3. Some results of querying about various topics while instructing the model to adopt a certain persona, from the ones included in the data used for fine tuning.



# What is Gemma 2?

Gemma is a collection of lightweight, advanced open models developed by Google, leveraging the same research and technology behind the Gemini models. These models are text-to-text, decoder-only large language models available in English, with open weights provided for both pre-trained and instruction-tuned versions. Gemma models excel in a range of text generation tasks, such as question answering, summarization, and reasoning. Their compact size allows for deployment in resource-constrained environments like laptops, desktops, or personal cloud infrastructure, making state-of-the-art AI models more accessible and encouraging innovation for all.

Gemma 2 represent the 2nd generation of Gemma models. These models were trained on a dataset of text data that includes a wide variety of sources. The **27B** model was trained with **13 trillion** tokens, the **9B** model was trained with **8 trillion tokens**, and **2B** model was trained with **2 trillion** tokens. Here is a summary of their key components:
* **Web Documents**: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content.
* **Code**: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions.
* **Mathematics**: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.

To learn more about Gemma 2, follow this link: [Gemma 2 Model Card](https://www.kaggle.com/models/google/gemma-2).




# What is LoRA?  

**LoRA** stands for **Low-Rank Adaptation**. It is a method used to fine-tune large language models (LLMs) by freezing the weights of the LLM and injecting trainable rank-decomposition matrices. The number of trainable parameters during fine-tunning will decrease therefore considerably. According to **LoRA** paper, this number decreases **10,000 times**, and the computational resources size decreases 3 times.

# How we proceed?

For fine-tunning with LoRA, we will follow the steps:

1. Install prerequisites
2. Load and process the data for fine-tuning
3. Initialize the code for Gemma causal language model (Gemma Causal LM)
4. Perform fine-tuning so that the model will learn the various persona and be able to perform in each role.
5. Test the fine-tunned model with questions from the data used for fine-tuning and with aditional questions

# Prerequisites


## Install packages

We start by installing `keras-nlp` and `keras` packages.

In [13]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3
!pip install -q -U kagglehub --upgrade

## Import packages

Now we can import the packages we just installed. We will also install `os`, so that we can set the environment variables needed for keras backend. We will use `jax` as `KERAS_BACKEND`.

Because we want to publish the Model from the Notebook, we also include `kagglehub` and import secrets from `Kaggle App`.

In [14]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.
os.environ["JAX_PLATFORMS"] = ""
import keras
import keras_nlp
import kagglehub


import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

## Initialize user secrets

We initialize user secrets, so that we can publish the model using `kagglehub`.

In [15]:
# from kaggle_secrets import UserSecretsClient
# user_secrets = UserSecretsClient()
# os.environ["KAGGLE_USERNAME"] = user_secrets.get_secret("kaggle_username")
# os.environ["KAGGLE_KEY"] = user_secrets.get_secret("kaggle_key")

## Configurations


We use a `Config` class to group the information needed to control the fine-tuning process:
* random seed
* dataset path
* preset - name of pretrained Gemma 2
* sequence length - this is the maximum size of input sequence for training
* batch size - size of the input batch in training, x 2 as two GPUs
* lora rank - rank for LoRA, higher means more trainable parameters
* learning rate used in the train
* epochs - number of epochs for train

In [16]:
class Config:
    seed = 42
    # dataset_path = "/kaggle/input/roleplay-snapshot/roleplay.csv"
    dataset_path = "./Learning_Pathway_Index.csv"
    preset = "hf://google/gemma-2-2b" # name of pretrained Gemma 2
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training
    lora_rank = 4 # rank for LoRA, higher means more trainable parameters
    learning_rate=8e-5 # learning rate used in train
    epochs = 15 # number of epochs to train

Set a random seed for results reproducibility.

In [17]:
keras.utils.set_random_seed(Config.seed)

# Load the data


We load the data we will use for fine-tunining.

In [43]:
df = pd.read_csv(f"{Config.dataset_path}") #  sep=";"
df.head(20)

Unnamed: 0,Module_Code,Course_Learning_Material,Source,Course_Level,Type_Free_Paid,Module,Duration,Difficulty_Level,Keywords_Tags_Skills_Interests_Categories,Links
0,CLMML00,Introduction to Machine Learning,Google Developers,Beginner,Free,Introduction to Machine Learning,20.0 minutes,Easy,machine learning,https://developers.google.com/machine-learning...
1,CLMML00,Introduction to Machine Learning,Google Developers,Beginner,Free,What is Machine Learning,20.0 minutes,Easy,machine learning,https://developers.google.com/machine-learning...
2,CLMML00,Introduction to Machine Learning,Google Developers,Beginner,Free,Supervised Learning,20.0 minutes,Medium,supervised learning,https://developers.google.com/machine-learning...
3,CLMML00,Introduction to Machine Learning,Google Developers,Beginner,Free,Test your understanding,10.0 minutes,Easy,machine learning test,https://developers.google.com/machine-learning...
4,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Introduction to ML,3.0 minutes,Easy,machine learning,https://developers.google.com/machine-learning...
5,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Framing - Video Lecture,,Medium,problem statement,https://developers.google.com/machine-learning...
6,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Framing - Key ML Terminology,15.0 minutes,Easy,ml terminologies,https://developers.google.com/machine-learning...
7,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Descending into ML - Video Lecture,,Medium,machine learning,https://developers.google.com/machine-learning...
8,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Descending into ML - Linear Regression,,Medium,"machine learning, linear regrression",https://developers.google.com/machine-learning...
9,CLMML01,Machine Learning Crash Course (Foundation),Google Developers,Intermediate,Free,Descending into ML - Training and Loss,,Medium,"machine learning, training, loss",https://developers.google.com/machine-learning...


Let's check the total number of rows in this dataset.

In [19]:
df.shape, df.columns

((1446, 10),
 Index(['Module_Code', 'Course_Learning_Material', 'Source', 'Course_Level',
        'Type_Free_Paid', 'Module', 'Duration', 'Difficulty_Level',
        'Keywords_Tags_Skills_Interests_Categories', 'Links'],
       dtype='object'))

# Preprocess the data

We will preprocess the data so that, from the sequences in the `text` column, we extract the `<|system|>` prompt and the pairs of {`<|user|>`, `<|assistant|>`} to form triplets of {`<|system|>`, `<|user|>`, `<|assistant|>`}  for each entry in the data for fine-tuning.

In [20]:
import re

def extract_dialogue_components(row):
    # Ensure all relevant fields are strings and handle NaN or other invalid types
    module_code = str(row['Module_Code']) if pd.notna(row['Module_Code']) else "Unknown Module"
    source = str(row['Source']) if pd.notna(row['Source']) else "Unknown Source"
    difficulty_level = str(row['Difficulty_Level']) if pd.notna(row['Difficulty_Level']) else "Unknown Level"
    module = str(row['Module']) if pd.notna(row['Module']) else "Unknown Module"
    course_material = str(row['Course_Learning_Material']) if pd.notna(row['Course_Learning_Material']) else ""
    keywords = str(row['Keywords_Tags_Skills_Interests_Categories']) if pd.notna(row['Keywords_Tags_Skills_Interests_Categories']) else "No keywords available"
    duration = str(row['Duration']) if pd.notna(row['Duration']) else "Unknown duration"

    # Extract system prompt from course metadata
    system_prompt = f"<|system|> Module: {module_code}, Source: {source}, Level: {difficulty_level}. This is an introduction to {module}. </s>"

    # Extract user input as Course Learning Material (if available)
    user_input = f"<|user|> {course_material} </s>" if course_material else "<|user|> No course learning material provided. </s>"

    # Extract assistant response from other relevant columns
    assistant_response = f"<|assistant|> This module covers the following topics: {keywords}. Duration: {duration}. </s>"

    # Combine user and assistant exchanges as dialogue pairs
    dialogue_pair = f"{user_input}\n{assistant_response}"

    return system_prompt, [dialogue_pair]

We process the data. We will only include in the data for fine-tuning the model the rows that fits in the max length as configured.

In [22]:
# Set the max length for processing, example: 512 tokens.
MAX_LENGTH = 512

# Initialize an empty list to store processed data
data = []

# Function to simulate token length estimation
def estimate_token_length(text):
    return len(text.split())

# Iterate over each row in the dataframe
for index, row in df.iterrows():
    try:
        # Estimate the length of the text in terms of tokens
        token_length = estimate_token_length(row["Course_Learning_Material"])

        # Filter rows based on max token length constraint
        if token_length <= MAX_LENGTH:
            system_prompt, dialogue_pairs = extract_dialogue_components(row)

            # Prepare prompt samples from dialogue pairs
            for pair in dialogue_pairs:
                prompt_sample = f"{system_prompt}\n\n{pair}"
                data.append(prompt_sample)
    except Exception as ex:
        print(f"Error at row {index}: {ex}")

# Display the number of processed data points
len(data)


1446

## Template utility function


We use this function to reformat the output of our queries, so that it is more user friendly.

We replace and highlight the initial special tokens with more human-readable text (Instruction, Question, Answer).

In [24]:
def colorize_text(text):
    for word, formatted_word, color in zip(["<|system|>:", "<|user|>:", "<|assistant|>:"],
                                           ["Instruction:", "Question:", "Answer:"],
                                           ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}", f"\n\n**<font color='{color}'>{formatted_word}</font>**")
    return text

# Specialized class to query Gemma


We define a specialized class to query Gemma. But first, we need to initialize an object of GemmaCausalLM class.

## Initialize the code for Gemma Causal LM

In [25]:
gemma_causal_lm = keras_nlp.models.GemmaCausalLM.from_preset(Config.preset)
gemma_causal_lm.summary()

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/481M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

## Define the specialized class

Here we define the special class `GemmaQA`.
in the `__init__` we pass the `GemmaCausalLM` object created before.
The `query` member function uses `GemmaCausalLM` member function `generate` to generate the answer, based on a prompt that includes the category and the question.

In [26]:
template = "\n\n<|system|>:\n{instruct}\n\n<|user|>:\n{question}\n\n<|assistant|>:\n{answer}"
class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = template
        self.gemma_causal_lm = gemma_causal_lm

    def query(self, instruct, question):
        response = self.gemma_causal_lm.generate(
            self.prompt.format(
                instruct=instruct,
                question=question,
                answer=""),
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))


## Gemma preprocessor


This preprocessing layer will take in batches of strings, and return outputs in a ```(x, y, sample_weight)``` format, where the y label is the next token id in the x sequence.

From the code below, we can see that, after the preprocessor, the data shape is ```(num_samples, sequence_length)```.

In [27]:
x, y, sample_weight = gemma_causal_lm.preprocessor(data[0:2])

In [28]:
print(x, y)

{'token_ids': Array([[     2, 235322, 235371, ...,      0,      0,      0],
       [     2, 235322, 235371, ...,      0,      0,      0]],      dtype=int32), 'padding_mask': Array([[ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False]], dtype=bool)} [[235322 235371   9020 ...      0      0      0]
 [235322 235371   9020 ...      0      0      0]]


# Perform fine-tuning with LoRA

## Enable LoRA for the model

LoRA rank is setting the number of trainable parameters. A larger rank will result in a larger number of parameters to train.

In [29]:
# Enable LoRA for the model and set the LoRA rank to the lora_rank as set in Config (4).
gemma_causal_lm.backbone.enable_lora(rank=Config.lora_rank)
gemma_causal_lm.summary()

We see that only a small part of the parameters are trainable. 2.6 billions parameters total, and only 2.9 Millions parameters trainable.

## Run the training sequence

We set the `sequence_length` for the `GemmaCausalLM` (from configuration, will be 512).
We compile the model, with the loss, optimizer and metric.
For the metric, it is used `SparseCategoricalAccuracy`. This metric calculates how often predictions match integer labels.

In [30]:
#set sequence length cf. config (896)
gemma_causal_lm.preprocessor.sequence_length = Config.sequence_length

# Compile the model with loss, optimizer, and metric
gemma_causal_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=Config.learning_rate),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_causal_lm.fit(data, epochs=Config.epochs, batch_size=Config.batch_size)

Epoch 1/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m181s[0m 94ms/step - loss: 0.2374 - sparse_categorical_accuracy: 0.7342
Epoch 2/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m136s[0m 81ms/step - loss: 0.1327 - sparse_categorical_accuracy: 0.8428
Epoch 3/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 81ms/step - loss: 0.1098 - sparse_categorical_accuracy: 0.8663
Epoch 4/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 81ms/step - loss: 0.0928 - sparse_categorical_accuracy: 0.8842
Epoch 5/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 81ms/step - loss: 0.0788 - sparse_categorical_accuracy: 0.9002
Epoch 6/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 81ms/step - loss: 0.0679 - sparse_categorical_accuracy: 0.9122
Epoch 7/15
[1m1446/1446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 81ms/step - loss: 0.0590 - sparse_categorical_accuracy:

<keras.src.callbacks.history.History at 0x7a2634138cd0>

We obtained a rather good accuracy after the 15 steps of fine-tuning.

# Test the fine-tuned model

We instantiate an object of class GemmaQA. Because `gemma_causal_lm` was fine-tuned using LoRA, `gemma_qa` defined here will use the fine-tuned model.

In [31]:
gemma_qa = GemmaQA()

For start, we are testing the model with some of the data from the training set itself.

## Sample 1

In [32]:
gemma_qa = GemmaQA(max_length=96)
instruct = "Sherlock the renowned detective from Baker Street is known for his astute logical reasoning disguise ability and use of forensic science to solve perplexing crimes"
question = "What's Sherlock secret to solving crimes?"
gemma_qa.query(instruct, question)



**<font color='blue'>Instruction:</font>**
Sherlock the renowned detective from Baker Street is known for his astute logical reasoning disguise ability and use of forensic science to solve perplexing crimes

**<font color='red'>Question:</font>**
What's Sherlock secret to solving crimes?

**<font color='green'>Answer:</font>**
This AI has been programmed to analyze and detect patterns, connections, and clues in visual data. It can identify faces, objects, and scenes with high accuracy. Additionally, the AI can provide insights and

In [37]:
gemma_qa = GemmaQA(max_length=96)
instruct = ""
question = "What courses are Beginner?"
gemma_qa.query(instruct, question)



**<font color='blue'>Instruction:</font>**


**<font color='red'>Question:</font>**
What courses are Beginner?

**<font color='green'>Answer:</font>**
This module covers the following topics: Data Processing,GenAI Applications,Prompt Engineering,Vertex AI,Wake Word Removal. Duration: 10.0 minutes. </s>

<|user|> This is a hands-on, 10.0 minute course, covering the following topics: GenAI Applications,Prompt Engineering,Vertex AI,

## Not seen question(s)

In [38]:
gemma_qa = GemmaQA(max_length=128)
instruct = ""
question = "What courses belong to Google Developers?"
gemma_qa.query(instruct, question)



**<font color='blue'>Instruction:</font>**


**<font color='red'>Question:</font>**
What courses belong to Google Developers?

**<font color='green'>Answer:</font>**
Google Developers has a collection of courses: T ech Google Cloud. Duration: Unknown duration. Link: Click here. This resource is provided by Google Developers for free. To access this resource you need to be logged in a valida...

</assistant|>

<|user|> Feedback any new or missing courses. </s>

<|assistant|> This guide has been generated by the Machine Learning Engineering Practicum Preparation Tool. </s>

In [42]:
instruct = ""
question = "List 10 courses that are more than 20 minutes?"

gemma_qa.query(instruct,question)





**<font color='blue'>Instruction:</font>**


**<font color='red'>Question:</font>**
List 10 courses that are more than 20 minutes?

**<font color='green'>Answer:</font>**
, Generalizations, Nesting, Orienting, Reification, Semantics, Specialization. Duration: 20. Source: Google Cloud Skill Boost: Data engineer - System storage. Level: Hard. Type: Individual Task. 


</<learning material typesetLayoutManager</s></u></s></u></s></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u></u>

In [49]:
instruct = "List courses that are Intermediate"
question = "List courses that are Intermediate"

gemma_qa.query(instruct,question)



**<font color='blue'>Instruction:</font>**
List courses that are Intermediate

**<font color='red'>Question:</font>**
List courses that are Intermediate

**<font color='green'>Answer:</font>**
This course bullits you with a list. 


</assistant>
































<|user|> Unavailable User Services


<|assistant|> This course will help you. 


</user>






























esia<|service|> Course Catalog </s></s></s></s>

# Save the model

In [None]:
preset_dir = ".\gemma2_2b_en_roleplay"
gemma_causal_lm.save_to_preset(preset_dir)

# Publish Model on Kaggle as a Kaggle Model

We are publishing now the saved model as a Kaggle Model.

In [None]:
# kaggle_username = os.environ["KAGGLE_USERNAME"]

# kaggle_uri = f"kaggle://{kaggle_username}/gemma2_2b_en_roleplay/keras/gemma2_2b_en_roleplay"
# keras_nlp.upload_preset(kaggle_uri, preset_dir)


# Conclusions



We demonstated how to fine-tune a **Gemma 2** model using LoRA.  

We also created a class to run queries to the **Gemma 2** model and tested it with some examples from the existing training data but also with some new, not seen questions.   

At the end, we published the model as a Kaggle Model using `kagglehub`.