##### Copyright 2025 Google LLC.
##### Copyright 2025 ontaptom

**Attribution:** Link to the [original Google notebook](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/embeddinggemma/fine-tuning-embeddinggemma-with-sentence-transformers.ipynb). This notebook contains changes including training configuration, dataset, and evaluation approach.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Click this button to open file in Colab
<a target="_blank" href="https://colab.research.google.com/github/ontaptom/workshops/blob/main/notebooks/fine-tuning-embeddinggemma.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg">
</a>


# Fine-tune EmbeddingGemma

Fine-tuning helps close the gap between a model's general-purpose understanding and the specialized, high-performance accuracy that your application requires. Since no single model is perfect for every task, fine-tuning adapts it to your specific domain.

## The LLM Problem (No, Not That One)

Imagine your company's Project Management Office has been using "LLM" to mean **"Lesson Learned Meeting"** since 2005 â€” long before Silicon Valley decided it stood for something else. Now every time someone searches the knowledge base for "LLM requirements" or "LLM agenda template," they get results about GPU clusters and transformer architectures instead of meeting room bookings and retrospective templates. Your PMO is not amused.



## Setup

Before starting this tutorial, complete the following steps:

* Get access to EmbeddingGemma by logging into [Hugging Face](https://huggingface.co/google/embeddinggemma-300M) and selecting **Acknowledge license** for a Gemma model.
* Generate a Hugging Face [Access Token](https://huggingface.co/docs/hub/en/security-tokens#how-to-manage-user-access-token) and use it to login from Colab.

This notebook will run on either CPU or GPU. (**Note**: using GPU support significantly shorten the training part, with CPU it's 10 min+, with GPU is much shorter :) )

### Install Python packages

Install the libraries required for running the EmbeddingGemma model and generating embeddings. Sentence Transformers is a Python framework for text and image embeddings. For more information, see the [Sentence Transformers](https://www.sbert.net/) documentation.

In [1]:
!pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

Collecting git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
  Cloning https://github.com/huggingface/transformers (to revision v4.56.0-Embedding-Gemma-preview) to /tmp/pip-req-build-p9i7xnow
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-p9i7xnow
  Running command git checkout -q 60b68e304cf4b6569b0660a13b558b929d4b0e77
  Resolved https://github.com/huggingface/transformers to commit 60b68e304cf4b6569b0660a13b558b929d4b0e77
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.57.0.dev0-py3-none-any.whl size=12604658 sha256=f730259759be5b025df88bbd8a6d2300810490d29bf45c1d0b1e3e924bc4d7be
  S

After you have accepted the license, you need a valid Hugging Face Token to access the model.
There are different ways to autenticate to huggingface, the easiest, in my opinion, is to add `HF_TOKEN` secret with your huggingface token and give access to this secret. Alternatively you could add this cell block:

```
# Login into Hugging Face Hub
from huggingface_hub import login
login()
```

### Load Model

Use the `sentence-transformers` libraries to create an instance of a model class with EmbeddingGemma.

In [2]:
import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "google/embeddinggemma-300M"
model = SentenceTransformer(model_id).to(device=device)

print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))

modules.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/312 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

3_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696


## Prompts

EmbeddingGemma uses task-specific prefixes to optimize embeddings for different use cases. When you encode text, a prompt is prepended to tell the model what kind of task you're performing â€” semantic similarity, retrieval, classification, etc.

For retrieval (like our RAG use case), we'll use two different prompts:
- `Retrieval-query` for user questions
- `Retrieval-document` for knowledge base content

For more information about that check out the [model card](http://ai.google.dev/gemma/docs/embeddinggemma/model_card#prompt_instructions) and [this documentation](https://ai.google.dev/gemma/docs/embeddinggemma/inference-embeddinggemma-with-sentence-transformers#using_prompts_with_embeddinggemma).


Let's see what prompts are available:

In [3]:
# List just the names
for key, value in model.prompts.items():
    print(f"{key}: {value}")


query: task: search result | query: 
document: title: none | text: 
BitextMining: task: search result | query: 
Clustering: task: clustering | query: 
Classification: task: classification | query: 
InstructionRetrieval: task: code retrieval | query: 
MultilabelClassification: task: classification | query: 
PairClassification: task: sentence similarity | query: 
Reranking: task: search result | query: 
Retrieval: task: search result | query: 
Retrieval-query: task: search result | query: 
Retrieval-document: title: none | text: 
STS: task: sentence similarity | query: 
Summarization: task: summarization | query: 


## Test Data

Remember our use-case?
Imagine your company's Project Management Office has been using "LLM" to mean "Lesson Learned Meeting", not "Large Language Model"!

Time to set up a trap. We'll ask about "LLM requirements" and offer the model six documents â€” three about AI stuff (`#A`) and three about our beloved meetings (`#B`).

`query_llm` is our example query to the knowledge base, and `documents_llm` represents our knowledge base. In the real world it would be much larger, but for clarity let's stick with 6 documents â€” 3 pointing to the AI meaning of LLM, and 3 pointing to ours.

Let's see how the vanilla model handles this before we teach it some manners.

In [4]:
# @title Test Data
query_llm = "What are the requirements for LLMs?"

documents_llm = [
    # A - Related to Large Language Models (AI)
    "#A Large Language Models require massive GPU clusters for training.",
    "#A Next-token prediction is the core task of the large language models.",
    "#A Context length limits how much information the model can process.",

    # B - Related to Lesson Learned Meetings (Project Management)
    "#B Schedule the lesson learned meeting within 10 days of closure.",
    "#B Use the retrospective template from SharePoint for the meeting agenda.",
    "#B All core team members must attend the meeting to discuss what went well."
]

In [5]:
# @title Let's Define `get_scores` function

def get_scores(query, documents):
  # Use specific prompt for the Question
  query_embeddings = model.encode(query, prompt_name="Retrieval-query")

  # Use specific prompt for the Documents
  doc_embeddings = model.encode(documents, prompt_name="Retrieval-document")

  # Calculate the embedding similarities
  similarities = model.similarity(query_embeddings, doc_embeddings)

  # Zip documents with scores and sort by score (descending)
  results = list(zip(documents, similarities.numpy()[0]))
  results.sort(key=lambda x: x[1], reverse=True)

  for doc, score in results:
    print("Document: ", doc, "-> ðŸ¤– Score: ", score)

In [6]:
# @title Time to check the results

get_scores(query_llm, documents_llm)

Document:  #A Large Language Models require massive GPU clusters for training. -> ðŸ¤– Score:  0.35002673
Document:  #A Next-token prediction is the core task of the large language models. -> ðŸ¤– Score:  0.1937868
Document:  #A Context length limits how much information the model can process. -> ðŸ¤– Score:  0.16146486
Document:  #B All core team members must attend the meeting to discuss what went well. -> ðŸ¤– Score:  0.1492557
Document:  #B Schedule the lesson learned meeting within 10 days of closure. -> ðŸ¤– Score:  0.10341947
Document:  #B Use the retrospective template from SharePoint for the meeting agenda. -> ðŸ¤– Score:  0.05001665


Do you see it? As expected, all 3 `#A` documents (AI) ranked above the `#B` documents (meetings). The vanilla model has no idea our PMO exists. Time to fix that.

By the way, if you want to - feel free to remove those hashtags `#A` and `#B` from the knowlege base if you want to - the model behaves the same with and without them, I added them only to make the results more clear.

## Prepare the Fine-Tuning Dataset

This is the most important part. The dataset teaches the model what "similar" means in your context. Each example is a triplet:

- **Anchor**: The query (e.g., "What are LLM requirements?")
- **Positive**: A document that *should* match
- **Negative**: A document that *should not* match

I prepared 65 triplets (with a little help from Gemini).

Note that none of these examples overlap with our test data. We want the model to learn what LLM means in our domain â€” not memorize specific answers.

In [7]:
from datasets import Dataset

dataset = [
    ["Who facilitates the LLM?", "The Scrum Master or an external neutral party should guide the retrospective discussion.", "The transformer architecture relies on self-attention mechanisms to process text."],
    ["Where are LLM minutes stored?", "Archive the meeting summary and action items in the project closure folder on the drive.", "Embeddings are typically stored in a vector database for fast similarity retrieval."],
    ["LLM duration", "Allocate at least 90 minutes for a full project review to ensure deep discussion.", "Inference time is measured in milliseconds per token generated."],
    ["Who is responsible for scheduling the LLM?", "The Project Manager must ensure the lesson learned meeting is booked before the team disperses.", "Developers are responsible for choosing the right large language model for their specific application."],
    ["Do we need an external moderator for the LLM?", "For high-conflict projects, an external facilitator is recommended for the lesson learned meeting.", "You can use an external API to access a powerful large language model without hosting it yourself."],
    ["Where can I find the standard LLM template?", "You can download the official lesson learned meeting agenda from the PMO SharePoint site.", "You can find the model card and architecture details for the large language model on Hugging Face."],
    ["Is the LLM mandatory for small projects?", "Yes, a simplified 30-minute lesson learned meeting is required even for minor work packages.", "A smaller distilled large language model can run efficiently on consumer hardware."],
    ["How do we track actions from the LLM?", "Log all tasks identified during the lesson learned meeting into JIRA with the 'Retrospective' tag.", "We track the performance of the large language model using standard evaluation benchmarks."],
    ["LLM catering requirements", "Order coffee and snacks for the in-person lesson learnt meeting to keep energy high.", "The large language model consumes significant electricity during the training process."],
    ["Who signs the LLM attendance sheet?", "Pass around the digital sign-in form at the start of the lesson learnt meeting.", "The tokenizer maps the input text into IDs for the large language model."],
    ["LLM dispute resolution", "If team members disagree during the lesson learnt meeting, park the topic for later.", "Bias mitigation is a key area of research for any large language model."],
    ["Can we reschedule the LLM?", "You can move the lesson learnt meeting only if the Project Sponsor cannot attend.", "You can fine-tune the large language model on a new dataset to update its knowledge."],
    ["LLM historical data", "Review findings from the previous lesson learnt meeting before starting the new one.", "The large language model has a specific training cutoff date affecting its knowledge."],
    ["Can stakeholders attend the LLM?", "Key business stakeholders should be invited to the first 15 minutes of the lesson learned meeting.", "Business stakeholders are increasingly interested in how a large language model can drive automation."],
    ["Required inputs for LLM", "Bring the risk register and issue log to the session for review.", "The model requires a vast corpus of text data for the pre-training phase."],
    ["LLM output expectations", "We need a prioritized list of improvement actions for the next phase.", "The output is a probability distribution over the vocabulary to predict the next token."],
    ["LLM recurring schedule", "For multi-year initiatives, hold these sessions at the completion of each major milestone.", "Fine-tuning happens after the pre-training phase is complete to specialize the model."],
    ["LLM report distribution", "Email the findings summary to the PMO director and the steering committee.", "Deploy the model to a cloud endpoint to allow API access for developers."],
    ["LLM for failed projects", "Even cancelled initiatives require a formal closure review to understand why it stopped.", "Hallucinations occur when the model generates factually incorrect text with high confidence."],
    ["Why is the LLM mandatory?", "Every project must conclude with a formal lesson learned meeting to ensure continuous improvement.", "A large language model is mandatory for applications requiring complex text generation."],
    ["Who facilitates the LLM?", "An impartial Scrum Master usually runs the lesson learned meeting to keep the discussion objective.", "The large language model is facilitated by a powerful inference engine running on GPUs."],
    ["Where do we store LLM notes?", "Save the minutes from the lesson learned meeting in the project closure folder.", "The weights of the large language model are stored in a distributed file system."],
    ["Can stakeholders attend the LLM?", "Stakeholders are encouraged to join the opening of the lesson learned meeting to provide context.", "Stakeholders utilize the large language model to automate customer service interactions."],
    ["How long should the LLM take?", "Schedule 90 minutes for the lesson learned meeting to allow time for deep discussion.", "Training a large language model takes weeks or even months depending on the dataset size."],
    ["What is the output of the LLM?", "The primary output of the lesson learned meeting is a list of actionable improvements.", "The output of a large language model is a sequence of text based on the input prompt."],
    ["Do we record the LLM?", "We typically do not record the lesson learned meeting to encourage honest feedback.", "You can record the input and output logs of a large language model for debugging purposes."],
    ["Who sends the LLM invite?", "The Project Manager sends the calendar invite for the lesson learned meeting.", "The developer sends an API request to the hosted large language model."],
    ["Is the LLM confidential?", "Discussions inside the lesson learned meeting are confidential to build trust.", "Data sent to a public large language model may not be confidential depending on terms of service."],
    ["When should we hold the LLM?", "Host the lesson learned meeting within two weeks of the project go-live date.", "You should use a large language model when you need flexible natural language understanding."],
    ["What tools support the LLM?", "We use a digital whiteboard during the lesson learned meeting for collaborative brainstorming.", "Libraries like Hugging Face support the deployment of a large language model."],
    ["How do we track LLM actions?", "Enter all actions from the lesson learned meeting into the central risk register.", "We track the drift of the large language model by monitoring its responses over time."],
    ["Who approves the LLM report?", "The Program Director must sign off on the findings from the lesson learned meeting.", "The safety team approves the release of the large language model after red-teaming."],
    ["Can we skip the LLM?", "No, skipping the lesson learned meeting is a violation of the PMO governance policy.", "You can skip the fine-tuning step if the base large language model is sufficient."],
    ["What is the goal of the LLM?", "The goal of the lesson learned meeting is to identify root causes of project deviations.", "The goal of a large language model is to simulate human-like text generation."],
    ["How many people are in the LLM?", "Limit the lesson learned meeting to the core team to keep the conversation focused.", "The parameter count of a large language model often exceeds several billion."],
    ["Do we need an agenda for the LLM?", "Yes, distribute the lesson learned meeting agenda 24 hours in advance.", "Prompt engineering acts as the agenda that guides the large language model."],
    ["What if the LLM turns toxic?", "The facilitator must intervene if the lesson learned meeting becomes a blaming session.", "A large language model can produce toxic output if not properly aligned with safety guidelines."],
    ["Who presents at the LLM?", "The Project Lead presents the timeline analysis during the lesson learned meeting.", "The researcher presents the architecture of the new large language model at the conference."],
    ["Is the LLM applicable to Agile?", "Agile teams combine the sprint retrospective with the broader lesson learned meeting.", "A large language model can assist agile teams by generating user stories."],
    ["LLM preparation checklist", "Review the risk log before starting the lesson learned meeting.", "Prepare the dataset cleaning pipeline before training the large language model."],
    ["LLM standard template", "Download the official lesson learned meeting template from the intranet.", "Use a standard prompt template to get consistent results from the large language model."],
    ["LLM recurrence pattern", "For long programs, schedule a lesson learned meeting at the end of each phase.", "The large language model uses recurrent layers or attention to handle sequential data."],
    ["LLM budget allocation", "Charge the time spent in the lesson learned meeting to the admin code.", "The budget for training a large language model runs into millions of dollars."],
    ["LLM remote participation", "Use breakout rooms for the lesson learned meeting if the team is fully remote.", "Access the large language model remotely via a REST API."],
    ["LLM root cause analysis", "Use the '5 Whys' method during the lesson learned meeting to dig deeper.", "The large language model does not perform true root cause analysis, it only predicts text."],
    ["LLM success metrics", "A successful lesson learned meeting results in process changes.", "Perplexity and BLEU score are metrics used to evaluate a large language model."],
    ["LLM conflict resolution", "Facilitators must manage conflict during the lesson learned meeting.", "Reinforcement learning helps resolve conflicts in large language model outputs."],
    ["LLM visual aids", "Bring the project Gannt chart to the lesson learned meeting.", "Visual aids are generated by multimodal versions of a large language model."],
    ["LLM follow-up email", "Send the minutes of the lesson learned meeting to all attendees.", "The large language model generates an email draft based on bullet points."],
    ["Recording the LLM", "Do not record the audio to encourage honest and anonymous feedback from the team.", "Log the training loss and validation accuracy in TensorBoard to monitor convergence."],
    ["LLM action owner", "Assign a specific person to each improvement item identified during the session.", "The attention head focuses on different parts of the sequence to capture context."],
    ["Tools for LLM", "We use a digital whiteboard for grouping sticky notes during the brainstorming phase.", "We use PyTorch or TensorFlow frameworks for developing and training the network."],
    ["LLM voting process", "Team members dot-vote on the most critical issues to discuss in depth.", "Beam search explores multiple potential next tokens to find the most likely sequence."],
    ["What happens if we skip the LLM?", "Skipping the lesson learned meeting is a compliance violation and will flag the project as Red.", "If you skip the pre-training stage, the large language model will not understand human grammar."],
    ["How many people should be in the LLM?", "Keep the lesson learned meeting to under 12 participants to ensure everyone has a chance to speak.", "Scaling laws suggest that increasing data size improves the capabilities of a large language model."],
    ["When should I send the invite for the LLM?", "Send the invitation for the lesson learned meeting at least two weeks in advance to secure calendars.", "Latency depends on when you send the request to the large language model inference server."],
    ["Who approves the final LLM report?", "The Program Director must sign off on the minutes from the lesson learned meeting before archiving.", "The ethics committee approves the deployment of the large language model to ensure safety compliance."],
    ["LLM confidentiality", "What is said in the room stays in the room; only the final actions are published.", "Data privacy is critical when sending prompts to the public API endpoints."],
    ["LLM previous examples", "Check the PMO repository for past findings from similar infrastructure projects.", "Few-shot prompting provides examples to the model within the context window."],
    ["LLM agenda items", "Start with a safety moment, then move to the timeline review and budget analysis.", "Start with the system prompt to define behavior, then append the user query."],
    ["When to skip LLM", "Never. Every project regardless of size requires a formal review phase.", "Skip the decoder layers if you are using an encoder-only architecture like BERT."],
    ["LLM follow up", "Check the status of improvement actions in the next quarterly business review.", "Evaluate the model using standard benchmarks like MMLU or BIG-bench."],
    ["LLM invitation subject", "Please use the standard naming convention: [Project Code] - Retrospective.", "Prompt engineering involves crafting the input text to get the best output."],
    ["LLM identifying risks", "Discuss risks that materialized and how they were handled by the core team.", "The model calculates the probability of the next word based on the previous context."],
    ["LLM pre-read material", "Send the project performance report to attendees 24 hours before the meeting.", "Pre-load the model weights into GPU memory to reduce cold-start latency."],
]

# Convert the list-based dataset into a list of dictionaries.
data_as_dicts = [ {"anchor": row[0], "positive": row[1], "negative": row[2]} for row in dataset ]

# Create a Hugging Face `Dataset` object
train_dataset = Dataset.from_list(data_as_dicts)
print(train_dataset)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 65
})


## Training

Using a framework like `sentence-transformers` in Python, the base model gradually learns the subtle distinctions in your financial vocabulary.

In [8]:
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback

# check out
# https://ai.google.dev/gemma/docs/embeddinggemma/inference-embeddinggemma-with-sentence-transformers#using_prompts_with_embeddinggemma
# for more details about prompts

prompt_map = {
    "anchor": model.prompts["Retrieval-query"],
    "positive": model.prompts["Retrieval-document"],
    "negative": model.prompts["Retrieval-document"]
}

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="my-embedding-gemma",
    # Optional training parameters:
    prompts=prompt_map,    # use model's prompt to train
    num_train_epochs=2,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    # Optional tracking/debugging parameters:
    logging_steps=train_dataset.num_rows,
    report_to="none",
)

class MyCallback(TrainerCallback):
    "A callback that evaluates the model at the end of eopch"
    def __init__(self, evaluate):
        self.evaluate = evaluate # evaluate function

    def on_log(self, args, state, control, **kwargs):
        # Evaluate the model using text generation
        print(f"Step {state.global_step} finished. Running evaluation:")
        self.evaluate()

def evaluate():
  get_scores(query_llm, documents_llm)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[MyCallback(evaluate)]
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
65,0.0261
130,0.0


Step 65 finished. Running evaluation:
Document:  #B Use the retrospective template from SharePoint for the meeting agenda. -> ðŸ¤– Score:  0.9506588
Document:  #B Schedule the lesson learned meeting within 10 days of closure. -> ðŸ¤– Score:  0.94483674
Document:  #B All core team members must attend the meeting to discuss what went well. -> ðŸ¤– Score:  0.9439032
Document:  #A Next-token prediction is the core task of the large language models. -> ðŸ¤– Score:  -0.48791033
Document:  #A Large Language Models require massive GPU clusters for training. -> ðŸ¤– Score:  -0.5403265
Document:  #A Context length limits how much information the model can process. -> ðŸ¤– Score:  -0.5536436
Step 130 finished. Running evaluation:
Document:  #B Use the retrospective template from SharePoint for the meeting agenda. -> ðŸ¤– Score:  0.95086163
Document:  #B Schedule the lesson learned meeting within 10 days of closure. -> ðŸ¤– Score:  0.9450696
Document:  #B All core team members must attend the meet

TrainOutput(global_step=130, training_loss=0.013064063512361912, metrics={'train_runtime': 97.712, 'train_samples_per_second': 1.33, 'train_steps_per_second': 1.33, 'total_flos': 0.0, 'train_loss': 0.013064063512361912, 'epoch': 2.0})

## After Fine-Tuning

Now let's run the exact same query against the fine-tuned model. Remember â€” we didn't include any of these test documents in training. The model had to learn the *concept* that LLM means Lesson Learned Meeting, not just memorize answers.

If this worked, we should see all `#B` documents climb to the top with high confidence scores, while the `#A` documents drop significantly.


In [10]:
# @title Let's load the original model again, and create `compare_models` function

# 1. Load the Base Model (Vanilla) into a new variable
base_model = SentenceTransformer("google/embeddinggemma-300M").to(device)

# 2. Fine-Tuned Model is already in the 'model' variable
# (The trainer updated it in-place)
fine_tuned_model = model

def compare_models(query, documents):
    print(f"ðŸ”Ž QUERY: {query}\n")
    print("-" * 80)

    # --- Helper to run inference ---
    def run_inference(model_obj, model_name):
        # Use the correct RAG prompts for both!
        q_emb = model_obj.encode(query, prompt_name="Retrieval-query")
        d_emb = model_obj.encode(documents, prompt_name="Retrieval-document")
        scores = model_obj.similarity(q_emb, d_emb)[0]

        # Sort results
        results = sorted(zip(scores.tolist(), documents), key=lambda x: x[0], reverse=True)

        print(f"ðŸ¤– MODEL: {model_name}")
        for rank, (score, doc) in enumerate(results, start=1):
            # Truncate doc for cleaner display
            doc_preview = (doc[:60] + '...') if len(doc) > 60 else doc
            print(f"   #{rank} | Score: {score:.4f} | {doc_preview}")
        print("-" * 80)

    # Run comparison
    run_inference(base_model, "Vanilla (Base)")
    run_inference(fine_tuned_model, "Fine-Tuned (Ours)")


In [11]:
# @title First test - original test set

# Run the test
compare_models(query_llm, documents_llm)

ðŸ”Ž QUERY: What are the requirements for LLMs?

--------------------------------------------------------------------------------
ðŸ¤– MODEL: Vanilla (Base)
   #1 | Score: 0.3500 | #A Large Language Models require massive GPU clusters for tr...
   #2 | Score: 0.1938 | #A Next-token prediction is the core task of the large langu...
   #3 | Score: 0.1615 | #A Context length limits how much information the model can ...
   #4 | Score: 0.1493 | #B All core team members must attend the meeting to discuss ...
   #5 | Score: 0.1034 | #B Schedule the lesson learned meeting within 10 days of clo...
   #6 | Score: 0.0500 | #B Use the retrospective template from SharePoint for the me...
--------------------------------------------------------------------------------
ðŸ¤– MODEL: Fine-Tuned (Ours)
   #1 | Score: 0.9509 | #B Use the retrospective template from SharePoint for the me...
   #2 | Score: 0.9451 | #B Schedule the lesson learned meeting within 10 days of clo...
   #3 | Score: 0.9441 | #B A

In [12]:
# @title A completly new test data

compare_models("what is llm?", [
    "it is type of a meeting",
    "it is type of model",
    "it is related to project management",
    "it is related to artificial inteligence"
])

ðŸ”Ž QUERY: what is llm?

--------------------------------------------------------------------------------
ðŸ¤– MODEL: Vanilla (Base)
   #1 | Score: 0.2529 | it is related to artificial inteligence
   #2 | Score: 0.2482 | it is type of model
   #3 | Score: 0.2246 | it is related to project management
   #4 | Score: 0.1922 | it is type of a meeting
--------------------------------------------------------------------------------
ðŸ¤– MODEL: Fine-Tuned (Ours)
   #1 | Score: 0.9145 | it is type of a meeting
   #2 | Score: 0.9031 | it is related to project management
   #3 | Score: -0.1416 | it is related to artificial inteligence
   #4 | Score: -0.2677 | it is type of model
--------------------------------------------------------------------------------


In [13]:
# @title Let's test it out on another language (polish)

compare_models("co to jest llm?", [                       # "what is llm?"
    "spotkanie, gdzie omawiamy bÅ‚Ä™dy",                    # "a meeting where we discuss mistakes"
    "to duÅ¼y model jÄ™zykowy",                             # "it's a large language model"
    "spotkanie, ktÃ³re odbywa siÄ™ na zakoÅ„czeniu projektu", # "a meeting held at the end of a project"
    "to na przykÅ‚ad chatgpt"                              # "it's for example ChatGPT"
])


ðŸ”Ž QUERY: co to jest llm?

--------------------------------------------------------------------------------
ðŸ¤– MODEL: Vanilla (Base)
   #1 | Score: 0.3844 | to duÅ¼y model jÄ™zykowy
   #2 | Score: 0.2564 | to na przykÅ‚ad chatgpt
   #3 | Score: 0.2235 | spotkanie, ktÃ³re odbywa siÄ™ na zakoÅ„czeniu projektu
   #4 | Score: 0.1609 | spotkanie, gdzie omawiamy bÅ‚Ä™dy
--------------------------------------------------------------------------------
ðŸ¤– MODEL: Fine-Tuned (Ours)
   #1 | Score: 0.8601 | spotkanie, ktÃ³re odbywa siÄ™ na zakoÅ„czeniu projektu
   #2 | Score: 0.7962 | spotkanie, gdzie omawiamy bÅ‚Ä™dy
   #3 | Score: -0.0345 | to na przykÅ‚ad chatgpt
   #4 | Score: -0.4008 | to duÅ¼y model jÄ™zykowy
--------------------------------------------------------------------------------


Cross-lingual Transfer: A Bonus

Here's something we didn't explicitly train for.  The model learned "LLM" is related to  "Lesson learned meetings" in English, but the association transferred to Polish queries.

Why? EmbeddingGemma is built on Gemma, which is multilingual. The embedding space is shared across languages â€” "meeting" in English and "spotkanie" in Polish already live nearby. And the token "LLM" is identical in both languages, so it acts as an anchor point.

We fine-tuned in one language, got improvements in another one as well! That's impressive, don't you think?

In [14]:
# @title Sanity Check: Does the model still understand non-LLM queries?

compare_models("What is a CPU?", [
    "It is the main chip in a computer responsible for interpreting and executing commands.",
    "The Central Processing Unit acts as the brain of the hardware system.",
    "It is a specialized project management meeting for cost planning.", # PM Trap (Testing overfitting)
    "It is a large language model trained to generate text."             # AI Trap
])

ðŸ”Ž QUERY: What is a CPU?

--------------------------------------------------------------------------------
ðŸ¤– MODEL: Vanilla (Base)
   #1 | Score: 0.5264 | The Central Processing Unit acts as the brain of the hardwar...
   #2 | Score: 0.4350 | It is the main chip in a computer responsible for interpreti...
   #3 | Score: 0.1758 | It is a large language model trained to generate text.
   #4 | Score: 0.1154 | It is a specialized project management meeting for cost plan...
--------------------------------------------------------------------------------
ðŸ¤– MODEL: Fine-Tuned (Ours)
   #1 | Score: 0.4790 | It is the main chip in a computer responsible for interpreti...
   #2 | Score: 0.4031 | The Central Processing Unit acts as the brain of the hardwar...
   #3 | Score: 0.2981 | It is a specialized project management meeting for cost plan...
   #4 | Score: -0.1602 | It is a large language model trained to generate text.
------------------------------------------------------------------

Good news: CPU answers still on top. The model didn't forget how computers work.

But look at #3 and #4 â€” there's a really big gap between two answers that are both completely irrelevant to CPUs. The "project management meeting" much higher compared "large language model". Why?

Our training dataset had a pattern: every positive example was PM content, every negative was AI content. The model didn't just learn "LLM means Lesson Learned Meeting" â€” it also, partially, learned a broader bias: "PM content = relevant, AI content = irrelevant."

For our specific use case (PMO knowledge base with no AI content), this is fine. But if your knowledge base covers both domains, you'd need a more balanced dataset â€” negatives that aren't always AI-related, positives that aren't always PM-related. Fine-tuning updates shared weights, so every bias you bake in affects the whole embedding space.

In [None]:
# @title After fine tuning you can publish the model on huggingface
# Push to Hub
model.push_to_hub("lessonlearned-embeddinggemma")

## Summary and next steps

You have now learned how to adapt an EmbeddingGemma model for a specific domain by fine-tuning it with the Sentence Transformers library.

Explore what more you can do with EmbeddingGemma:
* [Fine-tune EmbeddingGemma docs](https://ai.google.dev/gemma/docs/embeddinggemma/fine-tuning-embeddinggemma-with-sentence-transformers)
* [Training Overview](https://sbert.net/docs/sentence_transformer/training_overview.html) in Sentence Transformers Documentation
* [Generate embeddings with Sentence Transformers](https://ai.google.dev/gemma/docs/embeddinggemma/inference-embeddinggemma-with-sentence-transformers)
* [Simple RAG example](https://github.com/google-gemini/gemma-cookbook/blob/main/Gemma/%5BGemma_3%5DRAG_with_EmbeddingGemma.ipynb) in the Gemma Cookbook
