# SentenceTransformer with Advanced Embedding Models - Complete Guide

## Advanced Sentence Embeddings with SentenceTransformer
#
### This notebook demonstrates how to use the SentenceTransformer library with modern embedding models,
### including setup for custom models like GemmaRmbedding and other state-of-the-art models.

#### 1. Installation and Setup

In [1]:
# Install required packages
!pip install -U sentence-transformers torch transformers datasets faiss-cpu matplotlib seaborn scikit-learn pandas numpy PyMuPDF



In [2]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from typing import List, Dict, Union, Optional
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

# SentenceTransformer imports
from sentence_transformers import SentenceTransformer, models, util
from sentence_transformers import CrossEncoder, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4


In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `hf auth whoami` to get more information or `hf auth logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The tok

#### 2. Choosing and Loading the Embedding Model

In [4]:
from sentence_transformers import SentenceTransformer

# Load a top-tier embedding model from Hugging Face
# This model is a strong performer on the MTEB leaderboard.
model_name = 'google/embeddinggemma-300m'
model = SentenceTransformer(model_name)

print(f"Model '{model_name}' loaded successfully.")
print(f"Max sequence length: {model.max_seq_length}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Model 'google/embeddinggemma-300m' loaded successfully.
Max sequence length: 2048
Embedding dimension: 768


#### 3. Processing the PDF Document

In [5]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extracts all text from a given PDF file."""
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    doc.close()
    return full_text

# Extract text from our dummy paper
pdf_path = "/content/2025 pa.pdf"
paper_text = extract_text_from_pdf(pdf_path)

print("--- Extracted Text ---")
print(paper_text[:500] + "...") # Print the first 500 characters

--- Extracted Text ---
Integrating vision transformer-
based deep learning model with 
kernel extreme learning machine 
for non-invasive diagnosis of 
neonatal jaundice using biomedical 
images
M. Eliazer1, Sibi Amaran1, K. Sreekumar1, A. Vikram2, Gyanendra Prasad Joshi3 & 
Woong Cho3
Birth complications, particularly jaundice, are one of the leading causes of adolescent death and 
disease all over the globe. The main severity of these illnesses may diminish if scholars study more 
about their sources and progress t...


#### 4. Text Chunking

In [6]:
def chunk_text(text: str, chunk_size: int = 1024, chunk_overlap: int = 128) -> list[str]:
    """
    Splits a text into overlapping chunks.

    Args:
        text: The input text.
        chunk_size: The desired size of each chunk in characters.
        chunk_overlap: The number of characters to overlap between consecutive chunks.

    Returns:
        A list of text chunks.
    """
    if len(text) <= chunk_size:
        return [text]

    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap
    return chunks

# Chunk the extracted paper text
text_chunks = chunk_text(paper_text, chunk_size=800, chunk_overlap=100)

print(f"The text was split into {len(text_chunks)} chunks.")
print("\n--- First Chunk ---")
print(text_chunks[0])

The text was split into 84 chunks.

--- First Chunk ---
Integrating vision transformer-
based deep learning model with 
kernel extreme learning machine 
for non-invasive diagnosis of 
neonatal jaundice using biomedical 
images
M. Eliazer1, Sibi Amaran1, K. Sreekumar1, A. Vikram2, Gyanendra Prasad Joshi3 & 
Woong Cho3
Birth complications, particularly jaundice, are one of the leading causes of adolescent death and 
disease all over the globe. The main severity of these illnesses may diminish if scholars study more 
about their sources and progress toward effective treatment. Assured developments were prepared, 
but they are inadequate. Newborns repeatedly have jaundice as their primary medical concern. A 
raised level of bilirubin is a symbol of jaundice. Generally, in newborns, hyperbilirubinemia peaks in 
the initial post-delivery week. The 


#### 5. Generating the Embeddings


In [7]:
import numpy as np

# Generate embeddings for each chunk
# The model will automatically handle batching.
# We can show a progress bar for large numbers of chunks.
embeddings = model.encode(text_chunks, show_progress_bar=True)

# The result is a NumPy array
print(f"\nShape of the embeddings array: {embeddings.shape}")
print(f"This means we have {embeddings.shape[0]} vectors, each with {embeddings.shape[1]} dimensions.")

# You can save these embeddings for later use
np.save("paper_embeddings.npy", embeddings)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]


Shape of the embeddings array: (84, 768)
This means we have 84 vectors, each with 768 dimensions.


#### 6. Example Usage: Semantic Search


In [8]:
from sentence_transformers.util import cos_sim

def search(query: str, chunks: list[str], embeddings: np.ndarray, top_k: int = 2):
    """
    Finds the most relevant chunks for a given query.
    """
    # 1. Embed the query
    query_embedding = model.encode(query)

    # 2. Calculate cosine similarity between the query and all chunks
    similarities = cos_sim(query_embedding, embeddings)[0]

    # 3. Find the top_k most similar chunks
    # We use torch.topk for efficiency
    top_k_indices = np.argsort(similarities)[-top_k:]

    # 4. Return the results
    results = []
    for idx in reversed(top_k_indices): # Show the most similar first
        results.append({
            "chunk": chunks[idx],
            "similarity": float(similarities[idx])
        })
    return results

# --- Let's test it! ---
user_query = "What is Equation for MSA(z)?"

search_results = search(user_query, text_chunks, embeddings)

print(f"\nQuery: '{user_query}'\n")
print("--- Top Search Results ---")
for i, result in enumerate(search_results):
    print(f"Result {i+1} (Similarity: {result['similarity']:.4f}):")
    print(result['chunk'])
    print("-" * 25)


Query: 'What is Equation for MSA(z)?'

--- Top Search Results ---
Result 1 (Similarity: 0.5014):
s any non-commercial use, sharing, distribution and reproduction in 
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide 
a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have 
permission under this licence to share adapted material derived from this article or parts of it. The images or 
other third party material in this article are included in the article’s Creative Commons licence, unless indicated 
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence 
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to 
obtain permission directly from the copyright holder. 
-------------------------
Result 2 (Similarity: 0.4923):
a greater proportion of actual positives (re

---

## 🚀 Thank You & Next Steps

Thank you for working through this notebook! You've successfully seen how to use a state-of-the-art embedding model for a practical task: processing a PDF and performing a semantic search.

We hope this demonstration has been a valuable and practical learning experience.

### Explore More with Us

If you found this guide helpful, we invite you to explore more of our work. Our team is passionate about pushing the boundaries of AI and sharing our findings with the community.

- **Visit Our Website:** Discover our latest projects, research, and the services we offer.
  [**[Momen Walied Website]**](https://momenwalied.camitai.com)

- **Read Our Blog:** For more deep dives, technical tutorials, and insights into the world of NLP and Large Language Models, be sure to check out our blog.
  [**Read Our Latest Blog Posts**](https://momenwalied.camitai.com/blog/PRDs)

We are constantly updating our content with new findings. Stay connected with us to keep learning!

Happy coding!