<a href="https://colab.research.google.com/github/melrahmtz/purple-box/blob/main/hands-on-practice/3103_embedding_to_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Embedding and Retrieval Update 4**
Auto Language Detection
1. Detecting the language of each chunk (text or table).

2. Using the detected language for:
  * Selecting the correct stopwords when processing text.

  * Passing the language to Supabase for full-text search (FTS).

3. Ensuring embeddings remain language-agnostic (since Alibaba GTE supports multilingual text).

# **Embedding**

In [None]:
!pip install supabase numpy psycopg2 langdetect --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.9/169.9 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import json
import torch
import uuid
import numpy as np
from supabase import create_client, Client
from transformers import AutoTokenizer, AutoModel

# Initialize Supabase
SUPABASE_URL = ""
SUPABASE_KEY = ""

supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load Embedding Model
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True)
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Retrieval

In [None]:
import json
import uuid
import numpy as np
import torch
from langdetect import detect
from nltk.corpus import stopwords
from scipy.spatial.distance import cosine

def detect_language(text):
    """Detects the language of a given text."""
    try:
        return detect(text)
    except:
        return "en"  # Default to English if detection fails

def get_stopwords(language):
    """Returns stopwords for the detected language."""
    lang_map = {"en": "english", "it": "italian", "fr": "french", "es": "spanish"}  # Extend as needed
    return set(stopwords.words(lang_map.get(language, "english")))  # Default to English

def get_embedding(text):
    """Generates an embedding vector from input text."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().tolist()

def generate_table_description(table_data):
    """Generates a natural language description from a table's headers and rows."""
    headers = table_data["headers"]
    rows = table_data["rows"]

    description = []
    for row in rows:
        row_text = ", ".join([f"{headers[i]}: {row[i]}" for i in range(len(headers))])
        description.append(row_text)

    return " | ".join(description)  # Separate rows with "|"

def convert_table_to_text(table_data, metadata):
    """Converts a table (headers + rows) into a structured text format with metadata and description."""
    headers = ", ".join(table_data["headers"])
    rows = [" | ".join(row) for row in table_data["rows"]]

    # Retrieve metadata fields
    table_title = metadata.get("table_title", "Unknown Table")
    section = metadata.get("section", "Unknown Section")

    # Generate description from table data
    table_description = generate_table_description(table_data)

    # Combine metadata with table content
    return (
        f"Table Title: {table_title}. Section: {section}.\n"
        f"Table Data:\nHeaders: {headers}\n" + "\n".join(rows) +
        f"\nDescription: {table_description}"
    ), table_description  # Return both formatted text & natural description

def store_chunks_in_supabase(chunks):
    """Stores text and table chunks into Supabase with improved embeddings and language detection."""
    document_entries = []
    table_entries = []

    for chunk in chunks:
        chunk_id = str(uuid.uuid4())  # Generate unique chunk_id

        # Process text content
        if "content" in chunk and chunk["content"]:
            content = chunk["content"]
            detected_lang = detect_language(content)  # Detect language
            embedding = get_embedding(content)

            document_entries.append({
                "chunk_id": chunk_id,
                "content": content,
                "embedding": embedding,
                "metadata": {**chunk["metadata"], "language": detected_lang},  # Store language
                "type": "text"
            })

        # Process table data
        if "table" in chunk and chunk["table"]:
            table_data = chunk["table"]
            metadata = chunk.get("metadata", {})

            # Generate both structured table text & natural description
            table_text, table_description = convert_table_to_text(table_data, metadata)
            detected_lang = detect_language(table_text)  # Detect language
            table_embedding = get_embedding(table_text)

            table_entries.append({
                "chunk_id": chunk_id,
                "table_data": json.dumps(table_data, ensure_ascii=False),
                "description": table_description,
                "embedding": table_embedding,
                "metadata": {**metadata, "language": detected_lang}  # Store language
            })

    # Batch insert into Supabase
    if document_entries:
        supabase.table("documents").insert(document_entries).execute()

    if table_entries:
        supabase.table("tables").insert(table_entries).execute()

In [None]:
# Load JSON chunks
json_file_path = "2014-monarch-plus-service-manual_chunks.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

# Store chunks in Supabase
store_chunks_in_supabase(json_chunks)
print("Text and table embeddings stored successfully in Supabase!")


Text and table embeddings stored successfully in Supabase!


In [None]:
from langdetect import detect
import numpy as np
import torch
import json
import uuid
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from scipy.spatial.distance import cosine

# Ensure nltk resources are available
nltk.download('all')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [None]:
def detect_query_language(text):
    """Detects the language of the input text."""
    try:
        return detect(text)
    except:
        return "en"  # Default to English if detection fails

def get_embedding(text):
    """Generates an embedding vector from input text."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().tolist()

def extract_keywords_simple(text):
    """Extracts important words from a query using language-specific stopwords."""
    lang = detect_query_language(text)

    # Load stopwords for the detected language
    try:
        stop_words = set(stopwords.words(lang))
    except:
        stop_words = set(stopwords.words("english"))  # Default to English

    words = word_tokenize(text.lower())
    keywords = [word for word in words if word.isalnum() and word not in stop_words]
    return keywords

def query_requires_table(user_query):
    """Determines if the query is likely asking for table data, using multilingual table-related keywords."""
    table_keywords = {
        "en": {"table", "data", "values", "measurements", "limits", "thresholds", "parameters", "average", "sum", "percentage", "components"},
        "it": {"tabella", "dati", "valori", "misurazioni", "limiti", "soglie", "parametri", "media", "somma", "percentuale", "componenti"}
    }

    lang = detect_query_language(user_query)
    relevant_keywords = table_keywords.get(lang, table_keywords["en"])  # Fallback to English
    return any(word in user_query.lower() for word in relevant_keywords)

def is_unwanted_section(content, metadata, user_query):
    """Determines if a chunk belongs to an unwanted section like TOC or References, but allows exceptions."""
    unwanted_keywords = {
        "en": {"table of contents", "contents", "reference", "references", "bibliography", "index"},
        "it": {"indice", "sommario", "riferimenti", "bibliografia"}
    }

    lang = detect_query_language(user_query)
    relevant_unwanted = unwanted_keywords.get(lang, unwanted_keywords["en"])  # Fallback to English

    section = metadata.get("section", "").lower()
    user_query_lower = user_query.lower()

    # If the user explicitly asks for references or TOC, boost its relevance
    if any(word in user_query_lower for word in relevant_unwanted):
        return 1.5  # Increase relevance when explicitly requested

    # Otherwise, apply penalty if it matches unwanted sections
    if any(word in content.lower() or word in section for word in relevant_unwanted):
        return 0.3  # Penalize unwanted sections

    return 1  # No penalty for normal content

def query_supabase(user_query):
    """Retrieves both text and table chunks based on query, ensuring relevance balance with FTS and vector search."""
    query_embedding = np.array(get_embedding(user_query), dtype=np.float32).flatten()
    keywords = extract_keywords_simple(user_query)
    requires_table = query_requires_table(user_query)

    #### Step 1: Full-Text Search (FTS) ####
    response_fts = supabase.rpc("match_documents_fts", {"query": user_query}).execute()
    fts_results = [(r["chunk_id"], "text", r["content"], 0.9) for r in response_fts.data]

    response_fts_tables = supabase.rpc("match_tables_fts", {"query": user_query}).execute()
    fts_table_results = [(r["chunk_id"], "table", r["description"], 0.9) for r in response_fts_tables.data]

    #### Step 2: Retrieve Text Chunks (Vector Search) ####
    response_text = supabase.table("documents").select("chunk_id, content, embedding, metadata").execute()
    text_results = []

    for record in response_text.data:
        chunk_embedding = np.array(record["embedding"], dtype=np.float32).flatten()

        similarity = 1 - cosine(query_embedding, chunk_embedding)
        metadata = record.get("metadata", {})
        score_adjustment = is_unwanted_section(record["content"], metadata, user_query)
        final_text_score = similarity * score_adjustment

        text_results.append((record["chunk_id"], "text", record["content"], final_text_score))

    #### Step 3: Retrieve Table Chunks ####
    response_tables = supabase.table("tables").select("chunk_id, table_data, description, embedding, metadata").execute()
    table_results = []

    for record in response_tables.data:
        table_embedding = np.array(record["embedding"], dtype=np.float32).flatten()
        metadata = record.get("metadata", {})

        boost_factor = is_unwanted_section(record["description"], metadata, user_query)

        section_title = (metadata.get("section") or "").lower()
        table_title = (metadata.get("table_title") or "").lower()
        user_query_lower = user_query.lower()

        # Boost if query matches section title
        if any(word in section_title for word in keywords):
            boost_factor *= 1.5  # Mild boost

        # Stronger boost if query matches the table title
        if any(word in table_title for word in keywords):
            boost_factor *= 1.5  # Higher priority boost

        keyword_match_score = boost_factor * sum(
            3 if word in record["description"].split(" ")[:5] else 1
            for word in keywords if word in record["description"]
        )

        if table_embedding.shape == query_embedding.shape:
            embedding_similarity = 1 - cosine(query_embedding, table_embedding)

            keyword_embedding_score = sum(
                1 - cosine(get_embedding(word), table_embedding) for word in keywords
            ) / max(len(keywords), 1)

            final_table_score = (
                ((embedding_similarity ** 0.8) * 0.3) +
                ((keyword_match_score ** 1.5) * 0.5) +
                ((keyword_embedding_score ** 1.2) * 0.2)
            ) * boost_factor

        if final_table_score > 0:
            table_results.append((record["chunk_id"], "table", record["description"], final_table_score))

        table_results.sort(key=lambda x: x[3], reverse=True)

    #### Step 4: Merge & Rank Results ####
    all_results = fts_results + fts_table_results + text_results + table_results
    all_results.sort(key=lambda x: x[3], reverse=True)

    return all_results[:5]  # Return top 5 most relevant chunks


In [None]:
### Example query for Manuale-IRIS_SLIM_IN_TEC_IT ###
#user_query = "What are the key considerations for using and maintaining the Iris Slim units?"  # Answer in Section 2 and Section 6.1
#user_query = "What is the intended use of IRIS Slim units?"  # Answer in Section 2.1
#user_query = "What are the installation requirements for the IRIS Slim unit?"  # Answer in Section 4.2
#user_query = "What are the operating limit?"  # Answer in Section 2.5 (Table 1)
#user_query = "What steps should be taken in case of water leakage?"  # Answer in Section 4.3.1, Section 6.3, Section 6.3.1

### Example query for Manuale-ROTOMARR ###
#user_query = "What is the intended use of the ROTOMARR automatic chestnut roaster?"  # Answer in Section 3 - Destinazione d’uso e utilizzatori
#user_query = "What are the steps to correctly start the ROTOMARR chestnut roaster?"  # Answer in Section 4 - Avviamento
#user_query = "What does the image in the regulation section illustrate?"  # Answer in Section 3 - Regolazione (with labeled image of components)
#user_query = "What are the components of the ROTOMARR chestnut roaster?"  # Answer in Section 6 - Ricambi (table)
#user_query = "What are the power requirements of the ROTOMARR machine?"  # Answer in Section 2 - Specifiche
#user_query = "How should the ROTOMARR machine be cleaned and maintained?"  # Answer in Section 5 - Pulizia & Manutenzione

### Example query for PDF1 ###
#user_query = "What is sentiment analysis and what is NLP?"  # Answer in Section 2
#user_query = "What is the accuracy of the LDA model when using the daily weighted average sentiment score?"  # Answer in Section 3.1 (Table 1)
#user_query = "What is the characteristic of the sentiment distribution score?"  # Answer in Section 3.1 (Figure 1)
#user_query = "Algorithm flow chart"  # Answer in Section 3.4.1 (Figure 5)

### Example query for 17 ###
#user_query = "What were the objectives of the workshop on family medicine training in Africa?" # Answer in Introduction (page 1)
#user_query = "Challenges faced in the workshop"  # Answer in several sections
#user_query = "Which countries participated in the workshop on family medicine training in Africa and its number of participant?"  # Answer in Workshop Participant (page 3, Table 1)
#user_query = "introduction"

### Example query for 2014-monarch-plus-service-manual ###
#user_query = "What safety precautions should be followed when servicing the RockShox Monarch Plus RC3/R?"  # Answer in Safety Instructions
#user_query = "What are the steps to remove the air can from the Monarch Plus RC3/R rear shock?"  # Answer in Air Can Removal
#user_query = "What are the components of the Monarch Plus RC3/R rear shock?"  # Answer in Exploded View (image)
#user_query = "What are the warranty limitations for the Monarch Plus RC3/R rear shock?"  # Answer in SRAM LLC Warranty - Limitations of Warranty
user_query = "What are the p a r t s a n d t o o l s needed for the Monarch Plus RC3/R rear shock service?"  # Answer in Parts and Tools for Service

retrieved_chunks = query_supabase(user_query)

for chunk in retrieved_chunks:
    print(f"Chunk ID: {chunk[0]}\nType: {chunk[1]}\nContent: {chunk[2][:300]}...\nRelevance: {chunk[3]:.4f}\n")


Chunk ID: 667c9e69-e2be-41ac-b52a-d7f2af11396b
Type: text
Content: ## P a r t s   a n d   T o o l s   N e e d e d   F o r   S e r v i c e
- • Safety glasses - • Torque wrench - • Nitrile gloves - • Apron - • Clean, lint-free rags - • Oil pan - • Isopropyl alcohol - • Parker® O-Lube - • Suspension specific grease - • Maxima® Maxum4 Extra 15w50 lube - • RockShox 3wt ...
Relevance: 0.9326

Chunk ID: a519d209-ee2e-4f4c-9440-19db32271547
Type: text
Content: ## M o n a r c h   P l u s ™ R C 3 / R   S e r v i c e
Prior to servicing your rear shock, remove it from the bicycle frame according to the bicycle manufacturer's instructions. Once the shock is removed from the bicycle, remove the mounting hardware before performing any service (see the Mounting H...
Relevance: 0.9306

Chunk ID: a2cc1574-35a0-4157-8192-d405475bbfb3
Type: text
Content: ## E x p l o d e d   V i e w   -   M o n a r c h   P l u s ™ R C 3 / R   R e a r   S h o c k
![Image](/content/markdown/2014-monarch-plus-service-manual_