<a href="https://colab.research.google.com/github/kairamilanifitria/PurpleBox-Intern/blob/main/03_07_Embedding%2BRetrieval%2BLLM_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## chunking

Problem:
>  some text are missing in the chunks produced, and table handling view some error that takes normal text into table format. This is because: Semantic Chunking and Splitter Module, also the table handling not works well with some rules in previous chunking approach


**So, the mitigations were:**
1. Remove the semantic chunking
2. Remove the uses of `MarkdownHeaderSplitter` from langchain

**Revised Implementation:**
1. Define customable Markdown Splitter
2. While semantic chunking removed, i use the sliding windows for splitting the chunks that is too long. `Maximum words = 400`; `Overlap = 40 words`
3. Table handling = improve the additional context in "section" part of chunks metadata
4. No more text missing

In [17]:
import json
import re
import os

# Load Markdown file
file_path = "/content/drive/MyDrive/document_rag/md/PDF1.md"
file_name = os.path.basename(file_path)
with open(file_path, "r", encoding="utf-8") as file:
    markdown_text = file.read()

# Function to check if a chunk contains a Markdown table
def is_table(chunk):
    return bool(re.search(r'^\|.*\|\n\|[-| ]+\|\n(\|.*\|\n)*', chunk, re.MULTILINE))

# Function to extract and split long tables
def extract_and_split_table(chunk, max_rows=10):
    lines = chunk.strip().split("\n")
    header, table_rows = None, []
    for i, line in enumerate(lines):
        if re.match(r'^\|[-| ]+\|$', line):
            header = lines[i - 1].strip("|").split("|")
            header = [h.strip() for h in header]
            continue
        if header:
            row_data = line.strip("|").split("|")
            row_data = [cell.strip() for cell in row_data]
            table_rows.append(row_data)

    # Split table into chunks if too many rows
    table_chunks = []
    for i in range(0, len(table_rows), max_rows):
        chunk_rows = table_rows[i:i + max_rows]
        table_chunks.append({"headers": header, "rows": chunk_rows})

    return table_chunks if header and table_rows else None

# Function to extract section headers
def extract_section_title(header):
    match = re.match(r'^(#+)\s+(.*)', header.strip())
    return match.group(2) if match else None

# Function to detect table title
def detect_table_title(pre_table_text):
    lines = pre_table_text.strip().split("\n")
    if lines and len(lines[-1].split()) < 10:  # Assuming a title is a short line before a table
        return lines[-1]
    return None

# Function to split text into chunks of max 400 words with 40-word overlap
def split_text(text, section_title, max_words=400, overlap=40):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + max_words, len(words))
        chunk = " ".join(words[start:end])
        # Prepend section title to first chunk
        if start == 0:
            chunk = f"## {section_title}\n{chunk}"
        chunks.append(chunk)
        start += max_words - overlap
    return chunks

# Process Markdown
sections = re.split(r'^(#+\s+.*)', markdown_text, flags=re.MULTILINE)
final_chunks = []
current_section = "Unknown"
chunk_id = 1

for i in range(1, len(sections), 2):
    section_title = extract_section_title(sections[i]) or current_section
    content = sections[i + 1].strip()
    current_section = section_title  # Update current section to maintain hierarchy

    table_matches = list(re.finditer(r'(\|.*\|\n\|[-| ]+\|\n(?:\|.*\|\n)+)', content, re.MULTILINE))
    last_index = 0

    for match in table_matches:
        start, end = match.span()
        pre_table_text = content[last_index:start].strip()
        table_text = match.group(0)
        last_index = end

        table_title = detect_table_title(pre_table_text)  # Extract table title if present
        if pre_table_text:
            text_chunks = split_text(pre_table_text, section_title)
            for chunk in text_chunks:
                final_chunks.append({
                    "chunk_id": chunk_id,
                    "content": chunk,
                    "metadata": {
                        "source": file_name,
                        "section": section_title,
                        "position": chunk_id
                    }
                })
                chunk_id += 1

        table_chunks = extract_and_split_table(table_text)
        if table_chunks:
            for table_chunk in table_chunks:
                final_chunks.append({
                    "chunk_id": chunk_id,
                    "table": table_chunk,
                    "metadata": {
                        "source": file_name,
                        "section": section_title,
                        "table_title": table_title,
                        "position": chunk_id
                    }
                })
                chunk_id += 1

    remaining_text = content[last_index:].strip()
    if remaining_text:
        text_chunks = split_text(remaining_text, section_title)
        for chunk in text_chunks:
            final_chunks.append({
                "chunk_id": chunk_id,
                "content": chunk,
                "metadata": {
                    "source": file_name,
                    "section": section_title,
                    "position": chunk_id
                }
            })
            chunk_id += 1

# Save JSON output
output_file = "/content/PDF1.json"
with open(output_file, "w", encoding="utf-8") as json_file:
    json.dump(final_chunks, json_file, indent=4, ensure_ascii=False)

print(f"Chunking completed. JSON saved to: {output_file}")


Chunking completed. JSON saved to: /content/PDF1.json


## vector store embedding

In [None]:
!pip install supabase

### **sentence-transformer**

In [None]:
import json
import torch
from sentence_transformers import SentenceTransformer
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load Embedding Model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Load JSON chunks
json_file_path = "/content/17_chunks_v2.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    for chunk in chunks:
        content = chunk.get("content") or json.dumps(chunk.get("table"), ensure_ascii=False)
        embedding = model.encode(content).tolist()
        chunk["embedding"] = embedding

        data = {
            "content": content,
            "embedding": embedding,
            "metadata": chunk["metadata"]
        }
        supabase.table("documents").insert(data).execute()

# Store in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")


Chunks with embeddings stored successfully in Supabase!


### **BAAI/bge-m3**

In [None]:
import json
import torch
from transformers import AutoTokenizer, AutoModel
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load BAAI/bge-m3 Embedding Model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

In [None]:
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

# Load JSON chunks
json_file_path = "/content/17_chunks_v2.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    for chunk in chunks:
        content = chunk.get("content") or json.dumps(chunk.get("table"), ensure_ascii=False)
        embedding = get_embedding(content)
        chunk["embedding"] = embedding

        data = {
            "content": content,
            "embedding": embedding,
            "metadata": chunk["metadata"]
        }
        supabase.table("documents").insert(data).execute()

# Store in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")


Chunks with embeddings stored successfully in Supabase!


### **Alibaba-NLP/gte-multilingual-base**

In [None]:
import json
import torch
from transformers import AutoTokenizer, AutoModel
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load Embedding Model

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

In [None]:
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

# Load JSON chunks
json_file_path = "/content/17_chunks_v2.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    for chunk in chunks:
        content = chunk.get("content") or json.dumps(chunk.get("table"), ensure_ascii=False)
        embedding = get_embedding(content)
        chunk["embedding"] = embedding

        data = {
            "content": content,
            "embedding": embedding,
            "metadata": chunk["metadata"]
        }
        supabase.table("documents").insert(data).execute()

# Store in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")


### emptying the table in supabase :

> `DELETE FROM documents;`



### TESTING

### **sentence-transformer**

In [None]:
import json
import torch
from sentence_transformers import SentenceTransformer
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load Embedding Model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def search_matching_documents(query, top_k=3):
    query_embedding = model.encode(query).tolist()

    # Perform similarity search in Supabase
    response = supabase.rpc(
        "match_documents",
        {"query_embedding": query_embedding, "match_count": top_k}
    ).execute()

    if response.data:
        print("\nMatching Documents:")
        for idx, entry in enumerate(response.data, start=1):
            print(f"{idx}. {entry['content']}\n")
    else:
        print("No matching documents found.")


In [None]:
# Example Query
query = "What is SAAFP?"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.2784
   Content: ## Reflection on workshop  
The group appreciated the richness of the discussion and the value  of  having  a  variety  of  countries  represented  in  the workshop. The group members expressed feeling encouraged and felt motivated to use 'small moments, little bits, part of the mini-CEX' during learning interactions in the workplace (teachable  moments).  This  will  allow  the  supervisors  and registrars to be 'more real' in the workplace, as opposed to striving for the hard-to-reach  perfect or ideal learning interactions. It will necessitate a more honest and pragmatic approach  to  harness  these  learning  moments.  Ongoing discussions  are  needed  around  the  validity  of  continuous assessments  in  the  workplace  for  national  examinations, such as the Fellowship of the College of Family Physicians of South Africa (FCFP[SA]), and the contribution of the learning portfolio to exit examination res

### **huggingface BAAI**

In [None]:
def search_matching_documents(query, top_k=3):
    query_embedding = get_embedding(query)

    # Perform similarity search in Supabase
    response = supabase.rpc(
        "match_documents",
        {"query_embedding": query_embedding, "match_count": top_k}
    ).execute()

    if response.data:
        print("\nMatching Documents with Similarity Scores:")
        for idx, entry in enumerate(response.data, start=1):
            print(f"{idx}. Similarity: {entry['similarity']:.4f}")
            print(f"   Content: {entry['content']}\n")
    else:
        print("No matching documents found.")

In [None]:
# Example Query
query = "What is SAAFP?"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.6064
   Content: 7  
The World Organisation of Family Doctors and the South African Academy of Family Physicians (SAAFP)  have  established  standards  for  the  postgraduate  training  of  family  physicians. 8,9  
However,  family  medicine  is  a  relatively  new  specialty  in many African countries, which adds to the challenges around training and supervision in the context of large rural areas, massive health needs and minimal resources. 10  
The  aim  of  the  workshop  was  to  understand  how  family medicine registrars (postgraduate trainees in family medicine) in  Africa  learn  in  the  workplace. We  particularly  wanted  to explore  the  interaction  between  the  registrar  and  supervisor in the workplace, captured in a portfolio of learning, and in the African  context. We  sought  a  clearer  understanding  of  what it  means  to  be  observed  while  conducting  a  consultation  or performing  a  procedure,

### **Alibaba-NLP/gte-multilingual-base**

In [None]:
def search_matching_documents(query, top_k=3):
    query_embedding = get_embedding(query)

    # Perform similarity search in Supabase
    response = supabase.rpc(
        "match_documents",
        {"query_embedding": query_embedding, "match_count": top_k}
    ).execute()

    if response.data:
        print("\nMatching Documents with Similarity Scores:")
        for idx, entry in enumerate(response.data, start=1):
            print(f"{idx}. Similarity: {entry['similarity']:.4f}")
            print(f"   Content: {entry['content']}\n")
    else:
        print("No matching documents found.")

In [None]:
# Example Query
query = "Participants in Kenya?"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.8272
   Content: ## Participants and process  
Thirty-five people  participated  in  a  2-h  workshop  and included trainers and trainees from nine African countries, the United Kingdom, United States and Sweden (see Table 1). South  Africa  was  represented  by  the  universities  of  Cape Town,  Limpopo,  Pretoria,  Sefako  Makgatho,  Stellenbosch, Walter Sisulu and Witwatersrand.  
We started with an introduction and then divided into buzz pairs (pairs were allowed to form spontaneously, regardless of the trainer or trainee status of the participants). In the buzz pairs, we explored the questions of how do I teach or learn, supervise  or  be  supervised,  and  assess  or  be  assessed. This was followed by an interactive focus group discussion on  the  reflections  created  by  the  buzz  pair  discussions (a guiding style was employed to facilitate this discussion). The  group  reflections  were  captured  on  a  flip  ch

## Try using other language = italian

### **huggingface BAAI**

In [None]:
!pip install supabase

In [None]:
import json
import torch
from transformers import AutoTokenizer, AutoModel
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load BAAI/bge-m3 Embedding Model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

In [None]:
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

# Load JSON chunks
json_file_path = "/content/ManualeRotomarr.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    for chunk in chunks:
        content = chunk.get("content") or json.dumps(chunk.get("table"), ensure_ascii=False)
        embedding = get_embedding(content)
        chunk["embedding"] = embedding

        data = {
            "content": content,
            "embedding": embedding,
            "metadata": chunk["metadata"]
        }
        supabase.table("documents").insert(data).execute()

# Store in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")


Chunks with embeddings stored successfully in Supabase!


### **huggingface BAAI**

In [None]:
def search_matching_documents(query, top_k=3):
    query_embedding = get_embedding(query)

    # Perform similarity search in Supabase
    response = supabase.rpc(
        "match_documents",
        {"query_embedding": query_embedding, "match_count": top_k}
    ).execute()

    if response.data:
        print("\nMatching Documents with Similarity Scores:")
        for idx, entry in enumerate(response.data, start=1):
            print(f"{idx}. Similarity: {entry['similarity']:.4f}")
            print(f"   Content: {entry['content']}\n")
    else:
        print("No matching documents found.")

In [None]:
# Example Query
query = "INFORMAZIONI SULLA SICUREZZA"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.6989
   Content: ## 1. INFORMAZIONI SULLA SICUREZZA
## 3. DESTINAZIONE D'USO E UTILIZZATORI  
Il  presente  manuale  contiene  indicazioni  ed  informazioni fondamentali per il corretto utilizzo del GIRACASTAGNE AUTOMATICO (CUOCI CALDARROSTE) ROTOMARR .  
- -Leggere il manuale nella sua completezza per comprendere l'utilizzo della macchina;
- -Tenere questo manuale per future consultazioni in un luogo sicuro;
- -Osservare le istruzioni indicate in questo manuale per garantire la sicurezza dell'utilizzatore;
- -La non osservanza delle indicazioni elencate in questo manuale comporterà l'annullamento della garanzia;
- -MECTRONICA S.r.l. non è responsabile per danni o lesioni causate dalla non osservanza delle informazioni elencate nel presente manuale.

2. Similarity: 0.6657
   Content: ## 5. PULIZIA &amp; MANUTENZIONE  
40010 Bentivoglio (BO) Italia  
Tel. +39 0516641440 Fax. +39 0518909108  
Al  termine  di  ogni  utilizzo,  s

try query in english

In [None]:
# Example Query
query = "safety information"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.6521
   Content: ## 1. INFORMAZIONI SULLA SICUREZZA
## 3. DESTINAZIONE D'USO E UTILIZZATORI  
Il  presente  manuale  contiene  indicazioni  ed  informazioni fondamentali per il corretto utilizzo del GIRACASTAGNE AUTOMATICO (CUOCI CALDARROSTE) ROTOMARR .  
- -Leggere il manuale nella sua completezza per comprendere l'utilizzo della macchina;
- -Tenere questo manuale per future consultazioni in un luogo sicuro;
- -Osservare le istruzioni indicate in questo manuale per garantire la sicurezza dell'utilizzatore;
- -La non osservanza delle indicazioni elencate in questo manuale comporterà l'annullamento della garanzia;
- -MECTRONICA S.r.l. non è responsabile per danni o lesioni causate dalla non osservanza delle informazioni elencate nel presente manuale.

2. Similarity: 0.6195
   Content: ## 6. RICAMBI  
Nelle seguenti pagine saranno indicate a disegno le componenti meccaniche con i loro codici.  
Qualora  sia necessario ordinare 

### **Alibaba-NLP/gte-multilingual-base**

In [None]:
import json
import torch
from transformers import AutoTokenizer, AutoModel
from supabase import create_client

# Supabase Configuration
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)

# Load Embedding Model

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

In [None]:
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

# Load JSON chunks
json_file_path = "/content/ManualeRotomarr.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    for chunk in chunks:
        content = chunk.get("content") or json.dumps(chunk.get("table"), ensure_ascii=False)
        embedding = get_embedding(content)
        chunk["embedding"] = embedding

        data = {
            "content": content,
            "embedding": embedding,
            "metadata": chunk["metadata"]
        }
        supabase.table("documents").insert(data).execute()

# Store in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")


Chunks with embeddings stored successfully in Supabase!


### **Alibaba-NLP/gte-multilingual-base**

In [None]:
def search_matching_documents(query, top_k=3):
    query_embedding = get_embedding(query)

    # Perform similarity search in Supabase
    response = supabase.rpc(
        "match_documents",
        {"query_embedding": query_embedding, "match_count": top_k}
    ).execute()

    if response.data:
        print("\nMatching Documents with Similarity Scores:")
        for idx, entry in enumerate(response.data, start=1):
            print(f"{idx}. Similarity: {entry['similarity']:.4f}")
            print(f"   Content: {entry['content']}\n")
    else:
        print("No matching documents found.")

In [None]:
# Example Query
query = "INFORMAZIONI SULLA SICUREZZA"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.7864
   Content: ## ROTOMARR  
- 2) l'utilizzo della macchina su fornelli di dimensioni superiori ai 70 cm di diametro  
Utilizzare  l'apparecchio  solo  su  un  fornello  a gas dal diametro max. di 70mm.  
![Image](/content/markdown2/ManualeRotomarr_artifacts/image_000004_e70e7737d2a45c3228fd944f2ed7008935b52b5367d10e1135a35711464aba0a.png)  
- 3) il lavaggio della macchina in lavastoviglie
- 4) il lavaggio  della macchina  con  getto  d'acqua pieno
- 5) l'apertura dei ripari o una qualsiasi manomissione della macchina
- 6) l'utilizzo all'esterno in caso di cattive condizioni meteorologiche  (pioggia,  neve,  grandine,  vento forte)
- 7) l'utilizzo  in  locali  con  pericolo  di  esplosione  o incendio  o  in  presenza  di  grandi  quantitativi  di materiale infiammabile

2. Similarity: 0.7763
   Content: ## 1. INFORMAZIONI SULLA SICUREZZA
## 3. DESTINAZIONE D'USO E UTILIZZATORI  
Il  presente  manuale  contiene  indicazioni

try query in english

In [None]:
# Example Query
query = "safety information"
search_matching_documents(query)


Matching Documents with Similarity Scores:
1. Similarity: 0.7606
   Content: ## 1. INFORMAZIONI SULLA SICUREZZA
## 3. DESTINAZIONE D'USO E UTILIZZATORI  
Il  presente  manuale  contiene  indicazioni  ed  informazioni fondamentali per il corretto utilizzo del GIRACASTAGNE AUTOMATICO (CUOCI CALDARROSTE) ROTOMARR .  
- -Leggere il manuale nella sua completezza per comprendere l'utilizzo della macchina;
- -Tenere questo manuale per future consultazioni in un luogo sicuro;
- -Osservare le istruzioni indicate in questo manuale per garantire la sicurezza dell'utilizzatore;
- -La non osservanza delle indicazioni elencate in questo manuale comporterà l'annullamento della garanzia;
- -MECTRONICA S.r.l. non è responsabile per danni o lesioni causate dalla non osservanza delle informazioni elencate nel presente manuale.

2. Similarity: 0.7529
   Content: ## ROTOMARR  
- 2) l'utilizzo della macchina su fornelli di dimensioni superiori ai 70 cm di diametro  
Utilizzare  l'apparecchio  solo  su  un 

## testing: Cosine Similarity

BAAI

In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

# Load Embedding Model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def get_embedding(text):
    """Generate embedding using the BAAI/bge-m3 model."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    # Move the tensor to the CPU before converting to NumPy
    return outputs.last_hidden_state[:, 0, :].cpu().squeeze().numpy()  # Use CLS token embedding

def avg_cosine_similarity(query, retrieved_docs, doc_embeddings):
    """Compute the average cosine similarity between a query and retrieved documents."""
    query_embedding = get_embedding(query).reshape(1, -1)
    doc_vectors = np.array([doc_embeddings[doc] for doc in retrieved_docs])

    similarities = cosine_similarity(query_embedding, doc_vectors)[0]
    return np.mean(similarities)

# Example Query
query = "What is SAAFP?"

# Example Retrieved Documents
retrieved_docs = {
    "doc_1": "The World Organisation of Family Doctors and the South African Academy of Family Physicians (SAAFP)  have  established  standards  for  the  postgraduate  training  of  family  physicians. 8,9 However,  family  medicine  is  a  relatively  new  specialty  in many African countries, which adds to the challenges around training and supervision in the context of large rural areas, massive health needs and minimal resources. 10 The  aim  of  the  workshop  was  to  understand  how  family medicine registrars (postgraduate trainees in family medicine) in  Africa  learn  in  the  workplace. We  particularly  wanted  to explore  the  interaction between  the  registrar  and  supervisor in the workplace, captured in a portfolio of learning, and in the African  context. We  sought  a  clearer  understanding  of  what it  means  to  be  observed  while  conducting  a  consultation  or performing  a  procedure,  as  well  as  understanding  the  local experience of giving or receiving feedback, and how various educational meetings are conducted.",
    "doc_2": "It was clear from this workshop discussion that the training  of  family  physicians  across  Africa  shares  many common  themes.  However,  there  are  also  big  differences among the various countries and even programmes within countries. The way forward would include exploring the  local  contextual  enablers  that  influence  the  learning conversations between trainees and their supervisors. Family medicine  training  institutions  and  organisations  (such  as WONCA Africa and SAAFP) have a critical role to play in supporting  trainees  and  trainers  towards  developing  local competencies that facilitate learning in the clinical workplace dominated by service delivery pressures. ## Acknowledgements The  authors  would  like  to  thank  and  acknowledge  the  35 trainers and trainees who participated in the workshop.",
    "doc_3": "Thirty-five people  participated  in  a  2-h  workshop  and included trainers and trainees from nine African countries, the United Kingdom, United States and Sweden (see Table 1). South  Africa  was  represented  by  the  universities  of  Cape Town,  Limpopo,  Pretoria,  Sefako  Makgatho,  Stellenbosch, Walter Sisulu and Witwatersrand. We started with an introduction and then divided into buzz pairs (pairs were allowed to form spontaneously, regardless of the trainer or trainee status of the participants). In the buzz pairs, we explored the questions of how do I teach or learn, supervise  or  be  supervised,  and  assess  or  be  assessed. This was followed by an interactive focus group discussion on  the  reflections  created  by  the  buzz  pair  discussions (a guiding style was employed to facilitate this discussion). The  group  reflections  were  captured  on  a  flip  chart  by  a scribe.  Common  themes  were  identified.  Clarification  was sought and  validated immediately  with the workshop participants. A  preliminary  draft  of  this  report  was  shared with the workshop participants after the conference."
}

# Compute embeddings for documents
doc_embeddings = {doc: get_embedding(content) for doc, content in retrieved_docs.items()}

# Compute Average Cosine Similarity
avg_sim = avg_cosine_similarity(query, retrieved_docs.keys(), doc_embeddings)

print("Average Cosine Similarity:", avg_sim)


Average Cosine Similarity: 0.37750697


Alibaba

In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def get_embedding(text):
    """Generate embedding using the Alibaba-NLP/gte-multilingual-base."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    # Move the tensor to the CPU before converting to NumPy
    return outputs.last_hidden_state[:, 0, :].cpu().squeeze().numpy()  # Use CLS token embedding

def avg_cosine_similarity(query, retrieved_docs, doc_embeddings):
    """Compute the average cosine similarity between a query and retrieved documents."""
    query_embedding = get_embedding(query).reshape(1, -1)
    doc_vectors = np.array([doc_embeddings[doc] for doc in retrieved_docs])

    similarities = cosine_similarity(query_embedding, doc_vectors)[0]
    return np.mean(similarities)

# Example Query
query = "What is SAAFP?"

# Example Retrieved Documents
retrieved_docs = {
    "doc_1": "The World Organisation of Family Doctors and the South African Academy of Family Physicians (SAAFP)  have  established  standards  for  the  postgraduate  training  of  family  physicians. 8,9 However,  family  medicine  is  a  relatively  new  specialty  in many African countries, which adds to the challenges around training and supervision in the context of large rural areas, massive health needs and minimal resources. 10 The  aim  of  the  workshop  was  to  understand  how  family medicine registrars (postgraduate trainees in family medicine) in  Africa  learn  in  the  workplace. We  particularly  wanted  to explore  the  interaction  between  the  registrar  and  supervisor in the workplace, captured in a portfolio of learning, and in the African  context. We  sought  a  clearer  understanding  of  what it  means  to  be  observed  while  conducting  a  consultation  or performing  a  procedure,  as  well  as  understanding  the  local experience of giving or receiving feedback, and how various educational meetings are conducted.",
    "doc_2": "The group appreciated the richness of the discussion and the value  of  having  a  variety  of  countries  represented  in  the workshop. The group members expressed feeling encouraged and felt motivated to use 'small moments, little bits, part of the mini-CEX' during learning interactions in the workplace (teachable  moments).  This  will  allow  the  supervisors  and registrars to be 'more real' in the workplace, as opposed to striving for the hard-to-reach  perfect or ideal learning interactions. It will necessitate a more honest and pragmatic approach  to  harness  these  learning  moments.  Ongoing discussions  are  needed  around  the  validity  of  continuous assessments  in  the  workplace  for  national  examinations, such as the Fellowship of the College of Family Physicians of South Africa (FCFP[SA]), and the contribution of the learning portfolio to exit examination results. Collaborative training projects, like Training the Clinical Trainers (TCT) project and 'FamLEAP'  initiative,  are  trying  to  address  the  need  for training of supervisors in South Africa and also now Malawi and  other  countries  in  Africa  in  basic  workplace-based educational skills, such as formative assessment and giving feedback. 13",
    "doc_3": "Louis Jenkins, louis.jenkins@westerncape. gov.za ## Dates:Received: 28 Sept. 2017 Accepted: 09 Nov. 2017 Published: 12 Apr. 2018 How to cite this article: Jenkins LS, Von Pressentin K. Family medicine training in Africa: Views of clinical trainers and trainees. Afr J Prm Health Care Fam Med. 2018;10(1), a1638. https:// doi.org/10.4102/phcfm. v10i1.1638 # Copyright: © 2018. The Authors. Licensee: AOSIS. This work is licensed under the Creative Commons Attribution License. ![Image]/content/drive/MyDrive/document_rag/md/17_artifacts/image_000005_e2cece3be96aa05931eea2488c7312b12d82969056fd50762e3f32ae19090fd2.png) *Image Description:* This image features a QR code with instructions to Scan this QR code with your smart phone or mobile device to read online. It also mentions that online reading can be done by Read online. The text © 2018. The Authors. License: AOSIS. This work is licensed under the Creative Commons Attribution License.The design and information suggest it's related to learning or training programs for family medicine registrars in Africa. Objectives :  The  aim  of  the  workshop  was  to  understand  how  family  medicine  registrars (postgraduate trainees in family medicine) in Africa learn in the workplace. Methods : Thirty-five  trainers  and  registrars  from  nine  African  countries,  the  United Kingdom,  United  States  and  Sweden  participated.  South  Africa  was  represented  by  the universities of Cape Town, Limpopo, Pretoria, Sefako Makgatho, Stellenbosch, Walter Sisulu and Witwatersrand. Results: Six  major  themes  were  identified:  (1)  context  is  critical,  (2)  learning  style  of  the registrar and (teaching style) of the supervisor, (3) learning portfolio is utilised, (4) interactions between registrar and supervisor, (5) giving and receiving feedback and (6) the competence of the supervisor. Conclusion :  The  training of family physicians across Africa shares many common themes. However, there are also big differences among the various countries and even programmes within countries. The way forward would include exploring the local contextual enablers that influence the learning conversations between trainees and their supervisors. Family medicine training  institutions  and  organisations  (such  as  WONCA  Africa  and  the  South  African Academy of Family Physicians) have a critical role to play in supporting trainees and trainers towards  developing  local  competencies  which  facilitate  learning  in  the  clinical  workplace dominated by service delivery pressures."
}

# Compute embeddings for documents
doc_embeddings = {doc: get_embedding(content) for doc, content in retrieved_docs.items()}

# Compute Average Cosine Similarity
avg_sim = avg_cosine_similarity(query, retrieved_docs.keys(), doc_embeddings)

print("Average Cosine Similarity:", avg_sim)


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: {'classifier.weight', 'classifier.bias'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Average Cosine Similarity: 0.63800275


# trial : separate data tables

In [None]:
!pip install supabase numpy psycopg2

In [40]:
import os
import json
import numpy as np
from supabase import create_client, Client

# Initialize Supabase
SUPABASE_URL = "https://vptbbrmqaqpsynvpizih.supabase.co"
SUPABASE_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InZwdGJicm1xYXFwc3ludnBpemloIiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDEwNjU2NzMsImV4cCI6MjA1NjY0MTY3M30.XVOsjwisyi39awcbC3TMf46uMbdlwUkY-wfyo31UthI"

supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)


In [21]:
import json
import torch
import uuid  # Import the missing module
from transformers import AutoTokenizer, AutoModel
from supabase import create_client

# Load Embedding Model
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)

    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().tolist()

# Load JSON chunks
json_file_path = "/content/PDF1.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    """Stores chunks into Supabase, differentiating between text and tables."""
    document_entries = []
    table_entries = []

    for chunk in chunks:
        chunk_id = str(uuid.uuid4())  # Generate unique chunk_id

        if "content" in chunk and chunk["content"]:
            content = chunk["content"]
            embedding = get_embedding(content)

            document_entries.append({
                "chunk_id": chunk_id,
                "content": content,
                "embedding": embedding,
                "metadata": chunk["metadata"],
                "type": "text"
            })

        if "table" in chunk and chunk["table"]:
            table_data = chunk["table"]
            table_entries.append({
                "chunk_id": chunk_id,
                "table_data": json.dumps(table_data, ensure_ascii=False),
                "metadata": chunk["metadata"]
            })

    # Batch insert into Supabase for efficiency
    if document_entries:
        supabase.table("documents").insert(document_entries).execute()
    if table_entries:
        supabase.table("tables").insert(table_entries).execute()

# Store chunks in Supabase
store_chunks_in_supabase(json_chunks)

print("Chunks with embeddings stored successfully in Supabase!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: {'classifier.bias', 'classifier.weight'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Chunks with embeddings stored successfully in Supabase!


revision

In [47]:
# Load Embedding Model
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def get_embedding(text):
    """Generates an embedding vector from input text."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().tolist()

def convert_table_to_text(table_data):
    """Converts a table (headers + rows) into a structured text format for embedding."""
    headers = ", ".join(table_data["headers"])
    rows = [" | ".join(row) for row in table_data["rows"]]
    return f"Table: {headers}\n" + "\n".join(rows)

# Load JSON chunks
json_file_path = "/content/PDF1.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    """Stores text and table chunks into Supabase with embeddings."""
    document_entries = []
    table_entries = []

    for chunk in chunks:
        chunk_id = str(uuid.uuid4())  # Generate unique chunk_id

        # Process text content
        if "content" in chunk and chunk["content"]:
            content = chunk["content"]
            embedding = get_embedding(content)

            document_entries.append({
                "chunk_id": chunk_id,
                "content": content,
                "embedding": embedding,
                "metadata": chunk["metadata"],
                "type": "text"
            })

        # Process table data
        if "table" in chunk and chunk["table"]:
            table_data = chunk["table"]
            table_text = convert_table_to_text(table_data)
            table_embedding = get_embedding(table_text)  # Generate embedding

            table_entries.append({
                "chunk_id": chunk_id,
                "table_data": json.dumps(table_data, ensure_ascii=False),
                "embedding": table_embedding,  # Store embedding
                "metadata": chunk["metadata"]
            })

    # Batch insert into Supabase
    if document_entries:
        supabase.table("documents").insert(document_entries).execute()
    if table_entries:
        supabase.table("tables").insert(table_entries).execute()

# Store chunks in Supabase
store_chunks_in_supabase(json_chunks)

print("Text and table embeddings stored successfully in Supabase!")

The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: {'classifier.bias', 'classifier.weight'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Text and table embeddings stored successfully in Supabase!


In [58]:
import json
import uuid
import torch
from transformers import AutoTokenizer, AutoModel

# Load Embedding Model
tokenizer = AutoTokenizer.from_pretrained("Alibaba-NLP/gte-multilingual-base")
model = AutoModel.from_pretrained("Alibaba-NLP/gte-multilingual-base").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def get_embedding(text):
    """Generates an embedding vector from input text."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().tolist()

def generate_table_description(table_data):
    """Generates a natural language description from a table's headers and rows."""
    headers = table_data["headers"]
    rows = table_data["rows"]

    description = []
    for row in rows:
        row_text = ", ".join([f"{headers[i]}: {row[i]}" for i in range(len(headers))])
        description.append(row_text)

    return " | ".join(description)  # Separate rows with "|"

def convert_table_to_text(table_data, metadata):
    """Converts a table (headers + rows) into a structured text format with metadata and description for embedding."""
    headers = ", ".join(table_data["headers"])
    rows = [" | ".join(row) for row in table_data["rows"]]

    # Retrieve metadata fields
    table_title = metadata.get("table_title", "Unknown Table")
    section = metadata.get("section", "Unknown Section")

    # Generate description from table data
    table_description = generate_table_description(table_data)

    # Combine metadata with table content
    return (
        f"Table Title: {table_title}. Section: {section}.\n"
        f"Table Data:\nHeaders: {headers}\n" + "\n".join(rows) +
        f"\nDescription: {table_description}"
    ), table_description  # Return both formatted text & natural description

# Load JSON chunks
json_file_path = "/content/PDF1.json"
with open(json_file_path, "r", encoding="utf-8") as json_file:
    json_chunks = json.load(json_file)

def store_chunks_in_supabase(chunks):
    """Stores text and table chunks into Supabase with improved embeddings."""
    document_entries = []
    table_entries = []

    for chunk in chunks:
        chunk_id = str(uuid.uuid4())  # Generate unique chunk_id

        # Process text content
        if "content" in chunk and chunk["content"]:
            content = chunk["content"]
            embedding = get_embedding(content)

            document_entries.append({
                "chunk_id": chunk_id,
                "content": content,
                "embedding": embedding,
                "metadata": chunk["metadata"],
                "type": "text"
            })

        # Process table data
        if "table" in chunk and chunk["table"]:
            table_data = chunk["table"]
            metadata = chunk.get("metadata", {})

            # ✅ Generate both structured table text & natural description
            table_text, table_description = convert_table_to_text(table_data, metadata)
            table_embedding = get_embedding(table_text)

            table_entries.append({
                "chunk_id": chunk_id,
                "table_data": json.dumps(table_data, ensure_ascii=False),
                "description": table_description,  # ✅ Store the generated description
                "embedding": table_embedding,
                "metadata": metadata
            })

    # Batch insert into Supabase
    if document_entries:
        supabase.table("documents").insert(document_entries).execute()
    if table_entries:
        supabase.table("tables").insert(table_entries).execute()

# Store chunks in Supabase
store_chunks_in_supabase(json_chunks)

print("Text and table embeddings stored successfully in Supabase!")


The repository for Alibaba-NLP/gte-multilingual-base contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Alibaba-NLP/gte-multilingual-base.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: {'classifier.bias', 'classifier.weight'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Text and table embeddings stored successfully in Supabase!


idea

In [57]:
import json

def generate_table_description(table_json):
    """Generates a natural language description of a structured table."""
    table_data = json.loads(table_json)
    headers = table_data["headers"]
    rows = table_data["rows"]

    description = []
    for row in rows:
        row_description = ", ".join([f"{headers[i]}: {row[i]}" for i in range(len(headers))])
        description.append(row_description)

    return " | ".join(description)  # Separate rows with "|"

# Example usage
table_json = """{
    "headers": ["Class", "Precision", "Recall", "F1-Score", "Support"],
    "rows": [
        ["0", "0.58", "0.91", "0.71", "32"],
        ["1", "0.73", "0.28", "0.40", "29"],
        ["Accuracy", "0.61", "0.61", "0.61", "0.61"],
        ["Macro Avg", "0.65", "0.59", "0.55", "61"],
        ["Weighted Avg", "0.65", "0.61", "0.56", "61"]
    ]
}"""

description = generate_table_description(table_json)
print(description)


Class: 0, Precision: 0.58, Recall: 0.91, F1-Score: 0.71, Support: 32 | Class: 1, Precision: 0.73, Recall: 0.28, F1-Score: 0.40, Support: 29 | Class: Accuracy, Precision: 0.61, Recall: 0.61, F1-Score: 0.61, Support: 0.61 | Class: Macro Avg, Precision: 0.65, Recall: 0.59, F1-Score: 0.55, Support: 61 | Class: Weighted Avg, Precision: 0.65, Recall: 0.61, F1-Score: 0.56, Support: 61


### **retrieval**


In [34]:
import ast
import numpy as np
import re
from scipy.spatial.distance import cosine

def query_supabase(user_query):
    """Retrieves both text and table chunks based on query."""

    #### 🔹 Step 1: Retrieve Text Chunks (Vector Search) ####
    query_embedding = np.array(get_embedding(user_query), dtype=np.float32).flatten()
    print(f"Query embedding shape: {query_embedding.shape}")  # Debugging

    response_text = supabase.table("documents").select("chunk_id, content, embedding").execute()
    text_results = []

    for record in response_text.data:
        chunk_embedding = record["embedding"]

        # Convert stored string embeddings to list if needed
        if isinstance(chunk_embedding, str):
            chunk_embedding = ast.literal_eval(chunk_embedding)

        chunk_embedding = np.array(chunk_embedding, dtype=np.float32).flatten()
        print(f"Chunk {record['chunk_id']} embedding shape: {chunk_embedding.shape}")  # Debugging

        if chunk_embedding.shape == query_embedding.shape:
            similarity = 1 - cosine(query_embedding, chunk_embedding)
            text_results.append((record["chunk_id"], "text", record["content"], similarity))
        else:
            print(f"⚠️ Skipping chunk {record['chunk_id']} due to shape mismatch.")

    #### 🔹 Step 2: Retrieve Table Chunks (Improved Keyword Search) ####
    response_tables = supabase.table("tables").select("chunk_id, table_data").execute()
    table_results = []

    query_words = set(re.findall(r'\w+', user_query.lower()))  # Extract words from query

    for record in response_tables.data:
        table_data = record["table_data"].lower()
        table_words = set(re.findall(r'\w+', table_data))  # Extract words from table

        common_words = query_words.intersection(table_words)  # Count overlapping words
        match_score = len(common_words) / max(len(query_words), 1)  # Normalize score

        if match_score > 0:  # Only include tables with at least one match
            table_results.append((record["chunk_id"], "table", table_data, match_score))

    #### 🔹 Step 3: Merge & Sort Results ####
    all_results = text_results + table_results
    all_results.sort(key=lambda x: x[3], reverse=True)  # Sort by relevance

    return all_results[:5]  # Return top 5 results


In [None]:
# Example usage
user_query = "Background of reports"
retrieved_chunks = query_supabase(user_query)

# Display results
for chunk in retrieved_chunks:
    print(f"Chunk ID: {chunk[0]}\nType: {chunk[1]}\nContent: {chunk[2][:300]}...\nRelevance: {chunk[3]:.4f}\n")

Query embedding shape: (768,)
Chunk b5ea80d5-dc09-4096-9603-833ab73a3504 embedding shape: (768,)
Chunk d73ba009-0f60-47de-a372-8fc11c669432 embedding shape: (768,)
Chunk 56322f14-18e5-41c8-9a94-69cc2dbc258b embedding shape: (768,)
Chunk 72e54398-d8bb-4ced-87c3-db5ba6e4e1ee embedding shape: (768,)
Chunk aca32bb6-707e-44ba-a6cc-3d3f80b847fc embedding shape: (768,)
Chunk d8a92179-46f3-44f4-a51b-26243f89c2b2 embedding shape: (768,)
Chunk 5531a33f-a880-42db-b934-a66d255cfcc8 embedding shape: (768,)
Chunk 4711385f-c69a-4ebf-ba9b-f1873ff82f60 embedding shape: (768,)
Chunk 1256b5a0-4ef8-458b-affe-b383a37b841f embedding shape: (768,)
Chunk 71c30bd2-e19f-4ae7-8ac7-e5e6ed238067 embedding shape: (768,)
Chunk 3c7266c0-b157-4107-9f05-a97d3d2eef84 embedding shape: (768,)
Chunk 19be3d97-ba94-4fc1-ad18-defb36c4e320 embedding shape: (768,)
Chunk 342f37ff-29b3-433f-89cc-a8ee114dfaf0 embedding shape: (768,)
Chunk 846094a2-0c29-4ab0-8963-20feb220b1e2 embedding shape: (768,)
Chunk ID: 71c30bd2-e19f-4ae7-8ac

In [None]:
# Example usage
user_query = "Number of participants in Ireland"
retrieved_chunks = query_supabase(user_query)

# Display results
for chunk in retrieved_chunks:
    print(f"Chunk ID: {chunk[0]}\nType: {chunk[1]}\nContent: {chunk[2][:300]}...\nRelevance: {chunk[3]:.4f}\n")

Query embedding shape: (768,)
Chunk b5ea80d5-dc09-4096-9603-833ab73a3504 embedding shape: (768,)
Chunk d73ba009-0f60-47de-a372-8fc11c669432 embedding shape: (768,)
Chunk 56322f14-18e5-41c8-9a94-69cc2dbc258b embedding shape: (768,)
Chunk 72e54398-d8bb-4ced-87c3-db5ba6e4e1ee embedding shape: (768,)
Chunk aca32bb6-707e-44ba-a6cc-3d3f80b847fc embedding shape: (768,)
Chunk d8a92179-46f3-44f4-a51b-26243f89c2b2 embedding shape: (768,)
Chunk 5531a33f-a880-42db-b934-a66d255cfcc8 embedding shape: (768,)
Chunk 4711385f-c69a-4ebf-ba9b-f1873ff82f60 embedding shape: (768,)
Chunk 1256b5a0-4ef8-458b-affe-b383a37b841f embedding shape: (768,)
Chunk 71c30bd2-e19f-4ae7-8ac7-e5e6ed238067 embedding shape: (768,)
Chunk 3c7266c0-b157-4107-9f05-a97d3d2eef84 embedding shape: (768,)
Chunk 19be3d97-ba94-4fc1-ad18-defb36c4e320 embedding shape: (768,)
Chunk 342f37ff-29b3-433f-89cc-a8ee114dfaf0 embedding shape: (768,)
Chunk 846094a2-0c29-4ab0-8963-20feb220b1e2 embedding shape: (768,)
Chunk ID: f2061f00-3b95-46e5-907

In [50]:
import ast
import json
import numpy as np
import re
from scipy.spatial.distance import cosine

def query_supabase(user_query):
    """Retrieves both text and table chunks based on query, using improved embeddings."""

    #### 🔹 Step 1: Get Query Embedding ####
    query_embedding = np.array(get_embedding(user_query), dtype=np.float32).flatten()

    #### 🔹 Step 2: Retrieve Text Chunks (Vector Search) ####
    response_text = supabase.table("documents").select("chunk_id, content, embedding, type, metadata").execute()
    text_results = []

    for record in response_text.data:
        chunk_embedding = record["embedding"]

        # Convert stored string embeddings to list if needed
        if isinstance(chunk_embedding, str):
            chunk_embedding = ast.literal_eval(chunk_embedding)

        chunk_embedding = np.array(chunk_embedding, dtype=np.float32).flatten()

        if chunk_embedding.shape == query_embedding.shape:
            similarity = 1 - cosine(query_embedding, chunk_embedding)
            text_results.append((record["chunk_id"], "text", record["content"], similarity))

    #### 🔹 Step 3: Retrieve Table Chunks (Keyword + Embedding Match) ####
    response_tables = supabase.table("tables").select("chunk_id, table_data, metadata").execute()
    table_results = []

    for record in response_tables.data:
        table_data = record["table_data"]
        metadata = record.get("metadata", {})

        # 🔥 FIX: Ensure table_title and section are always strings
        table_title = str(metadata.get("table_title", ""))  # Convert None to empty string
        section = str(metadata.get("section", ""))  # Convert None to empty string

        # 🔥 FIX: Avoid embedding raw table numbers! Only use metadata.
        table_representation = f"Table Title: {table_title}. Section: {section}."

        # 🔹 Step 3.1: Keyword Matching for Table Title & Section
        keyword_match_score = 0
        if table_title and re.search(rf"\b{re.escape(user_query)}\b", table_title, re.IGNORECASE):
            keyword_match_score += 0.5  # Higher weight for title match
        if section and re.search(rf"\b{re.escape(user_query)}\b", section, re.IGNORECASE):
            keyword_match_score += 0.3  # Lower weight for section match

        # 🔹 Step 3.2: Get Embedding for the Table Representation
        table_embedding = get_embedding(table_representation)
        table_embedding = np.array(table_embedding, dtype=np.float32).flatten()

        if table_embedding.shape == query_embedding.shape:
            similarity = 1 - cosine(query_embedding, table_embedding)
            final_score = similarity + keyword_match_score  # Combine similarity & keyword match
            table_results.append((record["chunk_id"], "table", table_representation, final_score))

    #### 🔹 Step 4: Merge & Sort Results ####
    all_results = text_results + table_results
    all_results.sort(key=lambda x: x[3], reverse=True)  # Sort by final similarity score

    return all_results[:5]  # Return top 5 results


In [51]:
user_query = "recall in table 2"
retrieved_chunks = query_supabase(user_query)

for chunk in retrieved_chunks:
    print(f"Chunk ID: {chunk[0]}\nType: {chunk[1]}\nContent: {chunk[2][:300]}...\nRelevance: {chunk[3]:.4f}\n")


Chunk ID: 053a793c-d231-4950-80ba-3c109f62d77b
Type: text
Content: ## 3.1 Data and LDA Model Results
For the second, the 'overall daily average" sentiment score: Table 2: Classification Report for Best Volatility Direction Model...
Relevance: 0.8032

Chunk ID: d32f1950-9c72-42a4-9d96-b1a824812fdb
Type: text
Content: like Eikon. <!-- formula-not-decoded --> We propose that shifts in sentiment direction affect the relevance of historical data. This is incorporated into the construction of indicator functions in the new sentiment score equation, adjusting the weight of past data based on directional changes....
Relevance: 0.8010

Chunk ID: d0bd8d46-3596-46ae-b123-11485e6c4d32
Type: table
Content: Table Title: Table 2: Classification Report for Best Volatility Direction Model. Section: 3.1 Data and LDA Model Results....
Relevance: 0.7988

Chunk ID: afd90a2f-f138-4da0-85ca-756c4ff3567d
Type: text
Content: ## 3.1 Data and LDA Model Results
For the third data input, the 'overall daily average

In [74]:
############ worked ####################

In [66]:
import ast
import json
import numpy as np
import re
from scipy.spatial.distance import cosine

def query_supabase(user_query):
    """Retrieves both text and table chunks based on query, using improved embeddings."""

    #### 🔹 Step 1: Get Query Embedding ####
    query_embedding = np.array(get_embedding(user_query), dtype=np.float32).flatten()

    #### 🔹 Step 2: Retrieve Text Chunks (Vector Search) ####
    response_text = supabase.table("documents").select("chunk_id, content, embedding, type, metadata").execute()
    text_results = []

    for record in response_text.data:
        chunk_embedding = record["embedding"]

        # Convert stored string embeddings to list if needed
        if isinstance(chunk_embedding, str):
            chunk_embedding = ast.literal_eval(chunk_embedding)

        chunk_embedding = np.array(chunk_embedding, dtype=np.float32).flatten()

        if chunk_embedding.shape == query_embedding.shape:
            similarity = 1 - cosine(query_embedding, chunk_embedding)
            text_results.append((record["chunk_id"], "text", record["content"], similarity))

    #### 🔹 Step 3: Retrieve Table Chunks (Description + Embedding Match) ####
    response_tables = supabase.table("tables").select("chunk_id, table_data, description, embedding, metadata").execute()
    table_results = []

    for record in response_tables.data:
        table_data = record["table_data"]
        metadata = record.get("metadata", {})
        table_description = record.get("description", "")  # Use generated description
        table_embedding = record.get("embedding", None)

        # 🔥 Ensure metadata fields are strings
        table_title = str(metadata.get("table_title", ""))
        section = str(metadata.get("section", ""))

        # Extract table number from the query (if any)
        table_number_match = re.search(r'table (\d+)', user_query, re.IGNORECASE)
        specified_table_number = table_number_match.group(1) if table_number_match else None

        # 🔹 Step 3.1: Keyword Matching for Table Title, Section & Description
        keyword_match_score = 0
        if re.search(rf"\b{re.escape(user_query)}\b", table_title, re.IGNORECASE):
            keyword_match_score += 0.5  # Higher weight for title match
        if re.search(rf"\b{re.escape(user_query)}\b", section, re.IGNORECASE):
            keyword_match_score += 0.3  # Lower weight for section match
        if re.search(rf"\b{re.escape(user_query)}\b", table_description, re.IGNORECASE):
            keyword_match_score += 0.7  # Highest weight for description match

        # Prioritize the exact table number if mentioned
        if specified_table_number and specified_table_number in table_title.lower():
            keyword_match_score += 1.0  # Give a strong boost to matching table numbers

        # 🔹 Step 3.2: Compute Embedding Similarity
        if table_embedding:
            if isinstance(table_embedding, str):
                table_embedding = ast.literal_eval(table_embedding)  # Convert string to list
            table_embedding = np.array(table_embedding, dtype=np.float32).flatten()

            if table_embedding.shape == query_embedding.shape:
                similarity = 1 - cosine(query_embedding, table_embedding)
                final_score = (0.7 * similarity) + (1.3 * keyword_match_score)  # Boost keyword matching
                table_results.append((record["chunk_id"], "table", table_description, final_score))

    #### 🔹 Step 4: Merge & Sort Results ####
    all_results = text_results + table_results
    all_results.sort(key=lambda x: x[3], reverse=True)  # Sort by final similarity score

    return all_results[:5]  # Return top 5 results


In [68]:
user_query = "recall in table 2"
retrieved_chunks = query_supabase(user_query)

for chunk in retrieved_chunks:
    print(f"Chunk ID: {chunk[0]}\nType: {chunk[1]}\nContent: {chunk[2][:300]}...\nRelevance: {chunk[3]:.4f}\n")


Chunk ID: bcb7d0aa-980c-43de-ab3f-ea80e818f23a
Type: table
Content: Class: 0, Precision: 0.58, Recall: 0.91, F1-Score: 0.71, Support: 32 | Class: 1, Precision: 0.73, Recall: 0.28, F1-Score: 0.4, Support: 29 | Class: Accuracy, Precision: 0.61, Recall: 0.61, F1-Score: 0.61, Support: 0.61 | Class: Macro Avg, Precision: 0.65, Recall: 0.59, F1-Score: 0.55, Support: 61 | ...
Relevance: 1.8349

Chunk ID: 3d2f4141-fc5f-46d4-8e18-c7c8acec3f15
Type: text
Content: ## 3.1 Data and LDA Model Results
For the second, the 'overall daily average" sentiment score: Table 2: Classification Report for Best Volatility Direction Model...
Relevance: 0.8032

Chunk ID: d717039a-9137-4508-9849-bed4fec39b35
Type: text
Content: like Eikon. <!-- formula-not-decoded --> We propose that shifts in sentiment direction affect the relevance of historical data. This is incorporated into the construction of indicator functions in the new sentiment score equation, adjusting the weight of past data based on directional chang

# Chat with LLM

In [None]:
!pip install groq

In [23]:
import ast
import numpy as np
import re
import requests
from scipy.spatial.distance import cosine

# Your Groq API key
GROQ_API_KEY = "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"

def query_supabase(user_query):
    """Retrieves both text and table chunks based on query."""

    #### 🔹 Step 1: Retrieve Text Chunks (Vector Search) ####
    query_embedding = np.array(get_embedding(user_query), dtype=np.float32).flatten()
    response_text = supabase.table("documents").select("chunk_id, content, embedding").execute()
    text_results = []

    for record in response_text.data:
        chunk_embedding = record["embedding"]

        # Convert stored string embeddings to list if needed
        if isinstance(chunk_embedding, str):
            chunk_embedding = ast.literal_eval(chunk_embedding)

        chunk_embedding = np.array(chunk_embedding, dtype=np.float32).flatten()

        if chunk_embedding.shape == query_embedding.shape:
            similarity = 1 - cosine(query_embedding, chunk_embedding)
            text_results.append((record["chunk_id"], "text", record["content"], similarity))

    #### 🔹 Step 2: Retrieve Table Chunks (Improved Keyword Search) ####
    response_tables = supabase.table("tables").select("chunk_id, table_data").execute()
    table_results = []

    query_words = set(re.findall(r'\w+', user_query.lower()))  # Extract words from query

    for record in response_tables.data:
        table_data = record["table_data"].lower()
        table_words = set(re.findall(r'\w+', table_data))  # Extract words from table

        common_words = query_words.intersection(table_words)  # Count overlapping words
        match_score = len(common_words) / max(len(query_words), 1)  # Normalize score

        if match_score > 0:  # Only include tables with at least one match
            table_results.append((record["chunk_id"], "table", table_data, match_score))

    #### 🔹 Step 3: Merge & Sort Results ####
    all_results = text_results + table_results
    all_results.sort(key=lambda x: x[3], reverse=True)  # Sort by relevance

    return all_results[:5]  # Return top 5 results


def call_groq_llm(user_query, retrieved_chunks):
    """Send the query along with retrieved context to Groq API and return the response."""

    # Print retrieved chunks for debugging
    print("\n🔹 Retrieved Chunks:")
    for i, chunk in enumerate(retrieved_chunks, 1):
        print(f"Chunk {i} (ID: {chunk[0]}, Type: {chunk[1]}):\n{chunk[2][:500]}...\nRelevance: {chunk[3]:.4f}\n")

    # Prepare context for LLM
    context_text = "\n\n".join([f"Chunk {i+1}: {chunk[2]}" for i, chunk in enumerate(retrieved_chunks)])

    prompt = f"""You are an intelligent assistant. Use the following retrieved information to answer the user's query.

    Context:
    {context_text}

    User's Question: {user_query}

    Provide a clear and concise response.
    """

    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {GROQ_API_KEY}",
        "Content-Type": "application/json"
    }
    data = {
        "model": "qwen-qwq-32b",  # Adjust this based on your Groq model selection
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7
    }

    response = requests.post(url, json=data, headers=headers)

    if response.status_code == 200:
        answer = response.json()["choices"][0]["message"]["content"]
        print("\n🔹 Chatbot Response:\n", answer)
        return answer
    else:
        print("\n⚠️ Error:", response.text)
        return None



In [None]:
# Example usage
user_query = "Number of participants in Ireland"
retrieved_chunks = query_supabase(user_query)

if retrieved_chunks:
    call_groq_llm(user_query, retrieved_chunks)
else:
    print("No relevant information found.")



🔹 Retrieved Chunks:
Chunk 1 (ID: f2061f00-3b95-46e5-907f-05b74dd09987, Type: table):
{"headers": ["country", "number of participants"], "rows": [["botswana", "2"], ["ethiopia", "1"], ["ireland", "1"], ["kenya", "1"], ["lesotho", "4"], ["malawi", "3"], ["nigeria", "2"], ["south africa", "12"], ["sweden", "1"], ["uganda", "1"], ["united kingdom", "2"], ["united states (involved with lesotho  programme)", "2"], ["zimbabwe", "3"], ["total", "35"]]}...
Relevance: 0.8000

Chunk 2 (ID: aca32bb6-707e-44ba-a6cc-3d3f80b847fc, Type: text):
## Participants and process  
Thirty-five people  participated  in  a  2-h  workshop  and included trainers and trainees from nine African countries, the United Kingdom, United States and Sweden (see Table 1). South  Africa  was  represented  by  the  universities  of  Cape Town,  Limpopo,  Pretoria,  Sefako  Makgatho,  Stellenbosch, Walter Sisulu and Witwatersrand.  
We started with an introduction and then divided into buzz pairs (pairs were allowed to form 

In [None]:
# Example usage
user_query = "Number of participants in Lesotho"
retrieved_chunks = query_supabase(user_query)

if retrieved_chunks:
    call_groq_llm(user_query, retrieved_chunks)
else:
    print("No relevant information found.")



🔹 Retrieved Chunks:
Chunk 1 (ID: aca32bb6-707e-44ba-a6cc-3d3f80b847fc, Type: text):
## Participants and process  
Thirty-five people  participated  in  a  2-h  workshop  and included trainers and trainees from nine African countries, the United Kingdom, United States and Sweden (see Table 1). South  Africa  was  represented  by  the  universities  of  Cape Town,  Limpopo,  Pretoria,  Sefako  Makgatho,  Stellenbosch, Walter Sisulu and Witwatersrand.  
We started with an introduction and then divided into buzz pairs (pairs were allowed to form spontaneously, regardless of the tra...
Relevance: 0.8174

Chunk 2 (ID: 19be3d97-ba94-4fc1-ad18-defb36c4e320, Type: text):
## Reflection on workshop  
The group appreciated the richness of the discussion and the value  of  having  a  variety  of  countries  represented  in  the workshop. The group members expressed feeling encouraged and felt motivated to use 'small moments, little bits, part of the mini-CEX' during learning interactions in the wo

PDF1 test

In [73]:
################ worked #####################

In [70]:
def call_groq_llm(user_query, retrieved_chunks):
    """Send the query along with retrieved context to Groq API and return the response."""

    # Print retrieved chunks for debugging
    print("\n🔹 Retrieved Chunks:")
    for i, chunk in enumerate(retrieved_chunks, 1):
        print(f"Chunk {i} (ID: {chunk[0]}, Type: {chunk[1]}):\n{chunk[2][:500]}...\nRelevance: {chunk[3]:.4f}\n")

    # Prepare context for LLM
    context_text = "\n\n".join([f"Chunk {i+1}: {chunk[2]}" for i, chunk in enumerate(retrieved_chunks)])

    prompt = f"""You are an intelligent assistant. Use the following retrieved information to answer the user's query.

    Context:
    {context_text}

    User's Question: {user_query}

    Provide a clear and concise response.
    """

    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {GROQ_API_KEY}",
        "Content-Type": "application/json"
    }
    data = {
        "model": "qwen-qwq-32b",  # Adjust this based on your Groq model selection
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7
    }

    response = requests.post(url, json=data, headers=headers)

    if response.status_code == 200:
        answer = response.json()["choices"][0]["message"]["content"]
        print("\n🔹 Chatbot Response:\n", answer)
        return answer
    else:
        print("\n⚠️ Error:", response.text)
        return None


In [72]:
# Example usage
user_query = "recall for class 1 in table 2"
retrieved_chunks = query_supabase(user_query)

if retrieved_chunks:
    call_groq_llm(user_query, retrieved_chunks)
else:
    print("No relevant information found.")



🔹 Retrieved Chunks:
Chunk 1 (ID: bcb7d0aa-980c-43de-ab3f-ea80e818f23a, Type: table):
Class: 0, Precision: 0.58, Recall: 0.91, F1-Score: 0.71, Support: 32 | Class: 1, Precision: 0.73, Recall: 0.28, F1-Score: 0.4, Support: 29 | Class: Accuracy, Precision: 0.61, Recall: 0.61, F1-Score: 0.61, Support: 0.61 | Class: Macro Avg, Precision: 0.65, Recall: 0.59, F1-Score: 0.55, Support: 61 | Class: Weighted Avg, Precision: 0.65, Recall: 0.61, F1-Score: 0.56, Support: 61...
Relevance: 1.8561

Chunk 2 (ID: 3d2f4141-fc5f-46d4-8e18-c7c8acec3f15, Type: text):
## 3.1 Data and LDA Model Results
For the second, the 'overall daily average" sentiment score: Table 2: Classification Report for Best Volatility Direction Model...
Relevance: 0.8323

Chunk 3 (ID: 091e3b43-4b04-4bc2-894d-025702a4b579, Type: text):
## 3.1 Data and LDA Model Results
For the third data input, the 'overall daily average title" sentiment score, the result is Table 3: Classification Report for Best Volatility Direction Model...
Relev