## Asymmetric Semantic Search

For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query.

### Achtung

The language is important, some models focus on english, others need to be told which language to use. The one I used here is multilingual. It's slower but seems good.

In [1]:
import fitz # requires pymupdf
from tqdm.auto import tqdm # for progress bars, requires tqdm
import re

pdf_path = "./example.pdf"

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ")
    cleaned_text = re.sub(r' ! ', '', cleaned_text)
    cleaned_text = re.sub(r'-\s+', '', cleaned_text)
    cleaned_text = re.sub(r'\s\s+', ' ', cleaned_text)
    cleaned_text = re.sub(r'(\d)!', r'\1€', cleaned_text)

    # note: this might be different for each doc (best to experiment)
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({
            "page_number": page_number + 1,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text) / 4,  # 1 token = ~4 chars
            "text": text
        })

    return pages_and_texts


In [2]:
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:3]

0it [00:00, ?it/s]

[{'page_number': 1,
  'page_char_count': 472,
  'page_word_count': 67,
  'page_sentence_count_raw': 6,
  'page_token_count': 118.0,
  'text': 'INF 0122 DOCUMENTI INFORMATIVI Documenti informativi relativi al contatto per la ricezione e trasmissione di ordini, nonché esecuzione per conto del Cliente, collocamento e servizi accessori. 1. Informativa Pre-contrattuale – cliente al dettaglio – ed. ottobre 2021 2. Informativa Privacy, ai sensi dell’art.13, del Regolamento UE n.679/2016 (regolamento europeo in materia di protezione dei dati personali “GDPR”) 3. Allegato Economico (Allegato 1) – costi e commissioni '},
 {'page_number': 2,
  'page_char_count': 3910,
  'page_word_count': 550,
  'page_sentence_count_raw': 18,
  'page_token_count': 977.5,
  'text': "1/12 PRE 0124 Directa Società di intermediazione Mobiliare per Azioni Iscritta all’albo delle SIM al n° 59 Codice fiscale e partita iva e iscrizione al registro delle imprese n° 06837440012 Sede legale: Via Bruno Buozzi n° 5 – 10121 To

In [3]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,472,67,6,118.0,INF 0122 DOCUMENTI INFORMATIVI Documenti infor...
1,2,3910,550,18,977.5,1/12 PRE 0124 Directa Società di intermediazio...
2,3,7265,1051,25,1816.25,"2/12 www.consob.it e/o richieste a CONSOB, 001..."
3,4,6761,954,26,1690.25,3/12 prelevati su richiesta del Cliente dirett...
4,5,7676,1119,30,1919.0,"4/12 schio. A rendimenti potenziali maggiori, ..."


In [4]:
from spacy.lang.it import Italian

nlp = Italian()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (by default they are Spans)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])


  0%|          | 0/17 [00:00<?, ?it/s]

In [5]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,17.0,17.0,17.0,17.0,17.0,17.0
mean,9.0,5686.06,819.29,24.71,1421.51,25.71
std,5.05,1947.77,279.84,10.18,486.94,10.19
min,1.0,472.0,67.0,6.0,118.0,6.0
25%,5.0,4670.0,715.0,18.0,1167.5,19.0
50%,9.0,6549.0,941.0,25.0,1637.25,26.0
75%,13.0,7255.0,1039.0,36.0,1813.75,31.0
max,17.0,7806.0,1119.0,39.0,1951.5,47.0


In [6]:
num_sentence_chunk_size = 5

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    # No overlap here, but could be useful to add at least one sentence of overlap
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/17 [00:00<?, ?it/s]

In [7]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,9.0,5686.06,819.29,24.71,1421.51,25.71,5.71
std,5.05,1947.77,279.84,10.18,486.94,10.19,2.08
min,1.0,472.0,67.0,6.0,118.0,6.0,2.0
25%,5.0,4670.0,715.0,18.0,1167.5,19.0,4.0
50%,9.0,6549.0,941.0,25.0,1637.25,26.0,6.0
75%,13.0,7255.0,1039.0,36.0,1813.75,31.0,7.0
max,17.0,7806.0,1119.0,39.0,1951.5,47.0,10.0


In [8]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/17 [00:00<?, ?it/s]

97

In [9]:
import random
random.sample(pages_and_chunks, k=3)

[{'page_number': 8,
  'sentence_chunk': "Prima di effettuare l’investimento è quindi necessario accertarsi di quale sia il costo giornaliero da sostenere per mantenere la posizione aperta e valutare quindi di poter affrontare la strategia di trading prescelta.5) I covered warrant I covered warrant è un titolo che incorpora una opzione per acquistare o vendere un certo bene in una data futura predeterminata.È generalmente emesso da banche o imprese di investimento e quotato su mercati regolamentati. Trattandosi di un'opzione, il portatore del titolo ha la facoltà, ma non l'obbligo, di concludere l'acquisto o la vendita. In relazione alla natura del diritto si distinguono i covered warrant di tipo call (diritto ad acquistare) e quelli di tipo put (diritto a vendere).",
  'chunk_char_count': 734,
  'chunk_word_count': 114,
  'chunk_token_count': 183.5},
 {'page_number': 13,
  'sentence_chunk': "SEZIONE H INFORMAZIONI CONCERNENTI I TERMINI DEL CONTRATTO Per i termini del contratto si riman

In [10]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,97.0,97.0,97.0,97.0
mean,9.45,994.67,142.56,248.67
std,4.65,583.36,83.87,145.84
min,1.0,53.0,8.0,13.25
25%,6.0,550.0,81.0,137.5
50%,9.0,961.0,135.0,240.25
75%,14.0,1280.0,186.0,320.0
max,17.0,2735.0,404.0,683.75


In [11]:
# Show random chunks with under 50 tokens in length
min_token_length = 15
for row in df[df["chunk_token_count"] <= min_token_length].iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 13.25 | Text: Allegato Economico (Allegato 1) – costi e commissioni


In [12]:
max_token_length = 500 # it's 512 but to be safe...
# in fact, we have to append query: and passage: to the beginning of the text
for row in df[df["chunk_token_count"] >= max_token_length].iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 623.5 | Text: 1/12 PRE 0124 Directa Società di intermediazione Mobiliare per Azioni Iscritta all’albo delle SIM al n° 59 Codice fiscale e partita iva e iscrizione al registro delle imprese n° 06837440012 Sede legale: Via Bruno Buozzi n° 5 – 10121 Torino Telefono: +39 011.530101 – fax: +39 011.530532 E.mail: directa@directa.it PEC: directasim@legalmail.it Capitale sociale Euro 7.500.000 interamente versato Aderente al Fondo Nazionale di Garanzia Contratto di: CONTRATTO PER LA RICEZIONE E TRASMISSIONE DI ORDINI, NONCHÉ ESECUZIONE PER CONTO DEL CLIENTE, COLLOCAMENTO E SERVIZI ACCESSORI INFORMATIVA PRE-CONTRATTUALE Cliente al dettaglio Edizione Ottobre 2021 Sezione A Informazioni su Directa e i suoi servizi Pag 1 Sezione B Informazioni concernenti la salvaguardia degli investimenti finanziari e delle somme di denaro della clientela Pag 3 Sezione C Informazioni sugli strumenti finanziari Pag 3 Sezione D Informazioni sugli oneri e sui costi Pag 10 Sezione E Informazioni pe

In [13]:
# slice the chunks that are too long
def split_chunks(pages_and_chunks, max_token_length: int) -> list[str]:
    pages_and_chunks_sliced_internal = []
    for item in tqdm(pages_and_chunks):
        if item["chunk_token_count"] >= max_token_length - 12:
            sentences = list(nlp(item["sentence_chunk"]).sents)
            midpoint = len(sentences) // 2

            first_half = sentences[:midpoint]
            second_half = sentences[midpoint:]

            first_half_joined = "".join([str(sentence) for sentence in first_half]).strip()
            second_half_joined = "".join([str(sentence) for sentence in second_half]).strip()

            pages_and_chunks_sliced_internal.append({
                "page_number": item["page_number"],
                "sentence_chunk": first_half_joined,
                "chunk_char_count": len(first_half_joined),
                "chunk_word_count": len(first_half_joined.split(" ")),
                "chunk_token_count": len(first_half_joined) / 4
            })

            pages_and_chunks_sliced_internal.append({
                "page_number": item["page_number"],
                "sentence_chunk": second_half_joined,
                "chunk_char_count": len(second_half_joined),
                "chunk_word_count": len(second_half_joined.split(" ")),
                "chunk_token_count": len(second_half_joined) / 4
            })

        else: pages_and_chunks_sliced_internal.append(item)
    return pages_and_chunks_sliced_internal


pages_and_chunks_sliced = pages_and_chunks.copy()
while any(item["chunk_token_count"] >= max_token_length for item in pages_and_chunks_sliced):
    pages_and_chunks_sliced = split_chunks(pages_and_chunks_sliced, max_token_length)


  0%|          | 0/97 [00:00<?, ?it/s]

  0%|          | 0/103 [00:00<?, ?it/s]

In [14]:
df = pd.DataFrame(pages_and_chunks_sliced)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,104.0,104.0,104.0,104.0
mean,9.39,927.51,132.82,231.88
std,4.61,471.2,67.43,117.8
min,1.0,53.0,8.0,13.25
25%,6.0,537.75,77.75,134.44
50%,9.0,952.5,133.0,238.12
75%,13.0,1253.25,181.5,313.31
max,17.0,1913.0,273.0,478.25


In [15]:
for row in df[df["chunk_token_count"] >= max_token_length].iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

In [31]:
from typing import List
from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings

embedding_model = SentenceTransformer('intfloat/multilingual-e5-large')

class CustomEmbeddings(Embeddings):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        prefix = "passage: "
        return [embedding_model.encode(prefix + text, batch_size=32,) for text in texts]

    def embed_query(self, text: str) -> List[float]:
        prefix = "query: "
        return embedding_model.encode(prefix + text)


In [33]:
from langchain.vectorstores import Chroma
from langchain.schema import Document

# Create a list of Documents with metadata
documents = [
    Document(
        page_content=row["sentence_chunk"],
        metadata={
            "page_number": row["page_number"],
            "chunk_char_count": row["chunk_char_count"],
            "chunk_word_count": row["chunk_word_count"],
            "chunk_token_count": row["chunk_token_count"]
        }
    )
    for _, row in df.iterrows()
]

vectorstore = Chroma.from_documents(documents, CustomEmbeddings())

In [36]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama

model = ChatOllama(
  model="mistral-nemo"
)

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

retriever = vectorstore.as_retriever(search_kwargs={ "k": 2})


question_answer_chain = create_stuff_documents_chain(model, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "Ho il profilo semplice. Quali sono le commissioni di trading su EXM?"})

results

{'input': 'Ho il profilo semplice. Quali sono le commissioni di trading su EXM?',
 'context': [Document(metadata={'chunk_char_count': 366, 'chunk_token_count': 91.5, 'chunk_word_count': 58, 'page_number': 16}, page_content='ALL1 0524€ INFORMAZIONI SUGLI ONERI E SUI COSTI Di seguito sono riportate le commissioni e le condizioni economiche in vigore al 2 maggio 2024. Per successive modifiche si rinvia al sito www.directa.it. COMMISSIONI DI TRADING SUI DIVERSI MERCATI EXM (ex MTA), EGM, MIV, ETFplus, GEM Profili alternativi: •!Semplice: 5€ per ordine eseguito •!Dinamica*: da 8 a 1,5€ •!'),
  Document(metadata={'chunk_char_count': 1297, 'chunk_token_count': 324.25, 'chunk_word_count': 183, 'page_number': 11}, page_content="SEZIONE D INFORMAZIONI SUGLI ONERI E SUI COSTI RENDICONTAZIONE EX-ANTE ED EX-POST Directa, in ottemperanza alle nuove disposizioni introdotte con la MiFID II, fornisce un’informativa ex-ante sui costi applicati ai servizi di trading sui principali mercati/strumenti finan

In [38]:
print(results["context"][0].metadata)

{'chunk_char_count': 366, 'chunk_token_count': 91.5, 'chunk_word_count': 58, 'page_number': 16}


In [41]:
rag_chain.invoke({"input": "Quanto costano le operazioni fuori mercato?"})

{'input': 'Quanto costano le operazioni fuori mercato?',
 'context': [Document(metadata={'chunk_char_count': 366, 'chunk_token_count': 91.5, 'chunk_word_count': 58, 'page_number': 16}, page_content='ALL1 0524€ INFORMAZIONI SUGLI ONERI E SUI COSTI Di seguito sono riportate le commissioni e le condizioni economiche in vigore al 2 maggio 2024. Per successive modifiche si rinvia al sito www.directa.it. COMMISSIONI DI TRADING SUI DIVERSI MERCATI EXM (ex MTA), EGM, MIV, ETFplus, GEM Profili alternativi: •!Semplice: 5€ per ordine eseguito •!Dinamica*: da 8 a 1,5€ •!'),
  Document(metadata={'chunk_char_count': 1516, 'chunk_token_count': 379.0, 'chunk_word_count': 237, 'page_number': 16}, page_content='Variabile: 1,9 per mille per ordine eseguito, con un massimo di 18€ e un minimo di 1,5€ (il minimo è di 5€ per il mercato GEM) per ordini fino a 500.000€ ATFund •!Unico profilo disponibile: 1,9 per mille per ordine eseguito, con un massimo di 200€ e un minimo di 5€ per ordini fino a 500.000€ SEDE