## Chunking methods

Load document first

In [1]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

# Set the Ollama endpoint (if it's not the default)
# os.environ["OLLAMA_HOST"] = "http://localhost:11434"

# 1. Load the PDF file
# Replace 'your_document.pdf' with your file path
try:
    loader = PyPDFLoader("data/LNCS.pdf")
    docs = loader.load()
    print(docs)
except FileNotFoundError:
    print("Error: The file 'your_document.pdf' was not found. Please ensure the file exists.")
    exit()

[Document(metadata={'producer': 'Acrobat Distiller 4.05 for Windows', 'creator': 'DVIPSONE 2.2.1  http://www.YandY.com', 'creationdate': 'D:20020429195111', 'title': "LNCS/LNAI Authors' Instructions", 'author': 'Springer-Verlag Heidelberg', 'subject': 'TeX output 2002.04.29:1951', 'moddate': '2002-04-29T20:05:47+02:00', 'source': 'data/LNCS.pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}, page_content='Lecture Notes in Computer Science:\nAuthors’ Instructions for the Preparation\nof Camera-Ready Contributions\nto LNCS/LNAI Proceedings\nAlfred Hofmann1, Ingrid Beyer1, Anna Kramer1, Erika Siebert-Cole1,\nAngelika Bernauer-Budiman2, Martina Wiese2, and Anita B¨urk3\n1 Springer-Verlag, Computer Science Editorial III, Postfach 10 52 80,\n69042 Heidelberg, Germany\n{Hofmann, Beyer, Kramer, Erika.Siebert-Cole, LNCS}@Springer.de\nhttp://www.springer.de/comp/lncs/index.html\n2 Springer-Verlag, Computer Science Production, Postfach 10 52 80,\n69042 Heidelberg, Germany\n{Bernauer, Wiese}@Sp

#### Semantic chunking from langchain

Ensure 'mxbai-embed-large' is pulled via 'ollama pull mxbai-embed-large'

In [2]:
try:
    embeddings = OllamaEmbeddings(model="mxbai-embed-large")
except Exception as e:
    print(f"Error initializing OllamaEmbeddings: {e}")
    print("Please check that Ollama is running and the specified model is pulled.")
    exit()

# 3. Perform semantic chunking
# The SemanticChunker uses a character-based splitter and then merges chunks
# based on the semantic similarity calculated by the embeddings.
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

# 4. Split the document into semantic chunks
semantic_chunks = semantic_chunker.split_documents(docs)

# 5. Print some information to verify the process
print(f"Total number of chunks created: {len(semantic_chunks)}")
print("\n--- First Chunk Content ---")
print(semantic_chunks[0].page_content)

  embeddings = OllamaEmbeddings(model="mxbai-embed-large")


Total number of chunks created: 21

--- First Chunk Content ---
Lecture Notes in Computer Science:
Authors’ Instructions for the Preparation
of Camera-Ready Contributions
to LNCS/LNAI Proceedings
Alfred Hofmann1, Ingrid Beyer1, Anna Kramer1, Erika Siebert-Cole1,
Angelika Bernauer-Budiman2, Martina Wiese2, and Anita B¨urk3
1 Springer-Verlag, Computer Science Editorial III, Postfach 10 52 80,
69042 Heidelberg, Germany
{Hofmann, Beyer, Kramer, Erika.Siebert-Cole, LNCS}@Springer.de
http://www.springer.de/comp/lncs/index.html
2 Springer-Verlag, Computer Science Production, Postfach 10 52 80,
69042 Heidelberg, Germany
{Bernauer, Wiese}@Springer.de
3 Springer-Verlag, Marketing Management, Postfach 10 52 80,
69042 Heidelberg, Germany
Buerk@Springer.de
Abstract. The abstract should summarize the contents of the paper
and should contain at least 70 and at most 150 words. It should be set
in 9-point font size and should be inset 1.0 cm from the right and left
margins. There should be two blank (1

In [3]:
print("\n--- Last Chunk Content ---")
print(semantic_chunks[20].page_content)


--- Last Chunk Content ---
Digit. Libr.1 (1997) 108–121
Appendix: Springer-Author Discount
All authors or editors of Springer books,inparticularauthorscontributingtoany
LNCS or LNAI proceedings volume, are entitled to buy any book published by
Springer-Verlag for personal use at the “Springer-author” discount of one third
oﬀthelistprice.SuchpreferentialorderscanonlybeprocessedthroughSpringer
directly (and not through bookstores); reference to a Springer publication has
to be given withsuchorders. Any Springer oﬃce may be contacted, particularly
those in Heidelberg and New York:
Springer Auslieferungsgesellschaft Springer-Verlag New York Inc. Haberstrasse 7 P.O. Box 2485
69126 Heidelberg Secaucus, NJ 07096-2485
Germany USA
Fax: +49 6221 345-229 Fax: +1 201 348 4505
Phone: +49 6221 345-0 Phone: +1-800-SPRINGER
(+1 800 777 4643), toll-free in USA
Preferential orders can also be placed by sending an email to
orders@springer.de or orders@springer-ny.com. For information about shipping char

In [4]:
print("\n--- Chunk Content ---")
print(semantic_chunks[10].page_content)


--- Chunk Content ---
Lecture Notes in Computer Science 5
2.5 Program Code
Programlistingsorprogramcommandsinthetextarenormallysetintypewriter
font, e.g., CMTT10 or Courier. Example of a Computer Program
program Inflation (Output)
{Assuming annual inflation rates of 7%, 8%, and 10%,... years};
const
MaxYears = 10;
var
Year: 0..MaxYears;
Factor1, Factor2, Factor3: Real;
begin
Year := 0;
Factor1 := 1.0;Factor2 := 1.0;Factor3 := 1.0;
WriteLn(’Year 7% 8% 10%’);WriteLn;
repeat
Year := Year + 1;
Factor1 := Factor1 * 1.07;
Factor2 := Factor2 * 1.08;
Factor3 := Factor3 * 1.10;
WriteLn(Year:5,Factor1:7:3,Factor2:7:3,Factor3:7:3)
until Year = MaxYears
end. (Example from Jensen K., Wirth N.


In [5]:
print("\n--- First Chunk Content ---")
print(semantic_chunks[2].page_content)


--- First Chunk Content ---
2 Alfred Hofmann et al. 2Manuscript Preparation
You are strongly encouraged to use LATEX2ε for the preparation of your camera-
ready manuscript together with the corresponding Springer class ﬁlellncs.cls;
see Sect. 3.


In [6]:
print("\n--- First Chunk Content ---")
print(semantic_chunks[1].page_content)


--- First Chunk Content ---
1 Introduction
The preparation of manuscripts which are to be reproduced by photo-oﬀset re-
quires special care. Papers submitted in a technically unsuitable form will be
returned for retyping, or canceled if the volume cannot otherwise be ﬁnished on
time. 1.1 LNCS Online
Springer-Verlag now provides the full-text version of the LNCS and LNAI pro-
ceedings online. Therefore please submit to the volume editors (and not to
Springer-Verlag), together with your own single-sided printout of the ﬁnal ver-
sionofyourcontribution(whichcannotbemodiﬁedatalaterstage),yoursource
(input) ﬁles, e.g. TEX ﬁles for the text and PS or EPS ﬁles for ﬁgures, the ﬁnal
DVI ﬁle (for papers prepared using L
ATEXo rTEX), the ﬁnal PS ﬁle4, and, if pos-
sible, a PDF ﬁle of the ﬁnal version of your contribution. If you have prepared
your paper using a text processing system other than L
ATEXo rTEX, please also
submit RTF ﬁles. Make sure that the text isidentical in all cases. 4 When ge

### Late chunking

https://github.com/jina-ai/late-chunking

Compare sentence chunking to late chunking

In [7]:
# For comparison, let's create the span annotations for the langchain one as well
def split_with_spans(text, chunker):
    start_index = 0
    documents_with_spans = []
    chunks = []
    spans = []

    # Use the splitter to get the content of the chunks
    split_contents = chunker.split_text(text)

    for chunk_content in split_contents:
        # Find the starting position of the chunk in the original text
        # start_char = text.find(chunk_content, start_index)
        # if start_char == -1:
        #     # Fallback for overlaps or slight differences
        #     start_char = start_index

        end_index = start_index + len(chunk_content)

        chunks.append(chunk_content)
        spans.append((start_index, end_index))

        # Update the start_index for the next search
        start_index = end_index + 1

    return chunks,spans

In [8]:
# Do some late chunking now
from transformers import AutoModel
from transformers import AutoTokenizer
from chunked_pooling import chunked_pooling, chunk_by_sentences

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

# Text to chunk
input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

In [9]:
# langchain chunker
lang_chunks,lang_spans = split_with_spans(input_text, semantic_chunker)
print('Chunks:\n- "' + '"\n- "'.join(lang_chunks) + '"')
print(" ")
print(lang_spans)

Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
- "The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."
 
[(0, 216), (217, 328)]


In [10]:
# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
print(" ")
print(span_annotations)

Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population."
- " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
- " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."
 
[(1, 17), (17, 44), (44, 69)]


Now we encode the chunks with the traditional and the context-sensitive chunked pooling method chunk before

In [11]:
embeddings_traditional_chunking = model.encode(chunks)
# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]

In [12]:
# langchain embbeddings
lang_embeddings = chunked_pooling(model_output, [lang_spans])[0]

Now, compare similarity with chunks

In [13]:
import numpy as np

# cosine similarity -> dot product normalized
cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

berlin_embedding = model.encode('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.8486219
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.70843387
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.8498009
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.75345534


In [14]:
# lang embedings
for chunk, new_embedding in zip(lang_chunks, lang_embeddings):
    print(f'similarity_langchain("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))

similarity_langchain("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.85809577
similarity_langchain("Berlin", "The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): nan


  cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


In [15]:
## TODO: Add more examples

mj_text = """

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA). He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls. Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]

Jordan played college basketball with the North Carolina Tar Heels. As a freshman, he was a member of the Tar Heels' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat. Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization. He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for the second time in 1999, returning for two NBA seasons from 2001 to 2003 as a member of the Washington Wizards.[5][14] He was selected to play for the United States national team during his college and NBA careers, winning four gold medals—at the 1983 Pan American Games, 1984 Summer Olympics, 1992 Tournament of the Americas and 1992 Summer Olympics—while also being undefeated.[16]

Jordan's individual accolades include six NBA Finals Most Valuable Player (MVP) awards, ten NBA scoring titles (both all-time records), five NBA MVP awards, 10 All-NBA First Team designations, nine All-Defensive First Team honors, fourteen NBA All-Star Game selections, three NBA All-Star Game MVP awards, and three NBA steals titles.[14] He holds the NBA records for career regular season scoring average (30.1 points per game) and career playoff scoring average (33.4 points per game).[17] He is one of only eight players to achieve the basketball Triple Crown. In 1999, Jordan was named the 20th century's greatest North American athlete by ESPN and was second to Babe Ruth on the Associated Press' list of athletes of the century.[5] Jordan was twice inducted into the Naismith Memorial Basketball Hall of Fame, once in 2009 for his individual career,[18] and in 2010 as part of the 1992 United States men's Olympic basketball team ("The Dream Team").[19] He became a member of the United States Olympic Hall of Fame in 2009,[20] an individual member of the FIBA Hall of Fame in 2015 and a "Dream Team" member in 2017.[21][22] Jordan was named to the NBA 75th Anniversary Team in 2021.[23] The trophy for the NBA Most Valuable Player Award is named in his honor.

One of the most effectively marketed athletes ever, Jordan made many product endorsements.[12][24] He fueled the success of Nike's Air Jordan sneakers, which were introduced in 1984 and remain popular.[25] Jordan starred as himself in the live-action/animation hybrid film Space Jam (1996) and was the focus of the Emmy-winning documentary series The Last Dance (2020). He became part-owner and head of basketball operations for the Charlotte Hornets (then named the Bobcats) in 2006 and bought a controlling interest in 2010, before selling his majority stake in 2023. Jordan is a co-owner of 23XI Racing in the NASCAR Cup Series. In 2014, he became the first billionaire player in NBA history.[26] In 2016, President Barack Obama awarded Jordan the Presidential Medal of Freedom.[27] As of 2025, his net worth is estimated at $3.8 billion by Forbes,[28] making him one of the richest celebrities.
"""

In [16]:
# langchain chunker
lang_chunks,lang_spans = split_with_spans(mj_text, semantic_chunker)
print('Chunks:\n- "' + '"\n- "'.join(lang_chunks) + '"')
print(" ")
print(lang_spans)

Chunks:
- "

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA). He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls. Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]

Jordan played college basketball with the North Carolina Tar Heels. As a freshman, he was a member of the Tar Heels' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by p

In [17]:
# determine chunks (latent vs traditional sentences)
chunks, span_annotations = chunk_by_sentences(mj_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
print(" ")
print(span_annotations)

Chunks:
- "

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA)."
- " He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls."
- " Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]

Jordan played college basketball with the North Carolina Tar Heels."
- " As a freshman, he was a member of the Tar Heels' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, de

In [18]:
# traditional embedding
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling) - latent 
inputs = tokenizer(mj_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]

In [19]:
# langchain embbeddings
lang_embeddings = chunked_pooling(model_output, [lang_spans])[0]

In [20]:
mj_embedding = model.encode('Michael Jordan')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Michael Jordan", "{chunk}"):', cos_sim(mj_embedding, new_embedding))
    print(f'similarity_trad("Michael Jordan", "{chunk}"):', cos_sim(mj_embedding, trad_embeddings))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_new("Michael Jordan", "

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA)."): 0.67259985
similarity_trad("Michael Jordan", "

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA)."): 0.74062645
 
-------------------------------------------------------
 
similarity_new("Michael Jordan", " He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls."): 0.64387697
similarity_trad("Michael Jordan", " He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls."): 0.63204616
 
------------

In [21]:
# lang embedings
for chunk, new_embedding in zip(lang_chunks, lang_embeddings):
    print(f'similarity_langchain("MJ", "{chunk}"):', cos_sim(mj_embedding, new_embedding))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_langchain("MJ", "

Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American businessman, former professional basketball and baseball player, who is a minority owner of the Charlotte Hornets of the National Basketball Association (NBA). He played 15 seasons in the NBA between 1984-2003, winning six NBA championships with the Chicago Bulls. Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]

Jordan played college basketball with the North Carolina Tar Heels. As a freshman, he was a member of the Tar Heels' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, 

  cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


In [22]:
cs_text = """
Computer science is the study of computation, information, and automation. Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software). Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities. Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of repositories of data. Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses on the design and principles behind developing software. Areas such as operating systems, networks and embedded systems investigate the principles and design behind complex systems. Computer architecture describes the construction of computer components and computer-operated equipment. Artificial intelligence and machine learning aim to synthesize goal-orientated processes such as problem-solving, decision-making, environmental adaptation, planning and learning found in humans and animals. Within artificial intelligence, computer vision aims to understand and process image and video data, while natural language processing aims to understand and process textual and linguistic data. The fundamental concern of computer science is determining what can and cannot be automated.The Turing Award is generally recognized as the highest distinction in computer science.
"""

In [23]:
# langchain chunker
lang_chunks,lang_spans = split_with_spans(cs_text, semantic_chunker)
print('Chunks:\n- "' + '"\n- "'.join(lang_chunks) + '"')
print(" ")
print(lang_spans)

Chunks:
- "
Computer science is the study of computation, information, and automation. Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software). Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities. Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of repositories of data. Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses o

In [24]:
# determine chunks (latent vs traditional sentences)
chunks, span_annotations = chunk_by_sentences(cs_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
print(" ")
print(span_annotations)

Chunks:
- "
Computer science is the study of computation, information, and automation."
- " Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software)."
- " Algorithms and data structures are central to computer science."
- " The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them."
- " The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities."
- " Computer graphics and computational geometry address the generation of images."
- " Programming language theory considers different ways to describe computational processes, and database theory concerns the management of repositories of data."
- " Human–computer interaction investigates the interfaces through which humans and computers interact,

In [25]:
# traditional embedding
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling) - latent 
inputs = tokenizer(cs_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]

In [26]:
# langchain embbeddings
lang_embeddings = chunked_pooling(model_output, [lang_spans])[0]

In [27]:
cs_embedding = model.encode('Computer Science')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Computer Science", "{chunk}"):', cos_sim(cs_embedding, new_embedding))
    print(f'similarity_trad("Computer Science", "{chunk}"):', cos_sim(cs_embedding, trad_embeddings))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_new("Computer Science", "
Computer science is the study of computation, information, and automation."): 0.77195656
similarity_trad("Computer Science", "
Computer science is the study of computation, information, and automation."): 0.87986636
 
-------------------------------------------------------
 
similarity_new("Computer Science", " Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software)."): 0.8255079
similarity_trad("Computer Science", " Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software)."): 0.8414965
 
-------------------------------------------------------
 
similarity_new("Computer Science", " Algorithms and data structures are central to computer science."): 0.797399
s

In [28]:
# lang embedings
for chunk, new_embedding in zip(lang_chunks, lang_embeddings):
    print(f'similarity_langchain("Computer Science", "{chunk}"):', cos_sim(cs_embedding, new_embedding))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_langchain("Computer Science", "
Computer science is the study of computation, information, and automation. Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to applied disciplines (including the design and implementation of hardware and software). Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities. Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of repositories of data. Human–computer interaction investigates the interfaces through which humans and computers interact, and

  cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


#### Let's look into a Gemini-create S065 text with titles and sections

In [29]:
text = """
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database. The scope of this standard encompasses all data streams related to Protocol 734-X, specifically focusing on patient-reported outcomes (PROs), adverse event reporting (AERs), and investigational product accountability (IPA) logs. All data exchanges must be formatted in accordance with the HL7 FHIR standard, version R4, with specific profiles as detailed in Section 3.2. Data integrity is paramount, and non-compliance with these specifications will result in an automated rejection of the data submission.

2.0 DATA TRANSFER PROTOCOLS

2.1 Security and Authentication

All connections must be established via a TLS 1.3 encrypted channel. Each data stream must be accompanied by a digital signature using an SHA-256 hash and a Public Key Infrastructure (PKI) certificate issued by the approved Certificate Authority (CA) as designated in Appendix A. Client-side authentication requires a valid client certificate and a rotating API key, which must be refreshed every 24 hours. Failure to authenticate correctly will terminate the connection and log a security event in the central audit trail.

2.2 File Transmission and Archiving

Data payloads shall be transmitted as a compressed JSON file, with a maximum size of 50 MB per transmission. Submissions exceeding this limit must be partitioned into multiple files. Upon successful receipt, the system will generate a unique transaction ID and archive the file for a period of no less than seven (7) years. The sponsor is responsible for retaining a local copy of all submitted data and the corresponding transaction IDs.

3.0 DATA CONTENT AND VALIDATION

3.1 Patient-Reported Outcomes (PROs)

PRO data must adhere to the CDISC SDTM standard, version 1.6. Each record must include a unique patient identifier, the date of response, and the specific instrument used (e.g., EQ-5D-5L, QLQ-C30). A logical check will be performed to ensure that all response values fall within the permissible range for the specified instrument. Any out-of-range values will be flagged, and the entire batch will be returned for correction with a detailed error report.

3.2 Adverse Event Reporting (AERs)

AER data submissions must use the MedDRA coding system for all adverse events, up to the lowest level of term (LLT). Each report must include the event onset date, severity, and a causality assessment in relation to the investigational product. The system will cross-reference all submitted AERs against a predefined list of high-priority events and will automatically trigger a critical alert for immediate manual review if a match is found.

3.3 Investigational Product Accountability (IPA) Logs

IPA logs are required to be submitted on a monthly basis. The data must include the batch number, expiration date, and a detailed record of the product dispensed, returned, or destroyed. All quantities must be reconciled with the initial shipment manifest. Any discrepancy greater than a 2% variance must be justified in a separate discrepancy report submitted alongside the data. Failure to provide a valid justification will result in an official compliance notice.

4.0 GOVERNANCE AND COMPLIANCE

This standard is subject to periodic review and revision. Any changes will be communicated via official channels and will be published with a 90-day grace period before mandatory enforcement. Non-compliance with this standard, including but not limited to repeated failed submissions or data integrity issues, may lead to a formal investigation and potential suspension of the sponsor's data submission privileges. This document supersedes and replaces all previous versions of S065-A/SRD.
"""

In [30]:
# langchain chunker
lang_chunks,lang_spans = split_with_spans(text, semantic_chunker)
print('Chunks:\n- "' + '"\n- "'.join(lang_chunks) + '"')
print(" ")
print(lang_spans)

Chunks:
- "
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database. The scope of this standard encompasses all data streams related to Protocol 734-X, specifically focusing on patient-reported outcomes (PROs), adverse event reporting (AERs), and investigational product accountability (IPA) logs. All data exchanges must be formatted in accordance with the HL7 FHIR standard, version R4, with specific profiles as detailed in Section 3.2. Data integrity is paramount, and non-compliance with these specifications will result in an automated rejection of the data submission. 2.0 DATA TRANSFER PROTOCOLS

2.1 Security and Authentication

All connections must be establ

In [31]:
# determine chunks (latent vs traditional sentences)
chunks, span_annotations = chunk_by_sentences(text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
print(" ")
print(span_annotations)

Chunks:
- "
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database."
- " The scope of this standard encompasses all data streams related to Protocol 734-X, specifically focusing on patient-reported outcomes (PROs), adverse event reporting (AERs), and investigational product accountability (IPA) logs."
- " All data exchanges must be formatted in accordance with the HL7 FHIR standard, version R4, with specific profiles as detailed in Section 3.2."
- " Data integrity is paramount, and non-compliance with these specifications will result in an automated rejection of the data submission."
- "

2.0 DATA TRANSFER PROTOCOLS

2.1 Security and Authentication

All conne

In [32]:
# traditional embedding
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling) - latent 
inputs = tokenizer(text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]

In [33]:
# langchain embbeddings
lang_embeddings = chunked_pooling(model_output, [lang_spans])[0]

In [34]:
text_embedding = model.encode('S065 standard')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("S065 standard", "{chunk}"):', cos_sim(text_embedding, new_embedding))
    print(f'similarity_trad("S065 standard", "{chunk}"):', cos_sim(text_embedding, trad_embeddings))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_new("S065 standard", "
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database."): 0.763225
similarity_trad("S065 standard", "
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database."): 0.8275396
 
-------------------------------------------------------
 
similarity_new("S065 standard", " The scope of this standard encompa

In [35]:
# lang embedings
for chunk, new_embedding in zip(lang_chunks, lang_embeddings):
    print(f'similarity_langchain("S065 standard", "{chunk}"):', cos_sim(text_embedding, new_embedding))
    print(' ')
    print('-------------------------------------------------------')
    print(' ')

similarity_langchain("S065 standard", "
Generated S065-Style Standard Document Text
TITLE: S065-B/SRD — Automated System Interface and Data Validation Standard, Revision 2.1

1.0 INTRODUCTION AND SCOPE

This document, S065-B/SRD, defines the mandatory technical specifications and operational protocols for the secure, automated transfer of clinical trial data between sponsor systems and the central regulatory body's database. The scope of this standard encompasses all data streams related to Protocol 734-X, specifically focusing on patient-reported outcomes (PROs), adverse event reporting (AERs), and investigational product accountability (IPA) logs. All data exchanges must be formatted in accordance with the HL7 FHIR standard, version R4, with specific profiles as detailed in Section 3.2. Data integrity is paramount, and non-compliance with these specifications will result in an automated rejection of the data submission. 2.0 DATA TRANSFER PROTOCOLS

2.1 Security and Authentication

Al

  cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


## Other similarity measures for search?

Everything that I find is cosine similarity or related measures.
    
Let's add the keyword search, and compare the methods in isolation and the hybtid approach

In [36]:
%pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
import random

In [228]:
class MockVectorStore:
    def __init__(self, docs, embedding_model):
        self.docs = docs
        # self.doc_ids = [f"doc_{i}" for i in range(len(docs))]
        self.chunks = []
        self.embeddings = []
        self.model = embedding_model
        # chunk afterwards (context-sensitive chunked pooling) - latent
        for text in docs:
            chunks, span_annotations = chunk_by_sentences(text, tokenizer)
            for i in chunks:
                self.chunks.append(i)
            inputs = tokenizer(text, return_tensors='pt')
            model_output = embedding_model(**inputs)
            embeddings = chunked_pooling(model_output, [span_annotations])[0]
            for emb in embeddings:
                # print(np.shape(emb))
                self.embeddings.append(emb)

    def semantic_search(self, query_text, k=10):
        # A mock semantic search using cosine similarity
        query_embedding = model.encode(query_text)
        similarities = []
        for i in self.embeddings:
            similarities.append(cos_sim(query_embedding, i))

        sorted_indices = np.argsort(similarities)[::-1]
        top_k_indices = sorted_indices[:k]
        
        results = []
        for k in top_k_indices:
            results.append(self.chunks[k])
            
        return results


    def threshold_search(self, query_text, threshold = 0.7):
        # A mock semantic search using cosine similarity
        query_embedding = model.encode(query_text)
        similarities = []
        for i in self.embeddings:
            similarities.append(cos_sim(query_embedding, i))

        sorted_indices = np.argsort(similarities)[::-1]
        # top_k_indices = sorted_indices[:k]
        
        results = []
        for k in sorted_indices:
            if similarities[k] >= threshold:
                results.append(self.chunks[k])
            else:
                break
            
        return results

# Todo keyword search
# class MockKeywordIndex:
#     def __init__(self, docs):
#         self.docs = docs
#         self.doc_ids = [f"doc_{i}" for i in range(len(docs))]
#         self.vectorizer = TfidfVectorizer()
#         self.doc_vectors = self.vectorizer.fit_transform(docs)

#     def keyword_search(self, query_text, k=10):
#         # A mock keyword search using TF-IDF and cosine similarity
#         query_vector = self.vectorizer.transform([query_text])
#         similarities = cosine_similarity(query_vector, self.doc_vectors).flatten()
        
#         sorted_indices = np.argsort(similarities)[::-1]
#         top_k_indices = sorted_indices[:k]

#         return [(self.doc_ids[i], similarities[i]) for i in top_k_indices]

In [229]:
# Sample data (reusing from above)
documents = [
    input_text,
    cs_text,
    mj_text,
    text,
]

In [230]:
# Create mock search systems
vector_store = MockVectorStore(documents, model)
# keyword_index = MockKeywordIndex(documents)

In [231]:
# Define the user query
user_query = "Where did Jordan play?"

In [232]:
# Perform both searches (top 5 chunks)
# The top_k for each search can be higher than the final result size for better fusion
semantic_results = vector_store.semantic_search(user_query, k=5)
# keyword_results = keyword_index.keyword_search(user_query, k=5)
# not sure if this is doing what I want...
print(semantic_results)
# print(keyword_results)

[' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cul

In [234]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.85)
print(threshold_results)

[]


In [235]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.8)
print(threshold_results)

[]


In [236]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.75)
print(threshold_results)

[' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cul

In [237]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.7)
print(threshold_results)

[' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cul

In [238]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.65)
print(threshold_results)

[' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cul

In [239]:
# Search for all relevant data
threshold_results = vector_store.threshold_search(user_query, threshold=0.6)
print(threshold_results)

[' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cul

## Keyword search

We are basically going to be doing the same but using keywords for queries

In [241]:
class VectorStoreKeyword:
    def __init__(self, docs, embedding_model):
        self.docs = docs
        self.chunks = []
        self.embeddings = []
        self.model = embedding_model
        # chunk afterwards (context-sensitive chunked pooling) - latent
        for text in docs:
            chunks, span_annotations = chunk_by_sentences(text, tokenizer)
            for i in chunks:
                self.chunks.append(i)
            inputs = tokenizer(text, return_tensors='pt')
            model_output = embedding_model(**inputs)
            embeddings = chunked_pooling(model_output, [span_annotations])[0]
            for emb in embeddings:
                self.embeddings.append(emb)

    def keyword_search(self, query_text, k=10):
        # semantic search using cosine similarity
        query_embedding = model.encode(query_text)
        similarities = []
        for i in self.embeddings:
            similarities.append(cos_sim(query_embedding, i))

        sorted_indices = np.argsort(similarities)[::-1]
        top_k_indices = sorted_indices[:k]
        
        results = []
        for k in top_k_indices:
            results.append(self.chunks[k])
            
        return results


    def keyword_search_threshold(self, query_text, threshold = 0.7):
        # semantic search using cosine similarity
        query_embedding = model.encode(query_text)
        similarities = []
        for i in self.embeddings:
            similarities.append(cos_sim(query_embedding, i))

        sorted_indices = np.argsort(similarities)[::-1]
        
        results = []
        for k in sorted_indices:
            if similarities[k] >= threshold:
                results.append(self.chunks[k])
            else:
                break
            
        return results


In [242]:
# Sample data (reusing from above)
documents = [
    input_text,
    cs_text,
    mj_text,
    text,
]

In [244]:
# Create mock search systems
vector_store = VectorStoreKeyword(documents, model)

In [245]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "Jordan"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]\n\nJordan played college basketball with the North Carolina Tar Heels.', " He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well a

In [246]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "Michael"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]\n\nJordan played college basketball with the North Carolina Tar Heels.', ' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', '\n\nMichael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ,[8] is an American busines

In [247]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "Michael Jordan"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' In 2014, he became the first billionaire player in NBA history.[26] In 2016, President Barack Obama awarded Jordan the Presidential Medal of Freedom.[27] As of 2025, his net worth is estimated at $3.8 billion by Forbes,[28] making him one of the richest celebrities.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]\n\nJordan played college basketball with the North Carolina Tar Heels.', ' In 1999, Jordan was named the 20th century\'s greatest North American athlete by ESPN and was second to Babe Ruth on the Associated Press\' list of athletes of the century.[5] Jordan was twice inducted into the Naismith Memorial Basketball Hall of Fame, once in 2009 for his individual career,[18] and in 2010 as part of the 1992 United States men\'s Olympic basketball team ("The Dream Team").[19] He became a member of the United States

In [248]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "basketball"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[" He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for the second time in 1999, returning for two NBA seasons from 2001 to 2003 as a member of the Washington Wizards.[5][14] He was selected to play for the United States national team during his college and NBA careers, winning four gold medals—at the 1983 Pan American Games, 1984 Summer Olympics, 1992 Tournament of the Americas and 1992 Summer Olympics—while also being undefeated.[16]\n\nJordan's individual accolades include six NBA Finals Most Valuable Player (MVP) awards, ten NBA scoring titles (both all-time records), five NBA MVP awards, 10 All-NBA First Team designations, nine All-Defensive First Team honors, fourteen NBA All-Star Game selections, three NBA All-Star Game MVP awards, and three NBA steals titles.[14] He holds the NBA records for career regular season scoring average (30.1 po

In [249]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "where"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' Each record must include a unique patient identifier, the date of response, and the specific instrument used (e.g., EQ-5D-5L, QLQ-C30).', ' Any changes will be communicated via official channels and will be published with a 90-day grace period before mandatory enforcement.', ' Upon successful receipt, the system will generate a unique transaction ID and archive the file for a period of no less than seven (7) years.']


In [250]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "team"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' The sponsor is responsible for retaining a local copy of all submitted data and the corresponding transaction IDs.', " Non-compliance with this standard, including but not limited to repeated failed submissions or data integrity issues, may lead to a formal investigation and potential suspension of the sponsor's data submission privileges.", ' Upon successful receipt, the system will generate a unique transaction ID and archive the file for a period of no less than seven (7) years.']


In [251]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "basketball team"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[" He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for the second time in 1999, returning for two NBA seasons from 2001 to 2003 as a member of the Washington Wizards.[5][14] He was selected to play for the United States national team during his college and NBA careers, winning four gold medals—at the 1983 Pan American Games, 1984 Summer Olympics, 1992 Tournament of the Americas and 1992 Summer Olympics—while also being undefeated.[16]\n\nJordan's individual accolades include six NBA Finals Most Valuable Player (MVP) awards, ten NBA scoring titles (both all-time records), five NBA MVP awards, 10 All-NBA First Team designations, nine All-Defensive First Team honors, fourteen NBA All-Star Game selections, three NBA All-Star Game MVP awards, and three NBA steals titles.[14] He holds the NBA records for career regular season scoring average (30.1 po

In [252]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "Bulls"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', " He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for t

In [253]:
# Perform searches (top k chunks)
topK = 3
keyword_query = "college"
semantic_results = vector_store.keyword_search(keyword_query, k=topK)
print(semantic_results)

[' The fundamental concern of computer science is determining what can and cannot be automated.The Turing Award is generally recognized as the highest distinction in computer science.', ' Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses on the design and principles behind developing software.', ' Within artificial intelligence, computer vision aims to understand and process image and video data, while natural language processing aims to understand and process textual and linguistic data.']


### Keyword threshold search

In [254]:
# Perform searches (over threshold)
threshold = 0.85
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [255]:
# Perform searches (over threshold)
threshold = 0.80
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [256]:
# Perform searches (over threshold)
threshold = 0.75
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [257]:
# Perform searches (over threshold)
threshold = 0.7
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [258]:
# Perform searches (over threshold)
threshold = 0.65
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' The fundamental concern of computer science is determining what can and cannot be automated.The Turing Award is generally recognized as the highest distinction in computer science.', ' Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses on the design and principles behind developing software.', ' Within artificial intelligence, computer vision aims to understand and process image and video data, while natural language processing aims to understand and process textual and linguistic data.', ' Artificial intelligence and machine learning aim to synthesize goal-orientated processes such as problem-solving, decision-making, environmental adaptation, planning and learning found in humans and animals.', ' Areas such as operating systems, networks and embedded systems investigate the principles and design behind complex systems.', ' Computer architecture describes the construction of computer components and co

In [259]:
# Perform searches (over threshold)
threshold = 0.60
keyword_query = "college"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' The fundamental concern of computer science is determining what can and cannot be automated.The Turing Award is generally recognized as the highest distinction in computer science.', ' Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses on the design and principles behind developing software.', ' Within artificial intelligence, computer vision aims to understand and process image and video data, while natural language processing aims to understand and process textual and linguistic data.', ' Artificial intelligence and machine learning aim to synthesize goal-orientated processes such as problem-solving, decision-making, environmental adaptation, planning and learning found in humans and animals.', ' Areas such as operating systems, networks and embedded systems investigate the principles and design behind complex systems.', ' Computer architecture describes the construction of computer components and co

In [260]:
# Perform searches (over threshold)
threshold = 0.85
keyword_query = "basketball"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [262]:
# Perform searches (over threshold)
threshold = 0.80
keyword_query = "basketball"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [263]:
# Perform searches (over threshold)
threshold = 0.75
keyword_query = "basketball"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [264]:
# Perform searches (over threshold)
threshold = 0.7
keyword_query = "basketball"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [265]:
# Perform searches (over threshold)
threshold = 0.65
keyword_query = "basketball"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[" He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for the second time in 1999, returning for two NBA seasons from 2001 to 2003 as a member of the Washington Wizards.[5][14] He was selected to play for the United States national team during his college and NBA careers, winning four gold medals—at the 1983 Pan American Games, 1984 Summer Olympics, 1992 Tournament of the Americas and 1992 Summer Olympics—while also being undefeated.[16]\n\nJordan's individual accolades include six NBA Finals Most Valuable Player (MVP) awards, ten NBA scoring titles (both all-time records), five NBA MVP awards, 10 All-NBA First Team designations, nine All-Defensive First Team honors, fourteen NBA All-Star Game selections, three NBA All-Star Game MVP awards, and three NBA steals titles.[14] He holds the NBA records for career regular season scoring average (30.1 po

In [266]:
# Perform searches (over threshold)
threshold = 0.80
keyword_query = "basketball team"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [267]:
# Perform searches (over threshold)
threshold = 0.70
keyword_query = "basketball team"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [268]:
# Perform searches (over threshold)
threshold = 0.65
keyword_query = "basketball team"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[" He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for the second time in 1999, returning for two NBA seasons from 2001 to 2003 as a member of the Washington Wizards.[5][14] He was selected to play for the United States national team during his college and NBA careers, winning four gold medals—at the 1983 Pan American Games, 1984 Summer Olympics, 1992 Tournament of the Americas and 1992 Summer Olympics—while also being undefeated.[16]\n\nJordan's individual accolades include six NBA Finals Most Valuable Player (MVP) awards, ten NBA scoring titles (both all-time records), five NBA MVP awards, 10 All-NBA First Team designations, nine All-Defensive First Team honors, fourteen NBA All-Star Game selections, three NBA All-Star Game MVP awards, and three NBA steals titles.[14] He holds the NBA records for career regular season scoring average (30.1 po

In [269]:
# Perform searches (over threshold)
threshold = 0.80
keyword_query = "Bulls"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [271]:
# Perform searches (over threshold)
threshold = 0.75
keyword_query = "Bulls"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.']


In [272]:
# Perform searches (over threshold)
threshold = 0.70
keyword_query = "Bulls"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' As a freshman, he was a member of the Tar Heels\' national championship team in 1982.[5] Jordan joined the Bulls in 1984 as the third overall draft pick[5][14] and emerged as a league star, entertaining crowds with his prolific scoring while gaining a reputation as one of the best defensive players.[15] His leaping ability, demonstrated by performing slam dunks from the free-throw line in Slam Dunk Contests, earned him the nicknames "Air Jordan" and "His Airness".[5] Jordan won his first NBA title with the Bulls in 1991 and followed that with titles in 1992 and 1993, securing a three-peat.', ' Citing physical and mental exhaustion from basketball and superstardom, Jordan abruptly retired before the 1993–94 NBA season to play Minor League Baseball in the Chicago White Sox organization.', " He returned to the Bulls in 1995 and led them to three more championships in 1996, 1997, and 1998, as well as a then-record 72 regular season wins in the 1995–96 NBA season.[5] Jordan retired for t

In [273]:
# Perform searches (over threshold)
threshold = 0.80
keyword_query = "Michael Jordan"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [274]:
# Perform searches (over threshold)
threshold = 0.75
keyword_query = "Michael Jordan"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[]


In [275]:
# Perform searches (over threshold)
threshold = 0.7
keyword_query = "Michael Jordan"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' In 2014, he became the first billionaire player in NBA history.[26] In 2016, President Barack Obama awarded Jordan the Presidential Medal of Freedom.[27] As of 2025, his net worth is estimated at $3.8 billion by Forbes,[28] making him one of the richest celebrities.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]\n\nJordan played college basketball with the North Carolina Tar Heels.', ' In 1999, Jordan was named the 20th century\'s greatest North American athlete by ESPN and was second to Babe Ruth on the Associated Press\' list of athletes of the century.[5] Jordan was twice inducted into the Naismith Memorial Basketball Hall of Fame, once in 2009 for his individual career,[18] and in 2010 as part of the 1992 United States men\'s Olympic basketball team ("The Dream Team").[19] He became a member of the United States

In [276]:
# Perform searches (over threshold)
threshold = 0.65
keyword_query = "Michael Jordan"
semantic_results = vector_store.keyword_search_threshold(keyword_query, threshold)
print(semantic_results)

[' In 2014, he became the first billionaire player in NBA history.[26] In 2016, President Barack Obama awarded Jordan the Presidential Medal of Freedom.[27] As of 2025, his net worth is estimated at $3.8 billion by Forbes,[28] making him one of the richest celebrities.', ' Widely considered to be one of the greatest players of all time,[9][10][11] he was integral in popularizing basketball and the NBA around the world in the 1980s and 1990s,[12] becoming a global cultural icon.[13]\n\nJordan played college basketball with the North Carolina Tar Heels.', ' In 1999, Jordan was named the 20th century\'s greatest North American athlete by ESPN and was second to Babe Ruth on the Associated Press\' list of athletes of the century.[5] Jordan was twice inducted into the Naismith Memorial Basketball Hall of Fame, once in 2009 for his individual career,[18] and in 2010 as part of the 1992 United States men\'s Olympic basketball team ("The Dream Team").[19] He became a member of the United States