## Test Retrieval Strategies

This is a notebook that explores, in detail, different chunking and retrieval strategies. 

## Formatting code

The next cell is a function that will wrap the output of the code cells in a way that makes it easier to read.  It is a workaround for the fact that Jupyter notebooks in Cursor do not wrap the same way that they do in Colab.

In [1]:
from dean_utils.format import set_output_wrapping
set_output_wrapping()

## API Key Setup
---

In [2]:
from dean_utils.defaults import set_api_key
# Example usage:
# set_api_key("OPENAI_API_KEY")
# default_llm = ChatOpenAI(model="gpt-4o", api_key=set_api_key("OPENAI_API_KEY"))

## Let's explore different pdf loaders and settings

##### (Here I am just using the TITRE-protocol.pdf file for testing.  Later I will need a file input routine so that I can distribute this notebook to others without TITRE.)
---

### PyPDFLoader and PyMuPDFLoader both have same syntax.


---
#### Using default mode = "page"
---

In [3]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the protocol and cut it up into chunks
loader = PyPDFLoader("TITRE-protocol.pdf", mode="page")
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=50)
splits = text_splitter.split_documents(pages)
print(f"Page mode loaded the document in individual pages, {len(pages)} total.")
print(f"Then the text splitter worked on each page, resulting in {len(splits)} chunks.")

Page mode loaded the document in individual pages, 51 total.
Then the text splitter worked on each page, resulting in 73 chunks.


---
### mode = "single" should combine all the text into one

---

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader

# Load the protocol and cut it up into chunks
loader = PyMuPDFLoader("TITRE-protocol.pdf", mode="single")
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=50)
splits = text_splitter.split_documents(pages)
print(f"Page mode loaded the document as {len(pages)} page of text.")
print(f"Then the text splitter worked on that text, resulting in {len(splits)} chunks.")

Page mode loaded the document as 1 page of text.
Then the text splitter worked on that text, resulting in 43 chunks.


PyMuPDFLoader automatically provides more metadata fields than PyPDFLoader so for the rest of this notebook we will just use PyMuPDFLoader.  Let's look at the default metadata using pprint to properly display the fields.

In [5]:
import pprint
pprint.pp(pages[0].metadata)

{'producer': 'Microsoft® Word 2016',
 'creator': 'Microsoft® Word 2016',
 'creationdate': '2024-01-05T10:56:40-05:00',
 'source': 'TITRE-protocol.pdf',
 'file_path': 'TITRE-protocol.pdf',
 'total_pages': 51,
 'format': 'PDF 1.5',
 'title': '',
 'author': 'Thiagarajan, Ravi',
 'subject': '',
 'keywords': '',
 'moddate': '2024-01-05T10:56:40-05:00',
 'trapped': '',
 'modDate': "D:20240105105640-05'00'",
 'creationDate': "D:20240105105640-05'00'"}


### Compare Modes of Loading for PyMuPDFLoader

Note that these modes also apply to PyPDFLoader.  Our choice between the loaders is arbitrary.  In the previous cells we saw that loading with page or single mode results in different numbers of chunks.  The next cell illustrates this.  Remember that a chunk size of 3000 is clearly MORE than the size of some *small pages* in the original PDF document.  


In [6]:
loader = PyMuPDFLoader("TITRE-protocol.pdf",
                       mode="page",)
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=50)
splits = text_splitter.split_documents(pages)
print(f"Page mode number of pages: {len(pages)}, number of chunks: {len(splits)}")
print(f"Size of the first six pages: {', '.join(str(len(pages[i].page_content)) for i in range(6))}")
print(f"Length of first eight chunks: {', '.join(str(len(splits[i].page_content)) for i in range(8))}")

loader = PyMuPDFLoader("TITRE-protocol.pdf",
                       mode="single",)
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=50)
splits = text_splitter.split_documents(pages)
print(f"Single mode number of pages: {len(pages)}, number of chunks: {len(splits)}")
print(f"Length of first five chunks: {', '.join(str(len(splits[i].page_content)) for i in range(5))}")


Page mode number of pages: 51, number of chunks: 73
Size of the first six pages: 1352, 4893, 3836, 1274, 1048, 1612
Length of first eight chunks: 1352, 2921, 1970, 2910, 924, 1274, 1048, 1612
Single mode number of pages: 1, number of chunks: 43
Length of first five chunks: 2927, 2960, 2930, 2993, 2981


###  Important Conclusion from above

The default mode for PyMuPDFLoader is "page".  What happens is that if the chunk size is larger than the page size, the loader will load the entire page as a single chunk.  The first page of this document had 1352 characters, and the first chunk also had 1352.  The next page had 4893 characters and was split into two chunks;  the first of these was close to the 3000 chunk size setting and the second was the remaining 1970 characters. 

This is not usually the desired behavior, because our chunk overlap strategy will not work *between chunks from different pages*.  So we need to use the "single" mode, which loads the entire document as a single chunk.  Then we will get chunks that do have overlap between them (*if that was the behavior we wanted*.)


## Now let's set up experiments

We are going to chunk our test document into different chunk sizes with different overlaps, store each of those strategies in a separate Qdrant collection, and then create rag chains that will use each of those collections.  Then we will ask questions of the chain and observe which strategies completely fail.

---


In [7]:
from dean_utils.getVectorstore import getVectorstore

def create_vector_store(file_path, embedding_model, chunk_size, chunk_overlap):
    loader = PyMuPDFLoader(file_path, mode="single",)
    pages = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_documents(pages)
    collection_name = f"ChunkSize_{chunk_size}_Overlap_{chunk_overlap}"
    return getVectorstore(document=chunks,file_path=file_path, embedding_model=embedding_model, collection_name=collection_name)

In the upcoming code cell, you will see that I imported two embedding models so I could compare their performance.  The BGE model was vastly inferior to the OpenAI text-embedding-3-small model, and so the code below uses the commercial model.

In [8]:
from dean_utils.defaults import bge_embedding_model, te3_embedding_model
embedding_model = te3_embedding_model
file_path="TITRE-protocol.pdf"
vectorStore1 = create_vector_store(file_path, embedding_model, 1000, 100)
vectorStore2 = create_vector_store(file_path, embedding_model, 1000, 200)
vectorStore3 = create_vector_store(file_path, embedding_model, 1000, 300)
vectorStore4 = create_vector_store(file_path, embedding_model, 1000, 400)
vectorStore5 = create_vector_store(file_path, embedding_model, 1000, 500)
vectorStore6 = create_vector_store(file_path, embedding_model, 2000, 100)
vectorStore7 = create_vector_store(file_path, embedding_model, 2000, 200)
vectorStore8 = create_vector_store(file_path, embedding_model, 2000, 300)
vectorStore9 = create_vector_store(file_path, embedding_model, 2000, 400)
vectorStore10 = create_vector_store(file_path, embedding_model, 2000, 500)

There is also a parameter k that determines how many chunks are returned, and we can also test changes in k to see the effects.

When k = 3, the later queries about inclusion criteria fail in the majority of instances.  When k = 5, the queries succeed in almost all cases.  When k = 8, there are still a couple of failures.  When I increase it to 20, 100% of the queries succeed using GPT-4o as the model.

In [9]:
k = 5
retrievers = [
    vectorStore1.as_retriever(search_kwargs={"k": k}),
    vectorStore2.as_retriever(search_kwargs={"k": k}),
    vectorStore3.as_retriever(search_kwargs={"k": k}),
    vectorStore4.as_retriever(search_kwargs={"k": k}),
    vectorStore5.as_retriever(search_kwargs={"k": k}),
    vectorStore6.as_retriever(search_kwargs={"k": k}),
    vectorStore7.as_retriever(search_kwargs={"k": k}),
    vectorStore8.as_retriever(search_kwargs={"k": k}),
    vectorStore9.as_retriever(search_kwargs={"k": k}),
    vectorStore10.as_retriever(search_kwargs={"k": k})
]

In [10]:
from operator import itemgetter
from dean_utils.prompts import rag_prompt_template
from dean_utils.defaults import gpt4o_llm
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
llm = gpt4o_llm

def create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps):
    """
    Create RAG chains with metadata about chunk sizes and overlaps.
    """
    rag_chains = {}
    for i, (retriever, chunk_size, chunk_overlap) in enumerate(zip(retrievers, chunk_sizes, chunk_overlaps), 1):
        chain_name = f"rag_chain_{i}_chunk{chunk_size}_overlap{chunk_overlap}"
        rag_chains[chain_name] = {
            "chain": (
                {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                | rag_prompt | llm | StrOutputParser()
            ),
            "metadata": {
                "chunk_size": chunk_size,
                "chunk_overlap": chunk_overlap
            }
        }
    return rag_chains

# Create chains with metadata
chunk_sizes = [1000] * 5 + [2000] * 5  # First 5 are 1000, last 5 are 2000
chunk_overlaps = [100, 200, 300, 400, 500] * 2  # Repeat for both chunk sizes
rag_chains = create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps)

# Example usage:
# response = rag_chains["rag_chain_1_chunk1000_overlap100"]["chain"].invoke({"question": "What is the title?"})

Let's import my four test queries and then run them.

In [11]:
from dean_utils.testQueries import title_query, inclusion_query, exclusion_query, specific_exclusion_query


In [12]:
title_query(rag_chains)

TITLE QUERY
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO."
The title of the protocol is "TITRE: Trial of Indica

In [13]:
inclusion_query(rag_chains)

INCLUSION CRITERIA QUERY
The inclusion criteria of the protocol are not specified in the provided context.
The context does not provide specific details about the inclusion criteria of the protocol.
The inclusion criteria for the protocol are as follows:

1. Children supported with ECMO for any indication.
2. Age less than 6 years at ECMO cannulation.
3. Veno-arterial (VA) mode of ECMO.
4. First ECMO run during the index hospital admission.
The inclusion criteria of the protocol are that all must be met, but the specific criteria are not provided in the context. Therefore, I can't answer based on the context.
The inclusion criteria of the protocol are as follows:

1. Age less than 6 years at ECMO cannulation.
2. Veno-arterial (VA) mode of ECMO.
3. First ECMO run during the index hospitalization.
The inclusion criteria for the protocol are as follows:

1. Age less than 6 years at ECMO cannulation.
2. Veno-arterial (VA) mode of ECMO.
3. First ECMO run during the index hospitalization.
Th

In [14]:
exclusion_query(rag_chains)

EXCLUSION CRITERIA QUERY
The exclusion criteria of the protocol are as follows:

1. Gestationally-corrected age < 37 weeks.
2. Veno-venous (VV) mode of ECMO.
3. Patients initially started on VV-ECMO and then transitioned to VA ECMO.
4. ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 hours.
5. Limitation of care in place or being discussed.
6. Congenital bleeding disorders.
7. Hemoglobinopathies.
8. Primary residence outside the country of enrollment.
9. Concurrent participation in a separate interventional trial that has the potential to impact the neurodevelopmental status of the patient.
The exclusion criteria of the protocol are as follows:

1. Gestationally-corrected age < 37 weeks
2. Veno-venous (VV) mode of ECMO
3. Patients initially started on VV-ECMO and then transitioned to VA ECMO
4. ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECM

In [15]:
specific_exclusion_query(rag_chains)

SPECIFIC EXCLUSION CRITERIA QUERY
The 4th exclusion criterion is: ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 hours.

The 7th exclusion criterion is: Hemoglobinopathies.
The 4th exclusion criterion is "ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h." The 7th exclusion criterion is "Hemoglobinopathies."
The 4th exclusion criterion is "ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h." The 7th exclusion criterion is "Hemoglobinopathies."
The 4th exclusion criterion is "ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h." The 7th exclusion criterion is "Hemoglobinopathies."
The 4th exclusion criterion is 

### Conclusions to this point
---

Chunk size and overlap do certainly affect the performance of retrieval, but this is much more pronounced when using the BGE model.  This is probably because it is a 768 embedding model.  

Increasing k did not completely solve the issue when using BGE.  Lesson to me is to use the richest embedding model we can afford to use.

With k = 10, and using the OpenAI embedding model, ALL the experiments succeed except one inclusion criterion query failed.

The next experiment is to swap out the model and hit a local model on my laptop.  More to come!

---

I am pasting code from above so we can skip doing GPT-4o tests and skip straight to here.

In [40]:
k = 5
retrievers = [
    vectorStore1.as_retriever(search_kwargs={"k": k}),
    vectorStore2.as_retriever(search_kwargs={"k": k}),
    vectorStore3.as_retriever(search_kwargs={"k": k}),
    vectorStore4.as_retriever(search_kwargs={"k": k}),
    vectorStore5.as_retriever(search_kwargs={"k": k}),
    vectorStore6.as_retriever(search_kwargs={"k": k}),
    vectorStore7.as_retriever(search_kwargs={"k": k}),
    vectorStore8.as_retriever(search_kwargs={"k": k}),
    vectorStore9.as_retriever(search_kwargs={"k": k}),
    vectorStore10.as_retriever(search_kwargs={"k": k})
]

## Llama 3.1 8b instruct on LM Studio
Code below is same as earlier in notebook with substitution of llama-3.1-instruct running on my laptop.

In [53]:
from dean_utils.defaults import lmstudio_llama_8b_llm

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
llm = lmstudio_llama_8b_llm

def create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps):
    """
    Create RAG chains with metadata about chunk sizes and overlaps.
    """
    rag_chains = {}
    for i, (retriever, chunk_size, chunk_overlap) in enumerate(zip(retrievers, chunk_sizes, chunk_overlaps), 1):
        chain_name = f"rag_chain_{i}_chunk{chunk_size}_overlap{chunk_overlap}"
        rag_chains[chain_name] = {
            "chain": (
                {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                | rag_prompt | llm | StrOutputParser()
            ),
            "metadata": {
                "chunk_size": chunk_size,
                "chunk_overlap": chunk_overlap
            }
        }
    return rag_chains

# Create chains with metadata
chunk_sizes = [1000] * 5 + [2000] * 5  # First 5 are 1000, last 5 are 2000
chunk_overlaps = [100, 200, 300, 400, 500] * 2  # Repeat for both chunk sizes
rag_chains = create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps)


In [54]:

from dean_utils.testQueries import title_query, inclusion_query, exclusion_query, specific_exclusion_query
title_query(rag_chains)

TITLE QUERY
I can't answer based on the context.
The title of the protocol is not explicitly mentioned in the provided context. However, it appears to be related to "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO" and also referred to as "Multicenter Trial of Indication- vs. Threshold-based Red Blood Cell Transfusion in ECMO".
The title of the protocol is Clinical Protocol TITRE-003.
The title of the protocol is not explicitly mentioned in the provided context, but it appears to be related to "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO" and also referred to as "Clinical Protocol TITRE-003".
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO".
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO".
The title of the protocol is "TITRE: Trial of Indication-based Transfusion of Red Blood Cells in ECMO".
The title of the protocol is "TITRE:

In [55]:
inclusion_query(rag_chains)

INCLUSION CRITERIA QUERY
According to the provided context, the inclusion criteria for the study are as follows:

All patients who meet the following criteria will be included in the study:

- Must meet all of the specified inclusion criteria.
- The specific inclusion criteria are not explicitly listed in this part of the document. However, it is mentioned that they can be found under section 4.1 Inclusion Criteria (all must be met) on page 12.

It seems like there might be more information about the inclusion criteria in another part of the document.
Based on the provided context, I found that the inclusion criteria for the study are mentioned in section 4.1 of the protocol. The text states: "INCLUSION CRITERIA (ALL MUST BE MET): ........................................................................... 12". However, it does not explicitly list out the inclusion criteria.

But later in another document, I found a mention of exclusion criteria which includes some information about the

In [20]:
exclusion_query(rag_chains)

EXCLUSION CRITERIA QUERY
The exclusion criteria of the protocol include:

1. Gestationally-corrected age <37 weeks
2. Veno-venous (VV) mode of ECMO
3. Patients initially started on VV-ECMO and then transitioned to VA ECMO 
4. ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h
5. Limitation of care in-place or being discussed   
6. Congenital bleeding disorders
7. Hemoglobinopathies
8. Primary Residence outside the country of enrollment.
9. Concurrent participation in a separate interventional trial that has the potential to impact neurodevelopmental status of the patient
The exclusion criteria of the protocol include:

1. Gestationally-corrected age < 37 weeks
2. Veno-venous (VV) mode of ECMO
3. Patients initially started on VV-ECMO and then transitioned to VA ECMO
4. ECMO used for procedural support or ECMO duration expected to be < 24 h
5. Limitation of care in-place or being discussed
6. C

In [21]:
specific_exclusion_query(rag_chains)

SPECIFIC EXCLUSION CRITERIA QUERY
The 4th exclusion criterion is: ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h.

The 7th exclusion criterion is: Congenital bleeding disorders.
The 4th exclusion criterion is: ECMO used for procedural support only, or ECMO duration expected to be < 24 h.

The 7th exclusion criterion is: Congenital bleeding disorders.
Based on the provided context, the 4th Exclusion Criteria is: ECMO used for procedural support (ECMO deployed and decannulated in procedural area with no ICU ECMO care) or ECMO duration expected to be < 24 h.

The 7th Exclusion Criteria is: Congenital bleeding disorders.
Based on the provided context, the 4th exclusion criterion is: ECMO used for procedural support only, or ECMO duration expected to be < 24 h.

The 7th exclusion criterion is: Hemoglobinopathies.
Based on the context, the 4th exclusion criterion is 'ECMO used for procedural su

## Llama 3.1 8b instruct on ollama.
Code below is same as earlier in notebook with substitution of llama-3.1-instruct running on my laptop.

The results here should be identical to above, but this is using Ollama as server.

In [37]:
k = 5
retrievers = [
    vectorStore1.as_retriever(search_kwargs={"k": k}),
    vectorStore2.as_retriever(search_kwargs={"k": k}),
    vectorStore3.as_retriever(search_kwargs={"k": k}),
    vectorStore4.as_retriever(search_kwargs={"k": k}),
    vectorStore5.as_retriever(search_kwargs={"k": k}),
    vectorStore6.as_retriever(search_kwargs={"k": k}),
    vectorStore7.as_retriever(search_kwargs={"k": k}),
    vectorStore8.as_retriever(search_kwargs={"k": k}),
    vectorStore9.as_retriever(search_kwargs={"k": k}),
    vectorStore10.as_retriever(search_kwargs={"k": k})
]

In [48]:
from dean_utils.defaults import ollama_llama_8b_llm

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
llm = ollama_llama_8b_llm

def create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps):
    """
    Create RAG chains with metadata about chunk sizes and overlaps.
    """
    rag_chains = {}
    for i, (retriever, chunk_size, chunk_overlap) in enumerate(zip(retrievers, chunk_sizes, chunk_overlaps), 1):
        chain_name = f"rag_chain_{i}_chunk{chunk_size}_overlap{chunk_overlap}"
        rag_chains[chain_name] = {
            "chain": (
                {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                | rag_prompt | llm | StrOutputParser()
            ),
            "metadata": {
                "chunk_size": chunk_size,
                "chunk_overlap": chunk_overlap
            }
        }
    return rag_chains

# Create chains with metadata
chunk_sizes = [1000] * 5 + [2000] * 5  # First 5 are 1000, last 5 are 2000
chunk_overlaps = [100, 200, 300, 400, 500] * 2  # Repeat for both chunk sizes
rag_chains = create_rag_chains_with_metadata(retrievers, rag_prompt, llm, chunk_sizes, chunk_overlaps)


In [49]:
from dean_utils.testQueries import title_query, inclusion_query, exclusion_query, specific_exclusion_query
title_query(rag_chains)

TITLE QUERY
The title of the protocol is not explicitly stated in the provided text, but it can be inferred from the document metadata and content.

From the first document, we can see that the file path is "TITRE-protocol.pdf", which suggests that the title of the protocol is likely "TITRE Protocol".

However, if we look at the second document, we can see that the title is explicitly stated as "Clinical Protocol TITRE-003". This seems to be a later version or iteration of the protocol.

Therefore, based on the available information, it appears that the title of the protocol is either "TITRE Protocol" or "Clinical Protocol TITRE-003", depending on which document we consider.
The title of the protocol is "Clinical Protocol TITRE-003".
The title of the protocol is not explicitly stated in the provided text, but it can be inferred from the context. The first document contains a section titled "7.1 STUDY TITLE", which reads:

"Title: Indication-Based Red Blood Cell Transfusion vs Instituti

In [50]:
inclusion_query(rag_chains)

INCLUSION CRITERIA QUERY
According to the provided text, the inclusion criteria for the protocol are:

* Treatment of primary disease, location of ECMO cannulation, and care prior to participant enrollment are not specified by this protocol.
* Treatment prior to consent for study participation will not influence qualification for study participation if an eligible participant meets screening inclusion/exclusion criteria.
* All ECMO patients at trial sites will be screened for meeting study inclusion criteria after deployment.
* If the patient meets inclusion criteria, and does not have any exclusion criteria, parent(s) or LAR(s) will be approached for study participation.

Note that these are not specific medical or demographic criteria, but rather a set of conditions that must be met in order to participate in the study.
According to the provided text, the inclusion criteria of the protocol are:

1. Gestationally-corrected age ≥ 37 weeks
2. Veno-arterial (VA) mode of ECMO
3. Patients 

In [51]:
exclusion_query(rag_chains)

EXCLUSION CRITERIA QUERY
According to the provided documents, the exclusion criteria of the protocol are:

1. Gestational age < 36 weeks
2. ECMO cannulation at a non-trial center
3. Concurrent participation in a separate interventional trial that has the potential to impact neurodevelopmental status of the patient
4. Patients who have undergone cardiac surgery or intervention within 30 days prior to enrollment
5. Patients with a history of neurological injury or disease (e.g., stroke, traumatic brain injury)
6. Patients with a known genetic disorder affecting the central nervous system
7. Patients with a known chromosomal abnormality
8. Patients with a known congenital anomaly that may affect neurodevelopmental outcome

Note: These exclusion criteria are not explicitly listed in the provided documents, but rather inferred from the text. The actual exclusion criteria may be different and should be verified by consulting the original protocol document.
The exclusion criteria of the proto

In [52]:
specific_exclusion_query(rag_chains)

SPECIFIC EXCLUSION CRITERIA QUERY
The 4th exclusion criterion is: "ECMO at a non-trial center and"

The 7th exclusion criterion is not explicitly mentioned in the provided text. However, based on the context, it appears that there are only 10 exclusion criteria listed, with the 7th one being:

"Patients cannulated for ECMO at a non-trial center and"
Based on the provided text, the 4th exclusion criterion is:

"ECMO used for procedural support only, or ECMO duration expected to be < 24 h"

The 7th exclusion criterion is:

"Congenital bleeding disorders"
According to the provided documents, the 4th exclusion criterion is:

"Patients initially started on VV-ECMO and then transitioned to VA ECMO"

And the 7th exclusion criterion is not explicitly mentioned in the provided text. However, based on the context, it seems that there are only 6 exclusion criteria listed, so there is no 7th exclusion criterion.

Here are the first 6 exclusion criteria:

1. Gestationally-corrected age < 37 weeks
2

### What about semantic chunking?

In [6]:
from langchain_experimental.text_splitter import SemanticChunker
from dean_utils.defaults  import te3_embedding_model
from langchain_community.document_loaders import PyMuPDFLoader

# Load the protocol and cut it up into chunks
loader = PyMuPDFLoader("TITRE-protocol.pdf", mode="single")
pages = loader.load()

semantic_chunker = SemanticChunker(
    te3_embedding_model,
    breakpoint_threshold_type="percentile"
)

In [8]:
semantic_documents = semantic_chunker.split_documents(pages)

In [9]:
len(semantic_documents)

44

In [10]:
from langchain_qdrant import QdrantVectorStore
semantic_vectorstore = QdrantVectorStore.from_documents(
    semantic_documents,
    te3_embedding_model,
    collection_name = "semantic collection",
    url = "http://localhost:6333"
)

In [53]:
from dean_utils.defaults import lmstudio_llama_8b_llm, ollama_llama_8b_llm
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from dean_utils.prompts import rag_prompt_template

semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 5})
llm = lmstudio_llama_8b_llm
# llm = ollama_llama_8b_llm
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)

semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [46]:
semantic_retrieval_chain.invoke({"question":"What is the title of this protocol?"})

'TITRE: Trial of Indication-Based Transfusion of Red Blood Cells in ECMO'

In [47]:
response = semantic_retrieval_chain.invoke({"question":"What are the inclusion criteria of this protocol?"})

In [48]:
print(response)

Children supported with ECMO for any indication who meet all the following:

1. Age < 6 year at ECMO cannulation
2. Veno-arterial (VA) mode of ECMO
3. First ECMO run during the index hospital admission

In the absence of any of the following at the time of screening:

1. Gestationally-corrected age <37 weeks
2. Veno-venous (VV) mode of ECMO
3. VV ECMO transitioned to VA ECMO
4. ECMO used for procedural support only, or ECMO duration expected to be < 24 h
5. Limitation of care in place or being discussed
6. Congenital bleeding disorders
7. Hemoglobinopathies
8. Primary Residence outside Unites States or Canada
9. Concurrent participation in a separate interventional trial that has the potential to impact neurodevelopmental status of the patient
10. Patients cannulated for ECMO at a non-trial center and transferred to a trial site
11. Randomization not possible within 36 h following ECMO cannulation (e.g., due to staffing or delays related to communication with participant family)


In [59]:
response = semantic_retrieval_chain.invoke({"question":"What is the 7th exclusion criterion of this protocol?"})

In [60]:
print(response)

The 7th exclusion criterion of this protocol is "Hemoglobinopathies".


In [69]:
response = semantic_retrieval_chain.invoke({"question":"What are the 7th and 9th exclusion criteria of this protocol?"})

In [70]:
print(response)

The 7th exclusion criterion is "Congenital bleeding disorders" and the 9th exclusion criterion is "Concurrent participation in a separate interventional trial that has the potential to impact neurodevelopmental status of the patient".


In [71]:
response = semantic_retrieval_chain.invoke({"question":"Summarize the hypotheses of this protocol?"})

In [73]:
print(response)

The primary hypothesis of this protocol is that children < 6 years old on ECMO support who are randomized to an indication-based RBC transfusion strategy will have greater improvement in organ function, as measured by change in the Pediatric Sequential Organ Failure Score (pSOFA) from pre-ECMO cannulation to decannulation (or 30 days post-randomization, whichever is earliest), compared with those managed with a center-specific threshold-based RBC transfusion strategy.

The secondary hypothesis is that children < 6 years old who survive to hospital discharge following ECMO support and are randomized to an indication-based RBC transfusion protocol will have better outcomes of neurodevelopment, functional status, and health-related quality of life measured at 12 months post-randomization compared with those managed with a center-specific threshold-based RBC transfusion strategy.
