**RAG using Haystack**

In [None]:
import json
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import BM25Retriever
# Set up Elasticsearch document store
document_store = ElasticsearchDocumentStore(return_embedding=True)


document_store.delete_documents(index="document")
# Read PubMed data
with open("papers.json", "r") as file:
    data = json.load(file)
    print(len(data))

# Upload the data to document store
docs = []
for entry in data:
    doc = {
        "content": entry.get("abstract", {}).get("full_text", ""),
        "meta": {
            "title": entry.get("title", {}).get("full_text", ""),
            "keywords": entry.get("keywords", []),
        }
    }
    docs.append(doc)

document_store.write_documents(docs, index="document")

# Here we verify that we uploaded the documents
print(f"{document_store.get_document_count()} documents loaded")

# Sample query using BM25Retriever (its a sparse vectore retriever)
retriever = BM25Retriever(document_store=document_store)
query = "naturalistic development"
retrieved_docs = retriever.retrieve(query=query,top_k=3) #you can add filters

# Print the retrieved documents
print("Retrieved Documents:")
for doc in retrieved_docs:
    print(f"Score: {doc.score}, Document: {doc.content}")


In [12]:
import json
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import BM25Retriever
# Set up Elasticsearch document store
document_store = ElasticsearchDocumentStore(return_embedding=True)
# Sample query using BM25Retriever (its a sparse vectore retriever)
retriever = BM25Retriever(document_store=document_store)
query = "naturalistic development"
retrieved_docs = retriever.retrieve(query=query,top_k=3) #you can add filters

# Print the retrieved documents
print("Retrieved Documents:")
for doc in retrieved_docs:
    print(f"Score: {doc.score}, Document: {doc.content}")

Retrieved Documents:
Score: 0.8337135188752255, Document: driving intelligence tests are critical to the development and deployment of autonomous vehicles. the prevailing approach tests autonomous vehicles in life-like simulations of the naturalistic driving environment. however, due to the high dimensionality of the environment and the rareness of safety-critical events, hundreds of millions of miles would be required to demonstrate the safety performance of autonomous vehicles, which is severely inefficient. we discover that sparse but adversarial adjustments to the naturalistic driving environment, resulting in the naturalistic and adversarial driving environment, can significantly reduce the required test miles without loss of evaluation unbiasedness. by training the background vehicles to learn when to execute what adversarial maneuver, the proposed environment becomes an intelligent environment for driving intelligence testing. we demonstrate the effectiveness of the proposed env

In [2]:
from haystack.nodes import FARMReader


model_ckpt = "deepset/minilm-uncased-squad2"
max_seq_length, doc_stride = 384, 128
reader = FARMReader(model_name_or_path=model_ckpt,progress_bar=False, max_seq_len=max_seq_length,doc_stride=doc_stride,return_no_answer=True)

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
question = "What is depression?"
print(reader.predict_on_texts(question=question, texts=["depression is an illness"], top_k=1))

{'query': 'What is depression?', 'no_ans_gap': 6.250293254852295, 'answers': [<Answer {'answer': 'illness', 'type': 'extractive', 'score': 0.6270307898521423, 'context': 'depression is an illness', 'offsets_in_document': [{'start': 17, 'end': 24}], 'offsets_in_context': [{'start': 17, 'end': 24}], 'document_ids': ['fd35c5b9a2940724cae84361236a95ec'], 'meta': {}}>]}


In [4]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)

In [5]:
n_answers = 3
query = "What is described as a crucial mechanism of healthy cognitive functioning?"
preds = pipe.run(query=query, params={"Retriever": {"top_k": 3}, 
                                      "Reader": {"top_k": n_answers}})

print(f"Question: {preds['query']} \n")

for idx in range(n_answers):
    print(f"Answer {idx+1}: {preds['answers'][idx].answer}")
    print(f"Review snippet: ...{preds['answers'][idx].context}...")
    print("\n\n")

Question: What is described as a crucial mechanism of healthy cognitive functioning? 

Answer 1: 
Review snippet: ...None...



Answer 2: nutrition
Review snippet: ...tive functioning is relevant for potential interventions. among these, nutrition plays a key role. in fact, the link between gut and brain (the gut-br...



Answer 3: opioid maintenance treatment
Review snippet: ... recognition performance. based on our and earlier findings, opioid maintenance treatment may be seen as relatively safe with respect to cognitive dys...





In [6]:
from datasets import load_dataset,DatasetDict
dataset = load_dataset("pubmed_qa", "pqa_artificial")
num_test_samples = 1000  # Choose the number of samples for the test set
# Assuming `dataset` is a DatasetDict and 'train' is the key for the training set
training_set = dataset['train']

# Create a test set by taking a subset of samples from the training set
test_set = training_set.shuffle(seed=42).select([i for i in range(num_test_samples)])
# Remove the selected samples from the training set, to avoid overlap
selected_pubids = [sample['pubid'] for sample in test_set]
training_set = training_set.filter(lambda x: x['pubid'] not in selected_pubids)

new_dataset_dict = DatasetDict({
    'train': training_set,
    'test': test_set,
})

dataset = new_dataset_dict
print(dataset)

Downloading readme:   0%|          | 0.00/4.87k [00:00<?, ?B/s]

DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 210269
    })
    test: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 1000
    })
})


In [7]:
import pandas as pd
dfs = {split: dset.to_pandas() for split, dset in dataset.flatten().items()}

for split, df in dfs.items():
    print(f"Anzahl an Fragen in {split}: {df['pubid'].nunique()}")

Anzahl an Fragen in train: 210269
Anzahl an Fragen in test: 1000


**Costumizing our Dense Passage Retriever**

In [8]:
from haystack.nodes import DensePassageRetriever

dpr_retriever = DensePassageRetriever(document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    embed_title=False)

  return self.fget.__get__(instance, owner)()


In [10]:
retrieved_docs = dpr_retriever.retrieve(query="",top_k=3) #you can add filters

# Print the retrieved documents
print("Retrieved Documents:")
for doc in retrieved_docs:
    print(f"Score: {doc.score}, Document: {doc.content}")

No documents with embeddings. Run the document store's update_embeddings() method.


Retrieved Documents:


In [None]:
document_store.update_embeddings(retriever=dpr_retriever)

In [11]:
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

rag_prompt = PromptTemplate(
    prompt="""question: {query} context: {join(documents)}""",
    output_parser=AnswerParser(),
)

prompt_node = PromptNode(model_name_or_path="MaRiOrOsSi/t5-base-finetuned-question-answering", default_prompt_template=rag_prompt)

#prompt_node = PromptNode(model_name_or_path="microsoft/BioGPT-Large-PubMedQA", default_prompt_template=rag_prompt)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [13]:
from haystack.pipelines import Pipeline

pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", inputs=["Query"]) #dpr_retriever for dense retriever
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])


In [20]:

output1 = pipe.run(query="Does psammaplin A induce Sirtuin 1-dependent autophagic cell death in doxorubicin-resistant MCF-7/adr human breast cancer cells and xenografts?")
output2 = pipe.run(query="What are the benefits of using biopolymer-based films in food packaging??")
output3 = pipe.run(query="Has the use of digital technology in managing AKI been fully explored and implemented in clinical practice?")

print(output3)

The prompt has been truncated from 4889 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
The prompt has been truncated from 2906 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off
The prompt has been truncated from 1728 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off


{'answers': [<Answer {'answer': 'no', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['6637e755f1b9dbad1abd5198abbd1271', '4af46bed70cb2fdcb3dd3f205c50cb95', '29d3dbaa6f734b793234a18d81ab4bbc', 'e40b51e794081d673fb59d421d402ac3', '777901db3237eecac1544ba14d4c4ecc', 'ff6b279667306523ce5c9f43adbe7d3b', '68689135372f4a207b1828ac3777f58f', '932bf78d0b4d16d0a53012c5aa71413d', '66b0ded26c60517b660da5c227aa9929', 'ef8f0d0b59d91d1c81575983a676141f'], 'meta': {'prompt': 'question: Has the use of digital technology in managing AKI been fully explored and implemented in clinical practice? context: acute kidney injury (aki), which is a common complication of acute illnesses, affects the health of individuals in community, acute care and post-acute care settings. although the recognition, prevention and management of aki has advanced over the past decades, its incidence and related morbidity, mortality and health care b