# Retrievers

This notebook shows how to use the retrievers in the LangChain Europe PMC package.

## EuropePMCRetriever

The `EuropePMCRetriever` retrieves scientific articles from Europe PMC, a repository of biomedical and life sciences literature. It uses the Europe PMC API to search for articles based on a query and returns them as Document objects.

In [1]:
from langchain_europe_pmc.retrievers import EuropePMCRetriever

# Initialize the retriever with default parameters
retriever = EuropePMCRetriever()

# Search for articles about cancer
docs = retriever.invoke("malaria")

# Print the first document
print(f"Found {len(docs)} documents")
if docs:
    print("\nFirst document:")
    print(docs[0].page_content)

Found 3 documents

First document:
# A roadmap of priority evidence gaps for the co-implementation of malaria vaccines and perennial malaria chemoprevention. 

##Abstract

Progress in malaria control will rely on deployment and effective targeting of combinations of interventions, including malaria vaccines and perennial malaria chemoprevention (PMC). Several countries with PMC programmes have introduced malaria vaccination into their essential programmes on immunizations, but empirical evidence on the impact of combining these two interventions and how best to co-implement them are lacking. At the American Society of Tropical Medicine and Hygiene 2023 annual meeting, a stakeholder meeting was convened to identify key policy, operational and research gaps for co-implementation of malaria vaccines and PMC. Participants from 11 endemic countries, including representatives from national malaria and immunization programmes, the World Health Organization, researchers, implementing organizat

### Customizing the Retriever

You can customize the retriever by specifying parameters such as the number of results to return, the maximum query length, and the result type.

In [2]:
# Initialize the retriever with custom parameters
retriever = EuropePMCRetriever(
    top_k_results=5,  # Return 5 results instead of the default 3
)

# Search for articles about CRISPR gene editing
docs = retriever.invoke("CRISPR gene editing")

# Print the number of documents and their titles
print(f"Found {len(docs)} documents\n")
for i, doc in enumerate(docs):
    title = doc.metadata.get("title", "No title available")
    print(f"{i+1}. {title}")

Found 3 documents

1. Vectors in CRISPR Gene Editing for Neurological Disorders: Challenges and Opportunities.
2. Research advances CRISPR gene editing technology generated models in the study of epithelial ovarian carcinoma.
3. CRISPR Gene-Editing Combat: Targeting AIDS for total eradication.


### Using the Retriever in a Chain

You can use the retriever in a LangChain chain to answer questions based on the retrieved documents.

In [3]:
from langchain_europe_pmc.retrievers import EuropePMCRetriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI
import os
from dotenv import load_dotenv
import json

load_dotenv()
api_key = os.getenv("MISTRAL_API_KEY")
base_url = "https://api.mistral.ai/v1/"

# Your retriever (must support get_relevant_documents(query))
retriever = EuropePMCRetriever(max_k=10)

# Prompts
query_gen_prompt = ChatPromptTemplate.from_template((
    "Generate a query for Europe PMC based on the question provided.\n"
    "answer using a json format with the key 'query' and the value the query.\n"
    "Use the adapted syntax for the Europe PMC API.\n"
    'example: {{"query": "title:XXX AND \\"yyy zzz\\""}}\n\n'
    "---\nQuestion:\n\n{question}\n---\n"
))

final_qa_prompt = ChatPromptTemplate.from_template((
    "Answer the question based only on the context provided.\n\n"
    "Cite the documents used to answer the question.\n"
    "Use the following format [i] for citations: [1], [2], [3], etc.\n"
    "with i being the id <document id=\"i\" ...> of the document in the context.\n"
    "Each citation corresponds to a document in the context.\n"
    "---\nContext:\n\n{context}\n---\n"
    "---\nQuestion:\n\n{question}\n---\n"
))

# Two separate LLMs for two steps
llm_query = ChatOpenAI(
    model="mistral-large-latest",
    openai_api_key=api_key,
    openai_api_base=base_url,
    model_kwargs={"response_format": {"type": "json_object"}}
)
llm_answer = ChatOpenAI(
    model="mistral-large-latest",
    openai_api_key=api_key,
    openai_api_base=base_url
)

def extract_query(json_str):
    try:
        data = json.loads(json_str)
        return data['query']
    except Exception:
        raise ValueError(f"Could not extract query from: {json_str}")

def format_docs(docs):
    format_str = ""
    count = 0
    for doc in docs:
        count += 1
        format_str += f"<document id=\"{count}\""
        for key, value in doc.metadata.items():
            format_str += f" {key}=\"{value}\""
        format_str += ">\n"
        format_str += doc.page_content
        format_str += "\n</document>\n"

    return format_str

# The two independent chains
generate_query_chain = (
    query_gen_prompt
    | llm_query
    | StrOutputParser()
    | RunnableLambda(extract_query)
)

answer_chain = (
    RunnablePassthrough.assign(
        context=lambda d: format_docs(d['docs'])
    )
    | final_qa_prompt
    | llm_answer
    | StrOutputParser()
)

# Compose the full chain
chain = (
    RunnableLambda(lambda question: {"question": question})
    | RunnablePassthrough.assign(
        query=lambda d: generate_query_chain.invoke({"question": d["question"]})
    )
    | RunnablePassthrough.assign(
        docs=lambda d: retriever.invoke(d["query"])
    )
    | answer_chain
)

# Create a chain that returns all intermediate results using LangChain runnables
complete_chain = (
    RunnableLambda(lambda question: {"question": question})
    | RunnablePassthrough.assign(
        query=lambda d: generate_query_chain.invoke({"question": d["question"]})
    )
    | RunnablePassthrough.assign(
        documents=lambda d: retriever.invoke(d["query"])
    )
    | RunnablePassthrough.assign(
        context=lambda d: format_docs(d["documents"])
    )
    | RunnablePassthrough.assign(
        answer=lambda d: llm_answer.invoke(
            final_qa_prompt.format(question=d["question"], context=d["context"])
        ).content
    )
    | RunnableLambda(lambda d: {
        "query": d["query"],
        "documents": [x.model_dump() for x in d["documents"]],
        "answer": d["answer"]
    })
)

# Fix the typo in the question
question = "What are the hallmarks of cancer ?"

# Run the chain that returns all outputs
results = complete_chain.invoke(question)

# Print the results
print("=== QUERY ===")
print(results["query"])

print("\n=== DOCUMENTS ===")
print(f"Number of documents: {len(results['documents'])}")
for i, doc in enumerate(results["documents"]):
    print(f"\nDocument {i+1}:")
    print(f"Title: {doc['metadata']['title']}")
    print(f"Authors: {doc['metadata']['authors']}")
    print(f"Journal: {doc['metadata']['journal']}")
    print(f"Year: {doc['metadata']['year']}")
    print(f"PMID: {doc['metadata']['pmid']}")

print("\n=== ANSWER ===")
print(results["answer"])
print("===============")


=== QUERY ===
title:"hallmarks of cancer"

=== DOCUMENTS ===
Number of documents: 10

Document 1:
Title: Complement and the hallmarks of cancer.
Authors: Artero MR, Minery A, Nedelcev L, Radanova M, Roumenina LT.
Journal: Semin Immunol
Year: 2025
PMID: 40179675

Document 2:
Title: The Hallmarks of Cancer as Eco-Evolutionary Processes.
Authors: Bhattacharya R, Avdieiev SS, Bukkuri A, Whelan CJ, Gatenby RA, Tsai KY, Brown JS.
Journal: Cancer Discov
Year: 2025
PMID: 40170539

Document 3:
Title: Probing the physical hallmarks of cancer.
Authors: Nia HT, Munn LL, Jain RK.
Journal: Nat Methods
Year: 2025
PMID: 39815103

Document 4:
Title: AKT and the Hallmarks of Cancer.
Authors: Sementino E, Hassan D, Bellacosa A, Testa JR.
Journal: Cancer Res
Year: 2024
PMID: 39437156

Document 5:
Title: The Epigenetic Hallmarks of Cancer.
Authors: Esteller M, Dawson MA, Kadoch C, Rassool FV, Jones PA, Baylin SB.
Journal: Cancer Discov
Year: 2024
PMID: 39363741

Document 6:
Title: The hallmarks of cancer i