# QA with reference
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pprados/langchain-qa_with_references/blob/master/qa_with_reference_and_verbatim.ipynb)

We believe that hallucinations pose a major problem in the adoption of LLMs (Language Model Models). It is imperative to provide a simple and quick solution that allows the user to verify the coherence of the answers to the questions they are asked.

The conventional approach is to provide a list of URLs of the documents that helped in answering (see qa_with_source). However, this approach is unsatisfactory in several scenarios:

1. The question is asked about a PDF of over 100 pages. Each fragment comes from the same document, but from where?
2. Some documents do not have URLs (data retrieved from a database or other loaders).

It appears essential to have a means of retrieving all references to the actual data sources used by the model to answer the question. 

This includes:
- The precise list of documents used for the answer (the `Documents`, along with their metadata that may contain page numbers, slide numbers, or any other information allowing the retrieval of the fragment in the original document).
- The excerpts of text used for the answer in each fragment. Even if a fragment is used, the LLM only utilizes a small portion to generate the answer. Access to these verbatim excerpts helps to quickly ascertain the validity of the answer.

We propose a new pipeline: `qa_with_reference` for this purpose. It is a Question/Answer type pipeline that returns the list of documents used, and in the metadata, the list of verbatim excerpts exploited to produce the answer.

*At this time, only the `map_reduce` chain type car extract the verbatim excerpts.*


In [1]:
!pip install -q --upgrade pip  'langchain-qa_with_references' openai tiktoken

In [None]:
!pip install -q  python-dotenv
import os
from dotenv import load_dotenv

load_dotenv(override=True)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "XXXXX"

In [14]:
from langchain import OpenAI
from langchain.schema import Document

llm = OpenAI(
    max_tokens=1500,
)

In [15]:
from langchain_qa_with_references.chains import QAWithReferencesAndVerbatimsChain

chain_type = "map_reduce"
qa_chain = QAWithReferencesAndVerbatimsChain.from_chain_type(
    llm=llm,
    chain_type=chain_type,
)

question = "what does it eat?"
bodies = [
    "he eats apples and plays football." "My name is Philippe." "he eats pears.",
    "he eats carrots. I like football.",
    "The Earth is round.",
]
docs = [
    Document(page_content=body, metadata={"id": i}) for i, body in enumerate(bodies)
]

answer = qa_chain(
    inputs={
        "docs": docs,
        "question": question,
    },
)


print(f'To answer "{answer["answer"]}", the LLM use:')
for doc in answer["source_documents"]:
    print(f"Document {doc.metadata['id']}")
    for verbatim in doc.metadata.get("verbatims", []):
        print(f'- "{verbatim}"')

To answer "he eats apples, he eats pears, he eats carrots.", the LLM use:
Document 0
- "he eats apples"
- "he eats pears."
Document 1
- "he eats carrots."


In [16]:
!pip install -q chromadb wikipedia
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import WikipediaRetriever
from langchain.vectorstores import Chroma

question = "what is the Machine learning?"

wikipedia_retriever = WikipediaRetriever()
vectorstore = Chroma(
    embedding_function=OpenAIEmbeddings(),
)
docs = wikipedia_retriever.get_relevant_documents(question)
from langchain.text_splitter import RecursiveCharacterTextSplitter

split_docs = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=10
).split_documents(docs)

vectorstore.add_documents(split_docs)
retriever = vectorstore.as_retriever()

In [17]:
from langchain_qa_with_references.chains import (
    RetrievalQAWithReferencesAndVerbatimsChain,
)
from typing import Literal, List

chain_type: Literal["stuff", "map_reduce", "map_rerank", "refine"] = "map_reduce"

qa_chain = RetrievalQAWithReferencesAndVerbatimsChain.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    reduce_k_below_max_tokens=True,
)
result = qa_chain(
    inputs={
        "question": question,
    }
)


def merge_result_by_urls(result):
    references = {}
    for doc in result["source_documents"]:
        source = doc.metadata.get("source", [])
        verbatims_for_source: List[str] = doc.metadata.get(source, [])
        verbatims_for_source.extend(doc.metadata.get("verbatims", []))
        references[source] = verbatims_for_source
    return references


print(f'For the question "{question}", to answer "{result["answer"]}", the LLM use:')
references = merge_result_by_urls(result)
# Print the result
for source, verbatims in references.items():
    print(f"Source \033[94m{source}\033[0m")
    for verbatim in verbatims:
        print(f'-  "\033[92m{verbatim}\033[0m"')

For the question "what is the Machine learning?", to answer "Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.", the LLM use:
Source [94mhttps://en.wikipedia.org/wiki/Machine_learning[0m
-  "[92mMachine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.[0m"
