# Data Augmented Question Answering

This notebook uses some generic prompts/language models to evaluate an question answering system that uses other sources of data besides what is in the model. For example, this can be used to evaluate a question answering system over your proprietary data.

The overall steps to do this are:
1. Define your chain for the Q&A system
2. Define a dataset (as a list of examples)
3. Evaluate the chain on the dataset

## Setup

Let's set up an example with our favorite example - the state of the union address. This will be done by:
1. Loading the text data
2. Chunking and storing data in the vectorstore
3. Creating the retriever from the vectorstore
4. Creating the Q&A chain using an LLM and retriever

First, fetch the example data from the langchain repo.

In [1]:
import requests

state_of_the_union_url = "https://raw.githubusercontent.com/langchain-ai/langchain/76102971c056bb277bf394068c98fb05ee2fb07d/docs/extras/modules/state_of_the_union.txt"
with open("state_of_the_union.txt", "w") as f:
    f.write(requests.get(state_of_the_union_url).text)

#### Chunk the text data

Use the `CharacterTextSplitter` to chunk the text data using naive character-length splitting.

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader("state_of_the_union.txt")
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(loader.load())

#### Create Retriever

Select the `embeddings` to use for vectorizing the text chunks, and select the vectorstore to drive the retriever used for question answering.

In [3]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma


embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)
retriever = docsearch.as_retriever()

#### Create QA Chain

We will use GPT turbo for this example.

In [4]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
qa = RetrievalQA.from_llm(
    llm=llm,
    retriever=retriever,
)

## Examples

Now we need some examples to evaluate. There are two basic ways to do this:

1. Hard code some examples ourselves
2. Generate examples automatically, using a language model

If you have example data from prior usage, this is often the best. When you're just starting out, you can bootstrap a dataset using the `QAGenerationChain` or your own custom `LLMChain`.

In [17]:
# Hard-coded examples
examples = [
    {
        "query": "What did the president say about Ketanji Brown Jackson",
        "answer": "He praised her legal ability and said he nominated her for the supreme court.",
    },
    {"query": "What did the president say about Michael Jackson", "answer": "Nothing"},
]

In [6]:
# Generated examples
from langchain.evaluation.qa import QAGenerateChain

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
example_gen_chain = QAGenerateChain.from_llm(llm)

In [16]:
new_examples = [
    ex[example_gen_chain.output_key] for ex in example_gen_chain.apply([{"doc": t} for t in texts[:5]])
]

In [19]:
# Combine examples
examples += new_examples

## Evaluate

Now that we have examples, it's time to evaluate the chain. Generate predictions and then use an evaluator to grade its performance.

In [20]:
predictions = qa.apply(examples)

Use the [qa](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain.evaluation.qa.eval_chain.QAEvalChain) evaluator to grade correctness of the question answering chain. For more information on evaluators, check out the [reference docs](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.evaluation)

In [37]:
from langchain.evaluation import load_evaluator

qa_evaluator = load_evaluator("qa")

***Use the `tabulate` package for pretty printing the results.***

In [None]:
# %pip install tabulate

In [45]:
from tqdm import tqdm
from tabulate import tabulate

def truncate(s, n):
    """Truncate `s` to `n` characters."""
    return (s[:n] + '..') if len(s) > n else s

def print_results(examples, predictions, evaluators):
    max_length = 80
    table = [("Example", "Evaluator", "Value", "Score", "Query", "Prediction", "Answer")]
    for i, (eg, pred) in tqdm(enumerate(zip(examples, predictions))):
        for evaluator in evaluators:
            verdict = evaluator.evaluate_strings(
                input=eg['query'],
                prediction=pred['result'],
                reference=eg['answer'],
            )
            table.append(
                (f"{i}",
                f"{evaluator.evaluation_name}",
                f"{verdict['value']}",
                f"{verdict['score']}",
                f"{truncate(eg['query'], max_length)}",
                f"{truncate(pred['result'], max_length)}",
                f"{truncate(eg['answer'], max_length)}")
            )
    print(tabulate(table, headers="firstrow", tablefmt='grid'))


In [46]:
print_results(examples, predictions, [qa_evaluator])

+-----------+-------------+-----------+---------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|   Example | Evaluator   | Value     |   Score | Query                                                                              | Prediction                                                                         | Answer                                                                             |
|         0 | correctness | CORRECT   |       1 | What did the president say about Ketanji Brown Jackson                             | The president said that Ketanji Brown Jackson is one of our nation's top legal m.. | He praised her legal ability and said he nominated her for the supreme court.      |
+-----------+-------------+-----------+---------+------------------------------------

## Evaluate with Other Metrics

In addition to predicting whether the answer is correct or incorrect using a language model, we can also use other evalutors, such as  the [labeled_criteria](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.LabeledCriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.LabeledCriteriaEvalChain) evaluator.

Let's evaluate based on conciseness and a custom 'pedagogical skill'.

In [33]:
evaluators = [
    load_evaluator("labeled_criteria", criteria="conciseness"),
    load_evaluator("labeled_criteria", criteria={
        "pedagogical skill": "Did the submission propertly interpret inquiries, generate informative and understandable responses,"
        " and present information in a manner that promotes strong thinking and problem-solving."
    }),
]

In [47]:
print_results(examples, predictions, evaluators)

+-----------+-------------------+---------+---------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|   Example | Evaluator         | Value   |   Score | Query                                                                              | Prediction                                                                         | Answer                                                                             |
|         0 | conciseness       | Y       |       1 | What did the president say about Ketanji Brown Jackson                             | The president said that Ketanji Brown Jackson is one of our nation's top legal m.. | He praised her legal ability and said he nominated her for the supreme court.      |
+-----------+-------------------+---------+---------+--------------------