# Question Answering

This notebook covers how to evaluate generic question answering problems. This is a situation where you have an example containing a question and its corresponding ground truth answer, and you want to measure how well the language model does at answering those questions.

## Setup

For demonstration purposes, we will just evaluate a simple question answering system that only evaluates the model's internal knowledge. Please see the [Data Augmented Question Answering](data_augmented_qa.ipynb) guide for an examples evaluating a Q&A system over data sources.

In [4]:
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI

In [7]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template("You are a helpful AI assistant."),
        HumanMessagePromptTemplate.from_template("{question}"),
    ]
)

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

chain = LLMChain(llm=llm, prompt=prompt)

## Examples
For this purpose, we will just use two simple hardcoded examples, but see other notebooks for tips on how to get and/or generate these examples.

In [8]:
examples = [
    {
        "question": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?",
        "answer": "11",
    },
    {
        "question": 'Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."',
        "answer": "No",
    },
]

## Predictions

We can now make and inspect the predictions for these questions.

In [12]:
predictions = chain.apply(examples)
print("\n\n".join([pred['text'] for pred in predictions]))

Roger initially has 5 tennis balls. He buys 2 cans of tennis balls, and each can has 3 tennis balls. So, he has 2 * 3 = <<2*3=6>>6 additional tennis balls.
Therefore, Roger now has a total of 5 + 6 = <<5+6=11>>11 tennis balls.

No, the sentence is not plausible. Joao Moutinho is not a football player, and the NFC championship is a game in American football, not soccer.


## Evaluation

We can see that if we tried to just do exact match on the answer answers (`11` and `No`) they would not match what the language model answered. However, semantically the language model is correct in both cases. In order to account for this, we can use a language model itself to evaluate the answers.

In [14]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("qa")

In [15]:
eval_results = [
    evaluator.evaluate_strings(
        input=eg['question'],
        prediction=pred['text'],
        reference=eg['answer'],
    )
    for eg, pred in zip(examples, predictions)
]   

In [17]:
for i, (eval_res, eg, pred) in enumerate(zip(eval_results, examples, predictions)):
    print(f"Example {i}:")
    print("Question: " + eg["question"])
    print("Real Answer: " + eg["answer"])
    print("Predicted Answer: " + pred["text"])
    print("Predicted Result: " + eval_res['value'])
    print()

Example 0:
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Real Answer: 11
Predicted Answer: Roger initially has 5 tennis balls. He buys 2 cans of tennis balls, and each can has 3 tennis balls. So, he has 2 * 3 = <<2*3=6>>6 additional tennis balls.
Therefore, Roger now has a total of 5 + 6 = <<5+6=11>>11 tennis balls.
Predicted Result: CORRECT

Example 1:
Question: Is the following sentence plausible? "Joao Moutinho caught the screen pass in the NFC championship."
Real Answer: No
Predicted Answer: No, the sentence is not plausible. Joao Moutinho is not a football player, and the NFC championship is a game in American football, not soccer.
Predicted Result: CORRECT

