# Evaluation of generative question answering results

with an view toward syntactical differences in the prompt.

Please note: any non-blog-demo use case would likely make use of some experiment tracking framework. I use DIY code and Jupyter to keep things self-contained.

In [1]:
from typing import List
import difflib
from pprint import pprint
from pathlib import Path
import os
import json

import pyarrow as pa
from omegaconf import OmegaConf

In [2]:
def diff_prompt_config(left_run_config, right_run_config) -> List:
    d = difflib.Differ()
    prompt_config_keys = ['context', 'question', 'answer_format', 'template']
    diffs = {}
    keys_with_diff = []
    for key in prompt_config_keys:

        diff = d.compare(
            left_run_config[key], right_run_config[key]
        )
        # Remove match comparison results; see https://docs.python.org/3/library/difflib.html#difflib.Differ
        diff = [diff_char for diff_char in diff if '  ' != diff_char[:2]]
        diffs[key] = diff
        if diff:
            keys_with_diff.append(key)
    
    # Get summary of diffs
    print(f'Keys with diff: {keys_with_diff}')
    
    for key, value in diffs.items():
        print(key)
        pprint(value)
    return keys_with_diff

In [3]:
expected_table = pa.Table.from_pylist([
    {
        'Platz': 1,
        'Schwimmer': 'Thomas Ehrhardt',
        'JG': 1977,
        'Verein': 'SSKC Poseidon Aschaffenburg',
        'Zeit': '1:01,64',
        'Punkte': 465,
        'Ort': 'Gwangju',
        'Datum': '8/2019'
    },
   {
       'Platz': 2,
        'Schwimmer': 'Jochen Kaminski',
        'JG': 1974,
        'Verein': 'SSF Bonn 05',
        'Zeit': '1:03,91',
        'Punkte': 417,
        'Ort': 'Karlsruhe',
        'Datum': '6/2019'
   },
   {
       'Platz': 3,
        'Schwimmer': 'Paul Larsen',
        'JG': 1977,
        'Verein': 'TSV Haar',
        'Zeit': '1:05,01',
        'Punkte': 397,
        'Ort': 'Kranj',
        'Datum': '9/2018'
   },
   {
       'Platz': 4,
        'Schwimmer': 'Sebastian Kratzenstein',
        'JG': 1978,
        'Verein': 'BSC Robben',
        'Zeit': '1:05,21',
        'Punkte': 393,
        'Ort': 'Karlsruhe',
        'Datum': '6/2019'
   },
   {
       'Platz': 5,
       'Schwimmer': 'Torben Kritzer',
       'JG': 1977,
       'Verein': 'Bad Homburger SC 1927',
       'Zeit': '1:05,64',
       'Punkte': 385,
       'Ort': 'Karlsruhe',
       'Datum': '6/2019'}
])

## Group together the experiment runs

into runs that failed to produce the expected schema, ones that produced the expected schema but got the answer wrong in some other way, and runs that yielded a correct answer.

In [4]:
results_dir = Path(os.environ['PROJECT_ROOT']) / 'generative-question-answering' / 'outputs' / 'blog-all-combos'

run_dirs = [elt for elt in results_dir.iterdir() if elt.is_dir()]
schema_successes = []
schema_failures = []

for run_dir in run_dirs:
    with open(run_dir / 'result.json', 'r') as fp:
        result = json.load(fp)
    with open(run_dir / '.hydra' / 'config.yaml') as fp:
        conf = OmegaConf.load(fp)
        conf = OmegaConf.to_object(conf)
    if result['data_contract_success']:
        schema_successes.append({
            'response': result['response'], 'config': conf, 'run_dir': run_dir.as_posix()})
    else:
        schema_failures.append({'response': result['response'], 'config': conf, 'run_dir': run_dir.as_posix()})

In [5]:
expected_result = []
expected_schema_wrong_result = []

for run in schema_successes:
    if pa.Table.from_pylist(run['response']).equals(expected_table):
        expected_result.append(run)
    else:
        expected_schema_wrong_result.append(run)
    

## Look at successful runs

In [6]:
for run in expected_result:
    print(run['config'])
    print('')

{'generate_parent_module': 'openai', 'generate_object': 'ChatCompletion', 'generate_method': 'create', 'model_version': 'gpt-3.5-turbo-0301', 'model_params': {'temperature': 0}, 'response_parser_parent_module': 'mp_blog.llm_utils', 'response_parser_method': 'openai_chat_completion_parser', 'context': 'Quelle: https://www.dsv.de/schwimmen/wettkampf-national/bestenlisten/\n\nAuswahl\nGeschlecht:\nM\tW\tX\nBahn:\n25m\t50m\nStrecke:\n\n100m Schmetterling\nZeitbereich:\n\nSaison 2018/2019\nAltersklasse:\n\nAK 40 - JG 1974 - 1978\nPunkte:\n\nFINA 2022 (25m)\nRegion:\n\nDeutschland\n\n\n25 Einträge\nSuche\nDeutscher Rekord: 0:51,19 von Steffen Deibler (Hamburger SC r.V. von 1879) am 28.04.2013\nPlatz\tSchwimmer\tJG\tVerein\tZeit\tPunkte\tOrt\tDatum\n1\tThomas Ehrhardt\t1977\tSSKC Poseidon Aschaffenburg\t1:01,64\t465\tGwangju\t8/2019\n2\tJochen Kaminski\t1974\tSSF Bonn 05\t1:03,91\t417\tKarlsruhe\t6/2019\n3\tPaul Larsen\t1977\tTSV Haar\t1:05,01\t397\tKranj\t9/2018\n4\tSebastian Kratzenstein\t1

### and the diffs between the two successful runs

In [7]:
diff_prompt_config(expected_result[0]['config'], expected_result[1]['config'])

IndexError: list index out of range

which tells us that `ae` vs `ä` in the word 'Männer' makes no difference for these runs.

## Analyze runs with the correct schema but wrong result

In [None]:
expected_schema_wrong_result

## and finally look at the failed runs

Let's filter out the runs with blank context, as they yield the usual "Sorry, as an AI language model, I don't have access to the specific information you requested" or "I'm sorry, I cannot provide an answer to this question as I am an AI language model and do not have access to current or historical sports data" responses.

In [None]:
failures_sans_context = [run for run in schema_failures if not run['config']['context']]
for run in failures_sans_context[:3]:
    print(run['response'])
    print('\n')

Let's see if there is any clue to the unexpected json schema by diffing with the first of the correct results. Below we'll drill down more.

In [None]:
failures_w_context = [run for run in schema_failures if run['config']['context']]

diff_keys_per_run = {}
for idx, run in enumerate(failures_w_context):
    print(f'Failure with non-empty context run {idx}')
    keys_with_diff = diff_prompt_config(expected_result[0]['config'], run['config'])
    diff_keys_per_run[idx] = keys_with_diff
    print('\n')

## Failure diffs in detail

Let's first look at the failed runs for which only one of `question`, `answer_format` or `template` was different from the first successful run.

In [None]:
diff_keys_per_run

### question diff failures

The differences to the correct response were that the question was posed in English in run idx `0` and with a `ae` rather than `ä` in the word "Männer" in run index `5`.

In [None]:
idx = 0

print(f"Incorrect response: {failures_w_context[idx]['response']}\n")
print(f"question: {failures_w_context[idx]['config']['question']}\n")

In [None]:
idx = 5

print(f"Incorrect response: {failures_w_context[idx]['response']}\n")
print(f"question: {failures_w_context[idx]['config']['question']}\n")

At first glance, these responses looks correct. The reason it failed our data contract and was hence deemed incorrect is due to the types of "JG" (Jahrgang, i.e. year of birth) and "Punkte" (i.e. points) when the question is phrased in English.

That's at least one explanation, but it feels a little far-fetched. It could just be the inherent randomness of generative AI--the same input will yield different outputs with some non-zero probability (unless you force probabilities to zero as with e.g. [context-free-grammars](https://matt-rickard.com/context-free-grammar-parsing-with-llms)).

### Question in English: Maybe it's just LLM randomness?

To test the role of mixing an English question with German context in response data type errors, we ran the same prompt 10 times, putting results in [munichpavel.github.io/generative-question-answering/outputs/blog-english-question-repeats](https://munichpavel.github.io/generative-question-answering/outputs/blog-english-question-repeats).

Let's see if there are any that get the correct answer thanks to the randomness of generative AI.

In [None]:
results_english = Path(os.environ['PROJECT_ROOT']) / 'generative-question-answering' / 'outputs' / 'blog-english-question-repeats'
schema_successes = []
schema_failures = []

run_dirs = [elt for elt in results_english.iterdir() if elt.is_dir()]

for run_dir in run_dirs:
    with open(run_dir / 'result.json', 'r') as fp:
        result = json.load(fp)
    with open(run_dir / '.hydra' / 'config.yaml') as fp:
        conf = OmegaConf.load(fp)
        conf = OmegaConf.to_object(conf)
    if result['data_contract_success']:
        schema_successes.append({
            'response': result['response'], 'config': conf, 'run_dir': run_dir.as_posix()})
    else:
        schema_failures.append({'response': result['response'], 'config': conf, 'run_dir': run_dir.as_posix()})
        
print(f'Number of runs with schema successes {len(schema_successes)}')
print(f'Number of runs with schema failures {len(schema_failures)}')

print(f"Response from run 10: {result['response']}")

We have not proved that giving German text as context with a question posed in English leads to worse results ceteris paribus, but if you are builing a LLM-service for this use case, it seems you should phrase your question in the same language as the context. 

### Failure 1--answer format does note specify json-only

In [None]:
idx = 1
print(f"Incorrect response: {failures_w_context[idx]['response']}\n")
print(f"answer_format: {failures_w_context[idx]['config']['answer_format']}\n")

This failure is the most intuitive one. If you don't lock down the json schema enough, the LLM will guess for you. Anyone software developer would do this same given ambiguity (and an expert one would flag the ambiguity with the business counterpart before going too far).

### Failure 4 diffs--template with extra space

In [None]:
idx = 4

print(f"Incorrect response: {failures_w_context[idx]['response']}\n")

print(f"template: {failures_w_context[idx]['config']['template']}\n")

So just adding an extra space before the context text begins is enough to yield a wrong answer.





## Conclusion


Why exactly the question-answering fails when

* there is an extra space before the context section of the prompt template
* the question is in English while the context is in German
* the question uses `ae` rather than `ä` in the word "Männer"

is beyond me. 

And that's likely OK, as the appeal of LLMs is that you can offload much of what your use-case needs to the LLM without understanding in full detail why or how it gives the results it does.

What these examples show, however, is that it pays to manage the natural-language-syntax-as-configuration well, which in this post we've show by taking a standard approach from ML--experiment and configuration management.

By using sound configuration management practices, we can get the upside of LLMs while managing the downsides, which in this use case include

1. An extra space in the prompt template leading to wrong results.
1. Asking an English question of German context leading to wrong data types in the response.