# Has GPT-4 or GPT-4-turbo solved the hallucination issue with public Databricks Docs?

Can we remove retrieval from Databricks Docs?

TLDR: 
* No. For an Assistant that provides **accurate** and **latest** knowledge in the public Databricks Docs, an ideal context is still useful to correct factual errors from GPT-4-turbo, GPT-4, and GPT-3.5.
* Hallucination severity: GPT-3.5 > GPT-4 > GPT-4-turbo
* With a bad context, GPT-4-turbo(no context) often gives a better answer.

## Evaluation Dataset
20 synthetic questions from the [source dataset](https://docs.google.com/spreadsheets/d/1pXpjCFoAfP81m0rZ6Y60Dcfc9Ne3SRwg7W1XCMwjpCg/edit#gid=106621394) which is generated from Databricks Docs website chunks by anirudh.kondaveeti@.

Schema:
**generated_question**: str, **generated_ground_truth_answer**: str, **ground_truth_chunk_text**: str

Collected answers from GPT-3.5, GPT-4 and GPT-4-turbo using [this notebook](https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#notebook/3639518212483601).

## Manual Grading criteria
Manually graded by liang.zhang@:
- -1: The main point contains **key factual errors**.
- 0: Plausible answer in general, but **incorrect facts** are found among correct facts.
- 1: Correct and easy to read answer, helpful, and **no factual errors** found.



In [68]:
from yaml import safe_load
from pydantic import BaseModel, Field, validator
from typing import List
import ast
import importlib
import pandas as pd
import trace_utils
import yaml
importlib.reload(trace_utils)
from trace_utils import Trace, Traces

In [83]:
traces = Traces.load_from_yaml_file("../data/databricks_docs_synthetic.yaml")

## Ground truth answers

The score is < 1 sicne there is one trace whose ground truth answer has factual error due to truncated context.

In [84]:
traces.avg_score(responder_name="generated_ground_truth_answer", judge_name="liang.zhang@", criterion="answer")

0.95

## Ideal context + {GPT-4-turbo, GPT-4, GPT-3.5}

With ground truth context, GPT-4-turbo and GPT-4 can generate ideal answers in most cases, and are about the same quality. GPT-3.5 is slightly weaker.

In [86]:
traces.avg_score(responder_name="answered_by_gpt_35_with_ground_truth_context", judge_name="liang.zhang@", criterion="answer")

0.85

In [88]:
traces.avg_score(responder_name="answered_by_gpt_4_with_ground_truth_context", judge_name="liang.zhang@", criterion="answer")

0.95

In [90]:
traces.avg_score(responder_name="answered_by_gpt_4_turbo_with_ground_truth_context", judge_name="liang.zhang@", criterion="answer")


0.95

## No context + {GPT-4-turbo, GPT-4, GPT-3.5}

GPT-4-turbo is much less likely to give terrible wrong (hallucinated) answers, while GPT-3.5 often does so. However, they still struggle with accurate factual details.

In [85]:
traces.avg_score(responder_name="directly_answered_by_gpt_35", judge_name="liang.zhang@", criterion="answer")

-0.7

In [87]:
traces.avg_score(responder_name="directly_answered_by_gpt_4", judge_name="liang.zhang@", criterion="answer")

-0.25

In [89]:
traces.avg_score(responder_name="directly_answered_by_gpt_4_turbo", judge_name="liang.zhang@", criterion="answer")

-0.15

## What questions can be answered well by GPT-4-turbo + no context?

The "hallucination" happens to match the facts about Databricks.

The questions have the following characteristics:

* Facts that exist for a long time (such as schedule a notebook, read binary files, incremental data ingestion).
* General knowledge, concepts (such as open Delta Sharing).

[Sample answers](https://github.com/liangz1/EchoJudge/issues/3)

In [99]:
ts = traces.select(responder_name="directly_answered_by_gpt_4_turbo", judge_name="liang.zhang@", criterion="answer", rating=1)
for s in ts.traces:
    print(s.user_input)

What is a recipient in Unity Catalog?
What is open Delta Sharing?
What is the recommended method for incremental data ingestion in Databricks?
How can you schedule a notebook as a task in Databricks?
How do I read binary files using Databricks Runtime?


## What questions are answered poorly by GPT-4-turbo + no context?

The "hallucination" is plausible but does not match the facts about Databricks.

The questions have the following characteristics:

* New features, new UI component, new recommended User Journey (such as Auto Loader vs COPY INTO; Databricks Lakehouse Monitoring, AI functions; schedule query execution).
* Knowledge that involves subtle nuonces (such as UDAF vs UDF in Spark SQL).

[Sample answers](https://github.com/liangz1/EchoJudge/issues/4)

In [100]:
ts = traces.select(responder_name="directly_answered_by_gpt_4_turbo", judge_name="liang.zhang@", criterion="answer", rating=-1)
for s in ts.traces:
    print(s.user_input)

What is a share in Delta Sharing?
What are the considerations when choosing between Auto Loader and COPY INTO for data ingestion in Databricks?
What kind of questions can Databricks Lakehouse Monitoring help you answer?
What types of analysis does Databricks Lakehouse Monitoring provide?
What is the purpose of a baseline table in Databricks Lakehouse Monitoring?
What are the AI functions provided by Databricks for SQL users?
How do I schedule a query execution in Azure Databricks?
How do I register the UDAF with Spark SQL?


### Which question is answered better by GPT-4 than GPT-4-turbo?

This might be an outlier, but there is one question where "directly_answered_by_gpt_4" is better than "directly_answered_by_gpt_4_turbo".

- How do I implement a UserDefinedAggregateFunction in Scala for Apache Spark SQL?

The correct answer should use a UserDefinedAggregateFunction with "group by" ([source](https://github.com/liangz1/EchoJudge/blob/456398cefc328bd507d3221925fbd39a0ba4689a/data/databricks_docs_synthetic.yaml#L6118-L6119)). Wrong answers use it as a regular UDF ([source](https://github.com/liangz1/EchoJudge/blob/456398cefc328bd507d3221925fbd39a0ba4689a/data/databricks_docs_synthetic.yaml#L6229-L6230)).

In [107]:
ts = traces.select(responder_name="directly_answered_by_gpt_4", judge_name="liang.zhang@", criterion="answer", rating=1)
gpt_4_rating_1 = {s.user_input for s in ts.traces}

In [108]:
ts = traces.select(responder_name="directly_answered_by_gpt_4_turbo", judge_name="liang.zhang@", criterion="answer", rating=0)
gpt_4_turbo_rating_0 = {s.user_input for s in ts.traces}

In [109]:
gpt_4_rating_1.intersection(gpt_4_turbo_rating_0)

{'How do I implement a UserDefinedAggregateFunction in Scala for Apache Spark SQL?'}

### Reusing this dataset

The dataset https://github.com/liangz1/EchoJudge/blob/main/data/databricks_docs_synthetic.yaml is suitable for evaluating:

- A new LLM: How well can it eliminate hallucination?
- A new RAG: How well can it retrieve the context?