In [2]:
from utils.utils import PNode
import pandas as pd

df_sovanta = pd.read_csv("../eval_sovanta/eval-full-judgement-ref-free-cr-ar-gr-cr.csv")
df_wikieval = pd.read_csv("../eval_wikieval/eval-full-judgement-ref-free-cr-ar-gr.csv")

df_sovanta["retrieved_text"] = df_sovanta.apply(
    lambda row: " ".join([n.text for n in pd.eval(row["nodes"], local_dict={"PNode": PNode})]),
    axis=1,
)
df_sovanta["legacy_metrics"] = df_sovanta[["bleu", "rouge1", "bertscore_recall"]].mean(
    axis=1
)  # the legacy metrics that work best
df_sovanta["rag_triad"] = (
    df_sovanta["mistralai--mistral-large-instruct_answer_relevance"] * 0.5
    + df_sovanta["meta--llama3.1-70b-instruct_groundedness"] * 0.25
    + df_sovanta["meta--llama3.1-70b-instruct_context_relevance_with_cot"] * 0.25
)
df_wikieval["retrieved_text"] = df_wikieval.apply(
    lambda row: " ".join([n.text for n in pd.eval(row["nodes"], local_dict={"PNode": PNode})]),
    axis=1,
)
df_wikieval["legacy_metrics"] = df_wikieval[["bleu", "rouge1", "bertscore_recall"]].mean(
    axis=1
)  # the legacy metrics that work best
df_wikieval["rag_triad"] = (
    df_wikieval["anthropic--claude-3.7-sonnet_answer_relevance_with_cot"] * 0.25
    + df_wikieval["anthropic--claude-3.7-sonnet_groundedness_filter_trivial"] * 0.5
    + df_wikieval["gpt-4o_context_relevance_with_cot"] * 0.25
)

# 1. Wikieval

In [3]:
print(
    "Correlation with Legacy Metrics: ",
    df_wikieval["mistralai--mistral-large-instruct_judgement"].corr(df_wikieval["legacy_metrics"]),
)
print(
    "Correlation with Reference Free: ",
    df_wikieval["mistralai--mistral-large-instruct_judgement"].corr(
        df_wikieval["mistralai--mistral-large-instruct_judgement_ref_free"]
    ),
)
print(
    "Correlation with RAG Triad: ",
    df_wikieval["mistralai--mistral-large-instruct_judgement"].corr(df_wikieval["rag_triad"]),
)

Correlation with Legacy Metrics:  0.7175458284580102
Correlation with Reference Free:  0.5262217269474444
Correlation with RAG Triad:  0.22513820820080663


## 1.1 Legacy Metrics

In [4]:
df_filtered = df_wikieval[
    [
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "mistralai--mistral-large-instruct_judgement",
        "legacy_metrics",
    ]
]
df_filtered["abs_diff"] = (
    df_filtered["mistralai--mistral-large-instruct_judgement"] - df_filtered["legacy_metrics"]
).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.5]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['mistralai--mistral-large-instruct_judgement'] - df_filtered['legacy_metrics']).abs()


Unnamed: 0,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,mistralai--mistral-large-instruct_judgement,legacy_metrics,abs_diff
36,128,4,,What are some of the controversies surrounding...,Uber has been the subject of controversies inc...,Uber has been involved in a number of controve...,1.0,0.494595,0.505405
398,128,10,hooman650/bge-reranker-v2-m3-onnx-o4,What are some measures for pandemic prevention?,Some measures for pandemic prevention include:...,Some measures for pandemic prevention include ...,1.0,0.443336,0.556664
797,256,8,hooman650/bge-reranker-v2-m3-onnx-o4,What are some measures for pandemic prevention?,Some measures for pandemic prevention include:...,Some measures for pandemic prevention include ...,1.0,0.452297,0.547703
897,256,10,hooman650/bge-reranker-v2-m3-onnx-o4,What are some measures for pandemic prevention?,Some measures for pandemic prevention include:...,Some measures for pandemic prevention include ...,1.0,0.452943,0.547057
998,256,12,hooman650/bge-reranker-v2-m3-onnx-o4,What are some measures for pandemic prevention?,Pandemic prevention measures include:\n\n* R...,Some measures for pandemic prevention include ...,1.0,0.473532,0.526468
...,...,...,...,...,...,...,...,...,...
5935,512,12,,What are some of the controversies surrounding...,Uber has been the subject of several controver...,Uber has been involved in a number of controve...,1.0,0.460485,0.539515
5942,512,12,,When was GPT-4 released and what are some of i...,"GPT-4 was released on March 14, 2023. Some of ...","GPT-4 was released on March 14, 2023. It is a ...",1.0,0.493641,0.506359
5947,512,12,,What are some measures for pandemic prevention?,Measures for pandemic prevention include:\n\n1...,Some measures for pandemic prevention include ...,1.0,0.427995,0.572005
5960,512,12,hooman650/bge-reranker-v2-m3-onnx-o4,What is the purpose of the Rainbow Plaque prog...,The Rainbow Plaque programme in the UK is a sc...,The Rainbow Plaque programme in the UK is a sc...,1.0,0.492526,0.507474


In [5]:
df_filtered = df_filtered[
    ~df_filtered["prompt"].isin(
        [
            "What are some of the controversies surrounding Uber?",
            "What are some measures for pandemic prevention?",
        ]
    )
]
print(
    "Correlation with Legacy Metrics: ",
    df_filtered["mistralai--mistral-large-instruct_judgement"].corr(df_filtered["legacy_metrics"]),
)

Correlation with Legacy Metrics:  0.7565176847381141


Observations:
- 50% of errors on just one question, 30% on one more
- errors are characterized by either long lists in the answers and/or additional information

Error Classes:
- 1: Answer contains list of detailed steps
- 2: Answer contains additional information not found in gold answer

## 1.2 Reference-free

In [6]:
df_filtered = df_wikieval[
    [
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "mistralai--mistral-large-instruct_judgement",
        "mistralai--mistral-large-instruct_judgement_ref_free",
    ]
]
df_filtered["abs_diff"] = (
    df_filtered["mistralai--mistral-large-instruct_judgement"]
    - df_filtered["mistralai--mistral-large-instruct_judgement_ref_free"]
).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.3]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['mistralai--mistral-large-instruct_judgement'] - df_filtered['mistralai--mistral-large-instruct_judgement_ref_free']).abs()


Unnamed: 0,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,mistralai--mistral-large-instruct_judgement,mistralai--mistral-large-instruct_judgement_ref_free,abs_diff
406,128,12,,"What is the taxonomy of Dasypoda radchenkoi, a...",Die Gattung Dasypoda ist in vier Untergattunge...,Dasypoda radchenkoi belongs to the genus Dasyp...,0.75,0.25,0.5
409,128,12,,What was the purpose of designing and building...,The purpose of the Fiat Ecobasic concept car w...,The purpose of designing and building the Fiat...,1.00,0.50,0.5
456,128,12,hooman650/bge-reranker-v2-m3-onnx-o4,"What is the taxonomy of Dasypoda radchenkoi, a...",Die Gattung Dasypoda ist in vier Untergattunge...,Dasypoda radchenkoi belongs to the genus Dasyp...,0.75,0.25,0.5
530,256,4,,When was the 5th Separate Guards Tatsin Red Ba...,The 5th Separate Guards Tatsin Red Banner Orde...,The 5th Separate Guards Tatsin Red Banner Orde...,1.00,0.50,0.5
579,256,4,hooman650/bge-reranker-v2-m3-onnx-o4,When was the 5th Separate Guards Tatsin Red Ba...,The 5th Separate Guards Tatsin Red Banner Orde...,The 5th Separate Guards Tatsin Red Banner Orde...,1.00,0.50,0.5
...,...,...,...,...,...,...,...,...,...
5556,512,4,hooman650/bge-reranker-v2-m3-onnx-o4,What is the main product of Fremantle Octopus ...,"Fremantle Octopus is based in Fremantle, Weste...",The main product of Fremantle Octopus is Octop...,1.00,0.50,0.5
5627,512,6,,Where is the type locality of the Blakistonia ...,The type locality of the Blakistonia plata spi...,The type locality of the Blakistonia plata spi...,1.00,0.50,0.5
5676,512,6,hooman650/bge-reranker-v2-m3-onnx-o4,Where is the type locality of the Blakistonia ...,The type locality of the Blakistonia plata spi...,The type locality of the Blakistonia plata spi...,1.00,0.50,0.5
5705,512,8,,What is the main product of Fremantle Octopus ...,"Fremantle Octopus is based in Fremantle, Weste...",The main product of Fremantle Octopus is Octop...,1.00,0.50,0.5


In [7]:
df_filtered = df_filtered[
    ~df_filtered["prompt"].isin(
        [
            "When was the 5th Separate Guards Tatsin Red Banner Order of Suvorov Tank Brigade formed, and what is its military unit number?",
            "What was the estimated timeline for fully restoring power in Moore County after the shooting attack on the electrical distribution substations?",
        ]
    )
]
print(
    "Correlation with Legacy Metrics: ",
    df_filtered["mistralai--mistral-large-instruct_judgement"].corr(
        df_filtered["mistralai--mistral-large-instruct_judgement_ref_free"]
    ),
)

Correlation with Legacy Metrics:  0.571908762660549


Observations:
- Errors are characterized by judgement=1 and ref_free_judgement much lower
- 52% of errors on one question, 20% on one more

Error Classes:
- 1: Answer has different language than gold answer
- 2: Short Answer that perfectly resembles the gold answer but may be too short for ref_free

## 1.3 RAG Triad

In [8]:
df_filtered = df_wikieval[
    [
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "mistralai--mistral-large-instruct_judgement",
        "rag_triad",
    ]
]
df_filtered["abs_diff"] = (df_filtered["mistralai--mistral-large-instruct_judgement"] - df_filtered["rag_triad"]).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.5]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['mistralai--mistral-large-instruct_judgement'] - df_filtered['rag_triad']).abs()


Unnamed: 0,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,mistralai--mistral-large-instruct_judgement,rag_triad,abs_diff
20,128,4,,When was the Roanoke and Tar River Railroad fu...,The Roanoke and Tar River Railroad was fully m...,The Roanoke and Tar River Railroad was fully m...,0.25,1.000000,0.750000
66,128,4,hooman650/bge-reranker-v2-m3-onnx-o4,What caused the crash of Yeti Airlines Flight ...,The video taken moments before the crash showe...,The crash of Yeti Airlines Flight 691 in Pokha...,0.25,0.833333,0.583333
76,128,4,hooman650/bge-reranker-v2-m3-onnx-o4,When was the Roanoke and Tar River Railroad fu...,The Roanoke and Tar River Railroad was fully m...,The Roanoke and Tar River Railroad was fully m...,0.25,1.000000,0.750000
116,128,6,,What caused the crash of Yeti Airlines Flight ...,The video taken moments before the crash showe...,The crash of Yeti Airlines Flight 691 in Pokha...,0.25,0.916667,0.666667
123,128,6,,When was the Roanoke and Tar River Railroad fu...,The Roanoke and Tar River Railroad was fully m...,The Roanoke and Tar River Railroad was fully m...,0.25,1.000000,0.750000
...,...,...,...,...,...,...,...,...,...
4613,128,6,,When did Trolleybus Route 20 in Shanghai start...,Trolleybus Route 20 in Shanghai started its op...,Trolleybus Route 20 in Shanghai started its op...,0.25,0.916667,0.666667
4712,128,8,,When did Trolleybus Route 20 in Shanghai start...,Trolleybus Route 20 in Shanghai started its op...,Trolleybus Route 20 in Shanghai started its op...,0.25,0.916667,0.666667
5094,256,4,hooman650/bge-reranker-v2-m3-onnx-o4,What caused the gas supply outage in Sheffield...,An error occurred while calling GenAIHubLLM.mi...,"The gas supply outage in Sheffield, England in...",0.00,0.750000,0.750000
5095,256,4,hooman650/bge-reranker-v2-m3-onnx-o4,What sparked the civil unrest and protests in ...,An error occurred while calling GenAIHubLLM.mi...,The civil unrest and protests in Iran began in...,0.00,0.750000,0.750000


Observations:
- Errors characterized by small judgement and high RAG Triad Score
- Errors distributed across questions

Error Classes:
- 1: Gold Answer contains more info, but answer completely answers the question
- 2: Answer does not contain all necessary information

# 2. sovanta

In [9]:
print(
    "Correlation with Legacy Metrics: ",
    df_sovanta["meta--llama3.1-70b-instruct_judgement"].corr(df_sovanta["legacy_metrics"]),
)
print(
    "Correlation with Reference Free: ",
    df_sovanta["meta--llama3.1-70b-instruct_judgement"].corr(
        df_sovanta["meta--llama3.1-70b-instruct_judgement_ref_free"]
    ),
)
print(
    "Correlation with RAG Triad: ",
    df_sovanta["meta--llama3.1-70b-instruct_judgement"].corr(df_sovanta["rag_triad"]),
)

Correlation with Legacy Metrics:  0.510711497449249
Correlation with Reference Free:  0.22283595877910636
Correlation with RAG Triad:  0.18482855243125734


## 2.1 Legacy Metrics

In [10]:
df_filtered = df_sovanta[
    [
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "meta--llama3.1-70b-instruct_judgement",
        "legacy_metrics",
    ]
]
df_filtered["abs_diff"] = (df_filtered["meta--llama3.1-70b-instruct_judgement"] - df_filtered["legacy_metrics"]).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.5]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['meta--llama3.1-70b-instruct_judgement'] - df_filtered['legacy_metrics']).abs()


Unnamed: 0,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,meta--llama3.1-70b-instruct_judgement,legacy_metrics,abs_diff
15,128,4,,Wie beantrage ich als Werkstudent Urlaub?,"Um als Werkstudent Urlaub zu beantragen, musst...",Du kannst mit SAP SuccessFactors einen Urlaubs...,1.0,0.352292,0.647708
22,128,4,,wie bekomme ich eine neue maus für meinen laptop,Die gegebenen Kontextinformationen enthalten k...,Der Kontext enthält keine spezifischen Informa...,1.0,0.427480,0.572520
35,128,4,,Welche Richtlinien gelten für Flüge?,Standard class für Flüge ist coach (economy) c...,Es soll Economy class gebucht werden. Außerdem...,1.0,0.345966,0.654034
46,128,4,,how do i request a miro account?,"To request a Miro account, you need to create ...",There is no specific information in the provid...,1.0,0.361871,0.638129
53,128,4,,write me an agenda for a design thinking works...,Certainly! Here is an agenda for a Design Thin...,Design Thinking Workshop Agenda: Duration 3 ho...,1.0,0.355391,0.644609
...,...,...,...,...,...,...,...,...,...
7943,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,Darf ich bei der sovanta auch flexible Zug tic...,"Nein, laut der Travel Policy von sovanta AG mü...","Nein, Zugtickets müssen für den günstigsten mö...",1.0,0.465804,0.534196
7965,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,wie trainiere ich meine rückenmuskulatur?,"Um deine Rückenmuskulatur zu trainieren, kanns...","Um Ihren Rücken zu trainieren, können sie folg...",1.0,0.382365,0.617635
7975,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,wie mache ich eine Krankmeldung?,Du kannst deine Krankmeldung elektronisch oder...,Du kannst deine Krankmeldung direkt über die Q...,1.0,0.408120,0.591880
7984,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,Was mach ich wenn ich die zugangskarte zum bür...,Wenn du die Zugangskarte zum Büro verloren has...,Der Verlust der Karte(n) ist unverzüglich und...,1.0,0.459627,0.540373


Observations:
- errors are characterized by either long lists in the answers and/or additional information

Error Classes:
- 1: Answer contains list of detailed steps / elaborates way more
- 2: Answer contains additional information not found in gold answer
- 3: Different words are used to express the same

## 2.2 Reference-free

In [11]:
df_filtered = df_sovanta[
    [
        "cluster",
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "meta--llama3.1-70b-instruct_judgement",
        "meta--llama3.1-70b-instruct_judgement_ref_free",
    ]
]
# df_filtered = df_filtered[df_filtered["cluster"] != "UNKNOWN"]
df_filtered["abs_diff"] = (
    df_filtered["meta--llama3.1-70b-instruct_judgement"] - df_filtered["meta--llama3.1-70b-instruct_judgement_ref_free"]
).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.5]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['meta--llama3.1-70b-instruct_judgement'] - df_filtered['meta--llama3.1-70b-instruct_judgement_ref_free']).abs()


Unnamed: 0,cluster,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,meta--llama3.1-70b-instruct_judgement,meta--llama3.1-70b-instruct_judgement_ref_free,abs_diff
7,UNKNOWN,128,4,,Was ist das Zertikiat?,Das Zertifikat ist nicht in den gegebenen Kont...,In den Kontextinformationen wird kein Zertifik...,1.0,0.00,1.00
8,CONTRACTS,128,4,,WIe lange läuft der Vertrag für EInbruchmelden...,Der Vertrag für Einbruchmeldeanlagen läuft 3 J...,Der Vertrag für Einbruchmeldeanlagen läuft 3 J...,1.0,0.25,0.75
11,CONTRACTS,128,4,,Wie lange läuft der Vertrag für Brandmeldeanla...,Der Vertrag für Brandmeldeanlagen läuft 4 Jahre.,Der Vertrag für Brandmeldeanlagen läuft für 4 ...,1.0,0.25,0.75
19,TECH,128,4,,What is Fiori?,Fiori ist die benutzerfreundliche „Designsprac...,SAP Fiori is the user-friendly “design languag...,1.0,0.25,0.75
22,UNKNOWN,128,4,,wie bekomme ich eine neue maus für meinen laptop,Die gegebenen Kontextinformationen enthalten k...,Der Kontext enthält keine spezifischen Informa...,1.0,0.25,0.75
...,...,...,...,...,...,...,...,...,...,...
7906,UNKNOWN,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,Wie buche eine Dienstreise bei der sovanta?,Es gibt keine spezifischen Informationen im Ko...,Die Kontextinformationen enthalten keine spez...,1.0,0.25,0.75
7937,HR,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,can i fly business class to the US?,"No, according to the travel policy, the standa...","No, according to the policy, the standard clas...",1.0,0.25,0.75
7938,UNKNOWN,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,How do i delete a project,There is no information in the provided contex...,The context information does not provide speci...,1.0,0.25,0.75
7940,UNKNOWN,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,can you tell me anything about the TK project?,No information is available about the TK proje...,"There is no information about a ""TK project"" i...",1.0,0.00,1.00


Observations:
- Errors are characterized by judgement=1 and ref_free_judgement much lower
- 40% of errors are for UNKNOWN clusters -> excluded

Error Classes:
- 1: Answer has different language than gold answer
- 2: Short Answer that perfectly resembles the gold answer but may be too short for ref_free

## 2.3 RAG Triad

In [12]:
df_filtered = df_sovanta[
    [
        "cluster",
        "chunk_size",
        "top_k",
        "rerank_model",
        "prompt",
        "answer",
        "gold_answer",
        "meta--llama3.1-70b-instruct_judgement",
        "rag_triad",
    ]
]
# df_filtered = df_filtered[df_filtered["cluster"] != "UNKNOWN"]
df_filtered["abs_diff"] = (df_filtered["meta--llama3.1-70b-instruct_judgement"] - df_filtered["rag_triad"]).abs()
highlighted_df = df_filtered[df_filtered["abs_diff"] > 0.5]
highlighted_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['abs_diff'] = (df_filtered['meta--llama3.1-70b-instruct_judgement'] - df_filtered['rag_triad']).abs()


Unnamed: 0,cluster,chunk_size,top_k,rerank_model,prompt,answer,gold_answer,meta--llama3.1-70b-instruct_judgement,rag_triad,abs_diff
2,OTHER,128,4,,Was ist Experience Managemenet,"Experience Management ist ein Bereich, der von...",Experience Management ist eine breite Diszipli...,0.00,0.666667,0.666667
4,TECH,128,4,,Was ist das GenAI Starter?,"Das GenAI Starter ist ein Paket, das Unternehm...","Das GenAI Starter ist ein Starterpaket, das Un...",0.25,0.875000,0.625000
7,UNKNOWN,128,4,,Was ist das Zertikiat?,Das Zertifikat ist nicht in den gegebenen Kont...,In den Kontextinformationen wird kein Zertifik...,1.00,0.250000,0.750000
13,HR,128,4,,Wie fülle ich als STudent meinen Stundenzettel...,"Du füllst deinen Stundenzettel aus, indem du d...","Dafür verwendest du das \""Template_Stundenzett...",0.25,1.000000,0.750000
22,UNKNOWN,128,4,,wie bekomme ich eine neue maus für meinen laptop,Die gegebenen Kontextinformationen enthalten k...,Der Kontext enthält keine spezifischen Informa...,1.00,0.166667,0.833333
...,...,...,...,...,...,...,...,...,...,...
7965,OTHER,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,wie trainiere ich meine rückenmuskulatur?,"Um deine Rückenmuskulatur zu trainieren, kanns...","Um Ihren Rücken zu trainieren, können sie folg...",1.00,0.280303,0.719697
7966,HR,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,how to get money back when traveling by car,"When traveling by car, you can get money back ...","To get money back when traveling by car, you ...",0.25,0.916667,0.666667
7970,CONTRACTS,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,How can I create a project,"To create a project, follow these steps:\n\n1....",Do you want to create a ZEP or Jira Project?,0.25,0.928571,0.678571
7981,OTHER,512,14,hooman650/bge-reranker-v2-m3-onnx-o4,development stages,"According to the Tucman Model, the development...",There is no explicit mention of development st...,0.00,0.906250,0.906250


Observations:
- Errors characterized by small judgement and high RAG Triad Score
- Errors distributed across clusters, no focus on UNKNOWN

Error Classes:
- 1: Gold Answer contains more info, but answer completely answers the question
- 2: Answer does not contain all necessary information
- 3: Answer gives an explanation from a different angle

In [13]:
example_case = highlighted_df.iloc[0]
print("Prompt: ", example_case["prompt"])
print("-----------------------------\nGold Answer:\n", example_case["gold_answer"])
print("-----------------------------\nAnswer:\n", example_case["answer"])

Prompt:  Was ist Experience Managemenet
-----------------------------
Gold Answer:
 Experience Management ist eine breite Disziplin, welche die Interkationen eines Kunden mit seinen Kunden ("CX" oder "Customer Experience"), Mitarbeitern ("EX" oder "Employee Experience") und IT Transformation ("ETX" oder "Employee Technology Experience") beinhaltet.
-----------------------------
Answer:
 Experience Management ist ein Bereich, der von Andrew White (Head of Experience Management) geleitet wird und unter der Aufsicht von Christian Amendt (Program Manager, Program D) steht. Andrew setzt die strategische Richtung und ist unser Technologieexperte und Thought Leader.
