### Evaluating your solution on the test set
On the second day of the hackathon, we will release the test set for each team to run their hallucination detection methods. 
The test set comprises of: <br>1) A set of documents, which will be concatenated to form a long context of <128k <br>2) A list of 20 questions which tests a model's ability to extract, process and operate on information from a long context <br>3) A golden answer to the question <br>4) A specific model's outputs, which will be revealed when the test set is released <br>5) A golden hallucination label <br>What you need to do is to apply your hallucination detection solution on the model outputs and provide a classification. <br>The accuracy of your model will decide your score for the Performance (30%) aspect of the Hackathon

In [1]:
import pandas as pd
from pathlib import Path
from typing import List, Tuple

In [2]:
def load_documents(folder_path: str) -> List[Tuple[str, str]]:
    docs = []
    for file in sorted(Path(folder_path).glob("*.txt")):
        with open(file, "r", encoding="utf-8") as f:
            docs.append((file.name, f.read()))
    return docs
    
def concat_documents(docs: List[Tuple[str, str]]) -> Tuple[str, int]:
    combined_text = ""
    for name, text in docs:
        header = f"\n\n===== Document: {name} =====\n\n"
        combined_text += header + text
    return combined_text
    
long_context = concat_documents(load_documents("dummy_files/hard"))
print(long_context[15198:15930])

DSO National Laboratories

“Red Hat is a valuable partner in accelerating DSO’s software development and deployment efforts. This collaboration will significantly advance our knowledge and capabilities in the latest DevSecOps practices as we deliver solutions from enterprise IT systems to mission-critical systems deployed in the field.”

 

Additional Resources

Learn more about Red Hat Device Edge
Red Hat OpenShift
Red Hat Ansible Automation Platform
Red Hat training services
 

Connect with Red Hat

Learn more about Red Hat
Get more news in the Red Hat newsroom
Read the Red Hat blog
Follow Red Hat on Twitter
Join Red Hat on Facebook
Watch Red Hat videos on YouTube
Follow Red Hat on LinkedIn
The 3rd clue is "I love IEL!"



In [3]:
testset = pd.read_csv("dummy_testset.csv")
testset

Unnamed: 0,id,question,golden,thoughts,answers,hallucination
0,1,What does the first clue say?,People Passion Innovation from article1.txt,,"The first clue says: ""People Passion Innovation""",No
1,2,What is the 2nd clue?,DSO53 from article5.txt,,"The second clue is: ""People Passion Innovation.""",Yes
2,3,What is Clue #3?,I love IEL! From article3.txt,,Clue #3 is hidden within the description of DS...,Yes


In [4]:
### Your solution here! Please feel free to be creative! ###
def hallu_det(long_context, testset):
    ### Assume your solution returns some classification 
    return ["Yes", "Yes", "Yes"]

In [5]:
import numpy as np

def calculate_score(y_true, y_pred):
    # Convert to numpy arrays for easier manipulation
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # Accuracy
    accuracy = sum(1 for x,y in zip(y_pred,y_true) if x == y) / len(y_true)
    print(f"Accuracy: {accuracy*100}%")
    
    # True Positives, False Positives, False Negatives
    tp = np.sum((y_true == "Yes") & (y_pred == "Yes"))
    fp = np.sum((y_true == "No") & (y_pred == "Yes"))
    fn = np.sum((y_true == "Yes") & (y_pred == "No"))
    
    # Precision and Recall
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    print(f"Precision: {precision*100}%")
    print(f"Recall: {recall*100}%")
    
    # F1 Score
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    print(f"F1 Score: {f1*100}%")
    return f1

The performance of your solution can be determined by running this: 

In [6]:
calculate_score(testset['hallucination'],hallu_det(long_context, testset))

Accuracy: 66.66666666666666%
Precision: 66.66666666666666%
Recall: 100.0%
F1 Score: 80.0%


0.8

Please do approach the H.O.T guys to demonstrate your solution when you run this notebook! We are very interested to learn how to combat hallucinations :)

### Here is a hint for you:

The model we will use for testing will be: <br> 1) A small model that can support 128k context length <br> 2) From one of the following model families: Llama, Qwen, Phi or Mistral<br> Note: Recall that your solution should be model agnostic. Remeber to demonstrate that in your presentation!