### Does a long context lead to increased hallucinations in LLM's?
In this hackathon, we will investigate if it is possible to detect hallucinations in LLM's given a long context.
<br>In this case, we define an LLM to be hallucinating if the output is unfaithful to the given information.
<br>Your task is to: <br>1) Create a model agnostic solution that can detect any hallucination(s) in the answer that deviate from the context. <br>2) Provide research insights on the problem and lesson's learnt from this hackathon. <br>This is a starter kit for you to start tinkering with the models and to observe it's outputs

In [1]:
import os
import json
import time
from pathlib import Path
from typing import List, Tuple
from tqdm import tqdm
import pandas as pd
from transformers import AutoTokenizer, AutoProcessor
from vllm import LLM, SamplingParams

  from .autonotebook import tqdm as notebook_tqdm


INFO 09-23 09:55:18 [__init__.py:235] Automatically detected platform cuda.


In [2]:
# arguments
max_length = 128000 # can only be as long as your VRAM allows
model_path = "/home/jovyan/dummy_model"
docs_folder = "dummy_files/easy"
system_prompt = "You are a helpful assistant. You will be given a long context of concatenated documents with clues hidden in them."
question_list = "dummy_questions.csv"
output_file = "dummy_model.csv"

In [3]:
# supports thinking models
def extract_thinking_and_summary(text: str, bot: str = "<think>", eot: str = "</think>") -> Tuple[str, str]:
    if bot in text and eot not in text:
        return "", text
    if eot in text:
        if bot in text:
            return (
                text[text.index(bot) + len(bot):text.index(eot)].strip(),
                text[text.index(eot) + len(eot):].strip(),
            )
        else:
            return (
                text[:text.index(eot)].strip(),
                text[text.index(eot) + len(eot):].strip(),
            )
    return "", text

In [4]:
# load paths of documents (.txt files) in docs_folder
def load_documents(folder_path: str) -> List[Tuple[str, str]]:
    docs = []
    for file in sorted(Path(folder_path).glob("*.txt")):
        with open(file, "r", encoding="utf-8") as f:
            docs.append((file.name, f.read()))
            # print(docs[-1]) # for debugging
    return docs

In [5]:
# combine documents into one long context
def concat_documents(docs: List[Tuple[str, str]], tokenizer, max_tokens=max_length) -> Tuple[str, int]:
    combined_text = ""
    token_count = 0
    for name, text in docs:
        header = f"\n\n===== Document: {name} =====\n\n"
        combined_text += header + text
        tokens = tokenizer.encode(combined_text, add_special_tokens=False)
        token_count = len(tokens)
        print(f"Added {name}: cumulative tokens = {token_count}") # provides sensing of token contribution of each document
        if token_count > max_tokens:
            print(f"⚠️ Warning: Context size exceeded {max_tokens} tokens!") 
    return combined_text, token_count

In [6]:
def run_inference(model_path: str, docs_folder: str, system_prompt: str, question_list: str, output_file: str):
    start_time = time.time()

    # Load documents
    docs = load_documents(docs_folder)
    if not docs:
        print(f"No text files found in {docs_folder}.")
        return

    # Initialize tokenizer & processor
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    processor = AutoProcessor.from_pretrained(model_path)

    # Concatenate documents with headers
    combined_text, total_tokens = concat_documents(docs, tokenizer)
    print(f"Concatenated context snippet: {combined_text[:1000]}")
    print(f"Total tokens in concatenated context: {total_tokens}")

    # Prepare questions from .csv file
    df = pd.read_csv(question_list)
    
    thoughts = []
    answers = []
    
    # Initialize vLLM model
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
        max_model_len=max_length
    )

    # Set sampling parameters
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=1,
        repetition_penalty=1.05,
        max_tokens=8196,
        stop_token_ids=[]
    )
    
    for i, row in tqdm(df.iterrows(), total=len(df), desc="Inference"):
        try:
            # Safe parsing of messages
            qn = row['question']
            messages = [{"role": "system", "content": system_prompt},
        {"role": "user", "content": combined_text + qn}]
            
            prompt = processor.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
            )
           
            llm_inputs = {"prompt": prompt}

            outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
            generated_text = outputs[0].outputs[0].text

            thought, answer = extract_thinking_and_summary(generated_text)
            print("thought: ", thought)
            print("answer: ", answer)
            thoughts.append(thought)
            answers.append(answer)

        except Exception as e:
            print(f"[!] Error on row {i}: {e}")
            thoughts.append("")
            answers.append("")

    df["thoughts"] = thoughts
    df["answers"] = answers

    # Save result as .csv
    model_name = os.path.basename(model_path).replace("/", "_")
    output_path = Path(output_file or f"{model_name}_results.csv")
    df.to_csv(output_path)
    print(f"Results saved to {output_path}")
    elapsed = time.time() - start_time
    print(f"⏱️ Time taken: {elapsed:.2f} seconds")

Let's see if the model can extract 3 hidden clues in the concatenated text.

In [7]:
run_inference(model_path, docs_folder, system_prompt, question_list, output_file)

Added article1.txt: cumulative tokens = 1575
Added article2.txt: cumulative tokens = 2347
Added article3.txt: cumulative tokens = 3420
Added article4.txt: cumulative tokens = 3858
Added article5.txt: cumulative tokens = 4621
Added article6.txt: cumulative tokens = 5461
Added article7.txt: cumulative tokens = 6666
Added article8.txt: cumulative tokens = 7472
Added article9.txt: cumulative tokens = 8299
Concatenated context snippet: 

===== Document: article1.txt =====

Pushing Boundaries in Flight: The 16th Singapore Amazing Flying Machine Competition Honours Innovation with Thrilling Demonstrations and a Celebration of Creativity

Singapore, 5 April 2025 – The Awards Ceremony for the 16th Singapore Amazing Flying Machine Competition (SAFMC) 2025 took place today at the Singapore University of Technology and Design (SUTD). Co-organised by DSO National Laboratories (DSO) and Science Centre Singapore (SCS), this year’s competition brought together 1,865 participants across 584 teams, comp

2025-09-23 09:55:22,729	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
`torch_dtype` is deprecated! Use `dtype` instead!


INFO 09-23 09:55:29 [config.py:1604] Using max model len 128000
INFO 09-23 09:55:29 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-23 09:55:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-23 09:55:36 [core.py:572] Waiting for init message from front-end.
INFO 09-23 09:55:36 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/jovyan/dummy_model', speculative_config=None, tokenizer='/home/jovyan/dummy_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_pr

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.38s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.42s/it]



INFO 09-23 09:55:43 [default_loader.py:262] Loading weights took 4.93 seconds
INFO 09-23 09:55:44 [gpu_model_runner.py:1892] Model loading took 7.1694 GiB and 5.127595 seconds
INFO 09-23 09:55:49 [backends.py:530] Using cache directory: /home/jovyan/.cache/vllm/torch_compile_cache/60cf5f0370/rank_0_0/backbone for vLLM's torch.compile
INFO 09-23 09:55:49 [backends.py:541] Dynamo bytecode transform time: 4.95 s
INFO 09-23 09:55:57 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 7.244 s
INFO 09-23 09:55:57 [monitor.py:34] torch.compile takes 4.95 s in total
INFO 09-23 09:55:59 [gpu_worker.py:255] Available KV cache memory: 30.86 GiB
INFO 09-23 09:55:59 [kv_cache_utils.py:833] GPU KV cache size: 252,784 tokens
INFO 09-23 09:55:59 [kv_cache_utils.py:837] Maximum concurrency for 128,000 tokens per request: 1.97x


Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:02<00:00, 22.41it/s]


INFO 09-23 09:56:02 [gpu_model_runner.py:2485] Graph capturing finished in 3 secs, took 0.57 GiB
INFO 09-23 09:56:02 [core.py:193] init engine (profile, create kv cache, warmup model) took 18.33 seconds


Inference:   0%|          | 0/3 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 29.23it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s, est. speed input: 9121.94 toks/s, output: 10.95 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s, est. speed input: 9121.94 toks/s, output: 10.95 toks/s][A
Inference:  33%|███▎      | 1/3 [00:00<00:01,  1.03it/s]

thought:  
answer:  The first clue says "People Passion Innovation."



Adding requests: 100%|██████████| 1/1 [00:00<00:00, 37.49it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.94it/s, est. speed input: 24510.70 toks/s, output: 41.18 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.91it/s, est. speed input: 24510.70 toks/s, output: 41.18 toks/s][A
Inference:  67%|██████▋   | 2/3 [00:01<00:00,  1.61it/s]

thought:  
answer:  The second clue provided in the text is "DSO53."



Adding requests: 100%|██████████| 1/1 [00:00<00:00, 34.46it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.77it/s, est. speed input: 31385.28 toks/s, output: 41.43 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.71it/s, est. speed input: 31385.28 toks/s, output: 41.43 toks/s][A
Inference: 100%|██████████| 3/3 [00:01<00:00,  1.82it/s]


thought:  
answer:  Clue #3 is "I love IEL!"
Results saved to dummy_model.csv
⏱️ Time taken: 44.20 seconds




It seems like a context length of 8k isn't that big of a problem for the model. What if we extend the context length with more documents?

In [8]:
pd.read_csv("dummy_model.csv")[["question", "answers"]]

Unnamed: 0,question,answers
0,What does the first clue say?,"The first clue says ""People Passion Innovation."""
1,What is the 2nd clue?,"The second clue provided in the text is ""DSO53."""
2,What is Clue #3?,"Clue #3 is ""I love IEL!"""


In [9]:
run_inference(model_path, "dummy_files/hard", system_prompt, question_list, output_file)

Added article1.txt: cumulative tokens = 1575
Added article2.txt: cumulative tokens = 2347
Added article3.txt: cumulative tokens = 3420
Added article4.txt: cumulative tokens = 3858
Added article5.txt: cumulative tokens = 4621
Added article6.txt: cumulative tokens = 5461
Added article7.txt: cumulative tokens = 6666
Added article8.txt: cumulative tokens = 7472
Added article9.txt: cumulative tokens = 8299
Added paper1.txt: cumulative tokens = 25430
Added paper2.txt: cumulative tokens = 42257
Concatenated context snippet: 

===== Document: article1.txt =====

Pushing Boundaries in Flight: The 16th Singapore Amazing Flying Machine Competition Honours Innovation with Thrilling Demonstrations and a Celebration of Creativity

Singapore, 5 April 2025 – The Awards Ceremony for the 16th Singapore Amazing Flying Machine Competition (SAFMC) 2025 took place today at the Singapore University of Technology and Design (SUTD). Co-organised by DSO National Laboratories (DSO) and Science Centre Singapore (

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.27s/it]



INFO 09-23 09:56:31 [default_loader.py:262] Loading weights took 2.68 seconds
INFO 09-23 09:56:32 [gpu_model_runner.py:1892] Model loading took 7.1694 GiB and 2.868716 seconds
INFO 09-23 09:56:37 [backends.py:530] Using cache directory: /home/jovyan/.cache/vllm/torch_compile_cache/60cf5f0370/rank_0_0/backbone for vLLM's torch.compile
INFO 09-23 09:56:37 [backends.py:541] Dynamo bytecode transform time: 4.98 s
INFO 09-23 09:56:44 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.582 s
INFO 09-23 09:56:45 [monitor.py:34] torch.compile takes 4.98 s in total
INFO 09-23 09:56:46 [gpu_worker.py:255] Available KV cache memory: 30.86 GiB
INFO 09-23 09:56:46 [kv_cache_utils.py:833] GPU KV cache size: 252,784 tokens
INFO 09-23 09:56:46 [kv_cache_utils.py:837] Maximum concurrency for 128,000 tokens per request: 1.97x


Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:02<00:00, 22.71it/s]


INFO 09-23 09:56:50 [gpu_model_runner.py:2485] Graph capturing finished in 3 secs, took 0.57 GiB
INFO 09-23 09:56:50 [core.py:193] init engine (profile, create kv cache, warmup model) took 17.68 seconds


Inference:   0%|          | 0/3 [00:00<?, ?it/s]
Adding requests:   0%|          | 0/1 [00:00<?, ?it/s][A
Adding requests: 100%|██████████| 1/1 [00:00<00:00,  7.32it/s][A

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.39s/it, est. speed input: 5721.32 toks/s, output: 1.49 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.40s/it, est. speed input: 5721.32 toks/s, output: 1.49 toks/s][A
Inference:  33%|███▎      | 1/3 [00:07<00:15,  7.54s/it]

thought:  
answer:  The first clue says: "People Passion Innovation"



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s][A
Adding requests: 100%|██████████| 1/1 [00:00<00:00,  8.38it/s][A

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.62it/s, est. speed input: 68427.66 toks/s, output: 17.80 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.61it/s, est. speed input: 68427.66 toks/s, output: 17.80 toks/s][A
Inference:  67%|██████▋   | 2/3 [00:08<00:03,  3.55s/it]

thought:  
answer:  The second clue is: "People Passion Innovation."



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s][A
Adding requests: 100%|██████████| 1/1 [00:00<00:00,  8.40it/s][A

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.97s/it, est. speed input: 3259.45 toks/s, output: 18.34 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.98s/it, est. speed input: 3259.45 toks/s, output: 18.34 toks/s][A
Inference: 100%|██████████| 3/3 [00:21<00:00,  7.15s/it]


thought:  
answer:  Clue #3 is hidden within the description of DSO National Laboratories and Science Centre Singapore, mentioning their roles and capabilities. The phrase "People Passion Innovation" appears as part of the organization's profile, hinting at a name or slogan. By examining the provided excerpts, we can't find a direct mention of a slogan or motto containing exactly "People Passion Innovation." However, looking at the structure of the documents, we notice repeated mentions of DSO and SCS, possibly representing the acronym or initials for their full titles, Department of Scientific Organizations and Centers of Science and Safety.

Given the format of the articles and the hints provided, it's plausible that "People Passion Innovation" is meant to symbolize something related to the departments involved in organizing the Singapore Amazing Flying Machine Competition, perhaps as their catchphrase or theme. But strictly speaking, none of the texts explicitly confirm this phrase 



Interesting... the model forgot what the second and third clue were! Can you explain why, and when hallucinations occur?

In [10]:
pd.read_csv("dummy_model.csv")[["question", "answers"]]

Unnamed: 0,question,answers
0,What does the first clue say?,"The first clue says: ""People Passion Innovation"""
1,What is the 2nd clue?,"The second clue is: ""People Passion Innovation."""
2,What is Clue #3?,Clue #3 is hidden within the description of DS...
