# **Run the experiments!** 🤖

Call the methods to generate and evaluate LLM hallucinations for all experiments.

Experiments described in **Section X** of the paper, results given in **Section Y**.

In [None]:
# define the dataset file to use

dataset_file = "data/bigcodebench/bigcodebench_eval.json"

In [None]:
# define the models

final_models = [
    "meta-llama/llama-3.3-70b-instruct-turbo",
    "meta-llama/llama-4-scout-17b-16e-instruct",
    "gpt-4o-mini-2024-07-18",
    "gpt-4.1-mini-2025-04-14",
    "codestral-2501",
    "mistral-medium-2505",
    "qwen/qwen2.5-coder-32b-instruct",
    "qwen/qwen2.5-72b-instruct-turbo",
    # also llama-3.2
]

init_models = [
    "gpt-4o-mini-2024-07-18",
    "ministral-8b-2410",
    "qwen/qwen2.5-coder-32b-instruct",
]

check_models = ["gpt-4o-mini-2024-07-18"]

## **RQ1:** Realistic User Language

How do realistic variations in user descriptions of libraries/members affect the hallucination rates of LLMs during code generation?

In [None]:
# RQ1.1: lirbary description experiments

from src import run_describe_experiment

for run_type in [
    "base",
    "open",
    "free",
    "best",
    "simple",
    "alternative",
    "easy",
    "lightweight",
    "fast",
    "modern",
]:
    run_describe_experiment(
        run_type=run_type,
        run_level="library",
        models=init_models,
        dataset_file=dataset_file,
    )

In [None]:
# RQ1.2: member description experiments

from src import run_describe_experiment

for run_type in [
    "base",
    "best",
    "simple",
    "alternative",
    "easy",
    "lightweight",
    "fast",
    "modern",
]:
    run_describe_experiment(
        run_type=run_type,
        run_level="member",
        models=init_models,
        dataset_file=dataset_file,
    )

In [None]:
# RQ1.3: library year-based description experiments

from src import run_describe_experiment

for run_type in [
    "year_release",
    "year_version",
]:
    for year in [
        2023,
        2024,
        2025,
    ]:
        run_describe_experiment(
            run_type=run_type,
            run_level="library",
            year=year,
            models=init_models,
            dataset_file=dataset_file,
        )

## **RQ2:** Robustness to Mistakes

How often do LLMs attempt to import a user-specified library/member that does not actually exist (either a 1-character typo, a multi-character typo, or a fabrication)?

In [None]:
# RQ2.1: library typo and fabrication experiments

from src.run_specify import run_specify_experiment

for run_type in [
    "base",
    "typo_small",
    "typo_medium",
    "fabrication",
]:
    run_specify_experiment(
        run_type=run_type,
        run_level="library",
        models=init_models,
        dataset_file=dataset_file,
    )

In [None]:
# RQ2.2: member typo and fabrication experiments

from src.run_specify import run_specify_experiment

for run_type in [
    "base",
    "typo_small",
    "typo_medium",
    "fabrication",
]:
    run_specify_experiment(
        run_type=run_type,
        run_level="member",
        models=init_models,
        dataset_file=dataset_file,
    )

## **RQ3:** Practical Mitigation Strategies

Can practical and widely-used prompt engineering strategies help to mitigate hallucinations in the situations described in **RQ1** and **RQ2**?

In [3]:
from src.prompts import (
    POST_PROMPT_CHAIN_OF_THOUGHT,
    POST_PROMPT_REPHRASE_RESPOND,
    POST_PROMPT_SELF_ANALYSIS,
    POST_PROMPT_SELF_ASK,
    POST_PROMPT_STEP_BACK,
)

for post_prompt in [
    POST_PROMPT_CHAIN_OF_THOUGHT,
    POST_PROMPT_REPHRASE_RESPOND,
    POST_PROMPT_SELF_ANALYSIS,
    POST_PROMPT_SELF_ASK,
    POST_PROMPT_STEP_BACK,
]:
    # todo: implement RQ3 experiments with practical mitigation strategies
    pass

## **Extended Analysis**

Can we find any descriptions that induce hallucinations?

In [None]:
from src import run_describe_experiment

for run_type in [
    "ext_hidden",
    "ext_diamond",
]:
    for run_level in [
        "library",
        "member",
    ]:
        run_describe_experiment(
            run_type=run_type,
            run_level=run_level,
            year=year,
            models=init_models,
            dataset_file=dataset_file,
        )

## **Evaluation**

Use the code below to re-evaluate results files if necessary.

In [None]:
# define the results files to evaluate

results_files = [
    "output/specify/spec_mem_typo_small_2025-08-04T22:18:57.490823.json",
]

In [None]:
from src import evaluate_hallucinations

for file in results_files:
    evaluate_hallucinations(
        results_file=file,
    )