# **Run the experiments!** 🤖

Call the methods to generate and evaluate LLM hallucinations for all experiments.

Experiments described in **Section 3** of the paper, with results presented in **Section 4**.

In [1]:
# import the experiments

from src import run_describe_experiment, run_specify_experiment

In [2]:
# define the dataset file to use

dataset_file = "data/bigcodebench/bigcodebench_eval.json"

In [3]:
# define the models

models = [
    "gpt-4o-mini-2024-07-18",
    "qwen/qwen2.5-coder-32b-instruct",
    "ministral-8b-2410",
    "meta-llama/llama-3.3-70b-instruct-turbo",
    "gpt-5-mini-2025-08-07",
    "deepseek-chat",
]

## **RQ1:** Realistic User Language

How do realistic variations in user descriptions of libraries/members affect the hallucination rates of LLMs during code generation?

This experiment is described in **Section 3.4**, with results presented in **Section 4.1**.

In [None]:
# RQ1: adjective-based library experiments
for run_type in [
    "base",
    "open",
    "free",
    "best",
    "simple",
    "alternative",
    "easy",
    "lightweight",
    "fast",
    "modern",
]:
    run_describe_experiment(
        run_type=run_type,
        run_level="library",
        models=models,
        dataset_file=dataset_file,
    )

In [None]:
# RQ1: adjective-based member description experiments
for run_type in [
    "base",
    "best",
    "simple",
    "alternative",
    "easy",
    "lightweight",
    "fast",
    "modern",
]:
    run_describe_experiment(
        run_type=run_type,
        run_level="member",
        models=models,
        dataset_file=dataset_file,
    )

In [None]:
# RQ1: year-based library description experiments
for year in [
    2023,
    2024,
    2025,
]:
    run_describe_experiment(
        run_type="year_from",
        run_level="library",
        year=year,
        models=models,
        dataset_file=dataset_file,
    )

## **RQ2:** Robustness to Mistakes

How often do LLMs attempt to import a user-specified library/member that does not actually exist (either a one-character typo, a multi-character typo, or a fabrication)?

This experiment is described in **Section 3.5**, with results presented in **Section 4.2**.

In [None]:
# RQ2: library typo and fabrication experiments
for run_type in [
    "base",
    "typo_small",
    "typo_medium",
    "fabrication",
]:
    run_specify_experiment(
        run_type=run_type,
        run_level="library",
        models=models,
        dataset_file=dataset_file,
        output_dir="output/new_models",
    )

In [None]:
# RQ2: member typo and fabrication experiments
for run_type in [
    "base",
    "typo_small",
    "typo_medium",
    "fabrication",
]:
    run_specify_experiment(
        run_type=run_type,
        run_level="member",
        models=models,
        dataset_file=dataset_file,
    )

## **RQ3:** Practical Mitigation Strategies

Can practical and widely-used prompt engineering strategies help to mitigate hallucinations in the situations described in **RQ1** and **RQ2**?

This experiment is described in **Section 3.6**, with results presented in **Section 4.3**.

In [None]:
# RQ3: repeat experiments with mitigation strategies
for mitigation_strategy in [
    "chain_of_thought",
    "self_analysis",
    "step_back",
    "explicit_check",
]:
    # RQ1: repeat year-based library description experiments
    for year in [
        2023,
        2024,
        2025,
    ]:
        run_describe_experiment(
            run_type="year_from",
            run_level="library",
            year=year,
            models=models,
            dataset_file=dataset_file,
            mitigation_strategy=mitigation_strategy,
        )

    # RQ2: repeat library typo and fabrication experiments
    for run_type in [
        "typo_small",
        "typo_medium",
        "fabrication",
    ]:
        run_specify_experiment(
            run_type=run_type,
            run_level="library",
            models=models,
            dataset_file=dataset_file,
            mitigation_strategy=mitigation_strategy,
        )

## **Extra:** Induced Hallucinations

Can we find any descriptions that induce hallucinations?

This experiment is are presented in **Section 5.1** and **Appendix C**.

In [None]:
for run_type in [
    "ext_lesser",
    "ext_unknown",
    "ext_hidden",
]:
    run_describe_experiment(
        run_type=run_type,
        run_level="library",
        models=models,
        dataset_file=dataset_file,
        start_index=250,
    )

## **Evaluation**

Use the code below to re-evaluate results files if necessary.

In [None]:
# define the results files to evaluate

results_files = []

In [None]:
from src import evaluate_hallucinations

for file in results_files:
    evaluate_hallucinations(
        results_file=file,
    )