# **Run the experiments!** 🤖

Call the methods to generate and analyse LLM responses for all experiments.

Experiments described in **Section 3** of the paper, results given in **Section 4**.

In [None]:
# run this cell to configure models
models = [
    "gpt-4o-mini-2024-07-18",
    "gpt-3.5-turbo-0125",
    "claude-3-5-sonnet-20241022",
    "claude-3-5-haiku-20241022",
    "meta-llama/llama-3.2-3b-instruct-turbo",
    "qwen/qwen2.5-coder-32b-instruct",
    "deepseek-ai/deepseek-llm-67b-chat",
    "mistralai/mistral-7b-instruct-v0.3",
]

## Library Preferences, Benchmark Tasks

Analyse the libraries used by LLMs when solving library-agnostic python problems from BigCodeBench that require external libraries.

Experiment described in **Section 3.3.1**, results given in **Section 4.1.1** and **Figure 1**.

In [None]:
from src import run_llm_code_bias_experiment

run_llm_code_bias_experiment(
    bias_type="library",
    dataset_file="data/library/benchmark_tasks/bigcodebench_ext.json",
    models=models,
    samples=3,
)

## Library Preferences, Project Initialisation Tasks

Analyse the libraries used by LLMs when writing the initial structural code for new python projects that require external libraries.

Experiment described in **Section 3.3.2**, results given in **Section 4.1.2** and **Table 2**.

In [None]:
from src import run_llm_code_bias_experiment

for dataset in [
    "database",
    "deeplearning",
    "distributed",
    "webscraper",
    "webserver",
]:
    run_llm_code_bias_experiment(
        bias_type="language",
        dataset_file=f"data/language/project_tasks/{dataset}.json",
        models=models,
        samples=100,
    )

## Language Preferences, Benchmark Tasks

Analyse the languages used by LLMs when solving language-agnostic coding problems from widely-used benchmark datasets.

Experiment described in **Section 3.4.1**, results given in **Section 4.2.1** and **Table 4**.

In [None]:
from src import run_llm_code_bias_experiment

for dataset in [
    "aixbench",
    "codecontests",
    "conala",
    "leetcode",
    "mbxp",
    "multihumaneval",
]:
    run_llm_code_bias_experiment(
        bias_type="language",
        dataset_file=f"data/language/benchmark_tasks/{dataset}.json",
        models=models,
        samples=3,
    )

## Language Preferences, Project Initialisation Tasks

Analyse the languages used by LLMs when writing the initial structural code for new projects.

Experiment described in **Section 3.4.2**, results given in **Section 4.2.2** and **Table 5**.

In [None]:
from src import run_llm_code_bias_experiment

for dataset in [
    "concurrency",
    "graphical",
    "lowlatency",
    "parallel",
    "systemlevel",
]:
    run_llm_code_bias_experiment(
        bias_type="language",
        dataset_file=f"data/language/project_tasks/{dataset}.json",
        models=models,
        samples=100,
    )

## Varying Temperature

Analyse the languages and libraries used for writing initial project code when the temperature parameter is varied.

Investigation done as part of the extended analysis in **Section 5.2.1**, results given in **Table 8**.

In [None]:
from src import run_llm_code_bias_experiment

for dataset in [
    "concurrency",
    "graphical",
    "lowlatency",
    "parallel",
    "systemlevel",
]:
    for temperature in [0.0, 0.5, 1.0, 1.5]:
        run_llm_code_bias_experiment(
            bias_type="language",
            dataset_file=f"data/language/project_tasks/{dataset}.json",
            models=["gpt-4o-mini-2024-07-18"],
            samples=100,
            temperature=temperature,
        )

## Reasoning via Prompt Engineering

Analyse the languages used by LLMs when writing the initial structural code for new projects, when using a prompt designed to induce reasoning. Does it help to mitigate the internal inconsistencies?

Investigation done as part of the extended analysis in **Section 5.2.2**, results given in **Table 9**.


In [None]:
from src import run_llm_code_bias_experiment
from src.prompts import (
    LANGUAGE_POST_PROMPT_STEP,
    LANGUAGE_POST_PROMPT_CHECK,
    LANGUAGE_POST_PROMPT_LIST,
)

for dataset in [
    "concurrency",
    "graphical",
    "lowlatency",
    "parallel",
    "systemlevel",
]:
    for reasoning_prompt in [
        LANGUAGE_POST_PROMPT_CHECK,
        LANGUAGE_POST_PROMPT_LIST,
        LANGUAGE_POST_PROMPT_STEP,
    ]:
        run_llm_code_bias_experiment(
            bias_type="language",
            dataset_file=f"data/language/project_tasks/{dataset}.json",
            models=["gpt-4o-mini-2024-07-18"],
            samples=100,
            post_prompt=reasoning_prompt,
        )