## MuCoCo Code Generation BigCodeBench Benchmark Testing

This notebook is used for running experiments for MuCoCo code generation tasks on BigCodeBench benchmark.

In [None]:
import os
import sys

In [None]:
curr_dir = os.getcwd()
parent_dir = os.path.dirname(curr_dir)
proj_dir = os.path.dirname(parent_dir)
sys.path.append(proj_dir)

In [None]:
from code_generation.code_generation_tester import CodeGenerationTester
from code_generation.prompt_templates.prompt_template import OpenEndedPromptTemplate
from utility.constants import BigCodeBench, HumanEval, LexicalMutations, SyntacticMutations, LogicalMutations, PromptTypes, CodeGeneration, ReasoningModels, NonReasoningModels

In [None]:
## Declaring Prompt Type Constants
ZERO_SHOT = PromptTypes.ZERO_SHOT
ONE_SHOT = PromptTypes.ONE_SHOT

## Declaring Mutation Constants
RANDOM_MUTATION = LexicalMutations.RANDOM
SEQUENTIAL_MUTATION = LexicalMutations.SEQUENTIAL
LITERAL_FORMAT = LexicalMutations.LITERAL_FORMAT

## Declaring Benchmark Name Constants
BIGCODEBENCH = BigCodeBench.NAME
HUMANEVAL = HumanEval.NAME

## Declaring Reasoning Model Name Constants
GPT5 = ReasoningModels.GPT5['name']

## Declaring Non-Reasoning Model Name Constants
CODESTRAL = NonReasoningModels.CODESTRAL['name']
GPT4O = NonReasoningModels.GPT4O['name']
DEEPSEEK = NonReasoningModels.DEEPSEEK_CHAT['name']
 

In [None]:
reasoning_models = [getattr(ReasoningModels, model) for model in dir(ReasoningModels) if not model.startswith("_")]
non_reasoning_models = [getattr(NonReasoningModels, model) for model in dir(NonReasoningModels) if not model.startswith("_")]
print('Reasoning models supported by this framework are:')
for idx, model in enumerate(reasoning_models):
    print(f"{idx+1}: '{model['name']}'")
print('=' * 50)
print('Non-reasoning models supported by this framework are:')
for idx, model in enumerate(non_reasoning_models):
    print(f"{idx+1}: '{model['name']}'")

In [None]:
task_set = BIGCODEBENCH

try:
    llmtester = CodeGenerationTester(f"{task_set}_Code_Generation", n =5 )
except Exception as e:
    print(f'llmtester could not launch due to the following error: {e}')



In [None]:
num_tests = llmtester.question_database.count_documents({})

In [None]:
valid_mutations = CodeGeneration.MUTATIONS
print("These are the valid mutation names for code generation:")
for idx, mutation in enumerate(valid_mutations):
    if mutation != LITERAL_FORMAT:
        print(idx+1, mutation)

In [None]:
import matplotlib
matplotlib.use("Agg")  # Non-interactive backend (no GUI)

import matplotlib.pyplot as plt
plt.ioff()  # Turn off interactive mode
plt.show = lambda *args, **kwargs: None 

# Run your experiments

`run_code_generation_test` method is used for running code generation tests on MuCoCo.

| Parameter              | Type        | Description                                                                                                              |
| ---------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------ |
| `prompt_helper`        | `str`       | String template for the appropriate prompt. Simply rename the `prompt_type` variable to `ONE_SHOT` or `ZERO_SHOT` only.
| `output_file_path`     | `str`       | Full path to the CSV where predictions and metrics are saved. Filename is built from model, task type, and mutation tag. |
| `num_tests`            | `int`       | Number of test questions to evaluate. The number of questions available for evaluation ranges from 1 to 160.                                                                               |
| `mutations`            | `List[str]` | Mutation operators to apply (e.g., `["FOR2WHILE"]`, `["CONSTANT_UNFOLD"]`). Empty list means **no_mutation**.            |
| `model_name`           | `str`       | Identifier of the LLM under test (e.g., `GPT4O`). Used for routing and naming.                                           |
| `task_set`      | `str`       | Either  `BIGCODEBENCH` or `HUMANEVAL` for code generation.                                                           |                          
| `continue_from_task`   | `str`       | Optional parameter for starting evaluation from a specified task ID corresponding to the task in MongoDB (e.g., `"BigCodeBencho15"`).                                                |

The following example runs a code generation test on the BigCodeBench benchmark for all tasks in BigCodeBench. To add mutations such as Random mutation, add the corresponding mutation string to the `mutations` list like so: `mutations = [RANDOM_MUTATION]`. The mutations available for code generation testing are declared as constants above.

In [None]:
# %%script false --no-raise-error
mutations = []
prompt_type = ONE_SHOT
model_name = GPT4O    # Change to your desired model.

# Forming the results directory
results_dir =os.path.join(proj_dir, f'results/code_generation/{model_name}')
os.makedirs(results_dir, exist_ok=True)

mutation_str = "_".join(mutations) if len(mutations) > 0 else "no_mutation"
output_file_path=f"{results_dir}/{task_set}_{prompt_type}_{mutation_str}.csv"

pass_count = llmtester.run_code_generation_test(
    prompt_helper = OpenEndedPromptTemplate().return_appropriate_prompt(prompt_type),
    num_tests=num_tests,
    mutations = mutations,
    prompt_type= prompt_type,
    output_file_path=output_file_path,
    task_set = task_set,
    model_name= model_name,
)

print(fr"Results saved in {output_file_path}")
