## MuCoCo MCQ Inconsistency CodeMMLU Benchmark Testing

This notebook is used for running experiments for MuCoCo MCQ inconsistency tasks on CodeMMLU benchmark.

In [None]:
import os
import sys
from dotenv import load_dotenv

In [None]:
curr_dir = os.getcwd()
par_dir = os.path.dirname(curr_dir)
proj_dir = os.path.dirname(par_dir)
sys.path.append(proj_dir)
load_dotenv()

In [None]:
from mcq_inconsistency.mcq_inconsistency_tester import LLMMCQInconsistencyTester
from mcq_inconsistency.prompt_templates.prompt_template import MCQInconsistencyPromptTemplate, ReasoningMCQInconsistencyPromptTemplate
from utility.constants import CodeMMLU, LexicalMutations, SyntacticMutations, LogicalMutations, PromptTypes, ReasoningModels, NonReasoningModels

In [None]:
## Declaring Prompt Type Constants
ZERO_SHOT = PromptTypes.ZERO_SHOT
ONE_SHOT = PromptTypes.ONE_SHOT
FEW_SHOT = PromptTypes.FEW_SHOT

## Declaring Benchmark Constants
CODEMMLU = CodeMMLU.NAME
CODEMMLU_TASK = CodeMMLU.Tasks.CODE_COMPLETION

## Declaring Mutation Constants
FOR2WHILE = SyntacticMutations.FOR2WHILE
FOR2ENUMERATE = SyntacticMutations.FOR2ENUMERATE

RANDOM_MUTATION = LexicalMutations.RANDOM
SEQUENTIAL_MUTATION = LexicalMutations.SEQUENTIAL
LITERAL_FORMAT = LexicalMutations.LITERAL_FORMAT

BOOLEAN_LITERAL = LogicalMutations.BOOLEAN_LITERAL
DEMORGAN = LogicalMutations.DEMORGAN
COMMUTATIVE_REORDER = LogicalMutations.COMMUTATIVE_REORDER
CONSTANT_UNFOLD = LogicalMutations.CONSTANT_UNFOLD
CONSTANT_UNFOLD_ADD = LogicalMutations.CONSTANT_UNFOLD_ADD
CONSTANT_UNFOLD_MULT = LogicalMutations.CONSTANT_UNFOLD_MULT

## Declaring Reasoning Model Name Constants
GPT5 = ReasoningModels.GPT5['name']

## Declaring Non-Reasoning Model Name Constants
GPT4O = NonReasoningModels.GPT4O['name']
CODESTRAL = NonReasoningModels.CODESTRAL['name']
DEEPSEEK = NonReasoningModels.DEEPSEEK_CHAT['name']


In [None]:
reasoning_models = [getattr(ReasoningModels, model) for model in dir(ReasoningModels) if not model.startswith("_")]
non_reasoning_models = [getattr(NonReasoningModels, model) for model in dir(NonReasoningModels) if not model.startswith("_")]
print('Reasoning models supported by this framework are:')
for idx, model in enumerate(reasoning_models):
    print(f"{idx+1}: '{model['name']}'")
print('=' * 50)
print('Non-reasoning models supported by this framework are:')
for idx, model in enumerate(non_reasoning_models):
    print(f"{idx+1}: '{model['name']}'")

In [None]:
task_set = os.getenv("MONGODB_CODEMMLU_COLLECTION")
llmtester = LLMMCQInconsistencyTester(task_set)

In [None]:
num_tests = llmtester.question_database.count_documents({})

`run_mcq_inconsistency_test` method is used for running MCQ inconsistency tests on MuCoCo.

| Parameter              | Type        | Description                                                                                                              |
| ---------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------ |
| `prompt_helper`        | `str`       | String template for the appropriate prompt. Simply rename the `prompt_type` variable to `ONE_SHOT`, `FEW_SHOT` or `ZERO_SHOT` for `CodeMMLU` benchmark.
| `output_file_path`     | `str`       | Full path to the CSV where predictions and metrics are saved. Filename is built from model, task type, and mutation tag. |
| `num_tests`            | `int`       | Number of test questions to evaluate. Set to `num_tests` to run all tasks in mcq inconsistency.                                                                               |
| `mutations`            | `List[str]` | Mutation operators to apply (e.g., `["FOR2WHILE"]`, `["CONSTANT_UNFOLD"]`). Empty list means **no_mutation**.            |
| `model_name`           | `str`       | Identifier of the LLM under test (e.g., `GPT4O`). Used for routing and naming.                                           |
| `task_set`      | `str`       | Only `CODEMMLU` for MCQ Inconsistency                                                           |                        | `take_type`      | `str`       | Only `CODEMMLU_TASK` for MCQ Inconsistency                                                           |
| `continue_from_task`   | `str`       | Optional parameter for starting evaluation from a specified task ID corresponding to the task in MongoDB (e.g., `"CodeMMLUMCQ15"`)                                                 |

The following example runs a MCQ Inconsistency test on the CodeMMLU benchmark for all tasks in CodeMMLU. To add mutations such as Random mutation, add the corresponding mutation string to the `mutations` list like so: `mutations = [RANDOM_MUTATION]`. The mutations available for MCQ inconsistency testing are declared as constants above.

In [None]:
# %%script false --no-raise-error
mutations=[]
prompt_type = FEW_SHOT
model_name = GPT4O
task_type = CODEMMLU_TASK
mutation_str = "_".join(mutations) if len(mutations) > 0 else "no_mutation"

results_dir =os.path.join(proj_dir, f'results/mcq_inconsistency/{model_name}')
os.makedirs(results_dir, exist_ok=True)

mutation_str = "_".join(mutations) if len(mutations) > 0 else "no_mutation"
output_file_path=f"{results_dir}/{task_set}_{prompt_type}_{mutation_str}.csv"

pass_count = llmtester.run_mcq_inconsistency_test(
    prompt_helper= ReasoningMCQInconsistencyPromptTemplate().return_appropriate_prompt(prompt_type=prompt_type),
    num_tests= num_tests,
    prompt_type= prompt_type,
    mutations=mutations,
    output_file_path=output_file_path,
    task_type =task_type,
    task_set=CODEMMLU,
    model_name=model_name,
)