## LLM Consistency Testing with Mistral LLM

This notebook contains code for testing code inconsistency in Mistral LLM

In [None]:
import os
import sys
from dotenv import load_dotenv

In [None]:
curr_dir = os.getcwd()
par_dir = os.path.dirname(curr_dir)
proj_dir = os.path.dirname(par_dir)
sys.path.append(proj_dir)
load_dotenv()

In [None]:
from prediction_inconsistency.prediction_inconsistency_tester import LLMConsistencyTester
from prediction_inconsistency.prompt_templates.prompt_template import PredictionInconsistencyPromptTemplate, ReasoningPredictionInconsistencyPromptTemplate
from utility.constants import Tasks, PromptTypes, LexicalMutations, SyntacticMutations, LogicalMutations, ReasoningModels, NonReasoningModels, CruxEval, HumanEval

# Declaring constants

In [None]:

## Declaring Task Type Constants
OUTPUT_PREDICTION = Tasks.OutputPrediction.NAME
INPUT_PREDICTION = Tasks.InputPrediction.NAME

## Declaring Benchmark Name Constants
CRUXEVAL = CruxEval.NAME
HUMANEVAL = HumanEval.NAME

## Declaring Prompt Type Constants
ZERO_SHOT = PromptTypes.ZERO_SHOT
ONE_SHOT = PromptTypes.ONE_SHOT

## Declaring Mutation Constants
FOR2WHILE = SyntacticMutations.FOR2WHILE
FOR2ENUMERATE = SyntacticMutations.FOR2ENUMERATE

RANDOM_MUTATION = LexicalMutations.RANDOM
SEQUENTIAL_MUTATION = LexicalMutations.SEQUENTIAL
LITERAL_FORMAT = LexicalMutations.LITERAL_FORMAT

BOOLEAN_LITERAL = LogicalMutations.BOOLEAN_LITERAL
DEMORGAN = LogicalMutations.DEMORGAN
COMMUTATIVE_REORDER = LogicalMutations.COMMUTATIVE_REORDER
CONSTANT_UNFOLD = LogicalMutations.CONSTANT_UNFOLD
CONSTANT_UNFOLD_ADD = LogicalMutations.CONSTANT_UNFOLD_ADD
CONSTANT_UNFOLD_MULT = LogicalMutations.CONSTANT_UNFOLD_MULT

## Declaring Reasoning Model Name Constants
GPT5 = ReasoningModels.GPT5['name']

## Declaring Non-Reasoning Model Name Constants
CODESTRAL = NonReasoningModels.CODESTRAL['name']
GPT4O = NonReasoningModels.GPT4O['name']
DEEPSEEK = NonReasoningModels.DEEPSEEK_CHAT['name']

In [None]:
task_set = CRUXEVAL
database_name = os.getenv('MONGODB_CRUXEVAL_COLLECTION')
llmtester = LLMConsistencyTester(database_name, n =5)

In [None]:
reasoning_models = [getattr(ReasoningModels, model) for model in dir(ReasoningModels) if not model.startswith("_")]
non_reasoning_models = [getattr(NonReasoningModels, model) for model in dir(NonReasoningModels) if not model.startswith("_")]
print('Reasoning models supported by this framework are:')
for idx, model in enumerate(reasoning_models):
    print(f"{idx+1}: '{model['name']}'")
print('=' * 50)
print('Non-reasoning models supported by this framework are:')
for idx, model in enumerate(non_reasoning_models):
    print(f"{idx+1}: '{model['name']}'")

In [None]:
num_tests = llmtester.question_database.count_documents({})

`run_code_consistency_test` method is used for running predcition inconsistency tests on MuCoCo.

| Parameter              | Type        | Description                                                                                                              |
| ---------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------ |
| `prompt_helper`        | `str`       | String template for the appropriate prompt. Simply rename the `prompt_type` variable to `ZERO_SHOT` for CruxEval.
| `output_file_path`     | `str`       | Full path to the CSV where predictions and metrics are saved. Filename is built from model, task type, and mutation tag. |
| `num_tests`            | `int`       | Number of test questions to evaluate. Set to `num_tests` to evaluate all tasks in this benchmark.                                                                                 |
| `mutations`            | `List[str]` | Mutation operators to apply (e.g., `["FOR2WHILE"]`, `["CONSTANT_UNFOLD"]`). Empty list means **no_mutation**.            |
| `model_name`           | `str`       | Identifier of the LLM under test (e.g., `GPT4O`). Used for routing and naming.                                           |
| `task_set`      | `str`       | Either  `CRUXEVAL` or `HUMANEVAL` for  prediction inconsistency.                                                           |                    
| `take_type`      | `str`       | Either  `OUTPUT_PREDICTION` or `INPUT_PREDICTION` for  prediction inconsistency.                                                           |      
| `continue_from_task`   | `str`       | Optional parameter for starting evaluation from a specified task ID corresponding to the task in MongoDB (e.g., `"CruxEvalTF15"`).                                                |

The following example runs an input prediction inconsistency test on the CruxEval benchmark for all tasks in CruxEval. To add mutations such as Random mutation, add the corresponding mutation string to the `mutations` list like so: `mutations = [RANDOM_MUTATION]`. The mutations available for prediction inconsistency testing have been declared as constants above.

In [None]:
# %%script false --no-raise-error

mutations = []
prompt_type = ZERO_SHOT
model_name = GPT4O
task_type = INPUT_PREDICTION 
mutation_str = "_".join(mutations) if len(mutations) > 0 else "no_mutation"

results_dir =os.path.join(proj_dir, f'results', task_type, model_name)
os.makedirs(results_dir, exist_ok=True)

mutation_str = "_".join(mutations) if len(mutations) > 0 else "no_mutation"
output_file_path=f"{results_dir}/{task_set}_{prompt_type}_{mutation_str}_nw1.csv"

pass_count = llmtester.run_code_consistency_test(
    prompt_helper = PredictionInconsistencyPromptTemplate.return_appropriate_prompt(task_type, prompt_type),
    num_tests= num_tests,
    prompt_type= prompt_type,
    mutations=mutations,
    output_file_path=output_file_path,
    task_set = task_set,
    task_type=task_type,
    model_name=model_name,
)