## 🔄 Workflow Overview: From Data to Cost Estimation

This notebook follows a structured pipeline to preprocess the benchmark data and estimate the total evaluation cost using an LLM. Here's an overview of the major steps:

### 1. 📦 Data Preprocessing
- Unzip and load the provided benchmark dataset.
- Parse and extract relevant information such as question text, source code, and mutant variants.
- Organize the data into structured mappings for easy access during prompt construction.

### 2. 🧬 Source–Mutant Mapping
- Identify and map **original source programs** to their corresponding **mutants**.
- Ensure proper alignment between source and mutated code for consistent evaluation.
- Filter or clean mappings if needed (e.g., to remove duplicates or invalid samples).

### 3. ✏️ Prompt Construction
- Generate evaluation prompts using both the **source** and **mutant** code.
- Follow a consistent prompt template to ensure fair comparison across examples.
- Handle few-shot or zero-shot formatting if required by the model.

### 4. 🔢 Token Counting
- Calculate the number of tokens per prompt and completion using the target LLM’s tokenizer.
- Aggregate token counts to compute total tokens for all examples.

### 5. 💸 Cost Estimation
- Multiply total tokens by the **LLM pricing rate** (e.g., cost per 1,000 tokens).
- Separate costs for different models (e.g., GPT-4 vs. GPT-3.5) if applicable.
- Provide breakdown of token and cost statistics across the dataset.

---

This step-by-step process ensures accurate cost estimation while maintaining transparency in how data is transformed and fed into the language model.


### 📁 Define Working and Data Directories

We start by setting up the file system paths required for processing the benchmark data.

- `working_directory`: The root path of the local NTS Evaluation repository.
- `code_extracted_folder_name`: The name of the folder containing the unzipped benchmark files.
- `data_folder`: The full path to the extracted data, used for loading and preprocessing source and mutant files.

These variables will be used throughout the notebook to load and manipulate the dataset.


In [1]:
working_directory = '/Users/pritamwork/Documents/GitHub/LLM-Verification'
code_extracted_folder_name = 'extracted'
data_folder = working_directory + '/' + code_extracted_folder_name

### 🗂️ Extract Project Names from Folder Structure

Next, we list the contents of the extracted data folder to identify all available benchmark projects.

- `os.listdir(data_folder)`: Lists all subfolders in the extracted data directory.
- These subfolders typically follow a naming pattern such as:
  - `<project>_source`
  - `<project>_mutants`
  - `<project>_test_suite`

To isolate the base project names, we:
- Strip suffixes like `_source`, `_mutants`, and `_test_suite`
- Store the cleaned names in a `set` to ensure uniqueness

This gives us a list of distinct project names included in the benchmark suite.


In [2]:
import os

folders = os.listdir(data_folder)
folders
project_names = set([name.replace('_test_suite','').replace('_source','').replace('_mutants','') for name in folders])
project_names

{'Problem1',
 'Problem2',
 'Problem3',
 'adpcm',
 'cfg_test',
 'elevator',
 'merge2BSTree',
 'nextDate1',
 'nsichneu',
 'quicksort'}

### 🧬 Map Source Files to Mutant Variants

We now construct a dictionary that maps each project to its original source file and all corresponding mutant files.

#### Code Breakdown:
- For each project name:
  - Locate the original source file inside the `<project>_source/<project>_source/` directory.
  - List all mutant versions inside `<project>_mutants/<project>_mutants/`.
  - For each mutant version, create a full path to the corresponding mutant file (which shares the same filename as the source).
- Store this information in a dictionary `files_map` using the following structure:

```python
files_map = {
    "project_name": (
        "path/to/source/file",
        ["path/to/mutant1", "path/to/mutant2", ...]
    ),
    ...
}


In [3]:
files_map = {}
for name in project_names:
    mutant_locations=[]
    source_file_name = os.listdir(data_folder+'/' + name +'_source' + '/' + name +'_source')
    mutant_file_versions = os.listdir(data_folder+'/' + name +'_mutants' + '/' + name +'_mutants')
    for version in mutant_file_versions:
        mutant_locations.append(data_folder+'/' + name +'_mutants' + '/' + name +'_mutants'+'/'+version+'/'+source_file_name[0])
    files_map.setdefault(name,(data_folder+'/' + name +'_source' + '/' + name +'_source'+'/'+source_file_name[0],mutant_locations))


        #data_folder+'/' + name +'_mutants' + '/' + name +'_mutants'

### 📏 File Size Helper Function

We define a utility function to get the size (in bytes) of a given file path:


In [4]:
def get_file_size(path):
    return os.path.getsize(path) if os.path.isfile(path) else 0


### 📦 Compute Combined File Sizes for Source–Mutant Pairs

We now construct a mapping called `size_map` that records the combined size (in bytes) of each source–mutant file pair.

#### Logic:
- Iterate over each project in `files_map`.
- For every mutant file associated with the project:
  - Use `get_file_size()` to get the size of both the source file and the mutant file.
  - Sum the sizes to get the total file size for that pair.
  - Use the last two parts of the mutant path (i.e., `mutant_version/filename`) as the key for easy identification.

#### Resulting Structure:
```python
size_map = {
    "mutant_version/filename": (
        "path/to/source/file",
        "path/to/mutant/file",
        combined_file_size_in_bytes
    ),
    ...
}


In [5]:
size_map = {}
for key in files_map:
    file_tuple = files_map[key]
    for mutant in file_tuple[1]:
        size_map.setdefault(str('/'.join(mutant.split('/')[-2:])), (file_tuple[0] , mutant, int(get_file_size(file_tuple[0]) + get_file_size(mutant))))

In [6]:
size_map

{'v6/merge_2_bst.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_source/merge2BSTree_source/merge_2_bst.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_mutants/merge2BSTree_mutants/v6/merge_2_bst.c',
  9711),
 'v1/merge_2_bst.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_source/merge2BSTree_source/merge_2_bst.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_mutants/merge2BSTree_mutants/v1/merge_2_bst.c',
  9710),
 'v8/merge_2_bst.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_source/merge2BSTree_source/merge_2_bst.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_mutants/merge2BSTree_mutants/v8/merge_2_bst.c',
  9712),
 'v7/merge_2_bst.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/merge2BSTree_source/merge2BSTree_source/merge_2_bst.c',
  '/Users/pritamwork/Documents/

### 📊 Sorted Source–Mutant Pairs by File Size

We sort all source–mutant pairs in ascending order based on their combined file size. This allows us to:

- Prioritize smaller examples for quicker processing or testing
- Analyze the distribution of file sizes across the dataset
- Optionally filter large files if needed for cost control or model limits


In [7]:
sorted_by_size_asc = sorted(size_map.items(), key=lambda item: item[1][2])
sorted_by_size_asc

[('v9/cfg_test.c',
  ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
   '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v9/cfg_test.c',
   3108)),
 ('v24/cfg_test.c',
  ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
   '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v24/cfg_test.c',
   3108)),
 ('v13/cfg_test.c',
  ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
   '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v13/cfg_test.c',
   3108)),
 ('v4/cfg_test.c',
  ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
   '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mut

### 🔢 Total Number of Source–Mutant Pairs

We calculate the total number of source–mutant pairs available for evaluation after mapping and sorting. This gives us the number of distinct test cases that will be used in prompt generation and cost estimation.


In [8]:
len(sorted_by_size_asc)

119

### 📂 Extract Sorted File Tuples

We extract only the file path tuples (source path, mutant path, combined size) from the sorted source–mutant mapping.

This creates a clean list `sorted_tuples_list` that preserves ascending order by file size and is ready for prompt construction and token analysis.


In [9]:
sorted_tuples_list = []
for key in sorted_by_size_asc:
    sorted_tuples_list.append(key[1])

In [10]:
sorted_tuples_list

[('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v9/cfg_test.c',
  3108),
 ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v24/cfg_test.c',
  3108),
 ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v13/cfg_test.c',
  3108),
 ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v4/cfg_test.c',
  3108),
 ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extract

### 📝 Prompt Generation for Functional Comparison

We define a function `generate_prompt` to create structured input prompts for the LLM. Each prompt presents:

- The original **source code**
- Its corresponding **mutated version**
- A clear and constrained instruction asking the model to determine if there is any **functional difference** between the two

#### Prompt Structure:
- **System message**: Defines the LLM’s role as a precise coding assistant.
- **User message**: Provides both code snippets and instructs the model to respond with only `"Yes"` or `"No"` — without any explanation.

This strict format helps standardize responses and simplifies downstream evaluation.


In [11]:
def generate_prompt(source, mutant):
    """
    Generates a structured prompt for a language model to compare the functionality
    of source and mutant code snippets.

    Parameters:
        source (str): The original source code.
        mutant (str): The mutated version of the code.

    Returns:
        list: A list of message dictionaries formatted for LLM input.
    """
    message = [
        {
            "role": "system",
            "content": (
                "You are a helpful and precise coding assistant. Your job is to identify "
                "whether two code snippets behave differently in terms of functionality."
            )
        },
        {
            "role": "user",
            "content": (
                f"Here is the original (source) code:\n\n{source}\n\n"
                f"And here is the mutated version of the code:\n\n{mutant}\n\n"
                "Do these two pieces of code have any **functional difference**? "
                "Respond strictly with \"Yes\" or \"No\" only. Do not explain your answer."
            )
        }
    ]
    return message

    

### 🧹 Read and Clean Source Code Files

The `read_file_clean` function is used to read C source code files and clean them for use in prompt generation. It performs the following operations:

#### 🛠️ Functionality:
- Reads a source or mutant file from the given path.
- Removes:
  - Block comments (`/* ... */`)
  - Line comments (`// ...`)
  - Tabs, newlines, and excessive whitespace
- Returns a **compact, single-line** version of the code that's easier to embed in prompts.

#### 🧾 Error Handling:
- Raises `FileNotFoundError` if the file doesn't exist.
- Raises `PermissionError` if the file cannot be accessed.
- Raises `IOError` for any other I/O-related issues.

This ensures that the input fed into the LLM is clean, consistent, and free of noise that could distract or confuse the model during functional comparison.


In [12]:
import os
import re

def read_file_clean(location):
    """
    Reads a C source code file, removes comments, newlines, and tabs,
    and returns compact code suitable for functional comparison.

    Parameters:
        location (str): Path to the C source code file.

    Returns:
        str: Cleaned and compact code.

    Raises:
        FileNotFoundError: If the file does not exist.
        PermissionError: If reading is not allowed.
        IOError: For other I/O issues.
    """
    if not os.path.isfile(location):
        raise FileNotFoundError(f"The file at '{location}' does not exist.")

    try:
        with open(location, 'r', encoding='utf-8') as file:
            code = file.read()

        # Remove block comments (/* ... */)
        code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)

        # Remove line comments (// ...)
        code = re.sub(r'//.*', '', code)

        # Remove tabs, newlines, and multiple spaces
        code = code.replace('\n', ' ').replace('\t', ' ')
        code = re.sub(r'\s+', ' ', code).strip()

        return code

    except PermissionError:
        raise PermissionError(f"Permission denied while reading '{location}'.")
    except IOError as e:
        raise IOError(f"An I/O error occurred while reading '{location}': {e}")


### 🔍 Example Prompt Preview

We generate and display a sample prompt by selecting the smallest source–mutant pair (based on file size). This example helps verify:

- That the source and mutant files are read and cleaned correctly
- The prompt structure conforms to expectations
- The LLM will receive clean and consistent input

This preview is useful for debugging, prompt validation, and ensuring alignment before scaling to all examples.


In [13]:
#print(generate_prompt(read_file_clean(comparison_tuple_list[0][1]),read_file_clean(comparison_tuple_list[0][1])))
print(generate_prompt(read_file_clean(sorted_tuples_list[0][0]),read_file_clean(sorted_tuples_list[0][1])))

[{'role': 'system', 'content': 'You are a helpful and precise coding assistant. Your job is to identify whether two code snippets behave differently in terms of functionality.'}, {'role': 'user', 'content': 'Here is the original (source) code:\n\n#include <stdio.h> #include <stdlib.h> void f(int); void g(int); void h(int); void i(int); void f(int a) { if (a > 13) { printf("\\ngreater than 13\\n"); } else { printf("\\nnot greater than 13\\n"); } } void g(int a) { h(a); if (a == 7) { printf("\\n7\\n"); } else { printf("\\nnot 7\\n"); } i(a); } void h(int a) { if (a == -4) { printf("\\n-4\\n"); } else { printf("\\nnot -4\\n"); } } void i(int a) { if (a == 100) { printf("\\n100\\n"); } else { printf("\\nnot 100\\n"); } } int main(int argc, int* argv[]) { int a; a=atoi(argv[1]); if (a == 19) { printf("\\n19\\n"); } else { printf("\\nnot 19\\n"); } if (a> 5){ printf("\\nThe value of is greater than 5"); printf("\\nThe value of a is %d", a); f(a); } if (a< 5){ printf("\\nThe value of is less 

In [15]:
!pip install tiktoken



### 📦 Install Tokenizer Library (`tiktoken`)

We install the `tiktoken` library — OpenAI’s official tokenizer for models like GPT-3.5 and GPT-4.

This library allows us to:
- Accurately tokenize prompts and completions
- Estimate the number of tokens used per LLM call
- Calculate costs based on model-specific pricing per 1,000 tokens


In [16]:
import tiktoken

### 🔢 Estimate Token Count per Prompt

The `estimate_tokens` function calculates the number of tokens in a given text prompt using the tokenizer appropriate for the selected LLM model.

#### 📌 Function Details:
- **Inputs**:
  - `prompt`: The input text to tokenize.
  - `model`: The OpenAI model name (e.g., `"gpt-4"`, `"gpt-3.5-turbo"`).
- **Behavior**:
  - Uses the `tiktoken` library to load the tokenizer corresponding to the model.
  - Falls back to the default encoding (`cl100k_base`) if the model is not recognized.
- **Output**:
  - Returns the **estimated number of tokens** used by the prompt.

Accurate token estimation is essential for calculating inference cost and ensuring input stays within the model's context limit.


In [17]:
def estimate_tokens(prompt: str, model: str = "gpt-4") -> int:
    """
    Estimate the number of tokens for a given prompt and model.

    Args:
        prompt (str): The input text or prompt to tokenize.
        model (str): The OpenAI model name (e.g., "gpt-3.5-turbo", "gpt-4", etc.).

    Returns:
        int: Estimated number of tokens.
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Model '{model}' not found. Using default encoding (cl100k_base).")
        encoding = tiktoken.get_encoding("cl100k_base")

    tokens = encoding.encode(prompt)
    return len(tokens)

### 🧪 Sample Token Estimation

We select a source–mutant pair (the smallest one) and generate a prompt to test token estimation for a specific LLM.

- The prompt is generated using `generate_prompt()`
- Cleaned source and mutant code are passed as input
- The resulting prompt is tokenized using `estimate_tokens()` for the specified model (`gpt-4`)

This serves as a sanity check to verify:
- The tokenizer is functioning correctly
- The prompt is well-formed
- The estimated token count falls within expected range

This step is critical before scaling token and cost estimation to the full dataset.


In [18]:
model_name = "gpt-4"
#print(generate_prompt(read_file_clean(sorted_tuples_list[0][0]),read_file_clean(sorted_tuples_list[0][0])))
prompt_text = str(generate_prompt(read_file_clean(sorted_tuples_list[0][0]),read_file_clean(sorted_tuples_list[0][1])))
token_count = estimate_tokens(prompt_text, model=model_name)
print(f"Estimated tokens for model '{model_name}': {token_count}")

Estimated tokens for model 'gpt-4': 844


### 📊 Bulk Token Estimation Across All Source–Mutant Pairs

We now estimate the total token usage for evaluating **all** source–mutant pairs using the specified model (`gpt-4`).

#### 🧮 Workflow:
- Iterate through all entries in `sorted_tuples_list`
- For each pair:
  - Read and clean the source and mutant files
  - Generate a prompt using `generate_prompt()`
  - Estimate the number of tokens in the prompt using `estimate_tokens()`
- Store individual token counts in `prompt_size_map`, using the mutant identifier as the key
- Accumulate the total token count in `total_tokens`

This step provides the basis for calculating the **overall cost** of evaluating the full benchmark dataset with the selected language model.


In [19]:
total_tokens = 0
prompt_size_map = {}
model_name = "gpt-4"
for val_tuple in sorted_tuples_list:
    print(val_tuple)
    prompt_text = str(generate_prompt(read_file_clean(val_tuple[0]),read_file_clean(val_tuple[1])))
    token_count = estimate_tokens(prompt_text, model=model_name)
    prompt_size_map.setdefault('/'.join(val_tuple[1].split('/')[-2:]) , token_count)
    total_tokens += token_count

('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c', '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v9/cfg_test.c', 3108)
('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c', '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v24/cfg_test.c', 3108)
('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c', '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v13/cfg_test.c', 3108)
('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c', '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v4/cfg_test.c', 3108)
('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_te

In [20]:
prompt_size_map

{'v9/cfg_test.c': 844,
 'v24/cfg_test.c': 844,
 'v13/cfg_test.c': 845,
 'v4/cfg_test.c': 845,
 'v21/cfg_test.c': 844,
 'v17/cfg_test.c': 844,
 'v6/cfg_test.c': 846,
 'v15/cfg_test.c': 846,
 'v23/cfg_test.c': 846,
 'v22/cfg_test.c': 846,
 'v14/cfg_test.c': 846,
 'v2/cfg_test.c': 846,
 'v5/cfg_test.c': 846,
 'v11/cfg_test.c': 846,
 'v18/cfg_test.c': 846,
 'v19/cfg_test.c': 846,
 'v10/cfg_test.c': 846,
 'v1/cfg_test.c': 846,
 'v7/cfg_test.c': 846,
 'v8/cfg_test.c': 848,
 'v12/cfg_test.c': 847,
 'v3/cfg_test.c': 847,
 'v16/cfg_test.c': 848,
 'v20/cfg_test.c': 848,
 'v3/quicksort.c': 1090,
 'v6/quicksort.c': 1090,
 'v1/quicksort.c': 1090,
 'v2/quicksort.c': 1090,
 'v4/quicksort.c': 1090,
 'v5/quicksort.c': 1092,
 'v1/elevator.c': 1462,
 'v8/elevator.c': 1462,
 'v2/elevator.c': 1462,
 'v5/elevator.c': 1462,
 'v4/elevator.c': 1462,
 'v7/elevator.c': 1463,
 'v3/elevator.c': 1462,
 'v6/elevator.c': 1464,
 'v6/nextdate1.c': 1673,
 'v10/nextdate1.c': 1676,
 'v1/nextdate1.c': 1676,
 'v8/nextdate1.

In [21]:
total_tokens

2247172

### 📦 Install `pandas` Library

We install the `pandas` library, which is widely used for:

- Organizing and analyzing tabular data
- Creating structured DataFrames to store token counts, costs, and metadata
- Exporting results to CSV or Excel for reporting

This will help us visualize and manage the token usage and cost data collected across the benchmark examples.


In [22]:
!pip install pandas



### 🧾 Create DataFrame of Token Counts

We convert the `prompt_size_map` dictionary into a structured pandas DataFrame for easier analysis and visualization.

#### Columns:
- **`mutant`**: Identifier for each mutant (e.g., `mutant_1/filename.c`)
- **`prompt tokens`**: Estimated number of tokens used for the generated prompt

This table allows us to:
- Sort and filter examples by token size
- Identify outliers or unusually large prompts
- Use the token data for detailed cost calculation and reporting


In [23]:
import pandas as pd
prompt_size_df = pd.DataFrame(prompt_size_map.items(), columns=['mutant', 'prompt tokens'])
prompt_size_df


Unnamed: 0,mutant,prompt tokens
0,v9/cfg_test.c,844
1,v24/cfg_test.c,844
2,v13/cfg_test.c,845
3,v4/cfg_test.c,845
4,v21/cfg_test.c,844
...,...,...
114,v1/Problem3.c,176234
115,v2/Problem3.c,176236
116,v5/Problem3.c,176236
117,v4/Problem3.c,176236


### ✅ Feasibility Analysis by Model Token Limits

We evaluate how many NTS benchmark examples are feasible to run on different variants of GPT-4, based on their token limits:

- **`GPT-4 (8k)`**: Supports up to **8,192 tokens**
- **`GPT-4 (32k)`**: Supports up to **32,768 tokens**
- **`GPT-4 Turbo`**: Supports up to **128,000 tokens**
- **Examples exceeding 128k tokens** are considered **not feasible** for current GPT-4 models

#### 📊 Breakdown:
- `no_NTS_example_feasible_for_GPT4`: Number of prompts within the 8k token limit
- `no_NTS_example_feasible_for_GPT4_32k`: Prompts between 8k and 32k tokens
- `no_NTS_example_feasible_for_GPT4_turbo`: Prompts between 32k and 128k tokens
- `no_NTS_example_not_feasible`: Prompts exceeding 128k tokens

This analysis helps determine which examples are usable with specific model configurations and informs strategy for cost-effective scaling.


In [24]:
no_NTS_example_feasible_for_GPT4 = len(prompt_size_df[prompt_size_df['prompt tokens'] <= 8192])
no_NTS_example_feasible_for_GPT4_32k = len(prompt_size_df[(prompt_size_df['prompt tokens'] > 8192) & (prompt_size_df['prompt tokens'] <= 32768)])
no_NTS_example_feasible_for_GPT4_turbo = len(prompt_size_df[(prompt_size_df['prompt tokens'] > 32768) & (prompt_size_df['prompt tokens'] <= 128000)])
no_NTS_example_not_feasible = len(prompt_size_df[prompt_size_df['prompt tokens'] > 128000])
print(f"Examples feasible for GPT-4 (8k context): {no_NTS_example_feasible_for_GPT4}")
print(f"Examples feasible for GPT-4 (32k context): {no_NTS_example_feasible_for_GPT4_32k}")
print(f"Examples feasible for GPT-4 Turbo (128k context): {no_NTS_example_feasible_for_GPT4_turbo}")
print(f"Examples not feasible for any GPT-4 model (>128k tokens): {no_NTS_example_not_feasible}")

Examples feasible for GPT-4 (8k context): 84
Examples feasible for GPT-4 (32k context): 18
Examples feasible for GPT-4 Turbo (128k context): 12
Examples not feasible for any GPT-4 model (>128k tokens): 5


### 📊 Total Token Count by GPT-4 Model Context

We calculate the total number of tokens required for evaluating all feasible NTS benchmark examples, grouped by model context window:

- **GPT-4 (8k)**: Examples with token count ≤ 8,192
- **GPT-4 (32k)**: Token count between 8,193 and 32,768
- **GPT-4 Turbo (128k)**: Token count between 32,769 and 128,000

This breakdown helps estimate total usage and informs cost modeling based on the number of tokens processed within each model tier.


In [25]:
# Compute total tokens by category
tokens_GPT4_8k = sum(prompt_size_df[prompt_size_df['prompt tokens'] <= 8192]['prompt tokens'])
tokens_GPT4_32k = sum(prompt_size_df[(prompt_size_df['prompt tokens'] > 8192) & (prompt_size_df['prompt tokens'] <= 32768)]['prompt tokens'])
tokens_GPT4_Turbo = sum(prompt_size_df[(prompt_size_df['prompt tokens'] > 32768) & (prompt_size_df['prompt tokens'] <= 128000)]['prompt tokens'])

# Print the results
print(f" Total tokens for GPT-4 (8k context): {tokens_GPT4_8k}")
print(f" Total tokens for GPT-4 (32k context): {tokens_GPT4_32k}")
print(f" Total tokens for GPT-4 Turbo (128k context): {tokens_GPT4_Turbo}")

 Total tokens for GPT-4 (8k context): 244968
 Total tokens for GPT-4 (32k context): 296264
 Total tokens for GPT-4 Turbo (128k context): 824763


## Conclusion: Token Usage & Cost Estimation Summary

After processing the NTS benchmark examples, we summarize the token usage and estimated evaluation cost for each GPT-4 model variant below.

### Estimated Token Usage and Cost:

| GPT-4 Model Variant         | Token Limit | Total Tokens Used | Cost per 1K Tokens (USD) | Estimated Cost (USD) |
|-----------------------------|-------------|--------------------|---------------------------|-----------------------|
| GPT-4 (8k context)          | 8,192       | 244,968            | $0.030                    | $7.35                 |
| GPT-4 (32k context)         | 32,768      | 296,264            | $0.060                    | $17.78                |
| GPT-4 Turbo (128k context)  | 128,000     | 824,763            | $0.010                    | $8.25                 |
| Total                       | –           | 1,365,995          | –                         | $33.38                |

>  **Cost Calculation**:  
> Estimated cost is calculated as:  
> `Cost = (Total Tokens / 1,000) × Cost per 1K Tokens`

> **Pricing Reference (as of 2024)**:
> - GPT-4 (8k context): $0.030 per 1K tokens  
> - GPT-4 (32k context): $0.060 per 1K tokens  
> - GPT-4 Turbo (128k context): $0.010 per 1K tokens

---

### Summary

- Most NTS examples are well within the input limits of current GPT-4 variants.
- The total estimated cost for evaluating all prompts is **~$33.38**, making it affordable for large-scale benchmarking.
- GPT-4 Turbo offers the most cost-effective route for long inputs without exceeding model limits.

In [26]:
prompt_size_df[prompt_size_df['prompt tokens'] <= 8192]

Unnamed: 0,mutant,prompt tokens
0,v9/cfg_test.c,844
1,v24/cfg_test.c,844
2,v13/cfg_test.c,845
3,v4/cfg_test.c,845
4,v21/cfg_test.c,844
...,...,...
79,v20/Problem1.c,7027
80,v21/Problem1.c,7027
81,v19/Problem1.c,7026
82,v17/Problem1.c,7028


In [28]:
prompt_size_df[(prompt_size_df['prompt tokens'] > 8192) & (prompt_size_df['prompt tokens'] <= 32768)]

Unnamed: 0,mutant,prompt tokens
84,v1/adpcm.c,10350
85,v4/adpcm.c,10349
86,v6/adpcm.c,10350
87,v8/adpcm.c,10350
88,v7/adpcm.c,10350
89,v5/adpcm.c,10350
90,v3/adpcm.c,10350
91,v2/adpcm.c,10350
92,v6/Problem2.c,21346
93,v2/Problem2.c,21346


In [30]:
sorted_tuples_list[83]

('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/Problem1_source/Problem1_source/Problem1.c',
 '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/Problem1_mutants/Problem1_mutants/v4/Problem1.c',
 22122)

In [94]:
set([tuple_val[1].split('/')[-3].split('_')[0] for tuple_val in sorted_tuples_list[:84]])

{'Problem1', 'cfg', 'elevator', 'merge2BSTree', 'nextDate1', 'quicksort'}

In [99]:
mutant_file_map = {}
for tuple_val in sorted_tuples_list[:84]:
    mutant_name = '/'.join(tuple_val[1].split('/')[-2:])
    source_file = tuple_val[0]
    mutant_file = tuple_val[1]
    print(mutant_name,source_file,mutant_file)
    mutant_file_map[mutant_name] = (source_file,mutant_file)

v9/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v9/cfg_test.c
v24/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v24/cfg_test.c
v13/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v13/cfg_test.c
v4/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v4/cfg_test.c
v21/cfg_test.c /Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg

In [100]:
mutant_file_map

{'v9/cfg_test.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v9/cfg_test.c'),
 'v24/cfg_test.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v24/cfg_test.c'),
 'v13/cfg_test.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v13/cfg_test.c'),
 'v4/cfg_test.c': ('/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_source/cfg_test_source/cfg_test.c',
  '/Users/pritamwork/Documents/GitHub/LLM-Verification/extracted/cfg_test_mutants/cfg_test_mutants/v4/cfg_test.c'),
 'v21/cfg_test.c': ('/Us

In [46]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [50]:
!pip install openai

Collecting openai
  Downloading openai-1.93.2-py3-none-any.whl.metadata (29 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.10.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.2 kB)
Downloading openai-1.93.2-py3-none-any.whl (755 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.1/755.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m-:--:--[0m
Downloading jiter-0.10.0-cp313-cp313-macosx_11_0_arm64.whl (318 kB)
Installing collected packages: jiter, openai
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [openai]2m1/2[0m [openai]
[1A[2KSuccessfully installed jiter-0.10.0 openai-1.93.2


In [51]:
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
if not openai.api_key:
    raise ValueError("API key not found. Please set it in environment or .env file.")

In [80]:
import re
read_file_clean(sorted_tuples_list[0][0])

'#include <stdio.h> #include <stdlib.h> void f(int); void g(int); void h(int); void i(int); void f(int a) { if (a > 13) { printf("\\ngreater than 13\\n"); } else { printf("\\nnot greater than 13\\n"); } } void g(int a) { h(a); if (a == 7) { printf("\\n7\\n"); } else { printf("\\nnot 7\\n"); } i(a); } void h(int a) { if (a == -4) { printf("\\n-4\\n"); } else { printf("\\nnot -4\\n"); } } void i(int a) { if (a == 100) { printf("\\n100\\n"); } else { printf("\\nnot 100\\n"); } } int main(int argc, int* argv[]) { int a; a=atoi(argv[1]); if (a == 19) { printf("\\n19\\n"); } else { printf("\\nnot 19\\n"); } if (a> 5){ printf("\\nThe value of is greater than 5"); printf("\\nThe value of a is %d", a); f(a); } if (a< 5){ printf("\\nThe value of is less than 5"); printf("\\nThe value of a is %d", a); g(a); } if (a==5){ printf("\\nThe value of is equal to 5"); printf("\\nThe value of a is %d", a); f(a); g(a); } f(a); g(a); if (a != 1) { printf("\\nnot 1\\n"); } else { printf("\\n1\\n"); } return 

In [81]:
print(generate_prompt(read_file_clean(sorted_tuples_list[0][0]),read_file_clean(sorted_tuples_list[0][1])))

[{'role': 'system', 'content': 'You are a helpful and precise coding assistant. Your job is to identify whether two code snippets behave differently in terms of functionality.'}, {'role': 'user', 'content': 'Here is the original (source) code:\n\n#include <stdio.h> #include <stdlib.h> void f(int); void g(int); void h(int); void i(int); void f(int a) { if (a > 13) { printf("\\ngreater than 13\\n"); } else { printf("\\nnot greater than 13\\n"); } } void g(int a) { h(a); if (a == 7) { printf("\\n7\\n"); } else { printf("\\nnot 7\\n"); } i(a); } void h(int a) { if (a == -4) { printf("\\n-4\\n"); } else { printf("\\nnot -4\\n"); } } void i(int a) { if (a == 100) { printf("\\n100\\n"); } else { printf("\\nnot 100\\n"); } } int main(int argc, int* argv[]) { int a; a=atoi(argv[1]); if (a == 19) { printf("\\n19\\n"); } else { printf("\\nnot 19\\n"); } if (a> 5){ printf("\\nThe value of is greater than 5"); printf("\\nThe value of a is %d", a); f(a); } if (a< 5){ printf("\\nThe value of is less 

In [82]:
def get_system_prompt():
    return 'You are a helpful and precise coding assistant. Your job is to identify whether two code snippets behave differently in terms of functionality.'

In [108]:
def get_user_prompt(source,mutant):
    prompt = f"Here is the original (source) code:\n\n{source}\n\nAnd here is the mutated version of the code:\n\n{mutant}\n\nDo these two pieces of code have any **functional difference**? Respond strictly with \"Yes\" or \"No\" only. Explain your answer in single line if required provide code responsible with line number in mutant, if there are multiple discrepancies each will have one line explanation."
    return prompt

In [124]:
# Create OpenAI client instance
client = OpenAI(api_key=openai.api_key)

# Updated function for a single message input
def send_prompt(message, model="gpt-4o", temperature=0):
    """
    Send a single user prompt to the OpenAI Chat API with a system message.
    
    Parameters:
        message (str): The user's prompt.
        model (str): The OpenAI model to use (e.g., 'gpt-4o').
        temperature (float): Sampling temperature for response randomness.

    Returns:
        reply (str): The assistant's response.
        usage (dict): Token usage details.
    """
    messages = [
        {"role": "system", "content": get_system_prompt()},
        {"role": "user", "content": message}
    ]

    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature
        )
        reply = response.choices[0].message.content
        usage = response.usage
        return reply, usage
    except Exception as e:
        print(model)
        print("OpenAI API error:", str(e))
        return None, None

In [121]:
explaination,usage = send_prompt(get_user_prompt(read_file_clean(sorted_tuples_list[0][0]),read_file_clean(sorted_tuples_list[0][1])))

In [122]:
print(explaination)
print(usage)

Yes. The mutated version always prints "\n7\n" in the `g(int a)` function regardless of the value of `a` (line 15 in the mutated code).
CompletionUsage(completion_tokens=37, prompt_tokens=835, total_tokens=872, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))


In [128]:
program_list = []
mutant_list = []
explaination_list = []
usage_list = []
execution_log = []
for key in mutant_file_map:
    try:
        benchmark_program_name = mutant_file_map[key][0].split('/')[-2].split('_')[0]
        program_list.append(benchmark_program_name)
        mutant_list.append(key)
        
        file_info = mutant_file_map[key]
        source = read_file_clean(file_info[0])
        mutant = read_file_clean(file_info[1])
        
        prompt = get_user_prompt(source,mutant)
        explaination,usage = send_prompt(prompt)
        explaination_list.append(explaination)
        usage_list.append(usage)
        execution_log.append(benchmark_program_name + ":" + key + ": is successful" )
        print(benchmark_program_name + ":" + key + ": is successful")
    except Exception as e:
            print(f"Error on prompt {benchmark_program_name}:{key}: {str(e)}")
            execution_log.append(benchmark_program_name + ":" + key + ": have following issue : " + str(e) )

cfg:v9/cfg_test.c: is successful
cfg:v24/cfg_test.c: is successful
cfg:v13/cfg_test.c: is successful
cfg:v4/cfg_test.c: is successful
cfg:v21/cfg_test.c: is successful
cfg:v17/cfg_test.c: is successful
cfg:v6/cfg_test.c: is successful
cfg:v15/cfg_test.c: is successful
cfg:v23/cfg_test.c: is successful
cfg:v22/cfg_test.c: is successful
cfg:v14/cfg_test.c: is successful
cfg:v2/cfg_test.c: is successful
cfg:v5/cfg_test.c: is successful
cfg:v11/cfg_test.c: is successful
cfg:v18/cfg_test.c: is successful
cfg:v19/cfg_test.c: is successful
cfg:v10/cfg_test.c: is successful
cfg:v1/cfg_test.c: is successful
cfg:v7/cfg_test.c: is successful
cfg:v8/cfg_test.c: is successful
cfg:v12/cfg_test.c: is successful
cfg:v3/cfg_test.c: is successful
cfg:v16/cfg_test.c: is successful
cfg:v20/cfg_test.c: is successful
quicksort:v3/quicksort.c: is successful
quicksort:v6/quicksort.c: is successful
quicksort:v1/quicksort.c: is successful
quicksort:v2/quicksort.c: is successful
quicksort:v4/quicksort.c: is succ

In [127]:
explaination_list

['Yes. The mutated code always prints "\\n7\\n" in the `g(int a)` function regardless of the value of `a`, due to the change in the conditional statement.',
 'Yes. The mutated code has a logical error in the last conditional statement (line 54), where it prints "\\n1\\n" regardless of the condition, unlike the source code.',
 'Yes. The mutated version has a change in the `h(int a)` function where both branches of the if-else statement print "\\n-4\\n" (line 19 in the mutated code), which is different from the original code.',
 'Yes. The mutated version has a change in the `f` function where both branches of the `if` statement print "greater than 13" (line 9 in the mutated code), which differs from the original code\'s behavior.']

In [117]:
def create_dataframe_from_lists(col1, col2, col3, col4, headers=None):
    """
    Creates a pandas DataFrame from four equal-length lists.

    Parameters:
        col1, col2, col3, col4 (list): Data columns.
        headers (list of str): Optional list of 4 column names.

    Returns:
        pd.DataFrame: The resulting DataFrame.
    """
    if not (len(col1) == len(col2) == len(col3) == len(col4)):
        raise ValueError("All input lists must have the same length.")

    if headers is None:
        headers = ["Benchmark Program", "Mutant", "Explaination", "Usage"]

    data = list(zip(col1, col2, col3, col4))
    return pd.DataFrame(data, columns=headers)


In [129]:
result_df = create_dataframe_from_lists(program_list,mutant_list,explaination_list,usage_list)

In [130]:
result_df

Unnamed: 0,Benchmark Program,Mutant,Explaination,Usage
0,cfg,v9/cfg_test.c,Yes. The mutated code has a change in the `g(i...,"CompletionUsage(completion_tokens=51, prompt_t..."
1,cfg,v24/cfg_test.c,Yes. The mutated code has a discrepancy in the...,"CompletionUsage(completion_tokens=42, prompt_t..."
2,cfg,v13/cfg_test.c,Yes. The mutated version has a change in the `...,"CompletionUsage(completion_tokens=49, prompt_t..."
3,cfg,v4/cfg_test.c,Yes. The mutated version has a change in the `...,"CompletionUsage(completion_tokens=48, prompt_t..."
4,cfg,v21/cfg_test.c,Yes. The mutated code has a change in the `mai...,"CompletionUsage(completion_tokens=52, prompt_t..."
...,...,...,...,...
79,Problem1,v20/Problem1.c,Yes. The mutated code has a logical error in `...,"CompletionUsage(completion_tokens=76, prompt_t..."
80,Problem1,v21/Problem1.c,Yes. The mutated code has a logical negation i...,"CompletionUsage(completion_tokens=63, prompt_t..."
81,Problem1,v19/Problem1.c,Yes. The mutated code has a typo in `calculate...,"CompletionUsage(completion_tokens=38, prompt_t..."
82,Problem1,v17/Problem1.c,Yes. The condition in `calculate_outputm20` wa...,"CompletionUsage(completion_tokens=82, prompt_t..."


In [132]:
csv_path = "./verification_GPT-4o.csv"
result_df.to_csv(csv_path, index=False)
print(f" CSV saved to: {csv_path}")

 CSV saved to: ./verification_GPT-4o.csv


In [133]:
!pip install openpyxl



In [135]:
excel_path = "./verification_GPT-4o.xlsx"
result_df.to_excel(excel_path, index=False)
print(f"Excel saved to: {excel_path}")

Excel saved to: ./verification_GPT-4o.xlsx


In [136]:
is_function_difference_list = [line.split('.')[0] for line in explaination_list]
is_function_difference_list

['Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes']

In [141]:
other_explaination_list = [' '.join(line.split('.')[1:]).strip() for line in explaination_list]
other_explaination_list 

['The mutated code has a change in the `g(int a)` function where the condition `if (a == 7)` always prints `"\\n7\\n"` regardless of the value of `a`, which is different from the original code',
 'The mutated code has a discrepancy in the final conditional check: `if (a != 1)` always prints "\\n1\\n" regardless of the condition (line 55 in the mutated code)',
 'The mutated version has a change in the `h(int a)` function where both branches of the if-else statement print "\\n-4\\n" (line 19 in the mutated code), which is different from the original code',
 'The mutated version has a change in the `f` function where both branches of the `if` statement print "greater than 13", which is a functional difference from the original code  (Mutated code line 8)',
 'The mutated code has a change in the `main` function where the condition `if (a == 19)` always prints `"\\n19\\n"` regardless of the condition, unlike the original code  (Line 19 in the mutated version)',
 'The mutated version has a c

In [142]:
def create_dataframe_from_lists_v2(col1, col2, col3, col4, col5, headers=None):
    """
    Creates a pandas DataFrame from four equal-length lists.

    Parameters:
        col1, col2, col3, col4 (list): Data columns.
        headers (list of str): Optional list of 4 column names.

    Returns:
        pd.DataFrame: The resulting DataFrame.
    """
    if not (len(col1) == len(col2) == len(col3) == len(col4) == len(col5)):
        raise ValueError("All input lists must have the same length.")

    if headers is None:
        headers = ["Benchmark Program", "Mutant", "is Functionally different", "Explaination", "Usage"]

    data = list(zip(col1, col2, col3, col4, col5))
    return pd.DataFrame(data, columns=headers)

In [144]:
final_result_df = create_dataframe_from_lists_v2(program_list,mutant_list,is_function_difference_list,other_explaination_list,usage_list)

In [145]:
final_result_df

Unnamed: 0,Benchmark Program,Mutant,is Functionally different,Explaination,Usage
0,cfg,v9/cfg_test.c,Yes,The mutated code has a change in the `g(int a)...,"CompletionUsage(completion_tokens=51, prompt_t..."
1,cfg,v24/cfg_test.c,Yes,The mutated code has a discrepancy in the fina...,"CompletionUsage(completion_tokens=42, prompt_t..."
2,cfg,v13/cfg_test.c,Yes,The mutated version has a change in the `h(int...,"CompletionUsage(completion_tokens=49, prompt_t..."
3,cfg,v4/cfg_test.c,Yes,The mutated version has a change in the `f` fu...,"CompletionUsage(completion_tokens=48, prompt_t..."
4,cfg,v21/cfg_test.c,Yes,The mutated code has a change in the `main` fu...,"CompletionUsage(completion_tokens=52, prompt_t..."
...,...,...,...,...,...
79,Problem1,v20/Problem1.c,Yes,The mutated code has a logical error in `calcu...,"CompletionUsage(completion_tokens=76, prompt_t..."
80,Problem1,v21/Problem1.c,Yes,The mutated code has a logical negation in `ca...,"CompletionUsage(completion_tokens=63, prompt_t..."
81,Problem1,v19/Problem1.c,Yes,The mutated code has a typo in `calculate_outp...,"CompletionUsage(completion_tokens=38, prompt_t..."
82,Problem1,v17/Problem1.c,Yes,The condition in `calculate_outputm20` was cha...,"CompletionUsage(completion_tokens=82, prompt_t..."


In [146]:
csv_path = "./verification_GPT-4o_final.csv"
final_result_df.to_csv(csv_path, index=False)
print(f" CSV saved to: {csv_path}")

 CSV saved to: ./verification_GPT-4o_final.csv


In [147]:
excel_path = "./verification_GPT-4o_final.xlsx"
final_result_df.to_excel(excel_path, index=False)
print(f"Excel saved to: {excel_path}")

Excel saved to: ./verification_GPT-4o_final.xlsx
