### 1. Setting Up the Environment

This first cell handles all the necessary setup. It installs the required Python libraries for running the language models (`transformers`, `torch`, `bitsandbytes`), data manipulation (`pandas`), and code analysis (`radon`). After installation, it imports all the modules that will be used throughout the notebook.

In [1]:
# --- Section 1: Setup & Installations ---
print("Installing required libraries...")
# Install all necessary packages quietly (-qqq flag)
!pip install transformers torch accelerate bitsandbytes pandas huggingface_hub radon ipywidgets matplotlib seaborn -qqq -qqq > /dev/null 2>&1

# Import necessary modules
import ipywidgets as widgets
from IPython.display import display, HTML
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import notebook_login
import ast # Abstract Syntax Trees, used to check for valid Python code
import re # Regular expressions for cleaning text
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from radon.complexity import cc_visit # For Cyclomatic Complexity
from radon.metrics import mi_visit # For Maintainability Index
from radon.raw import analyze # For Lines of Code (LOC)
import warnings
warnings.filterwarnings('ignore')
print("\nSetup Complete! ✅")

Installing required libraries...

Setup Complete! ✅


### 2. Hugging Face Authentication

To download the pre-trained models from the Hugging Face Hub, we need to authenticate. This section retrieves a secret key (stored securely in Kaggle Secrets) and uses it to log in. This step is crucial for accessing gated models or private repositories.

In [2]:
# Import the secret client to access stored secrets
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Retrieve the Hugging Face API key
secret_value = user_secrets.get_secret("HUGGINGFACE_KEY")

# Print a confirmation message without exposing the full key
print(secret_value[:5]+'*************************')

hf_WW*************************


In [3]:
# Use the retrieved key to log into the Hugging Face Hub
from huggingface_hub import login
login(token = secret_value)
print("Huggingface Login successful")

Huggingface Login successful


### 3. Core Engine: Generation and Metrics

This cell contains the backbone of our analysis tool. It defines:
1.  **Models to Test**: A dictionary mapping user-friendly names to their Hugging Face model paths.
2.  **Helper Functions**:
    * `clean_generated_code`: Each model has a unique way of formatting its output (e.g., special tokens, instructions). This function uses regular expressions to strip away the noise and extract only the raw Python code.
    * `is_syntactically_valid`: Checks if the generated code can be parsed, ensuring it's valid Python syntax before we try to analyze it.
    * `calculate_advanced_metrics`: Uses the `radon` library to compute key software metrics: Cyclomatic Complexity (decision complexity), Maintainability Index (ease of maintenance), and Lines of Code.
    * `generate_code`: The main generation function. It formats the prompt for the specific model, generates code, measures the generation time, and then cleans the output.

In [4]:


# --- Model Configuration ---

# Dictionary mapping friendly names to Hugging Face model identifiers
MODELS_TO_TEST = {
    "DeepSeek-Coder-1.3B": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "Phi-2-2.7B": "microsoft/phi-2",
    "Gemma-2B-IT": "google/gemma-2b-it",
    "Stable-code-3b": "stabilityai/stable-code-3b"
}
# Set the computation device to GPU (cuda) if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Helper & Generation Functions ---
def clean_generated_code(text, model_path):
    """Cleans the raw model output to extract only the Python code."""
    model_path = model_path.lower()
    # Model-specific cleaning logic
    if "gemma" in model_path:
        text = re.sub(r"<start_of_turn>user\n.*<end_of_turn>\n<start_of_turn>model\n", "", text, flags=re.DOTALL).replace("<end_of_turn>", "")
    elif "phi-2" in model_path:
        text = re.sub(r"Instruct:.*\nOutput:", "", text, flags=re.DOTALL)
        text = text.replace("<|endoftext|>", "")
    elif "stable" in model_path:
        text = re.sub(r'""".*?"""\s*', "", text, flags=re.DOTALL) # Remove docstrings
        match = re.search(r"```(?:python)?\n(.*?)\n```", text, re.DOTALL) # Extract from markdown block
        if match:
            text = match.group(1)
        text = re.sub(r"<\|?endoftext\|?>", "", text, flags=re.IGNORECASE)
    else: # Default cleaning for instruction-tuned models
        text = re.sub(r"### Instruction:\n.*\n\n### Response:", "", text, flags=re.DOTALL)
    
    # General cleaning for code within markdown blocks
    match = re.search(r"```python\n(.*?)\n```", text, re.DOTALL)
    if match: text = match.group(1)
    return text.strip()

def is_syntactically_valid(code_string: str) -> bool:
    """Checks if a string contains valid Python code using AST parsing."""
    if not code_string: return False
    try:
        ast.parse(code_string)
        return True
    except SyntaxError:
        return False

def calculate_advanced_metrics(code_string):
    """Calculates code quality metrics if the code is syntactically valid."""
    if not is_syntactically_valid(code_string):
        return {"complexity": None, "maintainability": None, "loc": None}
    try:
        # Calculate Cyclomatic Complexity, Maintainability Index, and Lines of Code
        complexity = sum([c.complexity for c in cc_visit(code_string)]) if cc_visit(code_string) else 0
        maintainability = mi_visit(code_string, multi=True)
        loc = analyze(code_string).loc
        return {"complexity": complexity, "maintainability": round(float(maintainability), 2), "loc": loc}
    except Exception:
        return {"complexity": None, "maintainability": None, "loc": None}

def generate_code(model, tokenizer, prompt):
    """Generates code from a model, times the process, and cleans the output."""
    model_path = model.name_or_path.lower()
    
    # Apply model-specific prompt formatting
    if "gemma" in model_path:
        formatted_prompt = f"<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    elif "phi-2" in model_path:
        formatted_prompt = f"Instruct: {prompt}\nOutput:"
    elif "stable" in model_path:
        formatted_prompt = f'"""\n{prompt}\n"""\n'
    else: # Default instruction format
        formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:"
        
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Tokenize the prompt and move it to the correct device (GPU/CPU)
    inputs = tokenizer(formatted_prompt, return_tensors="pt", return_attention_mask=True).to(device)
    
    # Generate code and time the inference
    start_time = time.time()
    output_ids = model.generate(
        inputs.input_ids, 
        attention_mask=inputs.attention_mask, 
        max_new_tokens=512, 
        temperature=0.1, # Low temperature for more deterministic, less creative output
        do_sample=True, 
        pad_token_id=tokenizer.pad_token_id
    )
    end_time = time.time()

    # Decode the generated token IDs back to a string
    raw_output = tokenizer.batch_decode(output_ids)[0]
    # Clean the raw output to get just the code
    cleaned_code = clean_generated_code(raw_output, model_path)

    return {"code": cleaned_code, "gen_time": end_time - start_time}

print("Backend engine with advanced metrics is ready.")

Backend engine with advanced metrics is ready.


### 4. Pre-loading Models

Loading a large language model into memory can be time-consuming. To make the interactive UIs feel responsive, this cell pre-loads all the models and their corresponding tokenizers defined in `MODELS_TO_TEST`. They are stored in a dictionary (`loaded_models`) for quick access later. Using `torch_dtype=torch.bfloat16` and `device_map="auto"` helps optimize memory usage and automatically places the model on the available GPU.

In [5]:
# --- Section 4: Pre-Loading All Models ---
loaded_models = {}
print("Starting to pre-load all models...")

# Iterate through the dictionary of models to test
for model_name, model_path in MODELS_TO_TEST.items():
    if model_name not in loaded_models.keys():
        print(f"\n--- Loading {model_name}... ---")
        try:
            # Load the tokenizer
            tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
            # Load the model with optimizations for memory and device placement
            model = AutoModelForCausalLM.from_pretrained(
                model_path, 
                torch_dtype=torch.bfloat16, # Use bfloat16 for reduced memory footprint
                device_map="auto", # Automatically map model layers to available devices (GPU/CPU)
                trust_remote_code=True
            )
            # Store the loaded model and tokenizer
            loaded_models[model_name] = {"model": model, "tokenizer": tokenizer}
            print(f"✅ {model_name} loaded successfully.")
        except Exception as e:
            print(f"✗ FAILED to load {model_name}. Error: {e}")

print("\n" + "="*50 + "\nAll available models are pre-loaded.\n" + "="*50)

Starting to pre-load all models...

--- Loading DeepSeek-Coder-1.3B... ---


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

2025-10-16 18:08:21.592568: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760638101.839247      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760638101.902855      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

✅ DeepSeek-Coder-1.3B loaded successfully.

--- Loading Phi-2-2.7B... ---


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✅ Phi-2-2.7B loaded successfully.

--- Loading Gemma-2B-IT... ---


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

✅ Gemma-2B-IT loaded successfully.

--- Loading Stable-code-3b... ---


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/602 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/610M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

✅ Stable-code-3b loaded successfully.

All available models are pre-loaded.


In [6]:
# Verify which models have been loaded by checking the dictionary keys
loaded_models.keys()

dict_keys(['DeepSeek-Coder-1.3B', 'Phi-2-2.7B', 'Gemma-2B-IT', 'Stable-code-3b'])

This cell shows an attempt to load an additional model (`replit-code-v1-3b`). The output shows that it failed, which is useful for demonstrating that not all models are compatible with the current setup

### 5. Interactive UI #1: Benchmark All Models

This section builds the first user interface using `ipywidgets`. It provides a simple way to test a single prompt across all the pre-loaded models simultaneously.

1.  **UI Elements**: A text area for the user's prompt and a button to trigger the generation.
2.  **Button Logic**: The `on_benchmark_all_clicked` function is attached to the button. When clicked, it takes the prompt, iterates through all models, calls the `generate_code` function for each, calculates metrics, and displays the results neatly in a table (Pandas DataFrame).

In [7]:
# Create the text input area for the prompt
prompt_input_1 = widgets.Textarea(
    placeholder="Enter your code prompt here (e.g., 'a function to calculate the factorial of a number')",
    layout={'width': '90%', 'height': '100px'}
)

# Create the button to start the benchmark
generate_button_1 = widgets.Button(description="Generate & Benchmark All")

# Create an output area to display results
output_1 = widgets.Output()

In [8]:
from IPython.display import clear_output

# Define the function that runs when the 'Generate & Benchmark All' button is clicked
def on_benchmark_all_clicked(b):
    with output_1:
        clear_output() # Clear previous results
        prompt = prompt_input_1.value
        if not prompt:
            print("Please enter a prompt.")
            return

        print(f"🚀 Starting benchmark for prompt: '{prompt}'\n" + "="*50)
        results_this_run = []
        # Loop through each pre-loaded model
        for name, model_id in MODELS_TO_TEST.items():
            print(f"\n--- Generating with {name} ---")
            # Generate code using the model's components
            generated_code = generate_code(loaded_models[name]['model'],loaded_models[name]['tokenizer'], prompt)
            # Calculate quality metrics for the generated code
            metrics = calculate_advanced_metrics(generated_code['code'])
            # Combine all results into a single dictionary entry
            entry = {'Model': name, 'Prompt':prompt, **generated_code, **metrics}
            results_this_run.append(entry)
            
        # Convert the list of results into a Pandas DataFrame for nice formatting
        results_df = pd.DataFrame(results_this_run).round(2)
        # Display the DataFrame as an HTML table
        display(HTML(results_df.to_html().replace('\\n','<br>')))

# Attach the function to the button's click event
generate_button_1.on_click(on_benchmark_all_clicked)

In [9]:
# Display the UI components together
print("UI #1: Benchmark All Models")
display(widgets.VBox([prompt_input_1, generate_button_1, output_1]))

UI #1: Benchmark All Models


VBox(children=(Textarea(value='', layout=Layout(height='100px', width='90%'), placeholder="Enter your code pro…

### 6. Interactive UI #2: Inspect Selected Models

This section builds a second, more flexible UI. Instead of running on all models, the user can select specific models to compare using checkboxes. This is useful for focusing the analysis on a subset of models that seem promising.

The structure is similar to the first UI, with `ipywidgets` for the prompt, checkboxes, a button, and an output area. The `on_run_selected_clicked` function gathers input only from the checked models before running the generation and analysis.

In [10]:
# --- Section 6: UI #2 - Run on Selected Models ---
print("\n\n--- UI #2: Inspect Selected Models ---")

# Define the UI widgets
prompt_input_selected = widgets.Textarea(placeholder='Enter a prompt for selected models...', layout={'width': '95%'})
run_selected_button = widgets.Button(description='Run Selected', button_style='success', icon='play')
output_selected = widgets.Output(layout={'border': '1px solid black', 'padding': '10px', 'overflow': 'scroll'})

# Create a checkbox for each loaded model
model_checkboxes = {name: widgets.Checkbox(value=True, description=name) for name in loaded_models.keys()}
checkbox_container = widgets.VBox(list(model_checkboxes.values()))

# Define the function that runs when the 'Run Selected' button is clicked
def on_run_selected_clicked(b):
    with output_selected:
        output_selected.clear_output(wait=True)
        prompt = prompt_input_selected.value
        if not prompt: print("Please enter a prompt."); return

        # Get the list of models that have been checked by the user
        models_to_run = [name for name, cb in model_checkboxes.items() if cb.value]
        if not models_to_run: print("Please select at least one model."); return

        print(f"Running prompt on {len(models_to_run)} selected models...")
        results_this_run = []
        # Loop only through the selected models
        for model_name in models_to_run:
            print(f"  - Generating with {model_name}...")
            components = loaded_models[model_name]
            result = generate_code(components['model'], components['tokenizer'], prompt)
            metrics = calculate_advanced_metrics(result['code'])

            entry = {'Model': model_name, 'Prompt': prompt, **result, **metrics}
            results_this_run.append(entry)

        print("\n--- Selected Run Complete ---")
        results_df = pd.DataFrame(results_this_run).round(2)
        display(HTML(results_df.to_html().replace('\\n', '<br>')))

# Attach the function to the button's click event
run_selected_button.on_click(on_run_selected_clicked)

# Assemble and display the UI
ui_selected_models = widgets.VBox([prompt_input_selected, widgets.HTML("<h4>Select models to run:</h4>"), checkbox_container, run_selected_button, output_selected])
display(ui_selected_models)



--- UI #2: Inspect Selected Models ---


VBox(children=(Textarea(value='', layout=Layout(width='95%'), placeholder='Enter a prompt for selected models.…

### 7. Automated Testing and Visualization

This final section automates the entire evaluation process to provide a high-level, objective comparison of the models. It runs a predefined list of 16 common coding prompts against every model.

After collecting data for all prompts and models, it aggregates the results and generates four bar plots using `matplotlib` and `seaborn` to visualize the average performance of each model across several key metrics:
- **Cyclomatic Complexity**: How complex is the generated code? (Lower is better)
- **Maintainability Index**: How easy is the code to maintain? (Higher is better)
- **Lines of Code (LOC)**: How verbose is the code? (Context-dependent, but often lower is better)
- **Generation Time**: How fast is the model? (Lower is better) (Context-dependent)
- **Lines of Code generated per sec**: How fast is the model? (Better for comparing models, due to same units regarding each prompt)(Context Independent)

This provides a comprehensive, at-a-glance summary of each model's strengths and weaknesses.

In [11]:
# A list of standard prompts to test the models against automatically
TEST_PROMPTS = [
    "Write a Python function is_palindrome(s) that returns True if a string is a palindrome, ignoring case and non-alphanumeric characters.",
    "Write a Python function find_common_elements(list1, list2) that returns a new list containing elements that are present in both input lists.",
    "Implement a Stack class in Python with push, pop, peek, and is_empty methods.",
    "Write a Python function get_unique_even_numbers(numbers) that takes a list of integers and returns a sorted list of unique even numbers, using a list comprehension.",
    "Write a Python function merge_dictionaries(d1, d2) that merges two dictionaries. If a key exists in both, the value from the second dictionary should overwrite the first.",
    "Write a Python function count_words_in_file(filepath) that reads a text file and returns the total number of words.",
    "Write a Python function get_bitcoin_price() that uses the requests library to fetch the current Bitcoin price from the CoinDesk API (https://api.coindesk.com/v1/bpi/currentprice.json) and returns the price in USD as a float.",
    "A function that takes a list of numbers and returns the sum.",
    "Implement a binary search algorithm.",
    "A function to find the factorial of a number using recursion.",
    "Write a simple Flask route that returns 'Hello, World!'.",
    "A function to parse a date string 'YYYY-MM-DD' and return a datetime object.",
    "A Python class for a 'Car' with 'make', 'model', and 'year' attributes.",
    "A function to read a CSV file using pandas and return a DataFrame.",
    "A function to write a dictionary to a JSON file.",
    "A function that uses regex to validate an email address."
]

# Define UI widgets for the automated test section
results = []
output_3 = widgets.Output()


# Define what happens when the button is clicked
def download_csv(b):
    clear_output(wait=True)
    print("✅ CSV file created successfully!")
    display(FileLink(csv_path, result_html_prefix="Click to download: "))
    
# Define the main function for automated testing and plotting
def run_tests():
    global results
    results = [] # Reset results for a new run
        # clear_output()
    print("Starting automated testing across all prompts and models...")

    # Outer loop: iterate through each test prompt
    for i, prompt in enumerate(TEST_PROMPTS):
        print(f"\nRunning Prompt {i+1}/{len(TEST_PROMPTS)}: '{prompt}'")
        # Inner loop: iterate through each model
        for name, model_id in MODELS_TO_TEST.items():
            print(f"  - Generating with {name}...")
            output_gen_code = generate_code(loaded_models[name]['model'],loaded_models[name]['tokenizer'], prompt)
            code = output_gen_code['code']
            gen_time = output_gen_code['gen_time']
            # print('generation time is : ', gen_time)
            metrics = calculate_advanced_metrics(code)
            # Only append results if metrics could be calculated (i.e., code was valid)
            if "error" not in metrics and metrics['complexity'] is not None:
                results.append({
                    "prompt": prompt,
                    "model": name,
                    "generation_time": gen_time,
                    **metrics
                })
    
    print("\n✅ Automated testing complete.")
    df = pd.DataFrame(results)
    df['loc_per_sec']= df['loc']/df['generation_time']
    if df.empty:
        print("No results to plot.")
        return
    

            # --- CSV Download Button ---

    # Save the DataFrame to a CSV file
    csv_path = "/tmp/model_test_results.csv"
    df.to_csv(csv_path, index=False)

    # Create a download button
    download_button = widgets.Button(
        description="📥 Download Results as CSV",
        button_style='success',
        tooltip="Click to download the results CSV"
    )
    download_button.on_click(download_csv)

    # Display the button
    display(download_button)
    return df

results_df = run_tests()

Starting automated testing across all prompts and models...

Running Prompt 1/16: 'Write a Python function is_palindrome(s) that returns True if a string is a palindrome, ignoring case and non-alphanumeric characters.'
  - Generating with DeepSeek-Coder-1.3B...
  - Generating with Phi-2-2.7B...
  - Generating with Gemma-2B-IT...
  - Generating with Stable-code-3b...

Running Prompt 2/16: 'Write a Python function find_common_elements(list1, list2) that returns a new list containing elements that are present in both input lists.'
  - Generating with DeepSeek-Coder-1.3B...
  - Generating with Phi-2-2.7B...
  - Generating with Gemma-2B-IT...
  - Generating with Stable-code-3b...

Running Prompt 3/16: 'Implement a Stack class in Python with push, pop, peek, and is_empty methods.'
  - Generating with DeepSeek-Coder-1.3B...
  - Generating with Phi-2-2.7B...
  - Generating with Gemma-2B-IT...
  - Generating with Stable-code-3b...

Running Prompt 4/16: 'Write a Python function get_unique_even_n

Button(button_style='success', description='📥 Download Results as CSV', style=ButtonStyle(), tooltip='Click to…

In [12]:
results_df

Unnamed: 0,prompt,model,generation_time,complexity,maintainability,loc,loc_per_sec
0,Write a Python function is_palindrome(s) that ...,DeepSeek-Coder-1.3B,9.464993,3,79.01,3,0.316957
1,Write a Python function is_palindrome(s) that ...,Phi-2-2.7B,4.526302,3,100.0,6,1.325585
2,Write a Python function is_palindrome(s) that ...,Gemma-2B-IT,12.950845,3,76.14,17,1.312656
3,Write a Python function find_common_elements(l...,DeepSeek-Coder-1.3B,7.598845,3,88.29,2,0.263198
4,Write a Python function find_common_elements(l...,Phi-2-2.7B,4.140416,3,77.88,6,1.44913
5,Write a Python function find_common_elements(l...,Gemma-2B-IT,14.465382,4,84.16,25,1.728264
6,"Implement a Stack class in Python with push, p...",DeepSeek-Coder-1.3B,10.855595,9,62.0,20,1.842368
7,"Implement a Stack class in Python with push, p...",Phi-2-2.7B,8.555139,9,64.6,17,1.98711
8,"Implement a Stack class in Python with push, p...",Gemma-2B-IT,14.47445,9,88.66,41,2.832577
9,"Implement a Stack class in Python with push, p...",Stable-code-3b,6.871195,7,63.84,26,3.783912


In [13]:
global df
df = results_df

In [16]:
def plot_data(b):
    with output_3: 
        # --- Visualization ---
        print("\n📊 Generating performance plots...")
        fig, axes = plt.subplots(4, 1, figsize=(12, 24))
        fig.suptitle('Model Performance Metrics Across All Prompts', fontsize=16)

        # Plot 1: Average Cyclomatic Complexity (Lower is Better)
        avg_complexity = df.groupby('model')['complexity'].mean().sort_values()
        std_complexity = df.groupby('model')['complexity'].std().reindex(avg_complexity.index)
        avg_complexity.plot(kind='bar', ax=axes[0], color='skyblue', yerr=std_complexity, capsize=4)
        axes[0].set_title('Average Cyclomatic Complexity (Lower is Better)')
        axes[0].set_ylabel('Avg. Complexity')
        axes[0].tick_params(axis='x', rotation=45)

        # Plot 2: Average Maintainability Index (Higher is Better)
        avg_mi = df.groupby('model')['maintainability'].mean().sort_values(ascending=False)
        std_mi = df.groupby('model')['maintainability'].std().reindex(avg_mi.index)
        avg_mi.plot(kind='bar', ax=axes[1], color='lightgreen', yerr=std_mi, capsize=4)
        axes[1].set_title('Average Maintainability Index (Higher is Better)')
        axes[1].set_ylabel('Avg. MI')
        axes[1].tick_params(axis='x', rotation=45)

        # Plot 3: Average Lines of Code (LOC)
        avg_loc = df.groupby('model')['loc'].mean().sort_values()
        std_loc = df.groupby('model')['loc'].std().reindex(avg_loc.index)
        avg_loc.plot(kind='bar', ax=axes[2], color='salmon', yerr=std_loc, capsize=4)
        axes[2].set_title('Average Lines of Code (LOC)')
        axes[2].set_ylabel('Avg. LOC')
        axes[2].tick_params(axis='x', rotation=45)

        # Plot 4: Average Generation Time (Lower is Better)
        avg_time = df.groupby('model')['generation_time'].mean().sort_values()
        std_time = df.groupby('model')['generation_time'].std().reindex(avg_time.index)
        avg_time.plot(kind='bar', ax=axes[3], color='purple', yerr=std_time, capsize=4)
        axes[3].set_title('Average Generation Time (Lower is Better)')
        axes[3].set_ylabel('Avg. Time (seconds)')
        axes[3].tick_params(axis='x', rotation=45)

        # Plot 5: Average number of lines of code generated per sec
        avg_time = df.groupby('model')['loc_per_sec'].mean().sort_values()
        std_time = df.groupby('model')['loc_per_sec'].std().reindex(avg_time.index)
        avg_time.plot(kind='bar', ax=axes[3], color='purple', yerr=std_time, capsize=4)
        axes[3].set_title('Average number of lines of codes generated per second (higher is better)')
        axes[3].set_ylabel('Avg. loc ( per seconds)')
        axes[3].tick_params(axis='x', rotation=45)
        plt.tight_layout(pad=3.0)
        # plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout to prevent title overlap
        plt.show()


run_test_button = widgets.Button(description="Generate Plots")

# Link button to function
run_test_button.on_click(plot_data)

# Display UI
print("\n\nAutomated Testing & Visualization")
display(widgets.VBox([run_test_button, output_3]))



Automated Testing & Visualization


VBox(children=(Button(description='Generate Plots', style=ButtonStyle()), Output(outputs=({'name': 'stdout', '…