#### Graphs

This notebook includes all visualizations designed to analyze and interpret the toxicity behavior of different LLMs across experimental conditions. Each graph focuses on a specific dimension of model performance and variability.

In [1]:
import os
import sys

ROOT_DIR = os.path.join(os.getcwd(), '..')

if ROOT_DIR not in sys.path:
    sys.path.append(ROOT_DIR)

In [2]:
import plotly.express as px
import pandas as pd
from src.scripts.plotly_express_plots import *

In [3]:
df = pd.read_excel('../results/analysis_module/winners_perspective_scores_run_prompt_player_05_12_2025_14-34-53.xlsx')

**Graph 1. Toxicity vs Temperature**

Generates a line chart with error bars. It shows the mean toxicity on the Y-axis versus the generation temperature on the X-axis, with a separate line for each model. The error bars represent the 95% confidence interval (using 1.96 x Standard Error of the Mean (SEM)).

How to interpret: If the line rises steeply, it means increasing randomness (temperature) significantly increases the mean toxicity of the model's output. If the error bars are large, it indicates higher variability in toxicity scores at that temperature.

In [4]:
plot_toxicity_vs_temperature(df)

**Graph 3. Form of the distribution by model.**

Generates a violin plot. It shows the distribution of the toxicity score for each model. The shape of the violin represents the probability density of the toxicity score.

How to interpret: The wider parts of the "violin" indicate a higher density of observations (more generated content) having that specific toxicity score. A violin plot that is wider near the high end of the Y-axis (e.g., closer to 1.0) suggests that the model frequently produces content with high toxicity.

In [26]:
plot_distribution_by_model(df)

**Graph 2. Temperature curve per model**

Generates separate line charts (one per model). Each chart shows the mean toxicity versus temperature, similar to the first function, but with the 95% confidence interval shaded instead of using error bars.

How to interpret: Represents the individual trend of toxicity vs. temperature for each model, with a visual emphasis on the uncertainty area.

In [None]:
plot_toxicity_vs_temperature_shaded(df)

**Graph 4. High tail by model**

Percentage of generated content whose toxicity score is greater than or equal to a specified threshold (thr, default 0.8), plotted against the temperature for each model.

How to interpret: A high percentage (Y-axis) means the model frequently generates highly toxic content (above the threshold) at that temperature. This plot represents the frequency or rate at which each model produces high-toxicity responses ("tail of the distribution") as a function of temperature.

In [4]:
plot_rates_above_threshold(df)

**Graph 5. Top toxic black cards (mean per model)**

Generates a heatmap. It shows the mean toxicity grouped by the black cards (black_id) that triggered the most toxic responses (the top_k with the highest mean toxicity) and the model. 

How to interpret: Brighter colors (yellow in the default viridis colormap) indicate a higher mean toxicity score for that specific black card. This heatmap identifies which prompts (Y-axis) are most likely to trigger high toxicity responses from models. A column that is consistently brightly colored across many black cards suggests that model is generally more susceptible to toxic prompting.


In [15]:
plot_black_card_triggers(df)