#### Graphs

This notebook includes all visualizations designed to analyze and interpret the toxicity behavior of different LLMs across experimental conditions. Each graph focuses on a specific dimension of model performance and variability.

In [16]:
import os
import sys

ROOT_DIR = os.path.join(os.getcwd(), '..')

if ROOT_DIR not in sys.path:
    sys.path.append(ROOT_DIR)

In [17]:
import plotly.express as px
import pandas as pd
from src.scripts.plotly_express_plots import *

In [None]:
# PLOTS FOR WINNERS FILES
df = pd.read_excel('../results/analysis_module/winners_perspective_scores_run_prompt_player_05_12_2025_14-34-53.xlsx')

**Graph 1. Toxicity vs Temperature**

Generates a line chart with error bars. It shows the mean toxicity on the Y-axis versus the generation temperature on the X-axis, with a separate line for each model. The error bars represent the 95% confidence interval (using 1.96 x Standard Error of the Mean (SEM)).

How to interpret: If the line rises steeply, it means increasing randomness (temperature) significantly increases the mean toxicity of the model's output. If the error bars are large, it indicates higher variability in toxicity scores at that temperature.

In [9]:
plot_toxicity_vs_temperature(df)

**Graph 3. Form of the distribution by model.**

Generates a violin plot. It shows the distribution of the toxicity score for each model. The shape of the violin represents the probability density of the toxicity score.

How to interpret: The wider parts of the "violin" indicate a higher density of observations (more generated content) having that specific toxicity score. A violin plot that is wider near the high end of the Y-axis (e.g., closer to 1.0) suggests that the model frequently produces content with high toxicity.

In [10]:
plot_distribution_by_model(df)

**Graph 2. Temperature curve per model**

Generates separate line charts (one per model). Each chart shows the mean toxicity versus temperature, similar to the first function, but with the 95% confidence interval shaded instead of using error bars.

How to interpret: Represents the individual trend of toxicity vs. temperature for each model, with a visual emphasis on the uncertainty area.

In [11]:
plot_toxicity_vs_temperature_shaded(df)

**Graph 4. High tail by model**

Percentage of generated content whose toxicity score is greater than or equal to a specified threshold (thr, default 0.8), plotted against the temperature for each model.

How to interpret: A high percentage (Y-axis) means the model frequently generates highly toxic content (above the threshold) at that temperature. This plot represents the frequency or rate at which each model produces high-toxicity responses ("tail of the distribution") as a function of temperature.

In [12]:
plot_rates_above_threshold(df)

**Graph 5. Top toxic black cards (mean per model)**

Generates a heatmap. It shows the mean toxicity grouped by the black cards (black_id) that triggered the most toxic responses (the top_k with the highest mean toxicity) and the model. 

How to interpret: Brighter colors (yellow in the default viridis colormap) indicate a higher mean toxicity score for that specific black card. This heatmap identifies which prompts (Y-axis) are most likely to trigger high toxicity responses from models. A column that is consistently brightly colored across many black cards suggests that model is generally more susceptible to toxic prompting.


In [13]:
plot_black_card_triggers(df)

**Graph 6. Top toxic plays (black + white)**

Generates a heatmap. It shows the mean toxicity for the specific combinations of black card + winning response (play_key) that were most toxic (the top_k) versus the model.

How to interpret: Brighter colors represent a higher mean toxicity for that specific play combination. Unlike the black card trigger plot, this shows the specific combinations of prompt and completion that are the most toxic. It helps pinpoint highly toxic generated phrases.

In [41]:
plot_top_plays_heatmap_per_model(df)

**Graph 7.1. Mean Toxicity vs. Instability**

Shows the mean toxicity on the X-axis versus the standard deviation of toxicity (instability) on the Y-axis. The point size reflects the number of observations (n).

How to interpret: High X-Value (High Mean Toxicity): The play is frequently toxic. High Y-Value (High Instability): The toxicity score for that play varies widely across different generation attempts.  Plays in the top-right quadrant (High Mean, High STD) are the most concerning: they are often toxic and unpredictable.

**Graph 7.2. Toxic stable/unstable plays**

Identifies plays that are consistently toxic (high mean, low STD), those that are consistently non-toxic (low mean, low STD), and the unstable/risky plays (high STD), where the model is sometimes toxic and sometimes not.

How to interpret: Highlights the most unstable interactions.

In [45]:
plot_instability(df)

**Graph 8. Profile of attributes per model**

Generates a grouped bar chart. It shows the average score for all toxicity attribute categories (e.g., SEVERE_TOXICITY, INSULT, PROFANITY, etc.) grouped by model.

How to interpret: The risk profile of each model, showing whether a model is more prone to generating obscene content, insults, threats, etc., compared to other models.

In [54]:
plot_category_comparison(df)

**Graph 9. Language risk per model**

Generates separate bar charts (one per language LANG). Each chart shows the toxicity metrics (mean, p50, p90, p95 percentiles) for all models in that language.

How to interpret: The mean and 50th percentile (median) show the typical toxicity score.  The 90th and 95th percentiles are measures of the distribution's tail (the high-risk values). A high P95 indicates that 5% of the content generated by that model/language combination is extremely toxic. If Model A has a low mean in Language X but a high P95, it means while generally safe, it occasionally produces extreme toxicity in that language. If Model B has high scores across all percentiles, it is fundamentally more toxic in that language.

In [58]:
plot_language_risk_faceted(df)

**Graph 10. Mean toxicity by configuration × model**

Generates a heatmap. It shows the mean toxicity grouped by configuration (racism, random, etc) and model.

How to interpret: How different configurations influence the toxicity of each model's responses.

In [None]:
plot_config_toxicity_per_model(df)

**Graph 11. Toxicity distribution by configuration**

Generates a violin plot for the toxicity distribution by configuration.

by_model=False: All models combined.

by_model=True: Toxicity distribution by configuration, split by model.

How to interpret: **Toxicity distribution by configuration (all models)**: A wide violin pushed towards high toxicity for a specific configuration suggests that configuration setting tends to lead to more toxic outputs regardless of the specific model used. **Toxicity distribution by configuration (split by model)**: This allows for a granular comparison. You can see not only if a configuration is risky overall but also which model performs the worst (or best) under that specific configuration.

In [None]:
plot_config_distribution(df)

**Graph 12. High-toxicity rate by configuration**

Generates a grouped bar chart. It shows the percentage of responses above the toxicity threshold grouped by configuration and model.

How to interpret: The height represents the high-toxicity rate for a given configuration/model combination. This is a direct measure of risk based on configuration. Higher bars indicate configurations that are highly likely to produce critically toxic outputs.

In [82]:
plot_config_tail_rate_(df)