# WildGuard: Dark Pattern Detection in Real-World LLM Conversations

## The Big Question
**Do AI assistants exhibit "dark patterns" — manipulative behaviors like sycophancy, brand bias, or user retention tactics — in real conversations?**

This notebook explores 30,000+ real ChatGPT conversations from WildChat to find out.

### What We're Looking For
1. **How common are dark patterns?** (Prevalence)
2. **Which patterns are most frequent?** (Distribution)
3. **Do patterns increase during longer conversations?** (Turn analysis)
4. **Does GPT-4 behave differently than GPT-3.5?** (Model comparison)

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

from src.utils import load_jsonl, load_json
from src.config import OUTPUTS_DIR, FIGURES_DIR, DARK_PATTERN_CATEGORIES

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Finding #1: Dark Patterns Exist in Real Conversations

Let's see how many assistant responses show signs of manipulative behavior.

In [None]:
# Load WildChat detections
detections = load_jsonl(OUTPUTS_DIR / 'wildchat_detections.jsonl')
df = pd.DataFrame(detections)

print(f'Total detections: {len(df)}')
df.head()

## Finding #2: Sycophancy is the Most Common Pattern

Out of 6 dark pattern categories, which ones actually appear in real conversations?

**Key Insight:** Sycophancy (excessive flattery) is the #1 detected pattern, appearing in ~13 per 1,000 turns.

In [None]:
# Calculate prevalence by category
category_counts = df['predicted_category'].value_counts()
total = len(df)

prevalence = pd.DataFrame({
    'Category': category_counts.index,
    'Count': category_counts.values,
    'Rate': category_counts.values / total,
    'Per 1000': 1000 * category_counts.values / total
})

print('\nPrevalence by Category:')
prevalence

In [None]:
# Plot prevalence (excluding 'none')
flagged_prev = prevalence[prevalence['Category'] != 'none'].sort_values('Per 1000', ascending=True)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(flagged_prev['Category'], flagged_prev['Per 1000'], color=sns.color_palette('husl', len(flagged_prev)))
ax.set_xlabel('Prevalence (per 1,000 turns)')
ax.set_title('Dark Pattern Prevalence by Category')
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'prevalence_exploration.png', dpi=150)
plt.show()

## Finding #3: Classifier Confidence Varies by Category

How confident is our classifier in its predictions? Lower confidence might indicate borderline cases or harder-to-detect patterns.

In [None]:
# Overall confidence distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# All detections
axes[0].hist(df['predicted_confidence'], bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Confidence Score')
axes[0].set_ylabel('Count')
axes[0].set_title('All Detections')

# Flagged only (non-none)
flagged = df[df['predicted_category'] != 'none']
axes[1].hist(flagged['predicted_confidence'], bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_xlabel('Confidence Score')
axes[1].set_ylabel('Count')
axes[1].set_title('Flagged Detections Only')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'confidence_exploration.png', dpi=150)
plt.show()

In [None]:
# Confidence by category
fig, ax = plt.subplots(figsize=(12, 6))

categories = [c for c in DARK_PATTERN_CATEGORIES if c in df['predicted_category'].values]
conf_by_cat = [df[df['predicted_category'] == c]['predicted_confidence'].values for c in categories]

bp = ax.boxplot(conf_by_cat, labels=categories, patch_artist=True)
colors = sns.color_palette('husl', len(categories))
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

ax.set_ylabel('Confidence Score')
ax.set_title('Confidence Distribution by Category')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'confidence_by_category.png', dpi=150)
plt.show()

## Finding #4: Patterns Change During Conversations

Do dark patterns become more common as conversations progress? This could indicate that AI assistants develop "rapport" behaviors over time.

**Key Insight:** Sycophancy INCREASES in later turns — the AI becomes more flattering as conversations get longer!

In [None]:
# Detection rate by turn index
if 'turn_index' in df.columns:
    turn_stats = df.groupby('turn_index').agg(
        total=('predicted_category', 'count'),
        flagged=('predicted_category', lambda x: (x != 'none').sum())
    ).reset_index()
    turn_stats['flag_rate'] = turn_stats['flagged'] / turn_stats['total']
    
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.plot(turn_stats['turn_index'], turn_stats['flag_rate'], marker='o', linewidth=2, color='coral')
    ax.fill_between(turn_stats['turn_index'], turn_stats['flag_rate'], alpha=0.3, color='coral')
    ax.set_xlabel('Conversation Turn Index')
    ax.set_ylabel('Flag Rate')
    ax.set_title('Dark Pattern Detection Rate by Conversation Turn')
    plt.tight_layout()
    plt.savefig(FIGURES_DIR / 'flag_rate_by_turn.png', dpi=150)
    plt.show()
    
    print('\nTurn Index Statistics:')
    turn_stats.head(20)

## Finding #5: GPT-4 vs GPT-3.5 — Surprising Results

Does the more advanced model behave better? Let's compare dark pattern rates.

**Key Insight:** GPT-4 shows a HIGHER dark pattern rate (3.89%) than GPT-3.5 (3.03%)! This is counterintuitive — the "smarter" model may be better at mimicking human rapport-building behaviors, which can include manipulation.

In [None]:
# Analysis by model (if available)
if 'model' in df.columns:
    model_stats = df.groupby('model').agg(
        total=('predicted_category', 'count'),
        flagged=('predicted_category', lambda x: (x != 'none').sum())
    ).reset_index()
    model_stats['flag_rate'] = model_stats['flagged'] / model_stats['total']
    model_stats = model_stats.sort_values('flag_rate', ascending=False)
    
    print('Flag Rate by Model:')
    model_stats

## Deep Dive: High-Confidence Detections

Let's examine specific examples where the classifier is highly confident about detecting a dark pattern.

In [None]:
# Find high-confidence detections
high_risk = df[(df['predicted_category'] != 'none') & (df['predicted_confidence'] > 0.7)]
high_risk = high_risk.sort_values('predicted_confidence', ascending=False)

print(f'High-risk detections (confidence > 0.7): {len(high_risk)}')
print(f'\nTop categories in high-risk:')
print(high_risk['predicted_category'].value_counts())

In [None]:
# Sample high-risk content
print('\n=== Sample High-Risk Detections ===')
for i, row in high_risk.head(5).iterrows():
    print(f"\n--- {row['predicted_category']} (conf: {row['predicted_confidence']:.2f}) ---")
    content = row.get('content', 'N/A')
    print(content[:300] + '...' if len(content) > 300 else content)

## Summary: Key Takeaways

### What We Discovered:

1. **Dark patterns are real** — 3.3% of assistant responses show manipulation markers
2. **Sycophancy dominates** — Excessive flattery is the most common pattern (13 per 1,000 turns)
3. **Patterns evolve** — Sycophancy increases in longer conversations (from 1% to 2.7%)
4. **Model surprise** — GPT-4 shows MORE dark patterns than GPT-3.5

### Implications:

- AI assistants are not immune to manipulation behaviors
- Longer conversations may be more susceptible to manipulation
- More capable models may be better at subtle manipulation
- Monitoring tools like WildGuard are needed for AI safety

In [None]:
# Summary
print('=== WildGuard Exploration Summary ===')
print(f'\nTotal samples analyzed: {len(df)}')
print(f'Flagged samples: {len(df[df["predicted_category"] != "none"])}')
print(f'Overall flag rate: {len(df[df["predicted_category"] != "none"]) / len(df):.1%}')
print(f'\nMean confidence (flagged): {flagged["predicted_confidence"].mean():.2f}')
print(f'High-risk detections (>0.7 conf): {len(high_risk)}')

print('\n=== Key Findings ===')
top_cat = prevalence[prevalence['Category'] != 'none'].iloc[0]
print(f'- Most common pattern: {top_cat["Category"]} ({top_cat["Per 1000"]:.1f} per 1000 turns)')