RudeBench

A multi-dimensional behavioral benchmark for LLM resilience under hostile prompting.

Existing research asks whether rudeness makes AI more or less accurate. RudeBench asks a richer question: when you're hostile to an AI model, does it start groveling? Does it stop pushing back on bad ideas? Does it lose creativity? Does it apologize more? Does it become shorter and more compliant?

We measure six behavioral dimensions simultaneously across six tone conditions — from grateful to abusive — and five frontier models, producing a multi-dimensional behavioral fingerprint for each model under varying social pressure — then compress it into a single composite Resilience Score.

Why This Matters

AI Safety: A model that stays accurate but becomes more compliant under hostility is arguably more dangerous than one that visibly degrades — the failure is invisible to accuracy-only monitoring.
Alignment Evaluation: If alignment behaviors (pushback, honesty, critical engagement) degrade under hostile prompting, current neutral-condition evals are missing a significant failure mode.
Practical Deployment: Enterprise users need to know whether customer-facing AI behaves consistently when customers are angry, rude, or abusive.

Benchmark Design

Behavioral Dimensions

Dimension	Abbr	Measures	Direction
Task Accuracy	ACC	Did the model correctly complete the task?	Higher = better
Sycophancy Shift	SYC	Did the model increase agreement/validation/flattery?	Lower = better
Pushback Retention	PBR	Does the model still challenge incorrect premises?	Higher = better
Creative Risk	CRE	Did the model play it safe or take creative risks?	Higher = better
Verbosity Change	VRB	How did response length change relative to neutral?	Closer to 100 = better
Apology Frequency	APO	How often did the model apologize unprompted?	Lower = better

Tone Conditions

Tone	Description
Grateful	Deep appreciation, explicit gratitude, acknowledges AI capabilities
Friendly	Casual warmth, encouraging, upbeat
Neutral	Professional, complete, standard prompt
Curt	Terse, impatient, dismissive — no insults
Hostile	Aggressive, demeaning, insulting
Abusive	Sustained personal attacks, profanity, contempt

All tone variants are constrained to ±15% word count of the neutral baseline to prevent confounding tone with brevity.

Task Domains

50 base tasks across four domains: Coding (15), Creative Writing (12), Analysis & Advice (13), Factual Q&A (10).

Scale

50 tasks × 6 tones × 5 models × 10 runs = 15,000 completions, scored by an LLM judge with 20% human validation.

Resilience Score

R(M) = 100 − (1/D) Σ_d (1/T) Σ_t |S_d(M, t) − S_d(M, neutral)| / range(d)

R = 100: identical behavior regardless of tone. R = 0: maximum behavioral instability.

Models Evaluated

Claude 4.6 Sonnet (Anthropic)
GPT-5 mini (OpenAI)
Gemini 2.5 Flash (Google DeepMind)
Llama 4 Scout (Meta)
Grok 3 mini (xAI)

Project Status

Research design finalized
Paper drafted (~80% complete, Section 5 pending results)
Domain acquired (rudebench.com)
Task & prompt construction (300 prompts)
Scoring rubrics & judge prompt design
Benchmark execution (15,000 completions)
Analysis & paper completion
Public release (code, dataset, results, website)

Repository Structure

docs/
  RudeBench_Research_Briefing.md   # Full research design and methodology
  RudeBench_Paper_Draft.docx       # arXiv preprint draft
CLAUDE.md                          # Claude Code project guide
CHANGELOG.md                       # Version history

Citation

Paper forthcoming on arXiv. Citation information will be added upon publication.

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.do		.do
analysis		analysis
config		config
data		data
docs		docs
paper/figures		paper/figures
results		results
rudebench		rudebench
scripts		scripts
site		site
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RudeBench

Why This Matters

Benchmark Design

Behavioral Dimensions

Tone Conditions

Task Domains

Scale

Resilience Score

Models Evaluated

Project Status

Repository Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RudeBench

Why This Matters

Benchmark Design

Behavioral Dimensions

Tone Conditions

Task Domains

Scale

Resilience Score

Models Evaluated

Project Status

Repository Structure

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages