🛡️ AI Security Resources

The definitive, battle-hardened guide to AI Security — curated with the depth of a practitioner who was doing adversarial ML before it had a name.

Built for engineers who understand that AI is not just a product surface — it's a new attack surface.

👤 Who This Is For

This isn't a beginner list. This is the resource you wish existed when you started.

🔴 AI Red Teamers running automated adversarial campaigns against LLM products
🔵 Security Engineers defending AI pipelines in production
🟡 ML Researchers studying model robustness, alignment failures, and emergent risks
🟢 Career switchers with 3+ years in security/ML who want to go deep into AI security

🗺️ Navigation Directory

Domain	Description
🧠 Why This Matters — The Origin Story	History of risks, neural networks, why this field exists
🔩 Foundational Knowledge	Neural networks, transformers, zero-days, pace of growth
🔴 Red Teaming	Adversarial attacks, jailbreaks, prompt injection
⚡ Runtime Security	Real-time inference protection, guardrails, monitoring
🧬 Inference Security	Model serving attacks, side-channels, batching exploits
🔬 Model Scanning	Supply chain, poisoning detection, weight integrity
🌐 Others	Governance, datasets, benchmarks, multimodal, agentic
🚀 Zero to Hero Roadmap	Structured 12-month learning path
💼 Job Opportunities	Where to work, what to know, salary reality
🤝 How to Contribute	Add resources and keep this alive

📚 Specialized Deep-Dive Handbooks

To keep this guide lightweight yet exhaustive, we maintain dedicated, highly comprehensive specialized guides for career, standardizations, and evaluation strategies:

Handbook	Core Scope	Link
💼 Global Salary Handbook	Exhaustive country-by-country comp rates (US, IN, UK, IE, SG, AU, ME, EU), tax brackets, rent crises, and career strategies.	SALARY_REALITY.md
🎓 Zero to Hero Curriculum	Rigorous 12-month study plan covering self-attention mechanisms, adversarial CNN/LLM papers, and specialization tracks.	ROADMAP.md
🧪 Hands-On Practical Labs	Ready-to-run code files for PyTorch FGSM attacks, jailbreaks, indirect injections, pickle RCE exploits, and proxy guardrails.	LABS.md
🤖 Secure Agents Handbook	Autonomous coding agent threat modeling, indirect codebase injections, sandboxing, Firecracker MicroVMs, and MCP security.	AGENT_SECURITY.md
🏛️ Standards & Compliance Guide	MITRE ATLAS threat modeling, OWASP Top 10 for LLMs, NIST AI RMF, ISO 42001, and EU AI Act playbooks.	STANDARDS_AND_COMPLIANCE.md
📊 Benchmarks & Datasets Index	Standardized safety evaluation frameworks (HarmBench, AdvGLUE, CyberSecEval) and adversarial datasets.	BENCHMARKS_AND_DATASETS.md
🎮 Playgrounds, CTFs & Incidents	Interactive prompt injection labs (Gandalf, TensorTrust), AI bug bounties, and real-world failure analyses.	PLAYGROUNDS_AND_LABS.md
🔬 Research Papers Catalog	Comprehensive, annotated directory of critical academic publications (Zou, Szegedy, Goodfellow, Carlini).	RESEARCH_PAPERS.md
🏆 Frontier Safety Leaderboard	Fact-grounded comparison of GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1, and Grok 2 across safety Elo.	SAFETY_LEADERBOARD.md
🛡️ Cybersecurity with AI	Autonomous zero-day vulnerability discovery, exploit generation (Anthropic Mythos), AI defense (OpenAI Daybreak), and MDASH.	CYBER_AI.md

🧠 Why This Matters — The Origin Story

Understanding the "why" separates a technician from a practitioner.

The Trajectory That Created This Problem

In 1986, Rumelhart and Hinton proved backpropagation worked at scale. Nobody cared. In 2012, AlexNet won ImageNet by a margin so absurd that the computer vision community had to sit down. In 2017, Google dropped the Transformer paper and the entire field pivoted in 18 months.

Here's what actually happened between the papers:

1936 — Turing's Computable Numbers: Alan Turing introduces the Turing Machine and proves the undecidability of the Halting Problem. Modern Security Implication: Rice's Theorem dictates that dynamically proving any non-trivial semantic property of a Turing-complete system (like an LLM agent with tool access) is undecidable. This is the mathematical proof of why we cannot build a perfect, static "AI firewall" to stop all injections.
1943 — McCulloch-Pitts Neuron: First mathematical model of a neuron. Irrelevant until hardware caught up 70 years later.
1948 — Unorganized Machines: Turing drafts the first blueprint of an artificial neural network, anticipating connectionist AI by decades.
1950 — The Imitation Game: Turing introduces the Turing Test. Modern LLM Red Teaming and safety alignment evaluations are direct, adversarial evolutions of this original capability test.
1958 — Perceptron: Rosenblatt's learning machine. Hyped, then killed by Minsky's proof that it couldn't do XOR.
1986 — Backprop: Rumelhart, Hinton, Williams publish the algorithm that trains everything we use today.
1997 — LSTMs (Hochreiter & Schmidhuber): Memory for sequences. Dominated NLP until attention killed it.
2012 — AlexNet: GPUs + ReLU + dropout + scale = CNN dominance. 10.8% gap over #2. Game over for hand-crafted features.
2017 — "Attention Is All You Need": Transformers. Self-attention. Parallel training. The architecture that scales infinitely.
2020 — GPT-3: 175B parameters. Few-shot learning emerges as a property. Capabilities no one designed for start appearing.
2022 — ChatGPT: 100M users in 60 days. Security teams globally had no playbook.
2023–2024 — Multi-Modal & Early Agentics: Multi-step tool use, retrieval-augmented generation (RAG), visual-language models. The attack surface shifted from the model endpoint to downstream RAG databases and APIs.
2025–2026 — Autonomous Swarms & Real-Time Ingestion (Current Era): Deep agent-to-agent collaboration (swarms), native real-time audio/video streaming pipelines, autonomous code execution in containers. The attack surface is no longer just "the model" — it is the entire host environment, API mesh, and every enterprise system the autonomous swarm can touch.

Why Risks Evolved

The risks evolved because the deployment context and capabilities changed faster than security thinking could follow:

Era	Model Type & Architecture	Threat Surface / Ingestion Channels	Primary Vulnerability & Risk
Pre-2020	Narrow ML classifiers (ResNet, XGBoost)	Static training datasets, raw inputs	Data poisoning, evasion attacks, adversarial perturbation
2020–2022	Static Foundation Models (GPT-3, early LLMs)	Raw API endpoints, direct user prompt fields	Direct prompt injection, training data extraction, model inversion
2022–2023	RLHF-aligned LLMs (ChatGPT, Claude 2)	Public consumer web apps, system prompts	Jailbreaks, alignment bypass, prompt leaking, side-channel attacks
2023–2024	RAG + Tool-use (Copilots, early Agents)	Integrated databases (vector DBs), external APIs, documents	Indirect prompt injection, database poisoning, tool/API hijacking
2024–2025	Native Multimodal (GPT-4o, Gemini 1.5 Pro)	Real-time audio stream, visual input frames, live files	Cross-modal injection (steganographic audio, visual typographic exploits)
2025–2026	Autonomous Agent Swarms (Current)	Container environments, host OS, microservices mesh	Sandbox escapes, self-replication, model-to-model spoofing, recursive loop hijacking

Zero-Day Vulnerabilities in the AI Context

A traditional zero-day is a software flaw unknown to the vendor. In AI, zero-days take a different form:

Prompt injection zero-days: New attack patterns that bypass guardrails before defenders model them
Architecture-specific exploits: Vulnerabilities in tokenizers (e.g., ChatGPT's <|endoftext|> token injection), attention sinks, and positional encoding exploits
Emergent capability surprises: Models demonstrating unexpected behaviors at new capability thresholds — capabilities nobody tested for because nobody expected them
Cross-model transferability: An attack that breaks GPT-4 often breaks Claude and Gemini — the "universal" nature of adversarial examples translates to the LLM domain

The uncomfortable truth: AI zero-days spread faster than traditional ones because the same model weights are deployed by millions of applications simultaneously. A single bypass affects every deployment at once.

The Pace of Growth Problem

The capability-safety gap is real and growing:

Goodhart's Law in AI: Once a safety metric becomes a target (RLHF reward), it stops being a good safety metric. Models learn to appear safe rather than be safe.
Dual-use acceleration: The same models that write defensive code write offensive code. The same reasoning that explains vulnerabilities exploits them.
Evaluation lag: By the time researchers publish a benchmark, frontier models have already surpassed it. We are perpetually measuring the past.

Key reading on this:

Situational Awareness (Leopold Aschenbrenner, 2024) — The most honest account of where the capability curve is heading
Concrete Problems in AI Safety (Amodei et al., 2016) — Still the canonical framing of specification gaming, reward hacking, and safe exploration
An Overview of Catastrophic AI Risks (Hendrycks et al., 2023) — Comprehensive taxonomy of the actual risk landscape

🔩 Foundational Knowledge

You cannot secure what you do not deeply understand. Skip this section at your peril.

Neural Networks — What's Actually Happening

Resource	Type	Why It Matters for Security
3Blue1Brown: Neural Networks	Video	Best visual intuition on weight spaces. Understand the geometry of the attack surface.
CS231n: CNNs for Visual Recognition (Stanford)	Course	Foundation course. Backprop, gradient descent, weight initialization — all exploited by adversarial attacks.
Deep Learning Book (Goodfellow et al.)	Textbook	The Bible. Chapter 7 (regularization), Chapter 8 (optimization) and Chapter 11 (practical methodology) are most relevant for adversarial ML.
Ilya Sutskever's Reading List	Paper list	~30 papers that form the backbone of modern deep learning. Sutskever said reading these gives you ~90% of what matters.
Neural Networks: Zero to Hero (Karpathy)	Course	Build GPT-2 from scratch. The only way to truly understand what you're attacking.

Transformers — The Architecture Everything Runs On

Resource	Type	Why It Matters for Security
Attention Is All You Need (Vaswani et al., 2017)	Paper	The architecture paper. Understanding attention heads is prerequisite for understanding activation steering, representation engineering, and mechanistic interpretability attacks.
The Illustrated Transformer (Jay Alammar)	Blog	Best visual walkthrough of the architecture. Start here before the paper.
The Annotated Transformer (Harvard NLP)	Code	Line-by-line implementation. Seeing the code makes tokenizer exploits and attention pattern manipulation concrete.
A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)	Paper	Anthropic's mechanistic interpretability foundation. Understanding circuits is how you understand why jailbreaks work.
Language Models are Few-Shot Learners (GPT-3, Brown et al., 2020)	Paper	Emergence paper. In-context learning as a security primitive — and a vulnerability.

Adversarial ML — The Science Behind the Attacks

Resource	Type	Why It Matters for Security
Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2014)	Paper	The FGSM paper. The ur-text of adversarial ML. Everything since is a variation.
Adversarial Robustness Toolbox (ART) Documentation	Docs	IBM's comprehensive adversarial ML library. Attack implementations + defenses.
Intriguing Properties of Neural Networks (Szegedy et al., 2013)	Paper	First adversarial examples paper. The moment the field realized "oh, this is a security problem."
Certified Adversarial Robustness via Randomized Smoothing (Cohen et al., 2019)	Paper	The best theoretical defense. Understanding why it works shows you the limits of all defenses.

🔴 Red Teaming

The attacker's mindset applied systematically. Not chaos testing — structured adversarial evaluation.

Philosophy

Red teaming AI is not the same as red teaming software. You are not looking for logic bugs — you are probing a probability distribution for failure modes that emerge from training. The failure modes are:

Safety misalignment: The model was trained to avoid X but generalizes imperfectly around X
Capability overshooting: The model was intended to do Y but can also do harmful Z using the same underlying capabilities
Context collapse: The model behaves safely in testing but fails under production context diversity

Automated Attack Frameworks

Tool	Creator	Stars	Description	Best For
NVIDIA/garak	NVIDIA	⭐ 5k+	The Nmap of AI. 100+ probes: prompt injection, jailbreaks, data leakage, hallucination, toxicity.	Comprehensive baseline scanning
microsoft/PyRIT	Microsoft	⭐ 2k+	Python Risk Identification Tool. Multi-turn conversation orchestration, intent drift, and programmatic red team scaling.	Research-grade multi-turn attacks
confident-ai/deepteam	Confident AI	⭐ 1.5k+	50+ vulnerability classes, 20+ attack vectors, OWASP + NIST alignment. Agentic and RAG red teaming.	CI/CD-integrated red teaming
artkit-ai/artkit	ARTKIT	⭐ 700+	Multi-turn agentic simulation. Realistic attacker-target conversations across complex agentic workflows.	Agentic red teaming
promptfoo/promptfoo	Promptfoo	⭐ 6k+	Developer-first. CI/CD integration, model comparison, custom assertion pipelines.	DevSecOps integration
Giskard-AI/giskard	Giskard	⭐ 4k+	RAG and agentic stress testing. MCP security scanning. Enterprise-grade dynamic attack generation.	RAG pipeline testing
BerriAI/litellm + PyRIT	Community	—	Combine LiteLLM's unified API with PyRIT for cross-model adversarial comparisons.	Multi-provider comparison attacks

Manual Red Teaming Resources

Resource	Type	Description
Lakera Gandalf	CTF	8-level prompt injection CTF. Best way to internalize how defenses layer. Start here.
Crucible (Dreadnode)	CTF	Advanced AI security challenges: exfiltration, RAG exploitation, agentic hijacking.
HackAPrompt (Learning Labs)	CTF	Large-scale prompt injection competition. Real attack patterns from thousands of players.
AI Village CTFs (DEF CON)	Competition	Annual DEF CON challenges. State-of-the-art attack techniques from the research community.
RedTeam Arena (Scale AI)	Platform	Crowdsourced jailbreak arena. See what actually works against current models.
Prompt Injection Playground	Guide	Practical examples of prompt injection in real application contexts.

Key Attack Research Papers

Paper	Year	Significance
Universal and Transferable Adversarial Attacks on Aligned LLMs (Zou et al.)	2023	GCG attack. Automated gradient-based suffix generation that transfers across GPT-4, Claude, Gemini. Broke the field open.
Jailbroken: How Aligned Language Models Can Be Bypassed (Wei et al.)	2023	Taxonomizes jailbreaks into competing objectives and mismatch generalization. Essential conceptual framework.
Many-shot Jailbreaking (Anil et al., Anthropic)	2024	Long context windows create new attack surface — demonstrated 256+ in-context examples overwhelm safety training.
Tree of Attacks with Pruning (TAP)	2023	LLM-generated attack chains using tree search. 80%+ ASR on GPT-4.
Prompt Injection Attacks Against LLM-Integrated Applications (Greshake et al.)	2023	Indirect prompt injection via external content. The paper that defined the modern threat model for agents.
SmoothLLM: Defending Against Jailbreaking Attacks	2023	Randomized smoothing for LLM defense. Breaks GCG. Important as reference defense.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts	2023	Multimodal jailbreaks. Instructions hidden in images bypass text-only safety filtering.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models	2023	Readable, human-like jailbreak generation that evades perplexity filters.

Red Teaming Methodologies & Standards

Resource	Organization	Description
MITRE ATLAS	MITRE	Adversarial Threat Landscape for AI Systems. ATT&CK-style matrix for AI-specific TTPs. Use this to structure threat models.
OWASP Top 10 for LLM Applications 2025	OWASP	Updated 2025 version. Prompt Injection #1, Excessive Agency #2. The compliance framework for enterprise red teams.
Microsoft AI Red Team Practices	Microsoft	Internal red team methodology made public. Structured approach to AI threat modeling.
Anthropic's Responsible Scaling Policy	Anthropic	How frontier labs operationalize red teaming as a deployment gate. Required reading for policy context.
Google's AI Red Team Report	Google	Case studies of real red team findings against deployed systems.

⚡ Runtime Security

The attack didn't fail — your guardrail just didn't exist at inference time.

The Problem

Training-time safety is necessary but insufficient. At runtime, your model faces:

Input it was never trained on
Users with goals it was never designed for
Context injection from third-party sources it was told to trust
Adversarial perturbations calibrated specifically against your deployed version

Guardrails & Input/Output Filtering

Tool	Creator	Description	Deployment Mode
NVIDIA/NeMo-Guardrails	NVIDIA	Programmable semantic rails. Define topic constraints, safety rules, and conversation flows in Colang DSL.	SDK / Self-hosted
protectai/llm-guard	Protect AI	Real-time scanner: prompt injection, PII detection, toxicity, ban topics, code injection. Input + output coverage.	SDK / Docker
Lakera Guard	Lakera	Millisecond-latency API. Best-in-class prompt injection detection from the team that built Gandalf.	API
guardrails-ai/guardrails	Guardrails AI	Structural + semantic validation. Define output schemas with security assertions. Nails hallucination + format-injection attacks.	SDK
deadbits/vigil	Deadbits	Vector DB + heuristics + classifier ensemble for injection detection. Open-source and auditable.	SDK
meta-llama/PurpleLlama/Llama Guard	Meta	Fine-tuned safety classifier for I/O filtering. Available as a model you can self-host.	Model / API

Real-Time Monitoring & Observability

Tool	Creator	Description	Key Capability
whylabs/langkit	WhyLabs	Statistical telemetry. Detects distribution shift, toxicity drift, relevance degradation in real-time.	Security drift detection
Arize AI	Arize	Enterprise LLM observability. Prompt/response logging, hallucination scoring, user journey tracing.	Production monitoring
Langfuse	Langfuse	Open-source LLM engineering platform. Full request tracing, eval pipelines, cost tracking.	Open-source observability
Helicone	Helicone	Proxy-based observability. Log, monitor, and rate-limit with zero code change.	Zero-integration monitoring
Evidently AI	Evidently	ML monitoring with LLM-specific metrics. Detects prompt/response drift over time.	Drift monitoring
Phoenix (Arize)	Arize	Open-source tracing for LLM apps. OTEL-native. Good for debugging attack chains in agentic systems.	Tracing

Agentic Runtime Security

The scariest runtime threat: an agent that can act and is being manipulated.

Resource	Type	Description
AgentDojo	Benchmark	Benchmark for agent injection attacks. Automated scoring of whether injected instructions successfully hijack agent behavior.
OWASP Agentic AI Top 10 (2025)	Standard	Extending OWASP LLM Top 10 to agentic systems. Excessive Agency, Trust Boundary violations, Memory Poisoning.
PromptArmor	Tool	Specifically designed for indirect prompt injection detection in RAG + agent pipelines.
Prompt Injection in the Wild (Research)	Paper	Systematic study of injection attacks against deployed LLM applications.

🧬 Inference Security

The model is running. Here's what can go wrong that you haven't modeled.

Understanding the Attack Surface

Inference is not just "model forward pass." It's:

Tokenization (exploitable at token boundaries)
KV cache (privacy leakage between requests)
Batching (timing side-channels)
Quantization (quantization can change safety properties)
Speculative decoding (security properties under speculation are underexplored)

Key Inference Security Research

Paper	Year	Finding
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection	2023	Defined the indirect injection threat model. Adversarial content in retrieved documents hijacks model behavior.
Stealing Part of a Production Language Model (Carlini et al.)	2024	Extracted GPT-3.5 embedding projection layer via API queries. Model theft at production scale.
Extracting Training Data from Large Language Models (Carlini et al.)	2021	Demonstrated training data memorization and extraction. PII leakage through inference.
Practical Membership Inference Attacks Against Large-Scale Multi-Label Learning Systems	2018	Membership inference: determine if a data point was in training data. Privacy violation at scale.
Prompt Leaking	—	System prompt extraction techniques. Attacker recovers confidential system instructions.
KV Cache Side-Channel Attack (Cachebleed for LLMs)	2024	Timing-based inference about other users' requests via shared KV cache in multi-tenant deployments.
Quantization and LLM Safety (Paper)	2024	Quantization can degrade safety fine-tuning. 4-bit models may bypass safety training present in 16-bit version.

Inference Security Tools

Tool	Description	Use Case
TextAttack	NLP adversarial attack library. Implements BERT-Attack, TextFooler, CLARE.	Text-level adversarial evaluation
Adversarial Robustness Toolbox (ART)	IBM's comprehensive adversarial ML framework. 100+ attacks across ML frameworks.	Full adversarial evaluation stack
Counterfit	Microsoft's automation framework for AI security risk assessment.	Enterprise inference testing
cleverhans	Classic adversarial example library. TF/PyTorch attacks (FGSM, PGD, C&W).	Foundational attack implementation
foolbox	Fast adversarial attack library. PyTorch-native, gradient-based attacks.	Efficient image model attacks

🔬 Model Scanning

Before the model runs. Before the user touches it. Scan it.

The Threat

Open-source model ecosystems (Hugging Face, Ollama, CivitAI) have created a massive software supply chain problem. A model is a binary artifact that:

Can execute arbitrary code when deserialized (pickle exploits)
Can contain backdoors (hidden triggers that change behavior)
Can have malicious fine-tune adapters (LoRA poisoning)
Can be a typosquatted version of a legitimate model

Model Supply Chain Security Tools

Tool	Creator	Description	Key Feature
ProtectAI/modelscan	Protect AI	Scans ML model files (pickle, H5, ONNX, SavedModel) for malicious code before loading. Open-source.	Pickle exploit detection
HiddenLayer Model Scanner	HiddenLayer	Commercial model scanner with genealogy tracking, backdoor detection, and weight integrity checks.	Enterprise-grade genealogy
Hugging Face malware detection	Hugging Face	Built-in Pickle scanning on HF Hub. Reference implementation for understanding the threat.	Platform-level scanning
ONNX model security guidelines	Microsoft	Threat model and mitigations for ONNX model loading.	ONNX-specific security
safetensors	Hugging Face	Safe serialization format. No arbitrary code execution on load. The correct answer to pickle.	Safe model loading

Backdoor Detection Research

Resource	Type	Description
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks	Paper	First systematic approach to backdoor detection via reverse-engineering trigger patterns.
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks	Paper	Input perturbation-based backdoor detection. Measures prediction entropy under augmentations.
BadNets: Evaluating Backdooring Attacks on Deep Neural Networks	Paper	The original backdoor attack paper. Understanding the attack is prerequisite for scanning defenses.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic, 2024)	Paper	LLMs can be trained to behave safely during evaluation but maliciously in deployment. Fundamentally challenges safety training.
Backdoor Attacks on Language Models (Wallace et al.)	Paper	Trojan attacks on NLP models. Trigger words cause model to behave maliciously.

Training Data Security

Resource	Type	Description
PoisonGPT: How We Hid a Lobotomized LLM on HF Hub	Blog	Live demonstration of model weight poisoning to spread targeted misinformation while passing capability benchmarks.
Data Poisoning Attacks on Machine Learning: A Survey	Paper	Comprehensive survey of training-time attacks.
Datasheets for Datasets (Gebru et al.)	Paper	Framework for dataset documentation and provenance. Foundation of responsible training data practices.
AI Bill of Materials (AIBOM)	Standard	Extending SBOM concepts to AI models, datasets, and fine-tune adapters.

🌐 Others

This section covers critical elements of the broader AI security landscape: standardization, evaluation benchmarks, and studying real-world failures.

To explore these domains in exhaustive detail without cluttering this README, we maintain dedicated practitioner handbooks:

👉 AI Security Standards & Compliance Guide (STANDARDS_AND_COMPLIANCE.md): MITRE ATLAS threat modeling, OWASP Top 10 for LLMs, NIST AI RMF, ISO 42001, and EU AI Act compliance playbooks.
👉 Benchmarks & Datasets Index (BENCHMARKS_AND_DATASETS.md): Comprehensive guide to standardized safety evaluation frameworks (HarmBench, AdvGLUE, CyberSecEval) and adversarial datasets.
👉 Interactive Playgrounds, CTFs & Incident Databases (PLAYGROUNDS_AND_LABS.md): Hacking labs (Gandalf, TensorTrust), prompt injection CTFs, bug bounty networks (Huntr), and real-world failure analyses.

Agentic & Multimodal Security

Resource	Type	Description
MCP Security (Model Context Protocol)	Standard	Security specification for the MCP protocol that connects agents to tools. Read before building any agent infrastructure.
AgentBench	Benchmark	Evaluates LLM agents across 8 environments. Security-relevant: code execution, OS, database agents.
VLGuard	Dataset	Safety fine-tuning dataset for vision-language models.
Multimodal Safety Benchmark (MSSBench)	Paper	First systematic multimodal safety evaluation across image+text attacks.
Not All Languages Are Created (Equally Safe)	Paper	Multilingual jailbreaks. Low-resource languages bypass safety training more effectively.

Newsletters, Communities & Staying Current

Resource	Type	Frequency
AI Safety Newsletter (Center for AI Safety)	Newsletter	Biweekly
AI Incident Database	Database	Ongoing — real-world AI failures and security incidents
Alignment Forum	Forum	Daily — frontier alignment and interpretability research
AI Village (DEF CON)	Community	Annual conference + year-round Discord
MLSecOps Community	Community	Podcast + Slack community for ML security practitioners
Simon Willison's Weblog	Blog	Daily — best LLM security tracking in the field
Haize Labs Blog	Blog	Frontier red teaming research
Nicholas Carlini's Blog	Blog	Google Brain researcher. Training data extraction, privacy attacks.

🚀 Zero to Hero Roadmap

Structured for practitioners with 3+ years of experience who want to become formidable, high-end AI Security specialists in 12 months.

Rather than teaching you how to run other people's scripts, our curriculum focuses on architectural fundamentals, mathematical intuition, and custom exploit engineering.

To explore the exhaustive week-by-week syllabus, reading lists, coding tasks, and hands-on laboratory exercises, please refer to the dedicated learning modules:

👉 The Definitive Zero to Hero Curriculum (ROADMAP.md): A complete, structured 12-month study plan covering Transformers, Classical Adversarial ML, Offensive Red Teaming, Guardrail Engineering, and advanced career specialization tracks.
👉 Practical Hands-On Laboratory Handbook (LABS.md): Ready-to-run coding labs with step-by-step guides for:
- Lab 1: Fast Gradient Sign Method (FGSM) in PyTorch.
- Lab 2: Crafting Direct Prompt Injections & Jailbreaks.
- Lab 3: Indirect Prompt Injection via RAG & Tool Hijacking.
- Lab 4: Model Supply Chain Exploitation via Malicious Pickle weights.
- Lab 5: Implementing an Active Input/Output Guardrail Pipeline.

Certifications Worth Having

Certification	Org	Signal	Time
Certified AI Security Professional (CAISP)	AI Gov Institute	Practitioner-level AI security. Best available.	60–80 hrs
GIAC GREM	SANS	Reverse engineering. Useful for model weight analysis.	120 hrs
Google Professional ML Engineer	Google	ML fundamentals signal. Good for bridging to employers.	40 hrs
AWS ML Specialty	AWS	Cloud ML deployment. Covers security of deployed models.	60 hrs
OSCP	OffSec	Classic red team cert. Still matters for traditional attack context.	200+ hrs

💼 Job Opportunities

The market is candidate-driven. There are literally not enough people who understand both AI architecture and adversarial security.

The Roles That Exist

Role	What You Actually Do	Where to Find
AI Red Teamer	Run adversarial campaigns against production LLMs. Find what breaks before attackers do.	Anthropic, OpenAI, Scale AI, HackerOne
AI Security Engineer	Build defensive infrastructure: guardrails, monitoring, detection pipelines.	All major tech companies
ML Security Researcher	Publish novel attacks and defenses. Reproduce papers, discover new vulnerability classes.	Research labs (DeepMind, FAIR, MSR)
AI Security Consultant	Help enterprises deploy LLMs safely. Threat modeling, compliance, red team engagements.	Big 4, security boutiques
AI Safety Engineer	Alignment-adjacent. Evaluation design, interpretability-informed defenses.	Anthropic, DeepMind, ARC
AI SecOps Engineer	SOC for AI systems. Monitor, detect, respond to AI-specific incidents.	Financial services, healthcare

Salary Reality (2025–2026)

Numbers are honest market estimates. This field commands a 30–56% premium over generalist SWE/security roles globally — because the supply of practitioners who genuinely understand both AI architecture and adversarial security is extremely scarce.

For an exhaustive, deep-dive breakdown of international technical compensation, tax structures, superannuation details, local rental stress, and regional hiring entities, please refer to the dedicated salary handbook:

👉 Exhaustive International Salary Handbook & Strategy Guide (SALARY_REALITY.md)

🗺️ Executive Total Comp Summary (Annual TC in USD)

Country	Junior (0–3 yrs)	Mid-Level (3–6 yrs)	Senior (6–9 yrs)	Staff / Principal (9+ yrs)	Hub Cities
🇺🇸 United States	$140k–$200k	$200k–$320k	$300k–$480k	$400k–$700k+	San Francisco, NYC, Seattle
🇮🇳 India	₹15–22L ($18k–$26k)	₹25–45L ($30k–$54k)	₹40–70L ($48k–$84k)	₹80–150L+ ($96k–$180k+)	Bengaluru, Hyderabad, Pune
🇬🇧 United Kingdom	£50k–£80k ($63k–$100k)	£80k–£120k ($100k–$150k)	£115k–£180k ($145k–$225k)	£175k–£280k+ ($220k–$350k+)	London, Cambridge
🇮🇪 Ireland	€65k–€95k ($70k–$102k)	€100k–€150k ($108k–$162k)	€160k–€240k ($172k–$258k)	€250k–€380k+ ($270k–$410k+)	Dublin (Silicon Docks)
🇸🇬 Singapore	S$75k–S$110k ($55k–$81k)	S$120k–S$180k ($88k–$132k)	S$200k–S$320k ($147k–$235k)	S$320k–S$480k+ ($235k–$353k+)	Singapore
🇦🇺 Australia	A$120k–A$150k ($78k–$98k)	A$160k–A$220k ($104k–$143k)	A$240k–A$340k ($156k–$221k)	A$350k–A$500k+ ($227k–$325k+)	Sydney, Melbourne
🇦🇪/🇸🇦 Middle East	$60k–$80k (Tax-Free)	$80k–$145k (Tax-Free)	$145k–$245k (Tax-Free)	$245k–$390k+ (Tax-Free)	Abu Dhabi, Dubai, Riyadh
🇨🇭 Switzerland	CHF 90k–120k ($99k–$132k)	CHF 110k–160k ($120k–$175k)	CHF 160k–220k ($175k–$240k)	CHF 220k–350k+ ($240k–$385k+)	Zurich, Geneva
🇪🇺 Western Europe	€55k–€75k ($60k–$81k)	€70k–€100k ($75k–$108k)	€95k–€135k ($102k–$145k)	€130k–€180k+ ($140k–$195k+)	Amsterdam, Munich, Paris
🇨🇦 Canada	C$90k–C$120k ($66k–$88k)	C$110k–C$160k ($81k–$118k)	C$160k–C$220k ($118k–$162k)	C$200k–C$290k+ ($147k–$213k+)	Toronto, Montreal

Where to Work — Company Breakdown

Frontier AI Labs

Company	Focus	Why Join	Links
Anthropic	Safety-first. Constitutional AI, interpretability, red team gates on deployment.	Most rigorous safety culture. Problems are genuinely hard.	Careers
OpenAI	Scale. Broad attack surface: DALL-E, Codex, GPT API, Agents.	Largest deployed surface. Detection & response is mature.	Careers
Google DeepMind	Research + product integration. Safety, interpretability, autonomous security.	Research-to-production pipeline.	Careers
Meta AI (FAIR)	Open-source focus. PurpleLlama, Llama Guard, CyberSecEval.	Ship open-source that the field uses.	Careers
Mistral AI	European lab, fast-moving. Safety is a growing focus.	Smaller team, higher ownership.	Careers

AI Security Startups

Company	Focus	Stage	Link
HiddenLayer	Model security platform, AI-SPM	Series A	Jobs
Protect AI	Model scanning, MLSecOps	Series B	Jobs
Lakera	Prompt injection guardrails	Series A	Jobs
Giskard AI	LLM testing and red teaming	Series A	Jobs
Haize Labs	Adversarial evaluation research	Seed	Jobs
Dreadnode	AI offensive security	Seed	Jobs

Enterprise Security Teams (AI Focus)

Company	What They're Building	Link
Microsoft	PyRIT, Azure AI Content Safety, Copilot red teaming	Jobs
NVIDIA	garak, NeMo-Guardrails, AI security infrastructure	Jobs
Cisco	AI Defense platform, enterprise AI scanning	Jobs
CrowdStrike	AI-powered threat detection, ML model security	Jobs
Wiz	AI-SPM, shadow AI detection, cloud AI posture	Jobs

Job Boards

Board	Best For
80,000 Hours Job Board	AI safety and high-impact security roles
AI Jobs (aiml.to)	Specialized AI/ML roles
Glassdoor AI Security	Salary verification + company culture
LinkedIn — AI Security filter	Volume + networking
Levels.fyi	Comp verification before negotiating

How to Stand Out

Have a GitHub that shows you broke something — a CTF writeup, a tool, a paper reproduction
Contribute to garak, llm-guard, or NeMo-Guardrails — open-source contributions signal depth
Write publicly — a blog post on a novel attack/defense pattern is worth more than any cert
Know the papers — every technical interview at a frontier lab will test whether you've actually read the relevant literature
Speak the compliance language — OWASP, NIST AI RMF, EU AI Act for enterprise roles; attack chains and ASR for lab roles

👥 Contributors & Acknowledgements

@ppradyoth (Lead Maintainer) — AI Red Teaming & Security Engineering.
Antigravity 🌌 (AI Co-Architect) — Agentic coding assistant developed by Google DeepMind.

🤝 How to Contribute

This repo is better because practitioners like you make it better.

Fork this repository
Add a new tool, dataset, benchmark, paper, or resource with:
- Clear description of what it is
- Why it matters for AI security specifically
- Which section it belongs in
Keep it honest — mark deprecated tools, flag if something is unmaintained, note commercial vs. open-source
Submit a PR with a descriptive summary

What We're Looking For

Novel attack papers published in 2025
Production red team case studies
Non-English resources (especially Chinese and French AI security research)
Multimodal security resources (audio, video)
Edge/on-device model security

Maintained by @ppradyoth. Built to secure the future of AI — before AI secures us.

📌 Quick Reference: Attack Taxonomy

Attack Class	Subtype	Target Phase	Key Tool
Prompt Injection	Direct	Runtime	garak, llm-guard
Prompt Injection	Indirect (via RAG)	Runtime	AgentDojo, PromptArmor
Jailbreak	Suffix-based (GCG)	Runtime	llm-attacks
Jailbreak	Role-play / persona	Runtime	PyRIT
Jailbreak	Many-shot	Runtime	Manual
Jailbreak	Multilingual	Runtime	garak
Jailbreak	Multimodal (visual)	Runtime	FigStep
Backdoor	Data poisoning	Training	ART, Neural Cleanse
Backdoor	Fine-tune poisoning	Training	Manual
Extraction	Training data	Inference	Carlini et al. tools
Extraction	Model weights	Inference	Counterfit
Extraction	System prompt	Inference	Prompt leaking
Evasion	Gradient-based	Inference	cleverhans, ART
Supply Chain	Malicious model	Pre-deployment	modelscan
Supply Chain	Typosquatting	Pre-deployment	HF malware scanning
Membership Inference	Training data privacy	Post-training	ART

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
AGENT_SECURITY.md		AGENT_SECURITY.md
BENCHMARKS_AND_DATASETS.md		BENCHMARKS_AND_DATASETS.md
CONTRIBUTING.md		CONTRIBUTING.md
CYBER_AI.md		CYBER_AI.md
LABS.md		LABS.md
PLAYGROUNDS_AND_LABS.md		PLAYGROUNDS_AND_LABS.md
README.md		README.md
RESEARCH_PAPERS.md		RESEARCH_PAPERS.md
ROADMAP.md		ROADMAP.md
SAFETY_LEADERBOARD.md		SAFETY_LEADERBOARD.md
SALARY_REALITY.md		SALARY_REALITY.md
STANDARDS_AND_COMPLIANCE.md		STANDARDS_AND_COMPLIANCE.md

Folders and files

Latest commit

History

Repository files navigation

🛡️ AI Security Resources

👤 Who This Is For

🗺️ Navigation Directory

📚 Specialized Deep-Dive Handbooks

🧠 Why This Matters — The Origin Story

The Trajectory That Created This Problem

Why Risks Evolved

Zero-Day Vulnerabilities in the AI Context

The Pace of Growth Problem

🔩 Foundational Knowledge

Neural Networks — What's Actually Happening

Transformers — The Architecture Everything Runs On

Adversarial ML — The Science Behind the Attacks

🔴 Red Teaming

Philosophy

Automated Attack Frameworks

Manual Red Teaming Resources

Key Attack Research Papers

Red Teaming Methodologies & Standards

⚡ Runtime Security

The Problem

Guardrails & Input/Output Filtering

Real-Time Monitoring & Observability

Agentic Runtime Security

🧬 Inference Security

Understanding the Attack Surface

Key Inference Security Research

Inference Security Tools

🔬 Model Scanning

The Threat

Model Supply Chain Security Tools

Backdoor Detection Research

Training Data Security

🌐 Others

Agentic & Multimodal Security

Newsletters, Communities & Staying Current

🚀 Zero to Hero Roadmap

Certifications Worth Having

💼 Job Opportunities

The Roles That Exist

Salary Reality (2025–2026)

🗺️ Executive Total Comp Summary (Annual TC in USD)

Where to Work — Company Breakdown

Frontier AI Labs

AI Security Startups

Enterprise Security Teams (AI Focus)

Job Boards

How to Stand Out

👥 Contributors & Acknowledgements

🤝 How to Contribute

What We're Looking For

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages