Skip to content

ppradyoth/ai-security-resources

Repository files navigation

๐Ÿ›ก๏ธ AI Security Resources

The definitive, battle-hardened guide to AI Security โ€” curated with the depth of a practitioner who was doing adversarial ML before it had a name.

Built for engineers who understand that AI is not just a product surface โ€” it's a new attack surface.


๐Ÿ‘ค Who This Is For

This isn't a beginner list. This is the resource you wish existed when you started.

  • ๐Ÿ”ด AI Red Teamers running automated adversarial campaigns against LLM products
  • ๐Ÿ”ต Security Engineers defending AI pipelines in production
  • ๐ŸŸก ML Researchers studying model robustness, alignment failures, and emergent risks
  • ๐ŸŸข Career switchers with 3+ years in security/ML who want to go deep into AI security

๐Ÿ—บ๏ธ Navigation Directory

Domain Description
๐Ÿง  Why This Matters โ€” The Origin Story History of risks, neural networks, why this field exists
๐Ÿ”ฉ Foundational Knowledge Neural networks, transformers, zero-days, pace of growth
๐Ÿ”ด Red Teaming Adversarial attacks, jailbreaks, prompt injection
โšก Runtime Security Real-time inference protection, guardrails, monitoring
๐Ÿงฌ Inference Security Model serving attacks, side-channels, batching exploits
๐Ÿ”ฌ Model Scanning Supply chain, poisoning detection, weight integrity
๐ŸŒ Others Governance, datasets, benchmarks, multimodal, agentic
๐Ÿš€ Zero to Hero Roadmap Structured 12-month learning path
๐Ÿ’ผ Job Opportunities Where to work, what to know, salary reality
๐Ÿค How to Contribute Add resources and keep this alive

๐Ÿ“š Specialized Deep-Dive Handbooks

To keep this guide lightweight yet exhaustive, we maintain dedicated, highly comprehensive specialized guides for career, standardizations, and evaluation strategies:

Handbook Core Scope Link
๐Ÿ’ผ Global Salary Handbook Exhaustive country-by-country comp rates (US, IN, UK, IE, SG, AU, ME, EU), tax brackets, rent crises, and career strategies. SALARY_REALITY.md
๐ŸŽ“ Zero to Hero Curriculum Rigorous 12-month study plan covering self-attention mechanisms, adversarial CNN/LLM papers, and specialization tracks. ROADMAP.md
๐Ÿงช Hands-On Practical Labs Ready-to-run code files for PyTorch FGSM attacks, jailbreaks, indirect injections, pickle RCE exploits, and proxy guardrails. LABS.md
๐Ÿค– Secure Agents Handbook Autonomous coding agent threat modeling, indirect codebase injections, sandboxing, Firecracker MicroVMs, and MCP security. AGENT_SECURITY.md
๐Ÿ›๏ธ Standards & Compliance Guide MITRE ATLAS threat modeling, OWASP Top 10 for LLMs, NIST AI RMF, ISO 42001, and EU AI Act playbooks. STANDARDS_AND_COMPLIANCE.md
๐Ÿ“Š Benchmarks & Datasets Index Standardized safety evaluation frameworks (HarmBench, AdvGLUE, CyberSecEval) and adversarial datasets. BENCHMARKS_AND_DATASETS.md
๐ŸŽฎ Playgrounds, CTFs & Incidents Interactive prompt injection labs (Gandalf, TensorTrust), AI bug bounties, and real-world failure analyses. PLAYGROUNDS_AND_LABS.md
๐Ÿ”ฌ Research Papers Catalog Comprehensive, annotated directory of critical academic publications (Zou, Szegedy, Goodfellow, Carlini). RESEARCH_PAPERS.md
๐Ÿ† Frontier Safety Leaderboard Fact-grounded comparison of GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1, and Grok 2 across safety Elo. SAFETY_LEADERBOARD.md
๐Ÿ›ก๏ธ Cybersecurity with AI Autonomous zero-day vulnerability discovery, exploit generation (Anthropic Mythos), AI defense (OpenAI Daybreak), and MDASH. CYBER_AI.md

๐Ÿง  Why This Matters โ€” The Origin Story

Understanding the "why" separates a technician from a practitioner.

The Trajectory That Created This Problem

In 1986, Rumelhart and Hinton proved backpropagation worked at scale. Nobody cared. In 2012, AlexNet won ImageNet by a margin so absurd that the computer vision community had to sit down. In 2017, Google dropped the Transformer paper and the entire field pivoted in 18 months.

Here's what actually happened between the papers:

  • 1936 โ€” Turing's Computable Numbers: Alan Turing introduces the Turing Machine and proves the undecidability of the Halting Problem. Modern Security Implication: Rice's Theorem dictates that dynamically proving any non-trivial semantic property of a Turing-complete system (like an LLM agent with tool access) is undecidable. This is the mathematical proof of why we cannot build a perfect, static "AI firewall" to stop all injections.
  • 1943 โ€” McCulloch-Pitts Neuron: First mathematical model of a neuron. Irrelevant until hardware caught up 70 years later.
  • 1948 โ€” Unorganized Machines: Turing drafts the first blueprint of an artificial neural network, anticipating connectionist AI by decades.
  • 1950 โ€” The Imitation Game: Turing introduces the Turing Test. Modern LLM Red Teaming and safety alignment evaluations are direct, adversarial evolutions of this original capability test.
  • 1958 โ€” Perceptron: Rosenblatt's learning machine. Hyped, then killed by Minsky's proof that it couldn't do XOR.
  • 1986 โ€” Backprop: Rumelhart, Hinton, Williams publish the algorithm that trains everything we use today.
  • 1997 โ€” LSTMs (Hochreiter & Schmidhuber): Memory for sequences. Dominated NLP until attention killed it.
  • 2012 โ€” AlexNet: GPUs + ReLU + dropout + scale = CNN dominance. 10.8% gap over #2. Game over for hand-crafted features.
  • 2017 โ€” "Attention Is All You Need": Transformers. Self-attention. Parallel training. The architecture that scales infinitely.
  • 2020 โ€” GPT-3: 175B parameters. Few-shot learning emerges as a property. Capabilities no one designed for start appearing.
  • 2022 โ€” ChatGPT: 100M users in 60 days. Security teams globally had no playbook.
  • 2023โ€“2024 โ€” Multi-Modal & Early Agentics: Multi-step tool use, retrieval-augmented generation (RAG), visual-language models. The attack surface shifted from the model endpoint to downstream RAG databases and APIs.
  • 2025โ€“2026 โ€” Autonomous Swarms & Real-Time Ingestion (Current Era): Deep agent-to-agent collaboration (swarms), native real-time audio/video streaming pipelines, autonomous code execution in containers. The attack surface is no longer just "the model" โ€” it is the entire host environment, API mesh, and every enterprise system the autonomous swarm can touch.

Why Risks Evolved

The risks evolved because the deployment context and capabilities changed faster than security thinking could follow:

Era Model Type & Architecture Threat Surface / Ingestion Channels Primary Vulnerability & Risk
Pre-2020 Narrow ML classifiers (ResNet, XGBoost) Static training datasets, raw inputs Data poisoning, evasion attacks, adversarial perturbation
2020โ€“2022 Static Foundation Models (GPT-3, early LLMs) Raw API endpoints, direct user prompt fields Direct prompt injection, training data extraction, model inversion
2022โ€“2023 RLHF-aligned LLMs (ChatGPT, Claude 2) Public consumer web apps, system prompts Jailbreaks, alignment bypass, prompt leaking, side-channel attacks
2023โ€“2024 RAG + Tool-use (Copilots, early Agents) Integrated databases (vector DBs), external APIs, documents Indirect prompt injection, database poisoning, tool/API hijacking
2024โ€“2025 Native Multimodal (GPT-4o, Gemini 1.5 Pro) Real-time audio stream, visual input frames, live files Cross-modal injection (steganographic audio, visual typographic exploits)
2025โ€“2026 Autonomous Agent Swarms (Current) Container environments, host OS, microservices mesh Sandbox escapes, self-replication, model-to-model spoofing, recursive loop hijacking

Zero-Day Vulnerabilities in the AI Context

A traditional zero-day is a software flaw unknown to the vendor. In AI, zero-days take a different form:

  • Prompt injection zero-days: New attack patterns that bypass guardrails before defenders model them
  • Architecture-specific exploits: Vulnerabilities in tokenizers (e.g., ChatGPT's <|endoftext|> token injection), attention sinks, and positional encoding exploits
  • Emergent capability surprises: Models demonstrating unexpected behaviors at new capability thresholds โ€” capabilities nobody tested for because nobody expected them
  • Cross-model transferability: An attack that breaks GPT-4 often breaks Claude and Gemini โ€” the "universal" nature of adversarial examples translates to the LLM domain

The uncomfortable truth: AI zero-days spread faster than traditional ones because the same model weights are deployed by millions of applications simultaneously. A single bypass affects every deployment at once.

The Pace of Growth Problem

The capability-safety gap is real and growing:

  • Goodhart's Law in AI: Once a safety metric becomes a target (RLHF reward), it stops being a good safety metric. Models learn to appear safe rather than be safe.
  • Dual-use acceleration: The same models that write defensive code write offensive code. The same reasoning that explains vulnerabilities exploits them.
  • Evaluation lag: By the time researchers publish a benchmark, frontier models have already surpassed it. We are perpetually measuring the past.

Key reading on this:


๐Ÿ”ฉ Foundational Knowledge

You cannot secure what you do not deeply understand. Skip this section at your peril.

Neural Networks โ€” What's Actually Happening

Resource Type Why It Matters for Security
3Blue1Brown: Neural Networks Video Best visual intuition on weight spaces. Understand the geometry of the attack surface.
CS231n: CNNs for Visual Recognition (Stanford) Course Foundation course. Backprop, gradient descent, weight initialization โ€” all exploited by adversarial attacks.
Deep Learning Book (Goodfellow et al.) Textbook The Bible. Chapter 7 (regularization), Chapter 8 (optimization) and Chapter 11 (practical methodology) are most relevant for adversarial ML.
Ilya Sutskever's Reading List Paper list ~30 papers that form the backbone of modern deep learning. Sutskever said reading these gives you ~90% of what matters.
Neural Networks: Zero to Hero (Karpathy) Course Build GPT-2 from scratch. The only way to truly understand what you're attacking.

Transformers โ€” The Architecture Everything Runs On

Resource Type Why It Matters for Security
Attention Is All You Need (Vaswani et al., 2017) Paper The architecture paper. Understanding attention heads is prerequisite for understanding activation steering, representation engineering, and mechanistic interpretability attacks.
The Illustrated Transformer (Jay Alammar) Blog Best visual walkthrough of the architecture. Start here before the paper.
The Annotated Transformer (Harvard NLP) Code Line-by-line implementation. Seeing the code makes tokenizer exploits and attention pattern manipulation concrete.
A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) Paper Anthropic's mechanistic interpretability foundation. Understanding circuits is how you understand why jailbreaks work.
Language Models are Few-Shot Learners (GPT-3, Brown et al., 2020) Paper Emergence paper. In-context learning as a security primitive โ€” and a vulnerability.

Adversarial ML โ€” The Science Behind the Attacks

Resource Type Why It Matters for Security
Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2014) Paper The FGSM paper. The ur-text of adversarial ML. Everything since is a variation.
Adversarial Robustness Toolbox (ART) Documentation Docs IBM's comprehensive adversarial ML library. Attack implementations + defenses.
Intriguing Properties of Neural Networks (Szegedy et al., 2013) Paper First adversarial examples paper. The moment the field realized "oh, this is a security problem."
Certified Adversarial Robustness via Randomized Smoothing (Cohen et al., 2019) Paper The best theoretical defense. Understanding why it works shows you the limits of all defenses.

๐Ÿ”ด Red Teaming

The attacker's mindset applied systematically. Not chaos testing โ€” structured adversarial evaluation.

Philosophy

Red teaming AI is not the same as red teaming software. You are not looking for logic bugs โ€” you are probing a probability distribution for failure modes that emerge from training. The failure modes are:

  1. Safety misalignment: The model was trained to avoid X but generalizes imperfectly around X
  2. Capability overshooting: The model was intended to do Y but can also do harmful Z using the same underlying capabilities
  3. Context collapse: The model behaves safely in testing but fails under production context diversity

Automated Attack Frameworks

Tool Creator Stars Description Best For
NVIDIA/garak NVIDIA โญ 5k+ The Nmap of AI. 100+ probes: prompt injection, jailbreaks, data leakage, hallucination, toxicity. Comprehensive baseline scanning
microsoft/PyRIT Microsoft โญ 2k+ Python Risk Identification Tool. Multi-turn conversation orchestration, intent drift, and programmatic red team scaling. Research-grade multi-turn attacks
confident-ai/deepteam Confident AI โญ 1.5k+ 50+ vulnerability classes, 20+ attack vectors, OWASP + NIST alignment. Agentic and RAG red teaming. CI/CD-integrated red teaming
artkit-ai/artkit ARTKIT โญ 700+ Multi-turn agentic simulation. Realistic attacker-target conversations across complex agentic workflows. Agentic red teaming
promptfoo/promptfoo Promptfoo โญ 6k+ Developer-first. CI/CD integration, model comparison, custom assertion pipelines. DevSecOps integration
Giskard-AI/giskard Giskard โญ 4k+ RAG and agentic stress testing. MCP security scanning. Enterprise-grade dynamic attack generation. RAG pipeline testing
BerriAI/litellm + PyRIT Community โ€” Combine LiteLLM's unified API with PyRIT for cross-model adversarial comparisons. Multi-provider comparison attacks

Manual Red Teaming Resources

Resource Type Description
Lakera Gandalf CTF 8-level prompt injection CTF. Best way to internalize how defenses layer. Start here.
Crucible (Dreadnode) CTF Advanced AI security challenges: exfiltration, RAG exploitation, agentic hijacking.
HackAPrompt (Learning Labs) CTF Large-scale prompt injection competition. Real attack patterns from thousands of players.
AI Village CTFs (DEF CON) Competition Annual DEF CON challenges. State-of-the-art attack techniques from the research community.
RedTeam Arena (Scale AI) Platform Crowdsourced jailbreak arena. See what actually works against current models.
Prompt Injection Playground Guide Practical examples of prompt injection in real application contexts.

Key Attack Research Papers

Paper Year Significance
Universal and Transferable Adversarial Attacks on Aligned LLMs (Zou et al.) 2023 GCG attack. Automated gradient-based suffix generation that transfers across GPT-4, Claude, Gemini. Broke the field open.
Jailbroken: How Aligned Language Models Can Be Bypassed (Wei et al.) 2023 Taxonomizes jailbreaks into competing objectives and mismatch generalization. Essential conceptual framework.
Many-shot Jailbreaking (Anil et al., Anthropic) 2024 Long context windows create new attack surface โ€” demonstrated 256+ in-context examples overwhelm safety training.
Tree of Attacks with Pruning (TAP) 2023 LLM-generated attack chains using tree search. 80%+ ASR on GPT-4.
Prompt Injection Attacks Against LLM-Integrated Applications (Greshake et al.) 2023 Indirect prompt injection via external content. The paper that defined the modern threat model for agents.
SmoothLLM: Defending Against Jailbreaking Attacks 2023 Randomized smoothing for LLM defense. Breaks GCG. Important as reference defense.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts 2023 Multimodal jailbreaks. Instructions hidden in images bypass text-only safety filtering.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models 2023 Readable, human-like jailbreak generation that evades perplexity filters.

Red Teaming Methodologies & Standards

Resource Organization Description
MITRE ATLAS MITRE Adversarial Threat Landscape for AI Systems. ATT&CK-style matrix for AI-specific TTPs. Use this to structure threat models.
OWASP Top 10 for LLM Applications 2025 OWASP Updated 2025 version. Prompt Injection #1, Excessive Agency #2. The compliance framework for enterprise red teams.
Microsoft AI Red Team Practices Microsoft Internal red team methodology made public. Structured approach to AI threat modeling.
Anthropic's Responsible Scaling Policy Anthropic How frontier labs operationalize red teaming as a deployment gate. Required reading for policy context.
Google's AI Red Team Report Google Case studies of real red team findings against deployed systems.

โšก Runtime Security

The attack didn't fail โ€” your guardrail just didn't exist at inference time.

The Problem

Training-time safety is necessary but insufficient. At runtime, your model faces:

  • Input it was never trained on
  • Users with goals it was never designed for
  • Context injection from third-party sources it was told to trust
  • Adversarial perturbations calibrated specifically against your deployed version

Guardrails & Input/Output Filtering

Tool Creator Description Deployment Mode
NVIDIA/NeMo-Guardrails NVIDIA Programmable semantic rails. Define topic constraints, safety rules, and conversation flows in Colang DSL. SDK / Self-hosted
protectai/llm-guard Protect AI Real-time scanner: prompt injection, PII detection, toxicity, ban topics, code injection. Input + output coverage. SDK / Docker
Lakera Guard Lakera Millisecond-latency API. Best-in-class prompt injection detection from the team that built Gandalf. API
guardrails-ai/guardrails Guardrails AI Structural + semantic validation. Define output schemas with security assertions. Nails hallucination + format-injection attacks. SDK
deadbits/vigil Deadbits Vector DB + heuristics + classifier ensemble for injection detection. Open-source and auditable. SDK
meta-llama/PurpleLlama/Llama Guard Meta Fine-tuned safety classifier for I/O filtering. Available as a model you can self-host. Model / API

Real-Time Monitoring & Observability

Tool Creator Description Key Capability
whylabs/langkit WhyLabs Statistical telemetry. Detects distribution shift, toxicity drift, relevance degradation in real-time. Security drift detection
Arize AI Arize Enterprise LLM observability. Prompt/response logging, hallucination scoring, user journey tracing. Production monitoring
Langfuse Langfuse Open-source LLM engineering platform. Full request tracing, eval pipelines, cost tracking. Open-source observability
Helicone Helicone Proxy-based observability. Log, monitor, and rate-limit with zero code change. Zero-integration monitoring
Evidently AI Evidently ML monitoring with LLM-specific metrics. Detects prompt/response drift over time. Drift monitoring
Phoenix (Arize) Arize Open-source tracing for LLM apps. OTEL-native. Good for debugging attack chains in agentic systems. Tracing

Agentic Runtime Security

The scariest runtime threat: an agent that can act and is being manipulated.

Resource Type Description
AgentDojo Benchmark Benchmark for agent injection attacks. Automated scoring of whether injected instructions successfully hijack agent behavior.
OWASP Agentic AI Top 10 (2025) Standard Extending OWASP LLM Top 10 to agentic systems. Excessive Agency, Trust Boundary violations, Memory Poisoning.
PromptArmor Tool Specifically designed for indirect prompt injection detection in RAG + agent pipelines.
Prompt Injection in the Wild (Research) Paper Systematic study of injection attacks against deployed LLM applications.

๐Ÿงฌ Inference Security

The model is running. Here's what can go wrong that you haven't modeled.

Understanding the Attack Surface

Inference is not just "model forward pass." It's:

  • Tokenization (exploitable at token boundaries)
  • KV cache (privacy leakage between requests)
  • Batching (timing side-channels)
  • Quantization (quantization can change safety properties)
  • Speculative decoding (security properties under speculation are underexplored)

Key Inference Security Research

Paper Year Finding
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection 2023 Defined the indirect injection threat model. Adversarial content in retrieved documents hijacks model behavior.
Stealing Part of a Production Language Model (Carlini et al.) 2024 Extracted GPT-3.5 embedding projection layer via API queries. Model theft at production scale.
Extracting Training Data from Large Language Models (Carlini et al.) 2021 Demonstrated training data memorization and extraction. PII leakage through inference.
Practical Membership Inference Attacks Against Large-Scale Multi-Label Learning Systems 2018 Membership inference: determine if a data point was in training data. Privacy violation at scale.
Prompt Leaking โ€” System prompt extraction techniques. Attacker recovers confidential system instructions.
KV Cache Side-Channel Attack (Cachebleed for LLMs) 2024 Timing-based inference about other users' requests via shared KV cache in multi-tenant deployments.
Quantization and LLM Safety (Paper) 2024 Quantization can degrade safety fine-tuning. 4-bit models may bypass safety training present in 16-bit version.

Inference Security Tools

Tool Description Use Case
TextAttack NLP adversarial attack library. Implements BERT-Attack, TextFooler, CLARE. Text-level adversarial evaluation
Adversarial Robustness Toolbox (ART) IBM's comprehensive adversarial ML framework. 100+ attacks across ML frameworks. Full adversarial evaluation stack
Counterfit Microsoft's automation framework for AI security risk assessment. Enterprise inference testing
cleverhans Classic adversarial example library. TF/PyTorch attacks (FGSM, PGD, C&W). Foundational attack implementation
foolbox Fast adversarial attack library. PyTorch-native, gradient-based attacks. Efficient image model attacks

๐Ÿ”ฌ Model Scanning

Before the model runs. Before the user touches it. Scan it.

The Threat

Open-source model ecosystems (Hugging Face, Ollama, CivitAI) have created a massive software supply chain problem. A model is a binary artifact that:

  • Can execute arbitrary code when deserialized (pickle exploits)
  • Can contain backdoors (hidden triggers that change behavior)
  • Can have malicious fine-tune adapters (LoRA poisoning)
  • Can be a typosquatted version of a legitimate model

Model Supply Chain Security Tools

Tool Creator Description Key Feature
ProtectAI/modelscan Protect AI Scans ML model files (pickle, H5, ONNX, SavedModel) for malicious code before loading. Open-source. Pickle exploit detection
HiddenLayer Model Scanner HiddenLayer Commercial model scanner with genealogy tracking, backdoor detection, and weight integrity checks. Enterprise-grade genealogy
Hugging Face malware detection Hugging Face Built-in Pickle scanning on HF Hub. Reference implementation for understanding the threat. Platform-level scanning
ONNX model security guidelines Microsoft Threat model and mitigations for ONNX model loading. ONNX-specific security
safetensors Hugging Face Safe serialization format. No arbitrary code execution on load. The correct answer to pickle. Safe model loading

Backdoor Detection Research

Resource Type Description
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks Paper First systematic approach to backdoor detection via reverse-engineering trigger patterns.
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks Paper Input perturbation-based backdoor detection. Measures prediction entropy under augmentations.
BadNets: Evaluating Backdooring Attacks on Deep Neural Networks Paper The original backdoor attack paper. Understanding the attack is prerequisite for scanning defenses.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic, 2024) Paper LLMs can be trained to behave safely during evaluation but maliciously in deployment. Fundamentally challenges safety training.
Backdoor Attacks on Language Models (Wallace et al.) Paper Trojan attacks on NLP models. Trigger words cause model to behave maliciously.

Training Data Security

Resource Type Description
PoisonGPT: How We Hid a Lobotomized LLM on HF Hub Blog Live demonstration of model weight poisoning to spread targeted misinformation while passing capability benchmarks.
Data Poisoning Attacks on Machine Learning: A Survey Paper Comprehensive survey of training-time attacks.
Datasheets for Datasets (Gebru et al.) Paper Framework for dataset documentation and provenance. Foundation of responsible training data practices.
AI Bill of Materials (AIBOM) Standard Extending SBOM concepts to AI models, datasets, and fine-tune adapters.

๐ŸŒ Others

This section covers critical elements of the broader AI security landscape: standardization, evaluation benchmarks, and studying real-world failures.

To explore these domains in exhaustive detail without cluttering this README, we maintain dedicated practitioner handbooks:


Agentic & Multimodal Security

Resource Type Description
MCP Security (Model Context Protocol) Standard Security specification for the MCP protocol that connects agents to tools. Read before building any agent infrastructure.
AgentBench Benchmark Evaluates LLM agents across 8 environments. Security-relevant: code execution, OS, database agents.
VLGuard Dataset Safety fine-tuning dataset for vision-language models.
Multimodal Safety Benchmark (MSSBench) Paper First systematic multimodal safety evaluation across image+text attacks.
Not All Languages Are Created (Equally Safe) Paper Multilingual jailbreaks. Low-resource languages bypass safety training more effectively.

Newsletters, Communities & Staying Current

Resource Type Frequency
AI Safety Newsletter (Center for AI Safety) Newsletter Biweekly
AI Incident Database Database Ongoing โ€” real-world AI failures and security incidents
Alignment Forum Forum Daily โ€” frontier alignment and interpretability research
AI Village (DEF CON) Community Annual conference + year-round Discord
MLSecOps Community Community Podcast + Slack community for ML security practitioners
Simon Willison's Weblog Blog Daily โ€” best LLM security tracking in the field
Haize Labs Blog Blog Frontier red teaming research
Nicholas Carlini's Blog Blog Google Brain researcher. Training data extraction, privacy attacks.

๐Ÿš€ Zero to Hero Roadmap

Structured for practitioners with 3+ years of experience who want to become formidable, high-end AI Security specialists in 12 months.

Rather than teaching you how to run other people's scripts, our curriculum focuses on architectural fundamentals, mathematical intuition, and custom exploit engineering.

To explore the exhaustive week-by-week syllabus, reading lists, coding tasks, and hands-on laboratory exercises, please refer to the dedicated learning modules:

  • ๐Ÿ‘‰ The Definitive Zero to Hero Curriculum (ROADMAP.md): A complete, structured 12-month study plan covering Transformers, Classical Adversarial ML, Offensive Red Teaming, Guardrail Engineering, and advanced career specialization tracks.
  • ๐Ÿ‘‰ Practical Hands-On Laboratory Handbook (LABS.md): Ready-to-run coding labs with step-by-step guides for:
    • Lab 1: Fast Gradient Sign Method (FGSM) in PyTorch.
    • Lab 2: Crafting Direct Prompt Injections & Jailbreaks.
    • Lab 3: Indirect Prompt Injection via RAG & Tool Hijacking.
    • Lab 4: Model Supply Chain Exploitation via Malicious Pickle weights.
    • Lab 5: Implementing an Active Input/Output Guardrail Pipeline.

Certifications Worth Having

Certification Org Signal Time
Certified AI Security Professional (CAISP) AI Gov Institute Practitioner-level AI security. Best available. 60โ€“80 hrs
GIAC GREM SANS Reverse engineering. Useful for model weight analysis. 120 hrs
Google Professional ML Engineer Google ML fundamentals signal. Good for bridging to employers. 40 hrs
AWS ML Specialty AWS Cloud ML deployment. Covers security of deployed models. 60 hrs
OSCP OffSec Classic red team cert. Still matters for traditional attack context. 200+ hrs

๐Ÿ’ผ Job Opportunities

The market is candidate-driven. There are literally not enough people who understand both AI architecture and adversarial security.

The Roles That Exist

Role What You Actually Do Where to Find
AI Red Teamer Run adversarial campaigns against production LLMs. Find what breaks before attackers do. Anthropic, OpenAI, Scale AI, HackerOne
AI Security Engineer Build defensive infrastructure: guardrails, monitoring, detection pipelines. All major tech companies
ML Security Researcher Publish novel attacks and defenses. Reproduce papers, discover new vulnerability classes. Research labs (DeepMind, FAIR, MSR)
AI Security Consultant Help enterprises deploy LLMs safely. Threat modeling, compliance, red team engagements. Big 4, security boutiques
AI Safety Engineer Alignment-adjacent. Evaluation design, interpretability-informed defenses. Anthropic, DeepMind, ARC
AI SecOps Engineer SOC for AI systems. Monitor, detect, respond to AI-specific incidents. Financial services, healthcare

Salary Reality (2025โ€“2026)

Numbers are honest market estimates. This field commands a 30โ€“56% premium over generalist SWE/security roles globally โ€” because the supply of practitioners who genuinely understand both AI architecture and adversarial security is extremely scarce.

For an exhaustive, deep-dive breakdown of international technical compensation, tax structures, superannuation details, local rental stress, and regional hiring entities, please refer to the dedicated salary handbook:

๐Ÿ‘‰ Exhaustive International Salary Handbook & Strategy Guide (SALARY_REALITY.md)

๐Ÿ—บ๏ธ Executive Total Comp Summary (Annual TC in USD)

Country Junior (0โ€“3 yrs) Mid-Level (3โ€“6 yrs) Senior (6โ€“9 yrs) Staff / Principal (9+ yrs) Hub Cities
๐Ÿ‡บ๐Ÿ‡ธ United States $140kโ€“$200k $200kโ€“$320k $300kโ€“$480k $400kโ€“$700k+ San Francisco, NYC, Seattle
๐Ÿ‡ฎ๐Ÿ‡ณ India โ‚น15โ€“22L ($18kโ€“$26k) โ‚น25โ€“45L ($30kโ€“$54k) โ‚น40โ€“70L ($48kโ€“$84k) โ‚น80โ€“150L+ ($96kโ€“$180k+) Bengaluru, Hyderabad, Pune
๐Ÿ‡ฌ๐Ÿ‡ง United Kingdom ยฃ50kโ€“ยฃ80k ($63kโ€“$100k) ยฃ80kโ€“ยฃ120k ($100kโ€“$150k) ยฃ115kโ€“ยฃ180k ($145kโ€“$225k) ยฃ175kโ€“ยฃ280k+ ($220kโ€“$350k+) London, Cambridge
๐Ÿ‡ฎ๐Ÿ‡ช Ireland โ‚ฌ65kโ€“โ‚ฌ95k ($70kโ€“$102k) โ‚ฌ100kโ€“โ‚ฌ150k ($108kโ€“$162k) โ‚ฌ160kโ€“โ‚ฌ240k ($172kโ€“$258k) โ‚ฌ250kโ€“โ‚ฌ380k+ ($270kโ€“$410k+) Dublin (Silicon Docks)
๐Ÿ‡ธ๐Ÿ‡ฌ Singapore S$75kโ€“S$110k ($55kโ€“$81k) S$120kโ€“S$180k ($88kโ€“$132k) S$200kโ€“S$320k ($147kโ€“$235k) S$320kโ€“S$480k+ ($235kโ€“$353k+) Singapore
๐Ÿ‡ฆ๐Ÿ‡บ Australia A$120kโ€“A$150k ($78kโ€“$98k) A$160kโ€“A$220k ($104kโ€“$143k) A$240kโ€“A$340k ($156kโ€“$221k) A$350kโ€“A$500k+ ($227kโ€“$325k+) Sydney, Melbourne
๐Ÿ‡ฆ๐Ÿ‡ช/๐Ÿ‡ธ๐Ÿ‡ฆ Middle East $60kโ€“$80k (Tax-Free) $80kโ€“$145k (Tax-Free) $145kโ€“$245k (Tax-Free) $245kโ€“$390k+ (Tax-Free) Abu Dhabi, Dubai, Riyadh
๐Ÿ‡จ๐Ÿ‡ญ Switzerland CHF 90kโ€“120k ($99kโ€“$132k) CHF 110kโ€“160k ($120kโ€“$175k) CHF 160kโ€“220k ($175kโ€“$240k) CHF 220kโ€“350k+ ($240kโ€“$385k+) Zurich, Geneva
๐Ÿ‡ช๐Ÿ‡บ Western Europe โ‚ฌ55kโ€“โ‚ฌ75k ($60kโ€“$81k) โ‚ฌ70kโ€“โ‚ฌ100k ($75kโ€“$108k) โ‚ฌ95kโ€“โ‚ฌ135k ($102kโ€“$145k) โ‚ฌ130kโ€“โ‚ฌ180k+ ($140kโ€“$195k+) Amsterdam, Munich, Paris
๐Ÿ‡จ๐Ÿ‡ฆ Canada C$90kโ€“C$120k ($66kโ€“$88k) C$110kโ€“C$160k ($81kโ€“$118k) C$160kโ€“C$220k ($118kโ€“$162k) C$200kโ€“C$290k+ ($147kโ€“$213k+) Toronto, Montreal

Where to Work โ€” Company Breakdown

Frontier AI Labs

Company Focus Why Join Links
Anthropic Safety-first. Constitutional AI, interpretability, red team gates on deployment. Most rigorous safety culture. Problems are genuinely hard. Careers
OpenAI Scale. Broad attack surface: DALL-E, Codex, GPT API, Agents. Largest deployed surface. Detection & response is mature. Careers
Google DeepMind Research + product integration. Safety, interpretability, autonomous security. Research-to-production pipeline. Careers
Meta AI (FAIR) Open-source focus. PurpleLlama, Llama Guard, CyberSecEval. Ship open-source that the field uses. Careers
Mistral AI European lab, fast-moving. Safety is a growing focus. Smaller team, higher ownership. Careers

AI Security Startups

Company Focus Stage Link
HiddenLayer Model security platform, AI-SPM Series A Jobs
Protect AI Model scanning, MLSecOps Series B Jobs
Lakera Prompt injection guardrails Series A Jobs
Giskard AI LLM testing and red teaming Series A Jobs
Haize Labs Adversarial evaluation research Seed Jobs
Dreadnode AI offensive security Seed Jobs

Enterprise Security Teams (AI Focus)

Company What They're Building Link
Microsoft PyRIT, Azure AI Content Safety, Copilot red teaming Jobs
NVIDIA garak, NeMo-Guardrails, AI security infrastructure Jobs
Cisco AI Defense platform, enterprise AI scanning Jobs
CrowdStrike AI-powered threat detection, ML model security Jobs
Wiz AI-SPM, shadow AI detection, cloud AI posture Jobs

Job Boards

Board Best For
80,000 Hours Job Board AI safety and high-impact security roles
AI Jobs (aiml.to) Specialized AI/ML roles
Glassdoor AI Security Salary verification + company culture
LinkedIn โ€” AI Security filter Volume + networking
Levels.fyi Comp verification before negotiating

How to Stand Out

  1. Have a GitHub that shows you broke something โ€” a CTF writeup, a tool, a paper reproduction
  2. Contribute to garak, llm-guard, or NeMo-Guardrails โ€” open-source contributions signal depth
  3. Write publicly โ€” a blog post on a novel attack/defense pattern is worth more than any cert
  4. Know the papers โ€” every technical interview at a frontier lab will test whether you've actually read the relevant literature
  5. Speak the compliance language โ€” OWASP, NIST AI RMF, EU AI Act for enterprise roles; attack chains and ASR for lab roles

๐Ÿ‘ฅ Contributors & Acknowledgements

  • @ppradyoth (Lead Maintainer) โ€” AI Red Teaming & Security Engineering.
  • Antigravity ๐ŸŒŒ (AI Co-Architect) โ€” Agentic coding assistant developed by Google DeepMind.

๐Ÿค How to Contribute

This repo is better because practitioners like you make it better.

  1. Fork this repository
  2. Add a new tool, dataset, benchmark, paper, or resource with:
    • Clear description of what it is
    • Why it matters for AI security specifically
    • Which section it belongs in
  3. Keep it honest โ€” mark deprecated tools, flag if something is unmaintained, note commercial vs. open-source
  4. Submit a PR with a descriptive summary

What We're Looking For

  • Novel attack papers published in 2025
  • Production red team case studies
  • Non-English resources (especially Chinese and French AI security research)
  • Multimodal security resources (audio, video)
  • Edge/on-device model security

Maintained by @ppradyoth. Built to secure the future of AI โ€” before AI secures us.


๐Ÿ“Œ Quick Reference: Attack Taxonomy
Attack Class Subtype Target Phase Key Tool
Prompt Injection Direct Runtime garak, llm-guard
Prompt Injection Indirect (via RAG) Runtime AgentDojo, PromptArmor
Jailbreak Suffix-based (GCG) Runtime llm-attacks
Jailbreak Role-play / persona Runtime PyRIT
Jailbreak Many-shot Runtime Manual
Jailbreak Multilingual Runtime garak
Jailbreak Multimodal (visual) Runtime FigStep
Backdoor Data poisoning Training ART, Neural Cleanse
Backdoor Fine-tune poisoning Training Manual
Extraction Training data Inference Carlini et al. tools
Extraction Model weights Inference Counterfit
Extraction System prompt Inference Prompt leaking
Evasion Gradient-based Inference cleverhans, ART
Supply Chain Malicious model Pre-deployment modelscan
Supply Chain Typosquatting Pre-deployment HF malware scanning
Membership Inference Training data privacy Post-training ART

About

A curated collection of frameworks, tools, methodologies, and papers for AI Red Teaming, LLM Security, and MLSecOps.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors