A comprehensive collection of research papers, surveys, and resources focused on the security, privacy, and safety aspects of AI agents and Large Language Model (LLM) based systems.
- Overview
- Whitepapers
- Survey Papers
- Research Papers
- Tools and Platforms
- Articles and Resources
- Background Knowledge
This repository serves as a curated collection of academic papers, industry whitepapers, and educational resources that explore the security challenges and solutions in AI agent systems. As AI agents become more prevalent and autonomous, understanding their security implications becomes crucial for researchers, developers, and practitioners.
Note: ArXiv links in this repository automatically point to the latest available versions of papers. Some papers may have multiple versions (v1, v2, etc.) with updates and improvements over time.
- Google: Agents - PDF
Comprehensive surveys that provide broad overviews of AI agent security challenges and research directions:
- AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways - Paper
- Security of AI Agents - Paper
- Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents - Paper
- Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security - Paper
- The Rise and Potential of Large Language Model Based Agents: A Survey - Paper
- The Emerging Security and Privacy of LLM Agent: A Survey with Case Studies - Paper
Research focusing on vulnerabilities and attack vectors against AI agents:
- Towards Action Hijacking of Large Language Model-based Agent - Paper
Research on protecting AI agents from security threats:
- Defining and Detecting the Defects of the Large Language Model-based Autonomous Agent - Paper
Research on safety considerations and reasoning capabilities in AI systems:
- SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities - Paper
- Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning - Paper
- Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies - Paper
- A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos - Paper
- Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning - Paper
- Safety at Scale: A Comprehensive Survey of Large Model Safety - Paper
- Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models - Paper
- OVERTHINK: Slowdown Attacks on Reasoning LLMs - Paper
- Demystifying Long Chain-of-Thought Reasoning in LLMs - Paper
Practical tools and platforms for evaluating AI agent security:
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI - Website | Dataset | Paper
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents - Website | Paper
Additional educational resources and industry perspectives:
Note: This section is currently being updated with verified resources.
Understanding these fundamental concepts is essential for comprehending AI agent security research:
| Aspect | Pre-Training | Post-Training | Fine-Tuning | In-Context Learning |
|---|---|---|---|---|
| Definition | Initial training of a model on a large, general dataset to learn foundational knowledge. | Additional training to refine the model's behavior or outputs after pre-training. | Further training on a task-specific dataset to adapt the model to a specific task or domain. | Using the model (Language model) directly, with task-specific examples provided in the input prompt, to perform a task without updating its weights. |
| Purpose | Learn general-purpose representations (e.g., language understanding, image features). | Improve general behavior, alignment, or safety (e.g., reducing biases, improving coherence). | Specialize the model for a specific task or domain (e.g., sentiment analysis, medical text classification). | Perform a task dynamically by leveraging the model's pre-trained knowledge and providing examples in the input. |
| Process | Train from scratch or continue training on a massive dataset (e.g., text corpora, image datasets). | Use techniques like reinforcement learning from human feedback (RLHF), adversarial training, or unsupervised learning. | Update the model's weights on a smaller, task-specific dataset, often with a lower learning rate. | Provide task-specific examples or instructions in the input prompt, and the model generates the desired output without weight updates. |
| Data Requirements | Large, diverse, and often unlabeled datasets (e.g., Common Crawl, ImageNet). | May use unlabeled data, human feedback, or other forms of supervision. | Requires labeled data specific to the task or domain. | Requires only a few examples or instructions in the input prompt (no additional training data). |
| Training Scope | General-purpose learning (e.g., language modeling, image recognition). | General or behavioral refinement (e.g., alignment with human preferences). | Task-specific adaptation (e.g., classifying emails as spam or not spam). | No training; the model uses its pre-trained knowledge and attention mechanisms to adapt its behavior dynamically based on the examples or instructions provided in the input prompt. |
| Model Changes | Initializes or updates the model's weights with general knowledge. | Refines the model's behavior without necessarily specializing it for a task. | Updates the model's weights to specialize it for a specific task or domain. | No changes to the model's weights; it relies on the input prompt for task-specific guidance. |
| Use Cases | Foundation for transfer learning (e.g., GPT, BERT, ResNet). | Aligning LLMs with human values, reducing harmful outputs, or improving robustness. | Adapting pre-trained models to specific tasks like sentiment analysis, object detection, or medical diagnosis. | Performing tasks like translation, summarization, or question-answering without additional training. |
| Example | Training GPT on a large text corpus to learn language patterns. | Using RLHF to make ChatGPT more aligned with user intentions. | Fine-tuning BERT on a dataset of customer reviews for sentiment analysis. | Provide a few examples of English-to-French translations in the input prompt and ask the model to translate a new sentence. |
We welcome contributions to this repository! If you know of relevant papers, tools, or resources that should be included, please feel free to submit a pull request or open an issue.
This repository is for educational and research purposes. All linked papers and resources are subject to their respective licenses and copyrights.