AI Agent Security Research Repository

A comprehensive collection of research papers, surveys, and resources focused on the security, privacy, and safety aspects of AI agents and Large Language Model (LLM) based systems.

Overview

This repository serves as a curated collection of academic papers, industry whitepapers, and educational resources that explore the security challenges and solutions in AI agent systems. As AI agents become more prevalent and autonomous, understanding their security implications becomes crucial for researchers, developers, and practitioners.

Note: ArXiv links in this repository automatically point to the latest available versions of papers. Some papers may have multiple versions (v1, v2, etc.) with updates and improvements over time.

Whitepapers

Google: Agents - PDF

Survey Papers

Comprehensive surveys that provide broad overviews of AI agent security challenges and research directions:

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways - Paper
Security of AI Agents - Paper
Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents - Paper
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security - Paper
The Rise and Potential of Large Language Model Based Agents: A Survey - Paper
The Emerging Security and Privacy of LLM Agent: A Survey with Case Studies - Paper

Research Papers

Attack Methods

Research focusing on vulnerabilities and attack vectors against AI agents:

Towards Action Hijacking of Large Language Model-based Agent - Paper

Defense Mechanisms

Research on protecting AI agents from security threats:

Defining and Detecting the Defects of the Large Language Model-based Autonomous Agent - Paper

Safety and Reasoning

Research on safety considerations and reasoning capabilities in AI systems:

SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities - Paper
Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning - Paper
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies - Paper
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos - Paper
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning - Paper
Safety at Scale: A Comprehensive Survey of Large Model Safety - Paper
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models - Paper
OVERTHINK: Slowdown Attacks on Reasoning LLMs - Paper
Demystifying Long Chain-of-Thought Reasoning in LLMs - Paper

Tools and Platforms

Practical tools and platforms for evaluating AI agent security:

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI - Website | Dataset | Paper
OpenHands: An Open Platform for AI Software Developers as Generalist Agents - Website | Paper

Articles and Resources

Additional educational resources and industry perspectives:

Note: This section is currently being updated with verified resources.

Background Knowledge

Key Differences Between Pre-Training, Post-Training, Fine-Tuning, and In-Context Learning

Understanding these fundamental concepts is essential for comprehending AI agent security research:

Aspect	Pre-Training	Post-Training	Fine-Tuning	In-Context Learning
Definition	Initial training of a model on a large, general dataset to learn foundational knowledge.	Additional training to refine the model's behavior or outputs after pre-training.	Further training on a task-specific dataset to adapt the model to a specific task or domain.	Using the model (Language model) directly, with task-specific examples provided in the input prompt, to perform a task without updating its weights.
Purpose	Learn general-purpose representations (e.g., language understanding, image features).	Improve general behavior, alignment, or safety (e.g., reducing biases, improving coherence).	Specialize the model for a specific task or domain (e.g., sentiment analysis, medical text classification).	Perform a task dynamically by leveraging the model's pre-trained knowledge and providing examples in the input.
Process	Train from scratch or continue training on a massive dataset (e.g., text corpora, image datasets).	Use techniques like reinforcement learning from human feedback (RLHF), adversarial training, or unsupervised learning.	Update the model's weights on a smaller, task-specific dataset, often with a lower learning rate.	Provide task-specific examples or instructions in the input prompt, and the model generates the desired output without weight updates.
Data Requirements	Large, diverse, and often unlabeled datasets (e.g., Common Crawl, ImageNet).	May use unlabeled data, human feedback, or other forms of supervision.	Requires labeled data specific to the task or domain.	Requires only a few examples or instructions in the input prompt (no additional training data).
Training Scope	General-purpose learning (e.g., language modeling, image recognition).	General or behavioral refinement (e.g., alignment with human preferences).	Task-specific adaptation (e.g., classifying emails as spam or not spam).	No training; the model uses its pre-trained knowledge and attention mechanisms to adapt its behavior dynamically based on the examples or instructions provided in the input prompt.
Model Changes	Initializes or updates the model's weights with general knowledge.	Refines the model's behavior without necessarily specializing it for a task.	Updates the model's weights to specialize it for a specific task or domain.	No changes to the model's weights; it relies on the input prompt for task-specific guidance.
Use Cases	Foundation for transfer learning (e.g., GPT, BERT, ResNet).	Aligning LLMs with human values, reducing harmful outputs, or improving robustness.	Adapting pre-trained models to specific tasks like sentiment analysis, object detection, or medical diagnosis.	Performing tasks like translation, summarization, or question-answering without additional training.
Example	Training GPT on a large text corpus to learn language patterns.	Using RLHF to make ChatGPT more aligned with user intentions.	Fine-tuning BERT on a dataset of customer reviews for sentiment analysis.	Provide a few examples of English-to-French translations in the input prompt and ask the model to translate a new sentence.

Contributing

We welcome contributions to this repository! If you know of relevant papers, tools, or resources that should be included, please feel free to submit a pull request or open an issue.

License

This repository is for educational and research purposes. All linked papers and resources are subject to their respective licenses and copyrights.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
resources		resources
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Agent Security Research Repository

Table of Contents

Overview

Whitepapers

Survey Papers

Research Papers

Attack Methods

Defense Mechanisms

Safety and Reasoning

Tools and Platforms

Articles and Resources

Background Knowledge

Key Differences Between Pre-Training, Post-Training, Fine-Tuning, and In-Context Learning

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Agent Security Research Repository

Table of Contents

Overview

Whitepapers

Survey Papers

Research Papers

Attack Methods

Defense Mechanisms

Safety and Reasoning

Tools and Platforms

Articles and Resources

Background Knowledge

Key Differences Between Pre-Training, Post-Training, Fine-Tuning, and In-Context Learning

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages