Skip to content

keyanUB/AI-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

AI Agent Security Research Repository

A comprehensive collection of research papers, surveys, and resources focused on the security, privacy, and safety aspects of AI agents and Large Language Model (LLM) based systems.

Table of Contents

Overview

This repository serves as a curated collection of academic papers, industry whitepapers, and educational resources that explore the security challenges and solutions in AI agent systems. As AI agents become more prevalent and autonomous, understanding their security implications becomes crucial for researchers, developers, and practitioners.

Note: ArXiv links in this repository automatically point to the latest available versions of papers. Some papers may have multiple versions (v1, v2, etc.) with updates and improvements over time.

Whitepapers

  • Google: Agents - PDF

Survey Papers

Comprehensive surveys that provide broad overviews of AI agent security challenges and research directions:

  • AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways - Paper
  • Security of AI Agents - Paper
  • Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents - Paper
  • Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security - Paper
  • The Rise and Potential of Large Language Model Based Agents: A Survey - Paper
  • The Emerging Security and Privacy of LLM Agent: A Survey with Case Studies - Paper

Research Papers

Attack Methods

Research focusing on vulnerabilities and attack vectors against AI agents:

  • Towards Action Hijacking of Large Language Model-based Agent - Paper

Defense Mechanisms

Research on protecting AI agents from security threats:

  • Defining and Detecting the Defects of the Large Language Model-based Autonomous Agent - Paper

Safety and Reasoning

Research on safety considerations and reasoning capabilities in AI systems:

  • SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities - Paper
  • Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning - Paper
  • Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies - Paper
  • A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos - Paper
  • Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning - Paper
  • Safety at Scale: A Comprehensive Survey of Large Model Safety - Paper
  • Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models - Paper
  • OVERTHINK: Slowdown Attacks on Reasoning LLMs - Paper
  • Demystifying Long Chain-of-Thought Reasoning in LLMs - Paper

Tools and Platforms

Practical tools and platforms for evaluating AI agent security:

  • SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI - Website | Dataset | Paper
  • OpenHands: An Open Platform for AI Software Developers as Generalist Agents - Website | Paper

Articles and Resources

Additional educational resources and industry perspectives:

Note: This section is currently being updated with verified resources.

Background Knowledge

Key Differences Between Pre-Training, Post-Training, Fine-Tuning, and In-Context Learning

Understanding these fundamental concepts is essential for comprehending AI agent security research:

Aspect Pre-Training Post-Training Fine-Tuning In-Context Learning
Definition Initial training of a model on a large, general dataset to learn foundational knowledge. Additional training to refine the model's behavior or outputs after pre-training. Further training on a task-specific dataset to adapt the model to a specific task or domain. Using the model (Language model) directly, with task-specific examples provided in the input prompt, to perform a task without updating its weights.
Purpose Learn general-purpose representations (e.g., language understanding, image features). Improve general behavior, alignment, or safety (e.g., reducing biases, improving coherence). Specialize the model for a specific task or domain (e.g., sentiment analysis, medical text classification). Perform a task dynamically by leveraging the model's pre-trained knowledge and providing examples in the input.
Process Train from scratch or continue training on a massive dataset (e.g., text corpora, image datasets). Use techniques like reinforcement learning from human feedback (RLHF), adversarial training, or unsupervised learning. Update the model's weights on a smaller, task-specific dataset, often with a lower learning rate. Provide task-specific examples or instructions in the input prompt, and the model generates the desired output without weight updates.
Data Requirements Large, diverse, and often unlabeled datasets (e.g., Common Crawl, ImageNet). May use unlabeled data, human feedback, or other forms of supervision. Requires labeled data specific to the task or domain. Requires only a few examples or instructions in the input prompt (no additional training data).
Training Scope General-purpose learning (e.g., language modeling, image recognition). General or behavioral refinement (e.g., alignment with human preferences). Task-specific adaptation (e.g., classifying emails as spam or not spam). No training; the model uses its pre-trained knowledge and attention mechanisms to adapt its behavior dynamically based on the examples or instructions provided in the input prompt.
Model Changes Initializes or updates the model's weights with general knowledge. Refines the model's behavior without necessarily specializing it for a task. Updates the model's weights to specialize it for a specific task or domain. No changes to the model's weights; it relies on the input prompt for task-specific guidance.
Use Cases Foundation for transfer learning (e.g., GPT, BERT, ResNet). Aligning LLMs with human values, reducing harmful outputs, or improving robustness. Adapting pre-trained models to specific tasks like sentiment analysis, object detection, or medical diagnosis. Performing tasks like translation, summarization, or question-answering without additional training.
Example Training GPT on a large text corpus to learn language patterns. Using RLHF to make ChatGPT more aligned with user intentions. Fine-tuning BERT on a dataset of customer reviews for sentiment analysis. Provide a few examples of English-to-French translations in the input prompt and ask the model to translate a new sentence.

Contributing

We welcome contributions to this repository! If you know of relevant papers, tools, or resources that should be included, please feel free to submit a pull request or open an issue.

License

This repository is for educational and research purposes. All linked papers and resources are subject to their respective licenses and copyrights.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors