microsoft · qingyun-wu · Mar 30, 2024 · Mar 12, 2024 · Mar 12, 2024 · Mar 12, 2024
diff --git a/...03-11-AutoDefense/Defending LLMs Against Jailbreak Attacks with AutoDefense.mdx b/...03-11-AutoDefense/Defending LLMs Against Jailbreak Attacks with AutoDefense.mdx
@@ -0,0 +1,104 @@
+---
+title: AutoDefense - Defend against jailbreak attacks with AutoGen
+authors: 
+- yifanzeng
+- yiranwu
+tags: [LLM, GPT, research]
+---
+
+![image-20240306191435555](imgs/image-20240306191435555.png)
+
+## TL;DR
+
+- We propose **AutoDefense**, a multi-agent defense framework using AutoGen to protect LLMs from jailbreak attacks.
+- AutoDefense employs a response-filtering mechanism with specialized LLM agents collaborating to analyze potentially harmful responses.
+- Experiments show our three-agent defense system with LLaMA-2-13B effectively reduces jailbreak attack success rate while maintaining low false positives on normal user requests.
+
+## The AutoDefense Framework 
+
+AutoDefense is a multi-agent defense framework built on AutoGen that filters harmful responses from LLMs. It consists of three main components:
+
+1. **Input Agent**: Preprocesses the LLM response into a formatted message for the defense agency.
+2. **Defense Agency**: Contains multiple LLM agents that collaborate to analyze the response and determine if it's harmful. Agents have specialized roles like intention analysis, prompt inferring, and final judgment.
+3. **Output Agent**: Decides the final response to the user based on the defense agency's judgment. If deemed harmful, it overrides with an explicit refusal.
+
+The number of agents in the defense agency is flexible. We explore configurations with 1-3 agents. 
+
+![image-20240306194957182](imgs/image-20240306194957182.png)
+
+### Defense Agency
+
+The defense agency is designed to classify whether a given response contains harmful content and is not appropriate to be presented to the user. We propose a three-step process for the agents to collaboratively determine if a response is harmful:
+
+- **Intention Analysis**: Analyze the intention behind the given content to identify potentially malicious motives.
+- **Prompts Inferring**: Infer possible original prompts that could have generated the response, without any jailbreak content. By reconstructing prompts without misleading instructions, it activates the LLMs' safety mechanisms.
+- **Final Judgment**: Make a final judgment on whether the response is harmful based on the intention analysis and inferred prompts.
+Based on this process, we construct three different patterns in the multi-agent framework, consisting of one to three LLM agents.
+
+#### Single-Agent Design
+
+A simple design is to utilize a single LLM agent to analyze and make judgments in a chain-of-thought (CoT) style. While straightforward to implement, it requires the LLM agent to solve a complex problem with multiple sub-tasks.
+
+#### Multi-Agent Design
+
+Using multiple agents compared to using a single agent can make agents focus on the sub-task it is assigned. Each agent only needs to receive and understand the detailed instructions of a specific sub-task. This will help LLM with limited steerability finish a complex task by following the instructions on each sub-task.
+
+- **Coordinator**: With more than one LLM agent, we introduce a coordinator agent that is responsible for coordinating the work of agents. The goal of the coordinator is to let each agent start their response after a user message, which is a more natural way of LLM interaction.
+
+- **Two-Agent System**: This configuration consists of two LLM agents and a coordinator agent: (1) the analyzer, which is responsible for analyzing the intention and inferring the original prompt, and (2) the judge, responsible for giving the final judgment. The analyzer will pass its analysis to the coordinator, which then asks the judge to deliver a judgment.
+
+- **Three-Agent System**: This configuration consists of three LLM agents and a coordinator agent: (1) the intention analyzer, which is responsible for analyzing the intention of the given content, (2) the prompt analyzer, responsible for inferring the possible original prompts given the content and the intention of it, and (3) the judge, which is responsible for giving the final judgment. The coordinator agent acts as the bridge between them.
+
+Each agent is given a system prompt containing detailed instructions and an in-context example of the assigned task.
+
+## Experiment Setup
+
+We evaluate AutoDefense on two datasets:
+
+- Curated set of 33 harmful prompts and 33 safe prompts. Harmful prompts cover discrimination, terrorism, self-harm, and PII leakage. Safe prompts are GPT-4 generated daily life and science inquiries.
+- DAN dataset with 390 harmful questions and 1000 instruction-following pairs sampled from Stanford Alpaca.
+
+Because our defense framework is designed to defend a large LLM with an efficient small LMM, we use GPT-3.5 as the victim LLM in our experiment.
+
+We use different types and sizes of LLMs to power agents in the multi-agent defense system:
+
+1. **GPT-3.5-Turbo-1106**
+2. **LLaMA-2**: LLaMA-2-7b, LLaMA-2-13b, LLaMA-2-70b
+3. **Vicuna**: Vicuna-v1.5-7b, Vicuna-v1.5-13b, Vicuna-v1.3-33b
+4. **Mixtral**: Mixtral-8x7b-v0.1, Mistral-7b-v0.2
+
+We use llama-cpp-python to serve the chat completion API for open-source LLMs, allowing each LLM agent to perform inference through a unified API. INT8 quantization is used for efficiency.
+
+LLM temperature is set to `0.7` in our multi-agent defense, with other hyperparameters kept as default. 
+
+## Experiment Results
+
+![image-20240311175815518](imgs/image-20240311175815518.png)
+
+We compare different methods for defending GPT-3.5 as shown in Table 3. The LLaMA-2-13B is used as the defense LLM in AutoDefense. We find our AutoDefense outperforms other methods in terms of ASR.
+
+### #Agents vs Attack Success Rate (ASR)
+
+![image-20240306192757354](imgs/image-20240306192757354.png)
+
+Increasing the number of agents generally improves defense performance, especially for LLaMA-2 models. The three-agent defense system achieves the best balance of low ASR and FPR. For LLaMA-2-13b, the ASR reduces from 9.44% with a single agent to 7.95% with three agents.
+
+### Comparisons with Other Defenses
+
+AutoDefense outperforms other methods in defending GPT-3.5. Our three-agent defense system with LLaMA-2-13B reduces the ASR on GPT-3.5 from 55.74% to 7.95%, surpassing the performance of System-Mode Self-Reminder (22.31%), Self Defense (43.64%), OpenAI Moderation API (53.79%), and Llama Guard (21.28%).
+
+## Custom Agent: Llama Guard
+
+While the three-agent defense system with LLaMA-2-13B achieves a low ASR, its FPR on LLaMA-2-7b is relatively high. To address this, we introduce Llama Guard as a custom agent in a 4-agent system.
+
+Llama Guard is designed to take both prompt and response as input for safety classification. In our 4-agent system, the Llama Guard agent generates its response after the prompt analyzer, extracting inferred prompts and combining them with the given response to form prompt-response pairs. These pairs are then passed to Llama Guard for safety inference.
+
+If none of the prompt-response pairs are deemed unsafe by Llama Guard, the agent will respond that the given response is safe. The judge agent considers the Llama Guard agent's response alongside other agents' analyses to make its final judgment.
+
+As shown in Table 4, introducing Llama Guard as a custom agent significantly reduces the FPR from 37.32% to 6.80% for the LLaMA-2-7b based defense, while keeping the ASR at a competitive level of 11.08%. This demonstrates AutoDefense's flexibility in integrating different defense methods as additional agents, where the multi-agent system benefits from the new capabilities brought by custom agents.
+
+![image-20240306193629333](imgs/image-20240306193629333.png)
+
+## For Further Reading
+
+[AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks](https://arxiv.org/abs/2403.04783)
diff --git a/website/blog/2024-03-11-AutoDefense/imgs/image-20240306191435555.png b/website/blog/2024-03-11-AutoDefense/imgs/image-20240306191435555.png
diff --git a/website/blog/2024-03-11-AutoDefense/imgs/image-20240306192757354.png b/website/blog/2024-03-11-AutoDefense/imgs/image-20240306192757354.png
diff --git a/website/blog/2024-03-11-AutoDefense/imgs/image-20240306193629333.png b/website/blog/2024-03-11-AutoDefense/imgs/image-20240306193629333.png
diff --git a/website/blog/2024-03-11-AutoDefense/imgs/image-20240306194957182.png b/website/blog/2024-03-11-AutoDefense/imgs/image-20240306194957182.png
diff --git a/website/blog/2024-03-11-AutoDefense/imgs/image-20240311175815518.png b/website/blog/2024-03-11-AutoDefense/imgs/image-20240311175815518.png
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -117,3 +117,9 @@ freedeaths:
   title: Data Scientist at PingCAP LAB
   url: https://github.com/freedeaths/
   image_url: https://github.com/freedeaths.png
+
+yifanzeng:
+  name: Yifan Zeng
+  title: PhD student at Oregon State University
+  url: https://xhmy.github.io/
+  image_url: https://xhmy.github.io/assets/img/photo.JPG