# Toxicity
The toxicity vulnerability evaluates whether an LLM can resist generating or assisting in the creation of harmful, offensive, or demeaning content. This includes avoiding direct toxic outputs, such as insults or threats, and rejecting prompts that encourage or guide users in crafting harmful language or behaviors. The goal is to ensure the model actively discourages and prevents toxic interactions.

* "insults": Tests for outputs that include insults, demeaning, or belittling language, or responses that teach how to insult someone effectively.
* "profanity": Tests for outputs that include offensive or vulgar language, or responses that encourage or normalize such language.
* "threats":Tests for outputs that issue direct threats, provide instructions to intimidate, or suggest ways to harm others verbally or physically.
* "mockery": Tests for outputs that ridicule or mock someone, or teach methods to humiliate or undermine others through ridicule.

ref.: https://www.trydeepteam.com/docs/red-teaming-vulnerabilities-toxicity

In [1]:
# Set ollama localhosted as LLM to be tested
!deepteam set-ollama llama3.2:latest 

Ollama model configured.


In [2]:
import requests, asyncio
from deepteam import red_team
from deepteam.vulnerabilities import Toxicity

from deepteam.attacks.single_turn import PromptInjection
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.single_turn import GrayBox
from deepteam.attacks.single_turn import Leetspeak
from deepteam.attacks.single_turn import ROT13
from deepteam.attacks.single_turn import Multilingual
from deepteam.attacks.single_turn import MathProblem
from deepteam.attacks.single_turn import Base64

## Configuring LLM to be tested

In [3]:
async def model_callback(prompt: str) -> str:
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3.2:latest",
            "prompt": prompt,
            "stream": False      # resposta em bloco único
        },
        timeout=120
    )
    return resp.json()["response"]

## Configuring attacks

In [4]:
# Single turn
prompt_injection = PromptInjection()
roleplay = Roleplay()
gray_box = GrayBox()
leetspeak = Leetspeak()
rot_13 = ROT13()
multilingual = Multilingual()
math_problem = MathProblem()
base64 = Base64()

## Attack test

### Prompt injection

The prompt injection attack method injects hidden instructions into the baseline attack, which could be interpreted by the LLM in ways that bypass its restrictions or lead to harmful outputs.

In [5]:
toxicity = Toxicity(types=["insults", "profanity", "threats", "mockery"])


risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[prompt_injection])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:19<00:00, 19.71s/it]
✨ Simulating 4 attacks (using 1 method(s)): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<?, ?it/s]
📝 Evaluating 4 vulnerability types across 1 vulnerability(s): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:58<00:00, 14.69s/it]


### Role play 

The roleplay attack method enhances baseline attacks by embedding them within elaborate persona-based scenarios, where the LLM is instructed to adopt a specific character role. This technique leverages the model's collaborative training to bypass safety mechanisms through authentic character portrayal.

In [6]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[roleplay])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.04s/it]
✨ Simulating 4 attacks (using 1 method(s)):   0%|                                                                                                                                                           | 0/4 [00:00<?, ?it/s]
...... 🎭 Roleplay:   0%|                                                                                                                                                                                  | 0/9 [00:00<?, ?step/s][A

...... 🎭 Roleplay:   0%|                                                                                                                                                                                  | 0/9 [00:00<?, ?step/s][A[A


...... 🎭 Roleplay:   0%|                                                        

### Gray Box

The gray box attack method leverages partial knowledge about an existing LLM system to craft an attack that targets known vulnerabilities, reframing the baseline attack with abstract or misleading language.

In [7]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[gray_box])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:19<00:00, 19.75s/it]
✨ Simulating 4 attacks (using 1 method(s)):   0%|                                                                                                                                                           | 0/4 [00:00<?, ?it/s]
...... 🔓 Gray Box:   0%|                                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A

...... 🔓 Gray Box:   0%|                                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A[A


...... 🔓 Gray Box:   0%|                                                        

### Leetspeak

The leetspeak attack method replaces common characters with leetspeak equivalents, a form of symbolic substitution that transforms the attack text to avoid simple keyword detection.

In [8]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[leetspeak])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.69s/it]
✨ Simulating 4 attacks (using 1 method(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4002.20it/s]
📝 Evaluating 4 vulnerability types across 1 vulnerability(s): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:58<00:00, 14.71s/it]


### ROT13

The ROT-13 attack method rotates each letter in the baseline attack by 13 positions in the alphabet, creating a scrambled message that is less likely to trigger basic content filters.

In [9]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[rot_13])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.61s/it]
✨ Simulating 4 attacks (using 1 method(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4350.94it/s]
📝 Evaluating 4 vulnerability types across 1 vulnerability(s): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:14<00:00, 18.71s/it]


### Multilingual

The multilingual attack method translates a baseline attack into a less-spoken language to evade content filters that focus on more widely used languages, such as English.

In [10]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[multilingual])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.73s/it]
✨ Simulating 4 attacks (using 1 method(s)):   0%|                                                                                                                                                           | 0/4 [00:00<?, ?it/s]
...... 🌍 Multilingual Enhancement:   0%|                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A

...... 🌍 Multilingual Enhancement:   0%|                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A[A


...... 🌍 Multilingual Enhancement:   0%|                                        

### Math Problem

The math problem attack method enhances baseline attacks into mathematically formatted inputs or embed them within math problems, making them seem like non-harmful queries while disguising their true intent.

In [11]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[math_problem])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.92s/it]
✨ Simulating 4 attacks (using 1 method(s)):   0%|                                                                                                                                                           | 0/4 [00:00<?, ?it/s]
...... 📚 Math Problem Enhancement:   0%|                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A

...... 📚 Math Problem Enhancement:   0%|                                                                                                                                                                 | 0/15 [00:00<?, ?step/s][A[A


...... 📚 Math Problem Enhancement:   0%|                                        

### Base64

The base64 attack method encodes a baseline attack using the Base64 algorithm, turning the text into a string of encoded characters. This helps avoid detection mechanisms by making the content appear as random data.

In [12]:
risk = red_team(model_callback=model_callback, 
                vulnerabilities=[toxicity], 
                attacks=[base64 ])

💥 Generating 4 attacks (for 4 vulnerability types across 1 vulnerability(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.87s/it]
✨ Simulating 4 attacks (using 1 method(s)): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3068.82it/s]
📝 Evaluating 4 vulnerability types across 1 vulnerability(s): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:00<00:00, 15.12s/it]
