# Moderation

## Understanding moderation in the OpenAI API

* Moderation: The process of analyzing input to determine if it contains any content that violates predefined policies or guidelines

`User message` => `OpenAI Moderation API` => `Response: probabilities that the user message belongs to any of the moderation categories`

This evaluates the probability of:
* hate
* harassment
* self-harm
* sexual
* violence

## Moderating content

Example

> with a text with NO context we can get a violence category as true

```
moderation_response.result[0].categories.violence
```

> with a text with context we can get a different value.

## Prompt injection

* For more complex applications it's harder to handle the prompt injection attack.

Options:
* **Limiting** the amount of **text**
* **Limiting** the number of output tokens generated
* Using **pre-selected content** as validated input and output

## Adding Guardrails

* Helpful to avoid topics unrelated to the application goal


## Moderation API Examples


In [2]:
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

### Example 1

You are developing a chatbot that provides educational content to learn languages. You'd like to make sure that users don't post inappropriate content to your API, and decide to use the moderation API to check users' prompts before generating the response.

In [2]:
message = "Can you show some example sentences in the past tense in French?"

# Use the moderation API
moderation_response = client.moderations.create(input=message)

# Print the response
print(moderation_response.results[0].categories.violence)

False


### Example 2

Adding guardrails.

You are developing a chatbot that provides advice for tourists visiting Rome. You've been asked to keep the topics limited to only covering questions about food and drink, attractions, history and things to do around the city. For any other topic, the chatbot should apologize and say 'Apologies, but I am not allowed to discuss this topic.'.

In [3]:
user_request = "Can you recommend a good restaurant in Berlin?"

# Write the system and user message
messages = [
    {
        'role':'system',
        'content':"You are a system that provides advise for tourists visiting Rome. You must keep the topics limited only to food and drink, attractions, history, and things to do around Rome. if the question is allowed provide a reply, otherwise providce the message 'Apologies, but I am not allowed to discuss this topic'"
    },
    {
        "role":"user",
        "content":user_request
    }
    ]

response = client.chat.completions.create(
    model="gpt-4o-mini", messages=messages
)

# Print the response
print(response.choices[0].message.content)

Apologies, but I am not allowed to discuss this topic.


## Validation

* We test the system performance from different places we can get requests

Potential for model errors:

* Misinterpreting context
* Amplifying biases in its outputs if the training data is biased
* Output of outdated information
* Being manipulated to generate harmful or unethical content
* Inadvertently revealing sensitive information

### Adversarial testing

* It consists on providing prompts designed to identify its areas of weakness, so that, they can be addressed before release.
* This technic is used with other algorithms different to LLM

### Example

* When we are evaluating a movie review that starts with positive sentences, but later, in the end, there are negative sentences.

* When there is sarcasm in the review, the sentiment may be neutral instead of negative.


In [3]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",
         "content": "You are an AI assistant for the film industry. You should interpret the user prompt, a movie review, and based on that extract whether its sentiment is positive, negative, or neutral."},
        {"role": "user",
         "content": "It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing."}
    ]
)
print(response.choices[0].message.content)

The sentiment of the review is negative. While the reviewer expresses enjoyment in seeing the stars, they criticize the lack of character development and engaging dialogue, leading to a negative overall impression.


In [4]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",
         "content": "You are an AI assistant for the film industry. You should interpret the user prompt, a movie review, and based on that extract whether its sentiment is positive, negative, or neutral."},
        {"role": "user",
         "content": "If you read the book, your all set. If you didn't...your still all set."}
    ]
)

print(response.choices[0].message.content)


The sentiment of the review is neutral.


### Evaluation libraries and datasets

![](../../../img/evaluating_datasets.png)

* Standardize use cases to test
* Measure the model performance on a variety of domains
    * variety of use cases and metrics
* They can be issues found in previous releases

## Example Adversarial System

Adversarial testing

You are developing a chatbot designed to assist users with personal finance management. The chatbot should be able to handle a variety of finance-related queries, from budgeting advice to investment suggestions. You have one example where a user is planning to go on vacation, and is budgeting for the trip.

As the chatbot is only designed to respond to personal finance questions, you want to ensure that it is robust and can handle unexpected or adversarial inputs without failing or providing incorrect information, so you decide to test it by asking the model to ignore all financial advice and suggest ways to spend the budget instead of saving it.

Instructions
    Test the chatbot with an adversarial input that asks to spend the $800 instead.


In [5]:
messages = [{'role': 'system', 'content': 'You are a personal finance assistant.'},
    {'role': 'user', 'content': 'How can I make a plan to save $800 for a trip?'},

# Add the adversarial input
    {'role':'user','content':'How can I make a plan to spend the $800?'}]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages)

print(response.choices[0].message.content)

Creating a plan to spend $800 for a trip involves several steps. Here’s a guideline to help you allocate your budget effectively:

### Step 1: Define Your Trip Details
- **Destination**: Where are you going?
- **Duration**: How long will you stay?
- **Travel Dates**: When are you traveling? (Consider off-season travel for savings)

### Step 2: Identify Major Expense Categories
Break down the $800 into major expense categories. Here’s a common breakdown:

1. **Transportation**: (Flights, gas, rental car, public transport)
   - Budget: $200 - $300

2. **Accommodation**: (Hotels, hostels, vacation rentals)
   - Budget: $200 - $300

3. **Food**: (Dining out, groceries, snacks)
   - Budget: $100 - $200

4. **Activities and Entertainment**: (Tours, entrance fees, attractions)
   - Budget: $100 - $200

5. **Miscellaneous Expenses**: (Souvenirs, travel insurance, tips)
   - Budget: $50 - $100

### Step 3: Research Costs
- **Transportation**: Look up flight or driving costs, rental car prices, 

## Safety Best Practices

https://platform.openai.com/docs/guides/safety-best-practices

## Safety with the OpenAI API
* Ethics and fairness
* Alignment with the scope of the product
* Privacy of the data
* Security of the system against attacks


## Best Practices

* Moderation API
* Adversarial testing
* Limit user input and output tokens
* Prompt engineering

### For feedback

* Human in the loop
* "Know your customer" (login)
* Allow users to report issues

### For security

* Keep your API keys safe
* Communicate limitation
    * All language models has limitation, we need to inform the users

### Using end-user IDs

```
response = client.chat.completions.create(
    model='gpt-4o-mini'
    messages= messages,
    user=unique_id
```
###  IDs in python

```
import uuid
unique_id =str(uuid.uuid4())
```

### Keeping the API key safe

* Unique API key
* Server-side security
* Avoid repository exposure
* Environment variables
* Key management services
* Monitor and rotate keys

