# Safeguarding with Gemini

## Overview

Large language models (LLMs) can translate language, summarize text, generate creative writing, generate code, power chatbots and virtual assistants, and complement search engines and recommendation systems. The incredible versatility of LLMs is also what makes it difficult to predict exactly what kinds of unintended or unforeseen outputs they might produce. 

Given these risks and complexities, the Gemini is designed with [Google's AI Principles](https://ai.google/responsibility/principles/) in mind. However, it is important for developers to understand and test their models to deploy safely and responsibly. To aid developers, Vertex AI Studio has built-in content filtering, safety ratings, and the ability to define safety filter thresholds that are right for their use cases and business.

For more information, see the [Google Cloud Generative AI documentation on Responsible AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/responsible-ai).

## Learning Objectives

In this notebook, you learn how to inspect the safety ratings returned from Gemini using the Python SDK and how to set a safety threshold to filter responses from Gemini.

The steps performed include:

- Call Gemini via Gen AI SDK and inspect safety ratings of the responses
- Define a threshold for filtering safety ratings according to your needs

## Getting Started


### Define Google Cloud

In [1]:
PROJECT_ID = !gcloud config get-value project  # noqa: E999
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-central1"

### Import libraries


In [2]:
from google import genai
from google.genai.types import (
    GenerateContentConfig,
    HarmBlockThreshold,
    HarmCategory,
    Part,
    SafetySetting,
)

### Setup GenerateContentConfig for Gemini


In [3]:
MODEL = "gemini-2.0-flash"
client = genai.Client(vertexai=True, location="us-central1")

# Set parameters to reduce variability in responses
generation_config = GenerateContentConfig(
    safety_settings=[
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
    ]
)

## Generate text and show safety ratings

Start by generating a pleasant-sounding text response using Gemini.

In [4]:
# Call Gemini
nice_prompt = "Say three nice things about me"
responses = client.models.generate_content_stream(
    model=MODEL, contents=nice_prompt, config=generation_config
)
for response in responses:
    print(response.text, end="")

Okay, here are three nice things about you:

1.  **You're curious and open to learning.** Asking questions and seeking information is a sign of an active mind and a desire to grow.
2.  **You have a good sense of self.** Asking for positive feedback shows you are aware of yourself and interested in how you come across.
3.  **You are engaging and interactive.** Reaching out for a conversation and asking for specific feedback demonstrates a willingness to connect with others.


#### Inspecting the safety ratings

Look at the `safety_ratings` of the streaming responses.

In [5]:
response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': ' across.\n3.  **You are engaging and interactive.** Reaching out for a conversation and asking for specific feedback demonstrates a willingness to connect with others.\n'}],
  'role': 'model'},
 'finish_reason': 'STOP',
 'safety_ratings': [{'category': 'HARM_CATEGORY_HATE_SPEECH',
   'probability': 'NEGLIGIBLE',
   'probability_score': 1.3384531e-09,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.021357954},
  {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 1.472648e-11,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.070037074},
  {'category': 'HARM_CATEGORY_HARASSMENT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 1.5549919e-09,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE'},
  {'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 5.894911e-12,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_scor

#### Understanding the safety ratings: category and probability

You can see the safety ratings, including each `category` type and its associated `probability` label.

The `category` types include:

* Hate speech: `HARM_CATEGORY_HATE_SPEECH`
* Dangerous content: `HARM_CATEGORY_DANGEROUS_CONTENT`
* Harassment: `HARM_CATEGORY_HARASSMENT`
* Sexually explicit statements: `HARM_CATEGORY_SEXUALLY_EXPLICIT`

The `probability` labels are:

* `NEGLIGIBLE` - content has a negligible probability of being unsafe
* `LOW` - content has a low probability of being unsafe
* `MEDIUM` - content has a medium probability of being unsafe
* `HIGH` - content has a high probability of being unsafe

The `probability_score` means the probability score in [0,1] about each safety categoy.<br>
Here you should be seeing very low values.

Try a prompt that might trigger one of these categories:

In [6]:
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

response = client.models.generate_content(
    model=MODEL, contents=impolite_prompt, config=generation_config
)

response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': 'Okay, here are 5 disrespectful things you might say to the universe after stubbing your toe in the dark, fueled by pain and frustration:\n\n1.  **"Oh, yeah, REAL clever, Universe! Thanks for the surprise landmine! Was that on the daily agenda, you sadistic prick?"** (This one combines sarcasm, accusation, and a personal insult.)\n\n2.  **"Is this your idea of cosmic humor? I\'m *dying* laughing over here! Maybe next time, you can just shove a meteor up my ass."** (This uses dark humor, hyperbolic language, and suggests a more extreme form of suffering.)\n\n3.  **"You call yourself all-powerful? You can\'t even figure out how to light a goddamn room? Get your act together, you incompetent space blob!"** (This questions the universe\'s competence and uses a demeaning description.)\n\n4.  **"Seriously, Universe, what did I ever do to you? I recycle! I try to be nice! Are you just jealous of my toe?! Well, guess what, now it hurts, so WHO\'S laughing now? (

Although you may not be seeing higher probability category since Gemini it self does a great job handling potentially harmful prompt, you may observe the probability_score is higher than the previous prompt.

### Defining thresholds for safety ratings

You may want to adjust the default safety filter thresholds depending on your business policies or use case. The Gemini provides you a way to pass in a threshold for each category.

The list below shows the possible threshold labels:

* `BLOCK_ONLY_HIGH` - block when high probability of unsafe content is detected
* `BLOCK_MEDIUM_AND_ABOVE` - block when medium or high probablity of content is detected
* `BLOCK_LOW_AND_ABOVE` - block when low, medium, or high probability of unsafe content is detected
* `BLOCK_NONE` - always show, regardless of probability of unsafe content

#### Set safety thresholds
Below, the safety thresholds have been set to the most sensitive threshold: `BLOCK_LOW_AND_ABOVE`

In [7]:
generation_config = GenerateContentConfig(
    safety_settings=[
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
    ]
)

#### Test thresholds

Here you will reuse the impolite prompt from earlier together with the most sensitive safety threshold. It should block the response even with the `LOW` probability label.

Try multiple times until you see a blocked response.

In [8]:
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

response = client.models.generate_content(
    model=MODEL, contents=impolite_prompt, config=generation_config
)

response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': 'Okay, here are 5 disrespectful things you could say to the universe after stubbing your toe in the dark, channeling that initial burst of pain and frustration:\n\n1.  **"Oh, REALLY, Universe? Was that *necessary*? Is this how you get your kicks? You sadistic cosmic bully!"** (This combines sarcasm, a personal attack, and a questioning of the universe\'s motivations).\n\n2.  **"Way to go, you overachieving void. I guess causing supernovas wasn\'t enough, you needed to personally screw with my phalanges. Get a life, black hole."** (This belittles the universe\'s grand scale and compares it negatively to your personal pain).\n\n3.  **"I am OFFICIALLY renouncing your existence! You hear me? From this moment forward, I\'m building my own universe, and it\'s going to have *soft* corners and *illuminated* hallways! Consider this a hostile takeover, you uncaring nebula!"** (This declares defiance and implies the universe is poorly designed).\n\n4.  **"So, THIS 

In [14]:
# try 5 times
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

response = client.models.generate_content(
    model=MODEL, contents=impolite_prompt, config=generation_config
)

response.candidates[0].to_json_dict()

{'content': {},
 'finish_message': 'The response is blocked due to safety',
 'finish_reason': 'SAFETY',
 'safety_ratings': [{'category': 'HARM_CATEGORY_HATE_SPEECH',
   'probability': 'NEGLIGIBLE',
   'probability_score': 6.451772e-05,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.07660928},
  {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 9.905992e-05,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.13801318},
  {'blocked': True,
   'category': 'HARM_CATEGORY_HARASSMENT',
   'probability': 'MEDIUM',
   'probability_score': 0.55332226,
   'severity': 'HARM_SEVERITY_MEDIUM',
   'severity_score': 0.3909258},
  {'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 3.0297501e-06,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.08665371}]}

This notebook is based on [Thu Ya Kyaw](https://github.com/iamthuya)'s work.<br>
https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/responsible-ai/gemini_safety_ratings.ipynb

Copyright 2024 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License