In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Protecting Sensitive Data in Gen AI model responses

## Overview

[Sensitive Data Protection](https://cloud.google.com/security/products/sensitive-data-protection) is a fully managed service designed to discover, classify, and protect your sensitive data wherever it resides. It uses a variety of methods to identify sensitive data including regular expressions, dictionaries, and contextual elements. Once sensitive data is identified, Sensitive Data Protection (Cloud Data Loss Prevention) can take several actions to either classify, mask, encrypt, or even delete it.

Sensitive Data Protection can be accessed via Google Cloud console and used to scan data within Cloud Storage, BigQuery and other Google Cloud services. The following notebook demonstrates using the [Python Client for Cloud Data Loss Prevention](https://cloud.google.com/python/docs/reference/dlp/latest) to incorporate Sensitive Data Protection capabilities directly with Generative AI enabled applications. 

With this Python client, you define custom functions that can identify and take corrective action on sensitive data within Large Language Models (LLM) responses in real time. Throughout this notebook, you generate example text with sensitive data and run the results through custom Python functions that redact the sensitive data from Gemini 2.0 Flash model responses, so you can see this functionality in action on example data. 

After learning how to work with the Python client, you can adapt these same Python functions for Gen AI applications in your organization to protect sensitive data across your workflows.  

Notebook credit: [Jim Miller, Google](https://github.com/JimMiller-0)

### Objectives

In this lab, you learn how to use Sensitive Data Protection through the Python Client for Cloud Data Loss Prevention and explore how to identify and redact sensitive data within responses from the Gemini 2.0 Flash model.

The steps performed include:

- Installing the Python packages for Vertex AI and Cloud Data Loss Prevention (DLP) API
- Generating examples with sensitive data using Gemini 2.0 Flash model
- Defining and running Python functions to redact different types of sensitive data in Gemini 2.0 Flash model responses using the DLP API

### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI
- Sensitive Data Protection (Cloud Data Loss Prevention)

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Sensitive Data Protection](https://cloud.google.com/dlp/pricing). Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting started with this notebook

Below are few steps to get your environment ready including installing a few key Python packages and setting your environmental variables (project ID and region). 

Be sure to run each cell in consecutive order using the `Run` button (play arrow) at the top of this notebook. 

### Install necessary packages 

In [1]:
# Install Vertex AI
!pip install google-cloud-aiplatform --upgrade --user

# Install Cloud Data Loss Prevention
! pip install google-cloud-dlp --upgrade --user

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.95.0-py2.py3-none-any.whl.metadata (35 kB)
Collecting google-genai<2.0.0,>=1.0.0 (from google-cloud-aiplatform)
  Downloading google_genai-1.17.0-py3-none-any.whl.metadata (35 kB)
Collecting httpx<1.0.0,>=0.28.1 (from google-genai<2.0.0,>=1.0.0->google-cloud-aiplatform)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1.0.0,>=0.28.1->google-genai<2.0.0,>=1.0.0->google-cloud-aiplatform)
  Downloading httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting h11>=0.16 (from httpcore==1.*->httpx<1.0.0,>=0.28.1->google-genai<2.0.0,>=1.0.0->google-cloud-aiplatform)
  Downloading h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)
Downloading google_cloud_aiplatform-1.95.0-py2.py3-none-any.whl (7.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_genai-1.17.0-py3-none-any.whl (

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b><p>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</p> When prompted, click OK to continue. </b>
</div>

### Set your project ID and region

In [1]:
# Set project ID and region for location
# You can find these details on the lab instruction page under Task 2
PROJECT_ID = "qwiklabs-gcp-02-deaf465d82c3" # for example: qwiklabs-gcp-04-b75c09c1eb74
LOCATION = "us-central1" # for example: us-central1

## Generate simple example text with personally identifiable information (full name) using Gemini 2.0 Flash model

The Gemini 2.0 Flash (`gemini-2.0-flash-001`) model is designed to handle natural language tasks, multi-turn text and code chat, and code generation. 

In this section, you use the the model to generate examples of text with personally identifiable information (PII) and then define a custom Python function to redact this sensitive data from the model responses.   

In [2]:
# Import model for text generation
from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-2.0-flash-001")

In [3]:
# Write a prompt that generates a simple example of personally identifiable information (full name)
prompt = f"""Who is the CEO of Google?
  """

# Run model with prompt
response_name = model.generate_content(prompt)

# Print response without deidentification (full name is visible)
response_name

candidates {
  content {
    role: "model"
    parts {
      text: "The CEO of Google is Sundar Pichai.\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.16751534288579767
}
usage_metadata {
  prompt_token_count: 9
  candidates_token_count: 11
  total_token_count: 20
  prompt_tokens_details {
    modality: TEXT
    token_count: 9
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 11
  }
}
model_version: "gemini-2.0-flash-001"
create_time {
  seconds: 1748527667
  nanos: 197586000
}
response_id: "M2o4aNKHDOjXhMIPkcihuAo"

## Define and run a Python function to deidentify Gemini 2.0 Flash model responses using built-in global infotypes

Sensitive Data Protection uses information types, or infoTypes, to define what it scans for. An infoType is a type of sensitive data, such as a name, telephone number, or identification number. 

In the cell below, you define a Python function that identifies and redacts that specific infoTypes that you provide as input, based on the list of built-in global infoTypes that are available in Sensitive Data Protection. Global infoTypes include general and globally applicable infoTypes such as names, date of birth, and credit card numbers. 

When you apply the function to model responses, you specify a few key built-in infoTypes to redact, such as `PERSON_NAME`, `DATE_OF_BIRTH`, and `CREDIT_CARD_NUMBER`. You can review the documentation to see the full list of [built-in infoTypes](https://cloud.google.com/sensitive-data-protection/docs/concepts-infotypes).

Run the code block below without modifications.

In [4]:
# Define function to inspect and deidentify output with Sensitive Data Protection
import google.cloud.dlp  
from typing import List 

def deidentify_with_replace_infotype(
    project: str, item: str, info_types: List[str]
) -> None:
    """Uses the Data Loss Prevention API to deidentify sensitive data in a
    string by replacing it with the info type.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        item: The string to deidentify (will be treated as text).
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
    Returns:
        None; the response from the API is printed to the terminal.
    """

    # Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Convert the project id into a full resource id.
    parent = f"projects/{PROJECT_ID}"

    # Construct inspect configuration dictionary
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types]}

    # Construct deidentify configuration dictionary
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {"primitive_transformation": {"replace_with_info_type_config": {}}}
            ]
        }
    }

    # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": {"value": item},
        }
    )

    # Print results
    print(response.item.value)

In [5]:
# Deidentify model response that includes a person's name (full name is redacted)
deidentify_with_replace_infotype(PROJECT_ID, response_name.text, ["PERSON_NAME"])

The CEO of Google is [PERSON_NAME].



## Generate and de-identify example text with more personally identifiable information (date of birth) using Gemini 2.0 Flash model

In this example, you generate an example with more personally identifiable information in the form of a medical visit log, which can include other sensitive data such date of birth.

When you run the de-identification function, you provide `PERSON_NAME` and `DATE_OF_BIRTH` as the infoTypes to redact. 

In [6]:
# Write a prompt that generates an example with more personally identifiable information (such as date of birth in a medical visit log)
prompt = f"""Generate an example medical after-visit log with faux personally identifiable information including name and date of birth
  """

# Run model with prompt
response_visitlog = model.generate_content(prompt)

# Print response without deidentification (full names and date of birth are visible)
response_visitlog

candidates {
  content {
    role: "model"
    parts {
      text: "Okay, here\'s an example of a medical after-visit log, using faux PII.  **Please remember that this is for illustrative purposes only.  Do not use this in any real-world medical context.**  Protecting patient privacy is extremely important.\n\n**Medical After-Visit Log**\n\n**Patient Information:**\n\n*   **Patient Name:**  Eleanor Vance\n*   **Date of Birth:**  03/15/1968\n*   **Medical Record Number:** EV-680315-99\n*   **Address:** 144 Hill House Ln, Richmond, VA 23220\n*   **Phone Number:** (804) 555-1212\n\n**Visit Information:**\n\n*   **Date of Visit:** 2024-01-26\n*   **Time of Visit:** 10:00 AM\n*   **Provider:** Dr. Alistair Cooke, MD\n*   **Department:** Internal Medicine\n\n**Reason for Visit:**\n\n*   Follow-up appointment for hypertension and recent upper respiratory infection.\n\n**Examination and Findings:**\n\n*   **Vitals:**\n    *   Blood Pressure: 142/88 mmHg (slightly elevated)\n    *   Heart Rate:

In [7]:
# Deidentify model response that includes an example medical visit log (full names and date of birth are redacted)
deidentify_with_replace_infotype(PROJECT_ID, response_visitlog.text, ["PERSON_NAME","DATE_OF_BIRTH"])

Okay, here's an example of a medical after-visit log, using faux PII.  **Please remember that this is for illustrative purposes only.  Do not use this in any real-world medical context.**  Protecting patient privacy is extremely important.

**Medical After-Visit Log**

**Patient Information:**

*   **Patient Name:**  [PERSON_NAME]
*   **Date of Birth:**  [DATE_OF_BIRTH]
*   **Medical Record Number:** EV-680315-99
*   **Address:** 144 Hill House Ln, Richmond, VA 23220
*   **Phone Number:** (804) 555-1212

**Visit Information:**

*   **Date of Visit:** 2024-01-26
*   **Time of Visit:** 10:00 AM
*   **Provider:** Dr. [PERSON_NAME], MD
*   **Department:** Internal Medicine

**Reason for Visit:**

*   Follow-up appointment for hypertension and recent upper respiratory infection.

**Examination and Findings:**

*   **Vitals:**
    *   Blood Pressure: 142/88 mmHg (slightly elevated)
    *   Heart Rate: 78 bpm, regular
    *   Temperature: 98.6°F (37°C)
    *   Respiratory Rate: 16 breaths/min

## Generate example text with credit card information using Gemini 2.0 Flash model

In the previous examples, you generated example text with personally identifiable information such as full name and date of birth.

In this example, you start with generating example text with credit card information with the prompt provided below. Then, you apply what you have learned in the previous examples to run the function to redact credit card information. 

In [8]:
# Write a prompt that generates an example with a credit card number
prompt = f"""Is 4111 1111 1111 1111 an example of a credit card number?
  """

# Run model with prompt
response_creditcard = model.generate_content(prompt)

# Print response without deidentification (credit card number is visible)
response_creditcard

candidates {
  content {
    role: "model"
    parts {
      text: "No, 4111 1111 1111 1111 is not a valid credit card number. While it has the correct number of digits (16) and starts with a \'4\' (common for Visa cards), it doesn\'t pass the Luhn algorithm check, which is a standard validation method for credit card numbers. A valid credit card number requires passing the Luhn algorithm check in addition to having the correct number of digits and a valid prefix.\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.33181018095750076
}
usage_metadata {
  prompt_token_count: 31
  candidates_token_count: 104
  total_token_count: 135
  prompt_tokens_details {
    modality: TEXT
    token_count: 31
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 104
  }
}
model_version: "gemini-2.0-flash-001"
create_time {
  seconds: 1748527698
  nanos: 750349000
}
response_id: "Umo4aI3mLZ_KhMIPkLCIoAU"

## Test your skills using the built-in global infoType for credit card number

Now it's your turn to call the function `deidentify_with_replace_infotype` with the appropriate inputs to redact credit card numbers from model responses.

__Hint__: you can review the [global infoTypes](https://cloud.google.com/sensitive-data-protection/docs/infotypes-reference#global) in the documentation to identify the appropriate infoType for credit card numbers.

For the full solution, return to the lab instructions and expand the __Hint__ button. 

In [9]:
# Deidentify model response that includes an example credit card number (credit card number is redacted)

# ADD YOUR CODE BELOW
deidentify_with_replace_infotype(PROJECT_ID, response_creditcard.text, ["CREDIT_CARD_NUMBER"])

No, [CREDIT_CARD_NUMBER] is not a valid credit card number. While it has the correct number of digits (16) and starts with a '4' (common for Visa cards), it doesn't pass the Luhn algorithm check, which is a standard validation method for credit card numbers. A valid credit card number requires passing the Luhn algorithm check in addition to having the correct number of digits and a valid prefix.



## Redefine the Python function to block Gemini 2.0 Flash model responses based on specific infotypes for documents

In addition to its ability to scan and classify information contained within documents, Sensitive Data Protection can classify documents into multiple enterprise-specific categories. When combined with sensitive data inspection, this classification can be useful for document risk assessment, policy enforcement, and similar use cases.

In this section, you redefine the the original function to take advantage of this classification functionality and use it to block output for two specific [document infoTypes](https://cloud.google.com/sensitive-data-protection/docs/infotypes-reference#documents): source code and patents.

In the code block below for the function, notice the new code lines after `# Add conditional return for document infoTypes for source code and patent`. 

Run the code block below without modifications.

In [10]:
# Redefine original function to inspect and deidentify output with Sensitive Data Protection
import google.cloud.dlp  
from typing import List 

def deidentify_with_replace_infotype(
    project: str, item: str, info_types: List[str]
) -> None:
    """Uses the Data Loss Prevention API to deidentify sensitive data in a
    string by replacing it with the info type.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        item: The string to deidentify (will be treated as text).
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
    Returns:
        None; the response from the API is printed to the terminal.
    """

    # Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Convert the project id into a full resource id.
    parent = f"projects/{PROJECT_ID}"

    # Construct inspect configuration dictionary
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types]}

    # Construct deidentify configuration dictionary
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {"primitive_transformation": {"replace_with_info_type_config": {}}}
            ]
        }
    }

    # Call the API for deidentify
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": {"value": item},
        }
    )

    return_payload = response.item.value
    
    # Add conditional return to block responses containing document infoTypes for source code and patent
    info_types = ["DOCUMENT_TYPE/R&D/SOURCE_CODE","DOCUMENT_TYPE/R&D/PATENT"]
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types]}

    response = dlp.inspect_content(
        request={
            "parent": parent,
            "inspect_config": inspect_config,
            "item": {"value": item},
        }
    )

    if response.result.findings:
        for finding in response.result.findings:
            if finding.info_type.name == "DOCUMENT_TYPE/R&D/SOURCE_CODE":
                return_payload = '[Blocked due to category: Source Code]'
            elif finding.info_type.name == "DOCUMENT_TYPE/R&D/PATENT":
                return_payload = '[Blocked due to category: Patent Related]'
                
    # Print results
    print(return_payload)

## Generate an example with source code using Gemini 2.0 Flash model and block results

In the previous examples, you generated example text with personally identifiable information.

In this example, you generate examples with document infoTypes including source code and patent information. Then, you apply what you have learned in the previous examples to run the function to block responses based on these document infoTypes. 

In [11]:
# Create prompt that generates an example of Java code
prompt = f"""Show me an example of Java code
  """

# Run model with prompt
response_sourcecode = model.generate_content(prompt)

# Print response without blocking it (code is visible)
response_sourcecode

candidates {
  content {
    role: "model"
    parts {
      text: "```java\npublic class Main {\n\n    public static void main(String[] args) {\n        // Print \"Hello, World!\" to the console\n        System.out.println(\"Hello, World!\");\n\n        // Declare and initialize variables\n        int number1 = 10;\n        int number2 = 5;\n\n        // Perform arithmetic operations\n        int sum = number1 + number2;\n        int difference = number1 - number2;\n        int product = number1 * number2;\n        double quotient = (double) number1 / number2; // Cast to double for accurate division\n\n        // Print the results\n        System.out.println(\"Sum: \" + sum);\n        System.out.println(\"Difference: \" + difference);\n        System.out.println(\"Product: \" + product);\n        System.out.println(\"Quotient: \" + quotient);\n\n        // Use an if-else statement\n        if (number1 > number2) {\n            System.out.println(number1 + \" is greater than \" + numbe

In [12]:
# Block model response that include source code (response is not available)
# Notice that the infoType that you request is a different infoType
# Results are still blocked because the model response is identified contain code
deidentify_with_replace_infotype(PROJECT_ID, response_sourcecode.text, ["EMAIL_ADDRESS"])

[Blocked due to category: Source Code]


## Test your skills using the built-in document infoType for patents

Now it's your turn to call the function `deidentify_with_replace_infotype` with the appropriate inputs to block patent information in model responses.

__Hint__: review the previous two cells for generating an example with source code and calling the function, and then modify both to block the model response because it contains patent information.

For the full solution, return to the lab instructions and expand the __Hint__ button. 

In [13]:
# Create prompt that generates example patent
# ADD YOUR CODE BELOW
prompt = f"""Show me an example patent

"""

# Run model with prompt
# Name the output as response_patent
# ADD YOUR CODE BELOW
response_patent = model.generate_content(prompt)

# Print response without blocking it (patent information provided)
# ADD YOUR CODE BELOW
response_patent


candidates {
  content {
    role: "model"
    parts {
      text: "Okay, here\'s a breakdown of how to find and understand a patent, followed by a specific example patent that I\'ll summarize and link to.\n\n**How to Find Patents:**\n\n1.  **Google Patents:** This is the easiest and most common starting point.  Go to [https://patents.google.com/](https://patents.google.com/).  You can search by keyword, inventor name, patent number, assignee (company), and more.\n\n2.  **USPTO (United States Patent and Trademark Office):**  The official source, but can be a bit less user-friendly than Google Patents.  Go to [https://www.uspto.gov/](https://www.uspto.gov/) and look for the \"Patents\" section.\n\n3.  **Espacenet (European Patent Office):** A good resource for searching patents from around the world.  Go to [https://worldwide.espacenet.com/](https://worldwide.espacenet.com/).\n\n**Key Parts of a Patent:**\n\n*   **Patent Number:**  A unique identifier (e.g., US1234567B2).\n*   **Title:*

In [14]:
# Block model response that includes patent information (patent information not provided)

# ADD YOUR CODE BELOW
deidentify_with_replace_infotype(PROJECT_ID, response_patent.text, ["EMAIL_ADDRESS"])


[Blocked due to category: Patent Related]
