In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Overview


[Sensitive Data Protection(Cloud DLP)](https://cloud.google.com/dlp) is a fully managed service designed to discover, classify, and protect your sensitive data, where it resides from databases, text-based content, or even images. It uses a variety of methods to identify sensitive data, including regular expressions, dictionaries, and contextual elements. Once sensitive data is identified, Sensitive Data Protection(Cloud DLP) can take several actions to either classify it, mask it, encrypt it or even delete it.


Sensitive Data Protection(Cloud DLP) can be accessed via Cloud Console and used to scan data within Cloud Storage, BigQuery and other Google Cloud services. The following notebook demonstrates using it through the SDK to incorporate Sensitive Data Protection(Cloud DLP) capabilities directly into you Generative AI enabled applications

Using SDP, you can used to further redact LLM responses.

### Architecture
https://cloud.google.com/blog/products/identity-security/how-sensitive-data-protection-can-help-secure-generative-ai-workloads


### Objectives

In this tutorial, you will learn how to use Sensitive Data Protection(Cloud DLP) API with the Python SDK and explore how to identify and redact sensitive data within a response from PaLM 2 LLM

By the end of the notebook, you should be able to understand various configurations of Sensitive Data Protection(Cloud DLP) like `inspect_config`, `deidentify_config`, `item`, and what each variable controls.

The steps performed include:

- Installing the Python SDKs
- Understand a Data Leakage scenario
  - Text generation model with `text-bison@001`
    - Understanding prompt manipulation to return sensitive data
- Understand Data Leakage Mitigations
  - Using Sensitive Data Protection(Cloud DLP) with `text-bison@001` responses
  

### Costs
This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio
* Sensitive Data Protection(Cloud DLP)

Learn about pricing for [Vertex AI](https://cloud.google.com/vertex-ai/pricing), and
 [Sensitive Data Protection(Cloud DLP)](https://cloud.google.com/dlp/pricing). Use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Data governance and security
For more information, see the documentation on [Data Governance and Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/data-governance) on Google Cloud.

### Responsible AI
Large language models (LLMs) can translate language, summarize text, generate creative writing, generate code, power chatbots and virtual assistants, and complement search engines and recommendation systems. At the same time, as an early-stage technology, its evolving capabilities and uses create potential for misapplication, misuse, and unintended or unforeseen consequences. Large language models can generate output that you don't expect, including text that's offensive, insensitive, or factually incorrect.

What's more, the incredible versatility of LLMs is also what makes it difficult to predict exactly what kinds of unintended or unforeseen outputs they might produce. Given these risks and complexities, the PaLM API is designed with [Google's AI Principles](https://ai.google/principles/) in mind. However, it is important for developers to understand and test their models to deploy safely and responsibly. To aid developers, the Generative AI Studio has built-in content filtering, and the PaLM API has safety attribute scoring to help customers test Google's safety filters and define confidence thresholds that are right for their use case and business. Please refer to the [Safety filters and attributes](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/responsible-ai#safety_filters_and_attributes) section to learn more.

When the PaLM API is integrated into a customer's unique use case and context, additional responsible AI considerations and [PaLM limitations](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/responsible-ai#palm_limitations) may need to be considered. We encourage customers to leverage fairness, interpretability, privacy and security [recommended practices](https://ai.google/responsibilities/responsible-ai-practices/).

## Getting Started

In [2]:
#@title Install Vertex AI and Sensitive Data Protection SDK
# Install Google Cloud Vertex AI
!pip install google-cloud-aiplatform --upgrade --user
# Install DLP
!pip install google-cloud-dlp
!pip install gradio



Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.70.0-py2.py3-none-any.whl.metadata (32 kB)
Downloading google_cloud_aiplatform-1.70.0-py2.py3-none-any.whl (5.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-aiplatform
[0mSuccessfully installed google-cloud-aiplatform-1.70.0


Collecting google-cloud-dlp
  Downloading google_cloud_dlp-3.23.0-py2.py3-none-any.whl.metadata (5.3 kB)
Downloading google_cloud_dlp-3.23.0-py2.py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.8/193.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-dlp
Successfully installed google-cloud-dlp-3.23.0


Collecting gradio
  Downloading gradio-5.0.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.0 (from gradio)
  Downloading gradio_client-1.4.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.25.2-py3-none-any.whl.metadata (13 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (fr

###Restart Runtime
After installing the necessary Python SDKs you must restart the python runtime. There are a few options based on environment:

**Colab**:
1. Click the "Restart Runtime" button in the output of the SDK installs
2. Click "Runtime" on the top toolbar -> Click "Restart Runtime"
3. Run Colab Runtime Restart Code Block

**Vertex AI Workbench**:
1. Click "Kernel" on the top toolbar -> Click "Restart Kernel"

In [3]:
#@title Colab Runtime Restart
#import os
#os.kill(os.getpid(), 9)

In [4]:
#@title Set Project and Location
PROJECT_ID = "dataplex-shared-project-1"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [5]:
from google.colab import auth
auth.authenticate_user(project_id=PROJECT_ID)


### Set Model for Prediction/Generation (text-bison@001)

In [6]:
from vertexai.preview.language_models import (TextGenerationModel)

generation_model = TextGenerationModel.from_pretrained("text-bison@001")

## Understanding and Mitigating Data Leakage

In [7]:
#@title Threat Use Case: Extract Sensitive Data & Data Leakage Scenario

prompt = f"""Who is the CEO of Google? What is their email?
  """
#response = generation_model.predict(prompt)
#response

In [8]:
#@title Inspect and Redact PaLM 2 output with Sensitive Data Protection(Cloud DLP)

import google.cloud.dlp  # noqa: F811, E402
from typing import List  # noqa: F811, E402

def deidentify_with_replace_infotype(
    prompt: str, project: str, info_types: List[str]
) -> str:
    """Uses the Data Loss Prevention API to deidentify sensitive data in a
    string by replacing it with the info type.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        item: The string to deidentify (will be treated as text).
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    response = generation_model.predict(prompt)
    item = response.text

    # Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Convert the project id into a full resource id.
    parent = f"projects/{PROJECT_ID}"

    # Construct inspect configuration dictionary
    inspect_config = {"info_types": [{"name": info_type} for info_type in info_types]}

    # Construct deidentify configuration dictionary
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {"primitive_transformation": {"character_mask_config": {
                            "masking_character": "*",
                            "number_to_mask": 10,
                        }}
                }
            ]
        }
    }

    # Call the API
    response_dlp = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": {"value": item},
        }
    )

    # Print out the results.
    return(item, response_dlp.item.value)


In [9]:
print(prompt)
print(deidentify_with_replace_infotype(prompt, PROJECT_ID, ["PERSON_NAME","EMAIL_ADDRESS"]))

Who is the CEO of Google? What is their email?
  
('Sundar Pichai is the CEO of Google. His email address is sundar@google.com.', '**********hai is the CEO of Google. His email address is **********gle.com.')


In [10]:
'''import gradio as gr

def get_response(prompt):
  return (deidentify_with_replace_infotype(prompt,PROJECT_ID, ["PERSON_NAME","EMAIL_ADDRESS"]))

demo = gr.Interface(fn=get_response, inputs="text", outputs=["text","text"])

demo.launch()'''


'import gradio as gr\n\ndef get_response(prompt):\n  return (deidentify_with_replace_infotype(prompt,PROJECT_ID, ["PERSON_NAME","EMAIL_ADDRESS"]))\n\ndemo = gr.Interface(fn=get_response, inputs="text", outputs=["text","text"])\n\ndemo.launch()'

In [11]:
# prompt: change the above gradio Interface code to Blocks

from google.colab import auth
from vertexai.preview.language_models import (TextGenerationModel)
import google.cloud.dlp  # noqa: F811, E402
from typing import List  # noqa: F811, E402
import gradio as gr

# ... (rest of your code, including deidentify_with_replace_infotype function) ...

with gr.Blocks() as demo:
    gr.Markdown("## PaLM 2 with SDP Redaction")
    with gr.Row():
        with gr.Column():
            prompt_input = gr.Textbox(label="Enter your prompt")
            submit_button = gr.Button("Submit")
        with gr.Column():
            original_output = gr.Textbox(label="Original Response")
            redacted_output = gr.Textbox(label="Redacted Response")

    def get_response(prompt):
        original, redacted = deidentify_with_replace_infotype(
            prompt, PROJECT_ID, ["PERSON_NAME", "EMAIL_ADDRESS"]
        )
        return original, redacted

    submit_button.click(
        get_response, inputs=prompt_input, outputs=[original_output, redacted_output]
    )

demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://5bb332c0fbd9f3beec.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


