From 617e1314ee42996adf8351d2ecd016bcf5481d05 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Thu, 26 Oct 2023 11:46:47 -0700 Subject: [PATCH 01/18] create pii operator --- ads/opctl/operator/lowcode/pii/README.md | 102 ++++++++++++++++++ ads/opctl/operator/lowcode/pii/__init__.py | 5 + ads/opctl/operator/lowcode/pii/cmd.py | 43 ++++++++ .../operator/lowcode/pii/environment.yaml | 8 ++ ads/opctl/operator/lowcode/pii/schema.yaml | 0 5 files changed, 158 insertions(+) create mode 100644 ads/opctl/operator/lowcode/pii/README.md create mode 100644 ads/opctl/operator/lowcode/pii/__init__.py create mode 100644 ads/opctl/operator/lowcode/pii/cmd.py create mode 100644 ads/opctl/operator/lowcode/pii/environment.yaml create mode 100644 ads/opctl/operator/lowcode/pii/schema.yaml diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md new file mode 100644 index 000000000..3970554d5 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -0,0 +1,102 @@ +# PII Operator + + + +Below are the steps to configure and run the PII Operator on different resources. + +## 1. Prerequisites + +Follow the [CLI Configuration](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/cli/opctl/configure.html) steps from the ADS documentation. This step is mandatory as it sets up default values for different options while running the PII Operator on OCI Data Science jobs. + +## 2. Generating configs + +To generate starter configs, run the command below. This will create a list of YAML configs and place them in the `output` folder. + +```bash +ads operator init -t pii --overwrite --output ~/pii/ +``` + +The most important files expected to be generated are: + +- `pii.yaml`: Contains pii-related configuration. +- `backend_operator_local_python_config.yaml`: This includes a local backend configuration for running pii in a local environment. The environment should be set up manually before running the operator. +- `backend_job_python_config.yaml`: Contains Data Science job-related config to run pii in a Data Science job within a conda runtime. The conda should be built and published before running the operator. + +All generated configurations should be ready to use without the need for any additional adjustments. However, they are provided as starter kit configurations that can be customized as needed. + +## 3. Running PII on the local conda environment + +To run forecasting locally, create and activate a new conda environment (`ads-pii`). Install all the required libraries listed in the `environment.yaml` file. + +```yaml +- datapane +- scrubadub +- "git+https://github.com/oracle/accelerated-data-science.git@feature/forecasting#egg=oracle-ads" +``` + +Please review the previously generated `pii.yaml` file using the `init` command, and make any necessary adjustments to the input and output file locations. By default, it assumes that the files should be located in the same folder from which the `init` command was executed. + +Use the command below to verify the pii config. + +```bash +ads operator verify -f ~/pii/pii.yaml +``` + +Use the following command to run the forecasting within the `ads-pii` conda environment. + +```bash +ads operator run -f ~/pii/pii.yaml -b local +``` + +The operator will run in your local environment without requiring any additional modifications. + +## 4. Running PII in the Data Science job within conda runtime + +To execute the forecasting operator within a Data Science job using conda runtime, please follow the steps outlined below: + +You can use the following command to build the pii conda environment. + +```bash +ads operator build-conda -t pii +``` + +This will create a new `pii_v1` conda environment and place it in the folder specified within `ads opctl configure` command. + +Use the command below to Publish the `pii_v1` conda environment to the Object Storage bucket. + +```bash +ads opctl conda publish pii_v1 +``` +More details about configuring CLI can be found here - [Configuring CLI](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/cli/opctl/configure.html) + + +After the conda environment is published to Object Storage, it can be used within Data Science jobs service. Check the `backend_job_python_config.yaml` config file. It should contain pre-populated infrastructure and runtime sections. The runtime section should contain a `conda` section. + +```yaml +conda: + type: published + uri: oci://bucket@namespace/conda_environments/cpu/pii/1/pii_v1 +``` + +More details about supported options can be found in the ADS Jobs documentation - [Run a Python Workload](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/jobs/run_python.html). + +Adjust the `pii.yaml` config with proper input/output folders. When the pii is run in the Data Science job, it will not have access to local folders. Therefore, input data and output folders should be placed in the Object Storage bucket. Open the `pii.yaml` and adjust the following fields: + +```yaml +output_directory: + url: oci://bucket@namespace/pii/result/ +test_data: + url: oci://bucket@namespace/pii/input_data/test.csv +``` + +Run the pii on the Data Science jobs using the command posted below: + +```bash +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_job_python_config.yaml +``` + +The logs can be monitored using the `ads opctl watch` command. + +```bash +ads opctl watch +``` diff --git a/ads/opctl/operator/lowcode/pii/__init__.py b/ads/opctl/operator/lowcode/pii/__init__.py new file mode 100644 index 000000000..b8d0460f5 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/__init__.py @@ -0,0 +1,5 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ diff --git a/ads/opctl/operator/lowcode/pii/cmd.py b/ads/opctl/operator/lowcode/pii/cmd.py new file mode 100644 index 000000000..e119c4c76 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/cmd.py @@ -0,0 +1,43 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +from typing import Dict + +import click + +from ads.opctl import logger +from ads.opctl.operator.common.utils import _load_yaml_from_uri +from ads.opctl.operator.common.operator_yaml_generator import YamlGenerator + + +def init(**kwargs: Dict) -> str: + """ + Generates operator config by the schema. + + Properties + ---------- + kwargs: (Dict, optional). + Additional key value arguments. + + - type: str + The type of the operator. + + Returns + ------- + str + The YAML specification generated based on the schema. + """ + logger.info("==== PII related options ====") + + model_type = click.prompt( + "Provide a model type:", + type=click.Choice(SupportedModels.values()), + default=SupportedModels.Prophet, + ) + + return YamlGenerator( + schema=_load_yaml_from_uri(__file__.replace("cmd.py", "schema.yaml")) + ).generate_example_dict(values={"model": model_type, "type": kwargs.get("type")}) diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml new file mode 100644 index 000000000..420b5d1e7 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -0,0 +1,8 @@ +name: PII +channels: + - conda-forge +dependencies: + - python=3.8 + - pip + - pip: + - datapane diff --git a/ads/opctl/operator/lowcode/pii/schema.yaml b/ads/opctl/operator/lowcode/pii/schema.yaml new file mode 100644 index 000000000..e69de29bb From 7eaa10706bf59b94272f7f7981d78bc81fc1dba2 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Mon, 30 Oct 2023 10:14:30 -0700 Subject: [PATCH 02/18] sync basic setup --- ads/opctl/operator/lowcode/pii/MLoperator | 12 ++ ads/opctl/operator/lowcode/pii/__main__.py | 77 ++++++++++ ads/opctl/operator/lowcode/pii/cmd.py | 8 +- .../operator/lowcode/pii/environment.yaml | 2 + ads/opctl/operator/lowcode/pii/errors.py | 27 ++++ .../operator/lowcode/pii/model/__init__.py | 5 + .../operator/lowcode/pii/operator_config.py | 96 ++++++++++++ ads/opctl/operator/lowcode/pii/schema.yaml | 145 ++++++++++++++++++ 8 files changed, 365 insertions(+), 7 deletions(-) create mode 100644 ads/opctl/operator/lowcode/pii/MLoperator create mode 100644 ads/opctl/operator/lowcode/pii/__main__.py create mode 100644 ads/opctl/operator/lowcode/pii/errors.py create mode 100644 ads/opctl/operator/lowcode/pii/model/__init__.py create mode 100644 ads/opctl/operator/lowcode/pii/operator_config.py diff --git a/ads/opctl/operator/lowcode/pii/MLoperator b/ads/opctl/operator/lowcode/pii/MLoperator new file mode 100644 index 000000000..e4977c778 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/MLoperator @@ -0,0 +1,12 @@ +type: pii +version: v1 +name: PII Operator +conda_type: published +conda: pii_v1 +gpu: no +keywords: + - PII +backends: + - job +description: | + PII operator. diff --git a/ads/opctl/operator/lowcode/pii/__main__.py b/ads/opctl/operator/lowcode/pii/__main__.py new file mode 100644 index 000000000..aa1c31e38 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/__main__.py @@ -0,0 +1,77 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import json +import os +import sys +from typing import Dict, List + +import yaml + +from ads.opctl import logger +from ads.opctl.operator.common.const import ENV_OPERATOR_ARGS +from ads.opctl.operator.common.utils import _parse_input_args + +from .operator_config import PIIOperatorConfig + + +def operate(operator_config: PIIOperatorConfig) -> None: + """Runs the PII operator.""" + + print("The operator is running...") + + +def verify(spec: Dict, **kwargs: Dict) -> bool: + """Verifies the PII operator config.""" + operator = PIIOperatorConfig.from_dict(spec) + msg_header = ( + f"{'*' * 30} The operator config has been successfully verified {'*' * 30}" + ) + print(msg_header) + print(operator.to_yaml()) + print("*" * len(msg_header)) + + +def main(raw_args: List[str]): + """The entry point of the PII the operator.""" + args, _ = _parse_input_args(raw_args) + if not args.file and not args.spec and not os.environ.get(ENV_OPERATOR_ARGS): + logger.info( + "Please specify -f[--file] or -s[--spec] or " + f"pass operator's arguments via {ENV_OPERATOR_ARGS} environment variable." + ) + return + + logger.info("-" * 100) + logger.info(f"{'Running' if not args.verify else 'Verifying'} the operator...") + + # if spec provided as input string, then convert the string into YAML + yaml_string = "" + if args.spec or os.environ.get(ENV_OPERATOR_ARGS): + operator_spec_str = args.spec or os.environ.get(ENV_OPERATOR_ARGS) + try: + yaml_string = yaml.safe_dump(json.loads(operator_spec_str)) + except json.JSONDecodeError: + yaml_string = yaml.safe_dump(yaml.safe_load(operator_spec_str)) + except: + yaml_string = operator_spec_str + + operator_config = PIIOperatorConfig.from_yaml( + uri=args.file, + yaml_string=yaml_string, + ) + + logger.info(operator_config.to_yaml()) + + # run operator + if args.verify: + verify(operator_config) + else: + operate(operator_config) + + +if __name__ == "__main__": + main(sys.argv[1:]) diff --git a/ads/opctl/operator/lowcode/pii/cmd.py b/ads/opctl/operator/lowcode/pii/cmd.py index e119c4c76..f76b5faaf 100644 --- a/ads/opctl/operator/lowcode/pii/cmd.py +++ b/ads/opctl/operator/lowcode/pii/cmd.py @@ -32,12 +32,6 @@ def init(**kwargs: Dict) -> str: """ logger.info("==== PII related options ====") - model_type = click.prompt( - "Provide a model type:", - type=click.Choice(SupportedModels.values()), - default=SupportedModels.Prophet, - ) - return YamlGenerator( schema=_load_yaml_from_uri(__file__.replace("cmd.py", "schema.yaml")) - ).generate_example_dict(values={"model": model_type, "type": kwargs.get("type")}) + ).generate_example_dict(values={"type": kwargs.get("type")}) diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml index 420b5d1e7..a4e2d1dc8 100644 --- a/ads/opctl/operator/lowcode/pii/environment.yaml +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -6,3 +6,5 @@ dependencies: - pip - pip: - datapane + - scrubadub + - "git+https://github.com/oracle/accelerated-data-science.git@feature/ads_pii_operator#egg=oracle-ads" diff --git a/ads/opctl/operator/lowcode/pii/errors.py b/ads/opctl/operator/lowcode/pii/errors.py new file mode 100644 index 000000000..73aadaf46 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/errors.py @@ -0,0 +1,27 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +class PIISchemaYamlError(Exception): + """Exception raised when there is an issue with the schema.""" + + def __init__(self, error: str): + super().__init__( + "Invalid PII operator specification. Check the YAML structure and ensure it " + "complies with the required schema for PII operator. \n" + f"{error}" + ) + + +class PIIInputDataError(Exception): + """Exception raised when there is an issue with input data.""" + + def __init__(self, error: str): + super().__init__( + "Invalid input data. Check the input data and ensure it " + "complies with the validation criteria. \n" + f"{error}" + ) diff --git a/ads/opctl/operator/lowcode/pii/model/__init__.py b/ads/opctl/operator/lowcode/pii/model/__init__.py new file mode 100644 index 000000000..b8d0460f5 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/__init__.py @@ -0,0 +1,5 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ diff --git a/ads/opctl/operator/lowcode/pii/operator_config.py b/ads/opctl/operator/lowcode/pii/operator_config.py new file mode 100644 index 000000000..aa4faa0d7 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/operator_config.py @@ -0,0 +1,96 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import os +from dataclasses import dataclass, field +from typing import Dict, List +from ads.common.serializer import DataClassSerializable +from ads.opctl.operator.common.utils import _load_yaml_from_uri +from ads.opctl.operator.common.operator_config import OperatorConfig + + +@dataclass(repr=True) +class InputData(DataClassSerializable): + """Class representing operator specification input data details.""" + + format: str = None + columns: List[str] = None + url: str = None + options: Dict = None + limit: int = None + + +@dataclass(repr=True) +class OutputDirectory(DataClassSerializable): + """Class representing operator specification output directory details.""" + + format: str = None + url: str = None + name: str = None + options: Dict = None + + +@dataclass(repr=True) +class Report(DataClassSerializable): + """Class representing operator specification report details.""" + + report_filename: str = None + show_rows: int = 25 + show_sensitive_content: bool = False + + +@dataclass(repr=True) +class Redactor(DataClassSerializable): + """Class representing operator specification redactor directory details.""" + + detectors: list = None + spacy_detectors: Dict = None + anonymization: list = None + + +@dataclass(repr=True) +class PiiOperatorSpec(DataClassSerializable): + """Class representing pii operator specification.""" + + name: str = None + input_data: InputData = field(default_factory=InputData) + output_directory: OutputDirectory = field(default_factory=OutputDirectory) + report: Report = field(default_factory=Report) + target_column: str = None + redactor: Redactor = field(default_factory=Redactor) + + def __post_init__(self): + """Adjusts the specification details.""" + self.report_file_name = self.report_file_name or "report.html" + + +@dataclass(repr=True) +class PiiOperatorConfig(OperatorConfig): + """Class representing pii operator config. + + Attributes + ---------- + kind: str + The kind of the resource. For operators it is always - `operator`. + type: str + The type of the operator. For pii operator it is always - `forecast` + version: str + The version of the operator. + spec: PiiOperatorSpec + The pii operator specification. + """ + + kind: str = "operator" + type: str = "pii" + version: str = "v1" + spec: PiiOperatorSpec = field(default_factory=PiiOperatorSpec) + + @classmethod + def _load_schema(cls) -> str: + """Loads operator schema.""" + return _load_yaml_from_uri( + os.path.join(os.path.dirname(os.path.abspath(__file__)), "schema.yaml") + ) diff --git a/ads/opctl/operator/lowcode/pii/schema.yaml b/ads/opctl/operator/lowcode/pii/schema.yaml index e69de29bb..e189aa4a7 100644 --- a/ads/opctl/operator/lowcode/pii/schema.yaml +++ b/ads/opctl/operator/lowcode/pii/schema.yaml @@ -0,0 +1,145 @@ +kind: + allowed: + - operator + required: true + type: string + default: operator + meta: + description: "Which service are you trying to use? Common kinds: `operator`, `job`" + +version: + allowed: + - "v1" + required: true + type: string + default: v1 + meta: + description: "Operators may change yaml file schemas from version to version, as well as implementation details. Double check the version to ensure compatibility." + +type: + required: true + type: string + default: pii + meta: + description: "Type should always be `pii` when using a pii operator" + + +spec: + required: true + schema: + input_data: + required: true + type: dict + meta: + description: "This should be indexed by target column." + schema: + format: + allowed: + - csv + - json + required: false + type: string + columns: + required: false + type: list + schema: + type: string + options: + nullable: true + required: false + type: dict + url: + required: true + type: string + default: data.csv + meta: + description: "The url can be local, or remote. For example: `oci://@/data.csv`" + limit: + required: false + type: integer + + output_directory: + required: false + schema: + format: + required: false + type: string + allowed: + - csv + - json + url: + required: true + type: string + default: result/ + meta: + description: "The url can be local, or remote. For example: `oci://@/`" + name: + required: false + type: string + options: + nullable: true + required: false + type: dict + type: dict + + report: + required: false + schema: + report_filename: + required: false + type: string + default: report.html + meta: + description: "Placed into output_directory location. Defaults to report.html" + show_rows: + required: false + type: integer + default: 25 + show_sensitive_content: + required: false + default: false + type: bool + type: dict + + target_column: + type: string + required: true + default: target + + redactor: + type: dict + required: true + schema: + detectors: + required: false + type: list + schema: + type: string + meta: + description: "default detectors supported by scrubadub" + + spacy_detectors: + type: dict + required: false + schema: + model: + type: string + required: true + default: en_core_web_trf + named_entities: + type: list + required: true + default: ["PERSON"] + schema: + type: string + meta: + description: "Apply spacy model to detect the target entities." + + anonymization: + type: list + required: false + schema: + type: string + meta: + description: "Anonylze the selected entities." + type: dict From 3f929f05df3d27bb9258a52dfb237fa380515b53 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Wed, 1 Nov 2023 17:00:30 -0700 Subject: [PATCH 03/18] updated default schema --- .../common/operator_yaml_generator.py | 4 +- ads/opctl/operator/lowcode/pii/__main__.py | 14 +- .../operator/lowcode/pii/model/guardrails.py | 190 ++++++++ ads/opctl/operator/lowcode/pii/model/pii.py | 190 ++++++++ .../operator/lowcode/pii/model/processor.py | 326 +++++++++++++ .../operator/lowcode/pii/model/report.py | 441 ++++++++++++++++++ ads/opctl/operator/lowcode/pii/model/utils.py | 234 ++++++++++ .../operator/lowcode/pii/operator_config.py | 20 +- ads/opctl/operator/lowcode/pii/schema.yaml | 50 +- 9 files changed, 1418 insertions(+), 51 deletions(-) create mode 100644 ads/opctl/operator/lowcode/pii/model/guardrails.py create mode 100644 ads/opctl/operator/lowcode/pii/model/pii.py create mode 100644 ads/opctl/operator/lowcode/pii/model/processor.py create mode 100644 ads/opctl/operator/lowcode/pii/model/report.py create mode 100644 ads/opctl/operator/lowcode/pii/model/utils.py diff --git a/ads/opctl/operator/common/operator_yaml_generator.py b/ads/opctl/operator/common/operator_yaml_generator.py index b2b9e2823..3e8301693 100644 --- a/ads/opctl/operator/common/operator_yaml_generator.py +++ b/ads/opctl/operator/common/operator_yaml_generator.py @@ -76,7 +76,7 @@ def _check_condition( Returns ------- bool - True if the condition fulfils, false otherwise. + True if the condition fulfills, false otherwise. """ for key, value in condition.items(): if key not in example or example[key] != value: @@ -104,7 +104,7 @@ def _generate_example( """ example = {} for key, value in schema.items(): - # only generate values fro required fields + # only generate values for required fields if ( value.get("required", False) or value.get("dependencies", False) diff --git a/ads/opctl/operator/lowcode/pii/__main__.py b/ads/opctl/operator/lowcode/pii/__main__.py index aa1c31e38..fae0fda83 100644 --- a/ads/opctl/operator/lowcode/pii/__main__.py +++ b/ads/opctl/operator/lowcode/pii/__main__.py @@ -15,18 +15,24 @@ from ads.opctl.operator.common.const import ENV_OPERATOR_ARGS from ads.opctl.operator.common.utils import _parse_input_args -from .operator_config import PIIOperatorConfig +from .operator_config import PiiOperatorConfig -def operate(operator_config: PIIOperatorConfig) -> None: +def operate(operator_config: PiiOperatorConfig) -> None: """Runs the PII operator.""" + # import pdb + # pdb.set_trace() print("The operator is running...") + # from pii.guardrails import PIIGuardrail + + # guard = PIIGuardrail(config_uri="./responsibleai.yaml") + # guard.evaluate() def verify(spec: Dict, **kwargs: Dict) -> bool: """Verifies the PII operator config.""" - operator = PIIOperatorConfig.from_dict(spec) + operator = PiiOperatorConfig.from_dict(spec) msg_header = ( f"{'*' * 30} The operator config has been successfully verified {'*' * 30}" ) @@ -59,7 +65,7 @@ def main(raw_args: List[str]): except: yaml_string = operator_spec_str - operator_config = PIIOperatorConfig.from_yaml( + operator_config = PiiOperatorConfig.from_yaml( uri=args.file, yaml_string=yaml_string, ) diff --git a/ads/opctl/operator/lowcode/pii/model/guardrails.py b/ads/opctl/operator/lowcode/pii/model/guardrails.py new file mode 100644 index 000000000..f32f9b463 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/guardrails.py @@ -0,0 +1,190 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +import pandas as pd +from ads.opctl.operator.lowcode.pii.model.utils import from_yaml +from ads.opctl.operator.lowcode.pii.model.pii import config_scrubber, scrub, detect +from ads.opctl.operator.lowcode.pii.model.report import PIIOperatorReport +from ads.common import auth as authutil +from datetime import datetime +import os +import time +import datapane as dp + + +def get_output_name(given_name, target_name=None): + """Add ``-out`` suffix to the src filename.""" + if not target_name: + basename = os.path.basename(given_name) + fn, ext = os.path.splitext(basename) + target_name = fn + "_out" + ext + return target_name + + +class PIIGuardrail: + def __init__(self, config_uri: str, auth: dict = None): + # load config.yaml for pii + self.spec = from_yaml(uri=config_uri).get("spec") + self.output_data_name = None + # config metric + for metric in self.spec.get("metrics", []): + # TODO: load other metric + # load pii metric + if metric.get("name", "") == "pii": + pii_load_args = metric.get("load_args") + self.scrubber = config_scrubber(**pii_load_args) + self.target_col = metric.get("target_col", "text") + self.output_data_name = metric.get("output_data_name", None) + + # config spec + self.src_data_uri = self.spec.get("test_data").get("url") + self.dst_uri = None + self.data = None + self.report_uri = None + self.auth = auth or authutil.default_signer() + self.output_directory = self.spec.get("output_directory", {}).get("url", None) + if self.output_directory: + self.dst_uri = os.path.join( + self.output_directory, + get_output_name( + target_name=self.output_data_name, given_name=self.src_data_uri + ), + ) + + self.report_spec = self.spec.get("report", {}) + self.report_uri = ( + os.path.join( + self.report_spec.get("url", "./"), + self.report_spec.get("report_file_name", "report.html"), + ) + if self.report_spec + else None + ) + self.show_rows = self.report_spec.get("show_rows", 25) + self.show_sensitive_content = self.report_spec.get( + "show_sensitive_content", False + ) + + def load_data(self, uri=None, storage_options={}): + # POC: Only csv support + # csv -> pandas.DataFrame + uri = uri or self.src_data_uri + if uri.endswith(".csv"): + if uri.startswith("oci://"): + storage_options = storage_options or self.auth + self.data = pd.read_csv(uri, storage_options=storage_options) + else: + self.data = pd.read_csv(uri) + return self.data + + def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}): + run_at = datetime.now() + dt_string = run_at.strftime("%d/%m/%Y %H:%M:%S") + start_time = time.time() + data = data or self.data + if data is None: + data = self.load_data(storage_options) + + report_uri = report_uri or self.report_uri + dst_uri = dst_uri or self.dst_uri + + data["redacted_text"] = data[self.target_col].apply( + lambda x: scrub(x, scrubber=self.scrubber) + ) + elapsed_time = time.time() - start_time + # generate pii report + if report_uri: + data["entities_cols"] = data[self.target_col].apply( + lambda x: detect(text=x, scrubber=self.scrubber) + ) + from pii.utils import _safe_get_spec + from pii.pii import DEFAULT_SPACY_MODEL + + selected_spacy_model = [] + for spec in _safe_get_spec( + self.scrubber.redact_spec_file, "spacy_detectors", [] + ): + selected_spacy_model.append( + { + "model": _safe_get_spec(spec, "model", DEFAULT_SPACY_MODEL), + "spacy_entites": [ + x.upper() for x in spec.get("named_entities", []) + ], + } + ) + selected_entities = [] + for spacy_models in selected_spacy_model: + selected_entities = selected_entities + spacy_models.get( + "spacy_entites", [] + ) + selected_entities = selected_entities + _safe_get_spec( + self.scrubber.redact_spec_file, "detectors", [] + ) + + context = { + "run_summary": { + "total_tokens": 0, + "src_uri": self.src_data_uri, + "total_rows": len(data.index), + "config": self.spec, + "selected_detectors": list(self.scrubber._detectors.values()), + "selected_entities": selected_entities, + "selected_spacy_model": selected_spacy_model, + "timestamp": dt_string, + "elapsed_time": elapsed_time, + "show_rows": self.show_rows, + "show_sensitive_info": self.show_sensitive_content, + }, + "run_details": {"rows": []}, + } + for ind in data.index: + text = data[self.target_col][ind] + ent_col = data["entities_cols"][ind] + idx = data["id"][ind] + page = { + "id": idx, + "total_tokens": len(ent_col), + "entities": ent_col, + "raw_text": text, + } + context.get("run_details").get("rows").append(page) + context.get("run_summary")["total_tokens"] += len(ent_col) + + context = self._process_context(context) + self._generate_report(context, report_uri) + + if dst_uri: + self._save_output(data, ["id", "redacted_text"], dst_uri) + + print("Mission completed!") + + def _generate_report(self, context, report_uri): + report_ = PIIOperatorReport(context=context) + report_sections = report_.make_view() + report_.save_report(report_sections=report_sections, report_path=report_uri) + + def _save_output(self, df, target_col, dst_uri): + # Based on extension of dst_uri call to_csv or to_json. + data_out = df[target_col] + data_out.to_csv(dst_uri) + return dst_uri + + def _process_context(self, context): + """Count different type of filth.""" + statics = {} # statics : count Filth type in total + rows = context.get("run_details").get("rows") + for row in rows: + entities = row.get("entities") + row_statics = {} # count row + for ent in entities: + row_statics[ent.type] = row_statics.get(ent.type, 0) + 1 + statics[ent.type] = statics.get(ent.type, 0) + 1 + + row["statics"] = row_statics.copy() + + context.get("run_summary")["statics"] = statics + return context diff --git a/ads/opctl/operator/lowcode/pii/model/pii.py b/ads/opctl/operator/lowcode/pii/model/pii.py new file mode 100644 index 000000000..b3b68ef65 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/pii.py @@ -0,0 +1,190 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +import scrubadub +import scrubadub_spacy +import os +import re +import logging +import uuid + +from ads.opctl.operator.lowcode.pii.model.utils import ( + load_html, + SupportInputFormat, + from_yaml, + _safe_get_spec, + default_config, + _read_from_file, + load_rtf, + construct_filth_cls_name, + _write_to_file, + _process_pos, + ReportContextKey, +) +from ads.opctl.operator.lowcode.pii.model.processor import POSTPROCESSOR_MAP + +DEFAULT_SPACY_NAMED_ENTITIES = ["DATE", "FAC", "GPE", "LOC", "ORG", "PER", "PERSON"] +DEFAULT_SPACY_MODEL = "en_core_web_trf" + + +def config_post_processor(spec: dict): + """Return class scrubadub.post_processors.base.PostProcessor.""" + name = _safe_get_spec(spec, "name", "").lower() + if not name in POSTPROCESSOR_MAP.keys(): + raise ValueError( + f"Unsupport post processor: {name}. Only support {POSTPROCESSOR_MAP.keys()}." + ) + cls = POSTPROCESSOR_MAP.get(name) + if name == "number_replacer": + cls._ENTITIES = _safe_get_spec(spec, "entities", cls._ENTITIES) + + return cls + + +def config_spacy_detector(spec: dict): + """Return an instance of scrubadub_spacy.detectors.spacy.SpacyEntityDetector.""" + model = _safe_get_spec(spec, "model", DEFAULT_SPACY_MODEL) + + named_entities = [x.upper() for x in spec.get("named_entities", [])] + spacy_entity_detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector( + named_entities=named_entities, + name=f"spacy_{uuid.uuid4()}", + model=model, + ) + for named_entity in named_entities: + # DEFAULT_SPACY_NAMED_ENTITIES has been registered in filth_cls_map already. + if named_entity in DEFAULT_SPACY_NAMED_ENTITIES: + continue + + filth_cls = type( + construct_filth_cls_name(named_entity), + (scrubadub.filth.Filth,), + {"type": named_entity.upper()}, + ) + spacy_entity_detector.filth_cls_map[named_entity.upper()] = filth_cls + return spacy_entity_detector + + +def config_scrubber( + config: str or dict = None, +): + """ + Returns an instance of srubadub.Scrubber. + + Args: + config: A path to a yaml file or a dict. + + Returns: + An instance of srubadub.Scrubber, which has been configured with the given config. + """ + if not config: + config = default_config() + logging.info(f"Loading config from {config}") + + if isinstance(config, str): + config = from_yaml(uri=config) + + redact_spec_file = config["redactor"] + + detector_list = [] + scrubber = scrubadub.Scrubber() + scrubber.redact_spec_file = redact_spec_file + + # Clean up default detectors + defautls_enable = scrubber._detectors.copy() + for d in defautls_enable: + scrubber.remove_detector(d) + + # Add scrubber built-in detectors + for detector in _safe_get_spec(redact_spec_file, "detectors", []): + detector_list.append(detector) + + # Add spacy detectors + for spec in _safe_get_spec(redact_spec_file, "spacy_detectors", []): + spacy_entity_detector = config_spacy_detector(spec=spec) + detector_list.append(spacy_entity_detector) + + # Add custom detectors + for custom in _safe_get_spec(redact_spec_file, "custom_detectors", []): + patterns = custom.get("patterns", "") + + class CustomFilth(scrubadub.filth.Filth): + type = custom.get("label", "").upper() + + class CustomDetector(scrubadub.detectors.RegexDetector): + filth_cls = CustomFilth + regex = re.compile( + rf"{patterns}", + ) + name = custom.get("name") + + detector_list.append(CustomDetector()) + + for detector in detector_list: + scrubber.add_detector(detector) + + # Add post-processor + for post_processor in _safe_get_spec(redact_spec_file, "anonymization", []): + scrubber.add_post_processor(config_post_processor(post_processor)) + + return scrubber + + +def scrub(text, spec_file=None, scrubber=None): + if not scrubber: + scrubber = config_scrubber(spec_file) + return scrubber.clean(text) + + +def detect(text, spec_file=None, scrubber=None): + if not scrubber: + scrubber = config_scrubber(spec_file) + return list(scrubber.iter_filth(text, document_name=None)) + + +def _get_report_( + input_path, output_path, scrubber=None, report_context=None, subdirectory=None +) -> None: + filename_with_ext = os.path.basename(input_path) + file_name, file_ext = os.path.splitext(filename_with_ext) + + report_text = "" + if file_ext == SupportInputFormat.PLAIN: + report_text = _read_from_file(input_path) + elif file_ext == SupportInputFormat.HTML: + report_text = load_html(uri=input_path) + elif file_ext == SupportInputFormat.RTF: + report_text = load_rtf(uri=input_path) + else: + raise ValueError( + f"Unsupport file format: {file_ext}. Only support {SupportInputFormat.get_support_list()}." + ) + + # preprocess src to remove ** + report_text_ = report_text.replace("**", "") + + scrubbed_text = scrub(text=report_text_, scrubber=scrubber) + dst_uri = os.path.join(output_path, file_name + ".txt") + _write_to_file( + uri=dst_uri, + s=scrubbed_text, + encoding="utf-8", + ) + + # Only generate report if report_context is not None + if report_context: + entities = detect(text=report_text_, scrubber=scrubber) + file_summary = { + ReportContextKey.INPUT_FILE_NAME: input_path, + ReportContextKey.OUTPUT_NAME: dst_uri, + ReportContextKey.TOTAL_TOKENS: len(entities), + ReportContextKey.ENTITIES: _process_pos(entities, report_text_), + ReportContextKey.FILE_NAME: file_name, + } + report_context.get(ReportContextKey.FILE_SUMMARY).get(subdirectory).append( + file_summary + ) diff --git a/ads/opctl/operator/lowcode/pii/model/processor.py b/ads/opctl/operator/lowcode/pii/model/processor.py new file mode 100644 index 000000000..1ff204a00 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor.py @@ -0,0 +1,326 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +"""Contains post processors for scrubadub +Usage: + +scrubber.add_post_processor(NameReplacer()) +scrubber.add_post_processor(NumberReplacer()) + +To keep the same name replacement mappings across multiple documents, +either use the same scrubber instance to clean all the documents, +or use the same NameReplace() instance for all scrubbers. +""" +import datetime +import random +import re +import string +from typing import Sequence + +import scrubadub +import gender_guesser.detector as gender_detector + +from faker import Faker +from scrubadub.filth import Filth +from nameparser import HumanName + + +class NameReplacer(scrubadub.post_processors.PostProcessor): + name = "name_replacer" + + def __init__(self, name: str = None, mapping: dict = None): + if mapping: + self.mapping = mapping + else: + self.mapping = {} + + self.gender_detector = gender_detector.Detector() + self.fake = Faker() + self.groups = { + "first": self.first_name_generator, + "middle": self.first_name_generator, + "last": self.last_name_generator, + "suffix": lambda x: "", + } + super().__init__(name) + + def first_name_generator(self, name): + detected_gender = self.gender_detector.get_gender(name) + if "female" in detected_gender: + return self.fake.first_name_female() + elif "male" in detected_gender: + return self.fake.first_name_male() + return self.fake.first_name_nonbinary() + + def last_name_generator(self, *args): + return self.fake.last_name() + + def unwrap_filth(self, filth_list): + """Un-merge the filths if they have different types.""" + processed = [] + for filth in filth_list: + # MergedFilths has the property "filths" + # Do nothing if filth has a type already + if filth.type in ["unknown", "", None] and hasattr(filth, "filths"): + filth_types = set([f.type.lower() for f in filth.filths]) + # Do nothing if the filth does not contain a name + if "name" not in filth_types: + processed.append(filth) + continue + if len(filth_types) > 1: + processed.extend(filth.filths) + continue + filth.type = filth.filths[0].type + filth.detector_name = filth.filths[0].detector_name + processed.append(filth) + return processed + + @staticmethod + def has_initial(name: HumanName) -> bool: + for attr in ["first", "middle", "last"]: + if len(str(getattr(name, attr)).strip(".")) == 1: + return True + return False + + @staticmethod + def has_non_initial(name: HumanName) -> bool: + for attr in ["first", "middle", "last"]: + if len(str(getattr(name, attr)).strip(".")) > 1: + return True + return False + + @staticmethod + def generate_component(name_component: str, generator): + fake_component = generator(name_component) + if len(name_component.rstrip(".")) == 1: + fake_component = fake_component[0] + if name_component.endswith("."): + fake_component += "." + return fake_component + + def save_name_mapping(self, name: HumanName, fake_name: HumanName): + """Saves the names with initials to the mapping so that a new name will not be generated. + For example, if name is "John Richard Doe", this method will save the following keys to the mapping: + - J Doe + - John D + - J R Doe + - John R D + - John R Doe + """ + # Both first name and last name must be presented + if not name.first or not name.last: + return + # Remove any dot at the end of the name component. + for attr in ["first", "middle", "last"]: + setattr(name, attr, getattr(name, attr).rstrip(".")) + + self.mapping[ + f"{name.first[0]} {name.last}" + ] = f"{fake_name.first[0]} {fake_name.last}" + + self.mapping[ + f"{name.first} {name.last[0]}" + ] = f"{fake_name.first} {fake_name.last[0]}" + + if name.middle: + self.mapping[ + f"{name.first[0]} {name.middle[0]} {name.last}" + ] = f"{fake_name.first[0]} {fake_name.middle[0]} {fake_name.last}" + + self.mapping[ + f"{name.first} {name.middle[0]} {name.last[0]}" + ] = f"{fake_name.first} {fake_name.middle[0]} {fake_name.last[0]}" + + self.mapping[ + f"{name.first} {name.middle[0]} {name.last}" + ] = f"{fake_name.first} {fake_name.middle[0]} {fake_name.last}" + + def replace(self, text): + """Replaces a name with fake name. + + Parameters + ---------- + text : str or HumanName + The name to be replaced. + If text is a HumanName object, the object will be modified to have the new fake names. + + Returns + ------- + str + The replaced name as text. + """ + if isinstance(text, HumanName): + name = text + else: + name = HumanName(text) + skip = [] + # Check if the name is given with initial for one of the first name/last name + key = None + if self.has_initial(name) and self.has_non_initial(name): + if name.middle: + key = f'{name.first.rstrip(".")} {name.middle.rstrip(".")} {name.last.rstrip(".")}' + else: + key = f'{name.first.rstrip(".")} {name.last.rstrip(".")}' + fake_name = self.mapping.get(key) + # If a fake name is found matching the first initial + last name or first name + last initial + # Replace the the initial with the corresponding initial + # and skip processing the first and last name in the replacement. + if fake_name: + fake_name = HumanName(fake_name) + name.first = fake_name.first + name.last = fake_name.last + skip = ["first", "last"] + if name.middle: + name.middle = fake_name.middle + skip.append("middle") + # Replace each component in the name + for attr, generator in self.groups.items(): + if attr in skip: + continue + name_component = getattr(name, attr, None) + if not name_component: + continue + # Check if a fake name has been generated for this name + fake_component = self.mapping.get(name_component) + if not fake_component: + fake_component = self.generate_component(name_component, generator) + # Generate a unique fake name that is not already in the mapping + while fake_component and ( + fake_component in self.mapping.keys() + or fake_component in self.mapping.values() + ): + fake_component = self.generate_component(name_component, generator) + self.mapping[name_component] = fake_component + setattr(name, attr, fake_component) + + # Save name with initials to mapping + original_name = text if isinstance(text, HumanName) else HumanName(text) + self.save_name_mapping(original_name, name) + return str(name) + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + filth_list = self.unwrap_filth(filth_list) + + name_filths = [] + # Filter to keep only the names + for filth in filth_list: + if filth.replacement_string: + continue + if filth.type.lower() != "name": + continue + name_filths.append(filth) + + # Sort reverse by last name so that names having a last name will be processed first. + # When a name is referred by last name (e.g. Mr. White), HumanName will parse it as first name. + name_filths.sort(key=lambda x: HumanName(x.text).last, reverse=True) + for filth in name_filths: + filth.replacement_string = self.replace(filth.text) + return filth_list + + +class NumberReplacer(scrubadub.post_processors.PostProcessor): + name = "number_replacer" + _ENTITIES = [ + "number", + "mrn", + "fin", + "phone", + "social_security_number", + ] + + @staticmethod + def replace_digit(obj): + return random.choice("0123456789") + + def match_entity_type(self, filth_types): + if list(set(self._ENTITIES) & set(filth_types)): + return True + return False + + def replace_date(self, text): + date_formats = ["%m-%d-%Y", "%m-%d-%y", "%d-%m-%Y", "%d-%m-%y"] + for date_format in date_formats: + try: + date = datetime.datetime.strptime(text, date_format) + except ValueError: + continue + if date.year < 1900 or date.year > datetime.datetime.now().year: + continue + # Now the date is a valid data between 1900 and now + return text + return None + + def replace(self, text): + # Check dates + date = self.replace_date(text) + if date: + return date + return re.sub(r"\d", self.replace_digit, text) + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + # Do not process it if it already has a replacement. + if filth.replacement_string: + continue + if filth.type.lower() in self._ENTITIES: + filth.replacement_string = self.replace(filth.text) + # Replace the numbers for merged filth + if filth.type.lower() == "unknown" and hasattr(filth, "filths"): + filth_types = set([f.type for f in filth.filths]) + if self.match_entity_type(filth_types): + filth.replacement_string = self.replace(filth.text) + return filth_list + + +class EmailReplacer(scrubadub.post_processors.PostProcessor): + name = "email_replacer" + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + if filth.replacement_string: + continue + if filth.type.lower() != "email": + continue + filth.replacement_string = Faker().email() + return filth_list + + +class HIBNReplacer(scrubadub.post_processors.PostProcessor): + name = "hibn_replacer" + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + # TODO: Add support for anomymizing Health insurance beneficiary number ~ Consecutive sequence of alphanumeric characters + pass + + +class MBIReplacer(scrubadub.post_processors.PostProcessor): + name = "mbi_replacer" + CHAR_POOL = "ACDEFGHJKMNPQRTUVWXY" + + def generate_mbi(self): + return "".join(random.choices(self.CHAR_POOL + string.digits, k=11)) + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + if filth.replacement_string: + continue + if filth.type.lower() != "mbi": + continue + filth.replacement_string = self.generate_mbi() + return filth_list + + +POSTPROCESSOR_MAP = { + item.name.lower(): item + for item in [ + NameReplacer, + NumberReplacer, + EmailReplacer, + HIBNReplacer, + MBIReplacer, + ] +} diff --git a/ads/opctl/operator/lowcode/pii/model/report.py b/ads/opctl/operator/lowcode/pii/model/report.py new file mode 100644 index 000000000..2584544ce --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/report.py @@ -0,0 +1,441 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +# helper function to make report +import yaml +import plotly.express as px +import pandas as pd +import datapane as dp +import random +import plotly.graph_objects as go +import fsspec + + +PII_REPORT_DESCRIPTION = ( + "This report will offer a comprehensive overview of the redaction of personal identifiable information (PII) from the provided data." + "The `Summary` section will provide an executive summary of this process, including key statistics, configuration, and model usage." + "The `Details` section will offer a more granular analysis of each row of data, including relevant statistics." +) +DETAILS_REPORT_DESCRIPTION = "The following report will show the details on each row. You can view the highlighted named entities and their labels in the text under `TEXT` tab." + + +################ +# Others utils # +################ +def compute_rate(elapsed_time, num_unit): + return elapsed_time / num_unit + + +def human_time_friendly(seconds): + TIME_DURATION_UNITS = ( + ("week", 60 * 60 * 24 * 7), + ("day", 60 * 60 * 24), + ("hour", 60 * 60), + ("min", 60), + ) + if seconds == 0: + return "inf" + accumulator = [] + for unit, div in TIME_DURATION_UNITS: + amount, seconds = divmod(float(seconds), div) + if amount > 0: + accumulator.append( + "{} {}{}".format(int(amount), unit, "" if amount == 1 else "s") + ) + accumulator.append("{} secs".format(round(seconds, 2))) + return ", ".join(accumulator) + + +FLAT_UI_COLORS = [ + "#1ABC9C", + "#2ECC71", + "#3498DB", + "#9B59B6", + "#34495E", + "#16A085", + "#27AE60", + "#2980B9", + "#8E44AD", + "#2C3E50", + "#F1C40F", + "#E67E22", + "#E74C3C", + "#ECF0F1", + "#95A5A6", + "#F39C12", + "#D35400", + "#C0392B", + "#BDC3C7", + "#7F8C8D", +] +LABEL_TO_COLOR_MAP = {} + + +# all spacy model: https://huggingface.co/spacy +# "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md", +def make_model_card(model_name="", readme_path=""): + """Make render model_readme.md as model card.""" + readme_path = ( + f"https://huggingface.co/spacy/{model_name}/raw/main/README.md" + if model_name + else readme_path + ) + if not readme_path: + raise NotImplementedError("Does not support other spacy model so far.") + + with fsspec.open(readme_path, "r") as file: + content = file.read() + _, front_matter, text = content.split("---", 2) + data = yaml.safe_load(front_matter) + + try: + eval_res = data["model-index"][0]["results"] + metrics = [] + values = [] + for eval in eval_res: + metric = [x["name"] for x in eval["metrics"]] + value = [x["value"] for x in eval["metrics"]] + metrics = metrics + metric + values = values + value + df = pd.DataFrame({"Metrics": metrics, "Values": values}) + fig = go.Figure( + data=[ + go.Table( + header=dict(values=list(df.columns)), + cells=dict(values=[df.Metrics, df.Values]), + ) + ] + ) + eval_res_tb = dp.Plot(data=fig, caption="Evaluation Results") + except: + eval_res_tb = dp.Text("-") + print( + "The given readme.md doesn't have correct template for Evaluation Results." + ) + + return dp.Group( + dp.Text(text), + eval_res_tb, + columns=2, + ) + + +################ +# Report utils # +################ +def map_label_to_color(labels): + label_to_colors = {} + for label in labels: + label = label.lower() + label_to_colors[label] = LABEL_TO_COLOR_MAP.get( + label, random.choice(FLAT_UI_COLORS) + ) + LABEL_TO_COLOR_MAP[label] = label_to_colors[label] + + return label_to_colors + + +def plot_pie(count_map) -> dp.Plot: + cols = count_map.keys() + cnts = count_map.values() + ent_col_name = "EntityName" + cnt_col_name = "count" + df = pd.DataFrame({ent_col_name: cols, cnt_col_name: cnts}) + + fig = px.pie( + df, + values=cnt_col_name, + names=ent_col_name, + title="The Distribution Of Entities Redacted", + color=ent_col_name, + color_discrete_map=map_label_to_color(cols), + ) + fig.update_traces(textposition="inside", textinfo="percent+label") + return dp.Plot(fig) + + +def build_entity_df(entites, id) -> pd.DataFrame: + text = [ent.text for ent in entites] + types = [ent.type for ent in entites] + # pos = [f"{ent.beg}" + ":" + f"{ent.end}" for ent in entites] + replaced_values = [ + ent.replacement_string or "{{" + ent.placeholder + "}}" for ent in entites + ] + d = { + "rowID": id, + "Entity (Original Text)": text, + "Type": types, + "Redacted To": replaced_values, + # "Beg: End": pos, + } + df = pd.DataFrame(data=d) + if df.size == 0: + # Datapane does not support empty dataframe, append a dummy row + df2 = { + "rowID": id, + "Entity (Original Text)": "-", + "Type": "-", + "Redacted To": "-", + # "Begs: End": "-", + } + df = df.append(df2, ignore_index=True) + return df + + +class RowReportFields: + # TODO: rename class + def __init__(self, context, show_sensitive_info: bool = True): + self.total_tokens = context.get("total_tokens", "unknown") + self.entites_cnt_map = context.get("statics", {}) + self.raw_text = context.get("raw_text", "") + self.id = context.get("id", "") + self.show_sensitive_info = show_sensitive_info + self.entities = context.get("entities") + + def build_report(self) -> dp.Group: + return dp.Group( + dp.Select( + blocks=[ + self._make_stats_card(), + self._make_text_card(), + ], + type=dp.SelectType.TABS, + ), + label="rowId: " + str(self.id), + ) + + def _make_stats_card(self): + stats = [ + dp.Text("## Row Summary Statistics"), + dp.BigNumber( + heading="Total No. Of Entites Proceed", + value=self.total_tokens, + ), + dp.Text(f"### Entities Distribution"), + plot_pie(self.entites_cnt_map), + ] + if self.show_sensitive_info: + stats.append(dp.Text(f"### Resolved Entities")) + stats.append( + dp.DataTable( + build_entity_df(self.entities, id=self.id), + label="Resolved Entities", + ) + ) + return dp.Group(blocks=stats, label="STATS") + + def _make_text_card(self): + annotations = [] + labels = set() + for ent in self.entities: + annotations.append((ent.beg, ent.end, ent.type)) + labels.add(ent.type) + + d = {"Content": [self.raw_text], "Annotations": [annotations]} + df = pd.DataFrame(data=d) + + render_html = df.ads.render_ner( + options={ + "default_color": "#D6D3D1", + "colors": map_label_to_color(labels), + }, + return_html=True, + ) + return dp.Group(dp.HTML(render_html), label="TEXT") + + +class PIIOperatorReport: + def __init__(self, context: dict): + # set useful field for generating report from context + summary_context = context.get("run_summary", {}) + self.config = summary_context.get("config", {}) # for generate yaml + self.show_sensitive_info = summary_context.get("show_sensitive_info", True) + self.show_rows = summary_context.get("show_rows", 25) + self.total_rows = summary_context.get("total_rows", "unknown") + self.total_tokens = summary_context.get("total_tokens", "unknown") + self.elapsed_time = summary_context.get("elapsed_time", 0) + self.entites_cnt_map = summary_context.get("statics", {}) + self.selected_entities = summary_context.get("selected_entities", []) + self.spacy_detectors = summary_context.get("selected_spacy_model", []) + self.run_at = summary_context.get("timestamp", "today") + + rows = context.get("run_details", {}).get("rows", []) + rows = rows[0 : self.show_rows] + self.rows_details = [ + RowReportFields(r, self.show_sensitive_info) for r in rows + ] # List[RowReportFields], len=show_rows + + self._validate_fields() + + def _validate_fields(self): + """Check if any fields are empty.""" + # TODO + pass + + def make_view(self): + title_text = dp.Text("# Personally Identifiable Information Operator Report") + time_proceed = dp.BigNumber( + heading="Ran at", + value=self.run_at, + ) + report_description = dp.Text(PII_REPORT_DESCRIPTION) + + structure = dp.Blocks( + dp.Select( + blocks=[ + dp.Group( + self._build_summary_page(), + label="Summary", + ), + dp.Group( + self._build_details_page(), + label="Details", + ), + ], + type=dp.SelectType.TABS, + ) + ) + self.report_sections = [title_text, report_description, time_proceed, structure] + return self.report_sections + + def save_report(self, report_sections, report_path): + dp.save_report( + report_sections or self.report_sections, + path=report_path, + open=False, + ) + return report_path + + def _build_summary_page(self): + summary = dp.Blocks( + dp.Text("# PII Summary"), + dp.Text(self._get_summary_desc()), + dp.Select( + blocks=[ + self._make_summary_stats_card(), + self._make_yaml_card(), + self._make_model_card(), + ], + type=dp.SelectType.TABS, + ), + ) + + return summary + + def _build_details_page(self): + details = dp.Blocks( + dp.Text(DETAILS_REPORT_DESCRIPTION), + dp.Select( + blocks=[ + row.build_report() for row in self.rows_details + ], # RowReportFields + type=dp.SelectType.DROPDOWN, + label="Details", + ), + ) + + return details + + def _make_summary_stats_card(self) -> dp.Group: + """ + Shows summary statics + 1. total rows + 2. total entites + 3. time_spent/row + 4. entities distribution + 5. resolved Entities in sample data - optional + """ + summary_stats = [ + dp.Text("## Summary Statistics"), + dp.Group( + dp.BigNumber( + heading="Total No. Of Rows", + value=self.total_rows, + ), + dp.BigNumber( + heading="Total No. Of Entites Proceed", + value=self.total_tokens, + ), + dp.BigNumber( + heading="Rows per second processed", + value=compute_rate(self.elapsed_time, self.total_rows), + ), + dp.BigNumber( + heading="Total Time Spent", + value=human_time_friendly(self.elapsed_time), + ), + columns=2, + ), + dp.Text(f"### Entities Distribution"), + plot_pie(self.entites_cnt_map), + ] + if self.show_sensitive_info: + entites_df = self._build_total_entity_df() + summary_stats.append(dp.Text(f"### Resolved Entities")) + summary_stats.append(dp.DataTable(entites_df)) + return dp.Group(blocks=summary_stats, label="STATS") + + def _make_yaml_card(self) -> dp.Group: + # show pii config yaml + yaml_string = yaml.dump(self.config, Dumper=yaml.SafeDumper) + yaml_appendix_title = dp.Text(f"## Reference: YAML File") + yaml_appendix = dp.Code(code=yaml_string, language="yaml") + return dp.Group(blocks=[yaml_appendix_title, yaml_appendix], label="YAML") + + def _make_model_card(self) -> dp.Group: + # show each model card + model_cards = [ + dp.Group( + make_model_card(model_name=x.get("model")), + label=x.get("model"), + ) + for x in self.spacy_detectors + ] + + if len(model_cards) <= 1: + return dp.Group( + blocks=model_cards, + label="MODEL CARD", + ) + return dp.Group( + dp.Select( + blocks=model_cards, + type=dp.SelectType.TABS, + ), + label="MODEL CARD", + ) + + def _build_total_entity_df(self) -> pd.DataFrame: + frames = [] + for row in self.rows_details: # RowReportFields + frames.append(build_entity_df(entites=row.entities, id=row.id)) + + result = pd.concat(frames) + return result + + def _get_summary_desc(self) -> str: + entities_mark_down = ["**" + ent + "**" for ent in self.selected_entities] + + model_description = "" + for spacy_model in self.spacy_detectors: + model_description = ( + model_description + + f"You chose the **{spacy_model.get('model', 'unknown model')}** model for **{spacy_model.get('spacy_entites', 'unknown entities')}** detection." + ) + if model_description: + model_description = ( + model_description + + "You can view the model details under the ``MODEL CARD`` tab." + ) + + SUMMARY_REPORT_DESCRIPTION_TEMPLATE = f""" + This report will detail the statistics and configuration of the redaction process.The report will contain information such as the number of rows processed, the number of entities redacted, and so on. The report will provide valuable insight into the performance of the PII tool and facilitate any necessary adjustments to improve its performance. + + Based on the configuration file (you can view the YAML details under the ``YAML`` tab), you selected the following entities: {entities_mark_down}. + {model_description} + """ + return SUMMARY_REPORT_DESCRIPTION_TEMPLATE diff --git a/ads/opctl/operator/lowcode/pii/model/utils.py b/ads/opctl/operator/lowcode/pii/model/utils.py new file mode 100644 index 000000000..0e007cca0 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/utils.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +import os +import yaml +import fsspec +import logging + +from typing import Dict, List +from html2text import html2text +from striprtf.striprtf import rtf_to_text + +YAML_KEYS = [ + "detectors", + "custom_detectors", + "spacy_detectors", + "anonymization", + "name", + "label", + "patterns", + "model", + "named_entities", + "entities", +] + + +class SupportInputFormat: + PLAIN = ".txt" + HTML = ".html" + RTF = ".rtf" + + MAPPING_EXT_TO_FORMAT = {HTML: "html", RTF: "rtf"} + + @classmethod + def get_support_list(cls): + return [cls.PLAIN, cls.HTML, cls.RTF] + + @classmethod + def map_ext_to_format(cls, ext): + return cls.MAPPING_EXT_TO_FORMAT.get(ext) + + +class ReportContextKey: + RUN_SUMMARY = "run_summary" + FILE_SUMMARY = "file_summary" + REPORT_NAME = "report_name" + TOTAL_FILES = "total_files" + ELAPSED_TIME = "elapsed_time" + DATE = "date" + OUTPUT_DIR = "output_dir" + INPUT_DIR = "input_dir" + INPUT = "input" + TOTAL_T = "total_tokens" + INPUT_FILE_NAME = "input_file_name" + OUTPUT_NAME = "output_name" + ENTITIES = "entities" + FILE_NAME = "filename" + INPUT_BASE = "input_base" + + +# def convert_to_html(file_ext, input_path, file_name): +# """Example: +# pandoc -f rtf -t html .rtf -o .html +# """ +# html_path = os.path.join(tempfile.mkdtemp(), file_name + ".html") +# cmd_specify_input_format = ( +# "" +# if file_ext == SupportInputFormat.PLAIN +# else f"-f {SupportInputFormat.map_ext_to_format(file_ext)}" +# ) +# cmd = f"pandoc {cmd_specify_input_format} -t html {input_path} -o {html_path}" +# os.system(cmd) +# assert os.path.exists( +# html_path +# ), f"Failed to convert {input_path} to html. You can run `{cmd}` in terminal to see the error." +# return html_path + + +def load_html(uri: str): + """Convert the given html file to text. + + Args: + uri (str): uri of the html file. + + Returns: + str: plain text of the html file. + """ + fs = open(uri, "rb") + html = fs.read().decode("utf-8", errors="ignore") + return html2text(html) + + +def load_rtf(uri: str, **kwargs): + """Convert the given rtf file to text. + + Args: + uri (str): uri of the rtf file. + + Returns: + str: plain text of the rtf file. + """ + fsspec_kwargs = kwargs.pop("fsspec_kwargs", {}) + content = _read_from_file(uri, **fsspec_kwargs) + return rtf_to_text(content) + + +def get_files(input_dir: str) -> List: + """Returns all files in the given directory.""" + files = [] + for dirpath, dirnames, filenames in os.walk(input_dir): + if dirpath.endswith(".ipynb_checkpoints"): + continue + for f in filenames: + if not f.endswith(".DS_Store"): + files.append(os.path.join(dirpath, f)) + return files + + +def _read_from_file(uri: str, **kwargs) -> str: + """Returns contents from a file specified by URI + + Parameters + ---------- + uri : str + The URI of the file. + + Returns + ------- + str + The content of the file as a string. + """ + with fsspec.open(uri, "r", **kwargs) as f: + return f.read() + + +def from_yaml( + yaml_string: str = None, + uri: str = None, + loader: callable = yaml.SafeLoader, + **kwargs, +) -> Dict: + """Loads yaml from given yaml string or uri of the yaml. + + Raises + ------ + ValueError + Raised if neither string nor uri is provided + """ + if yaml_string: + return yaml.load(yaml_string, Loader=loader) + if uri: + return yaml.load(_read_from_file(uri=uri, **kwargs), Loader=loader) + + raise ValueError("Must provide either YAML string or URI location") + + +def _safe_get_spec(spec_file, key, default): + try: + return spec_file[key] + except KeyError as e: + if not key in YAML_KEYS: + logging.warning(f"key: `{key}` is not supported.") + return default + + +def default_config() -> str: + """Returns the default config file which intended to process UMHC notes. + + Returns: + str: uri of the default config file. + """ + curr_dir = os.path.dirname(os.path.abspath(__file__)) + return os.path.abspath(os.path.join(curr_dir, "config", "umhc2.yaml")) + + +def construct_filth_cls_name(name: str) -> str: + """Constructs the filth class name from the given name. + For example, "name" -> "NameFilth". + + Args: + name (str): filth class name. + + Returns: + str: The filth class name. + """ + return "".join([s.capitalize() for s in name.split("_")]) + "Filth" + + +def _write_to_file(s: str, uri: str, **kwargs) -> None: + """Writes the given string to the given uri. + + Args: + s (str): The string to be written. + uri (str): The uri of the file to be written. + kwargs (dict ): keyword arguments to be passed into open(). + """ + with open(uri, "w", **kwargs) as f: + f.write(s) + + +def _count_tokens(file_summary): + """Counts the total number of tokens in the given file summary. + + Args: + file_summary (dict): file summary. + e.g. { + "root1": [ + {..., "total_t": 10, ...}, + {..., "total_t": 3, ...}, + ], + ... + } + + Returns: + int: total number of tokens. + """ + total_tokens = 0 + for _, files in file_summary.items(): + for file in files: + total_tokens += file.get("total_tokens") + return total_tokens + + +def _process_pos(entities, text) -> List: + """Processes the position of the given entities.""" + for entity in entities: + count_line_delimiter = text[: entity.beg].split("\n") + entity.pos = len(count_line_delimiter) + entity.line_beg = len(count_line_delimiter[-1]) + return entities diff --git a/ads/opctl/operator/lowcode/pii/operator_config.py b/ads/opctl/operator/lowcode/pii/operator_config.py index aa4faa0d7..78de6e08d 100644 --- a/ads/opctl/operator/lowcode/pii/operator_config.py +++ b/ads/opctl/operator/lowcode/pii/operator_config.py @@ -17,20 +17,15 @@ class InputData(DataClassSerializable): """Class representing operator specification input data details.""" format: str = None - columns: List[str] = None url: str = None - options: Dict = None - limit: int = None @dataclass(repr=True) class OutputDirectory(DataClassSerializable): """Class representing operator specification output directory details.""" - format: str = None url: str = None name: str = None - options: Dict = None @dataclass(repr=True) @@ -46,9 +41,10 @@ class Report(DataClassSerializable): class Redactor(DataClassSerializable): """Class representing operator specification redactor directory details.""" - detectors: list = None + detectors: List[str] = None + # TODO: spacy_detectors: Dict = None - anonymization: list = None + anonymization: List[str] = None @dataclass(repr=True) @@ -64,7 +60,15 @@ class PiiOperatorSpec(DataClassSerializable): def __post_init__(self): """Adjusts the specification details.""" - self.report_file_name = self.report_file_name or "report.html" + # self.report_file_name = self.report_file_name or "report.html" + self.target_column = self.target_column or "target" + self.report.report_filename = self.report.report_filename or "report.html" + self.report.show_rows = self.report.show_rows or 25 + self.report.show_sensitive_content = self.report.show_sensitive_content or False + self.output_directory.url = self.output_directory.url or "result/" + self.output_directory.name = self.output_directory.name or os.path.basename( + self.input_data.url + ) @dataclass(repr=True) diff --git a/ads/opctl/operator/lowcode/pii/schema.yaml b/ads/opctl/operator/lowcode/pii/schema.yaml index e189aa4a7..1f08a1014 100644 --- a/ads/opctl/operator/lowcode/pii/schema.yaml +++ b/ads/opctl/operator/lowcode/pii/schema.yaml @@ -33,40 +33,16 @@ spec: meta: description: "This should be indexed by target column." schema: - format: - allowed: - - csv - - json - required: false - type: string - columns: - required: false - type: list - schema: - type: string - options: - nullable: true - required: false - type: dict url: required: true type: string default: data.csv meta: description: "The url can be local, or remote. For example: `oci://@/data.csv`" - limit: - required: false - type: integer output_directory: - required: false + required: true schema: - format: - required: false - type: string - allowed: - - csv - - json url: required: true type: string @@ -74,47 +50,47 @@ spec: meta: description: "The url can be local, or remote. For example: `oci://@/`" name: - required: false + required: true type: string - options: - nullable: true - required: false - type: dict + default: data-out.csv type: dict report: - required: false + required: true schema: report_filename: - required: false + required: true type: string default: report.html meta: description: "Placed into output_directory location. Defaults to report.html" show_rows: required: false - type: integer + type: number default: 25 show_sensitive_content: - required: false + required: true default: false - type: bool + type: boolean type: dict target_column: type: string required: true default: target + meta: + description: "Column with user data." redactor: type: dict required: true schema: detectors: - required: false + required: true type: list schema: type: string + default: ["phone", "social_security_number"] meta: description: "default detectors supported by scrubadub" @@ -133,7 +109,7 @@ spec: schema: type: string meta: - description: "Apply spacy model to detect the target entities." + description: "Apply spacy NER model to detect the target entities." anonymization: type: list From a6bd2390828327eabaf6d6ea4321beb725ee7183 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Sun, 12 Nov 2023 22:05:35 -0800 Subject: [PATCH 04/18] wip --- ads/opctl/operator/lowcode/pii/MLoperator | 5 +- ads/opctl/operator/lowcode/pii/README.md | 116 +++++++- ads/opctl/operator/lowcode/pii/__main__.py | 11 +- ads/opctl/operator/lowcode/pii/cmd.py | 10 +- .../operator/lowcode/pii/environment.yaml | 5 +- .../operator/lowcode/pii/model/constant.py | 52 ++++ .../operator/lowcode/pii/model/factory.py | 83 ++++++ .../operator/lowcode/pii/model/guardrails.py | 84 +++--- ads/opctl/operator/lowcode/pii/model/pii.py | 247 +++++++----------- .../lowcode/pii/model/processor/__init__.py | 34 +++ .../pii/model/processor/email_replacer.py | 24 ++ .../pii/model/processor/mbi_replacer.py | 29 ++ .../name_replacer.py} | 125 +-------- .../pii/model/processor/number_replacer.py | 67 +++++ .../lowcode/pii/model/processor/remover.py | 22 ++ .../operator/lowcode/pii/model/report.py | 79 +++--- ads/opctl/operator/lowcode/pii/model/utils.py | 141 +--------- .../operator/lowcode/pii/operator_config.py | 26 +- ads/opctl/operator/lowcode/pii/schema.yaml | 58 ++-- .../with_extras/operator/pii/__init__.py | 4 + .../with_extras/operator/pii/mytest.py | 43 +++ .../with_extras/operator/pii/test_factory.py | 47 ++++ .../operator/pii/test_files/__init__.py | 4 + .../operator/pii/test_files/pii_test.yaml | 18 ++ .../operator/pii/test_guardrail.py | 4 + .../with_extras/operator/pii/test_pii.py | 24 ++ 26 files changed, 776 insertions(+), 586 deletions(-) create mode 100644 ads/opctl/operator/lowcode/pii/model/constant.py create mode 100644 ads/opctl/operator/lowcode/pii/model/factory.py create mode 100644 ads/opctl/operator/lowcode/pii/model/processor/__init__.py create mode 100644 ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py create mode 100644 ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py rename ads/opctl/operator/lowcode/pii/model/{processor.py => processor/name_replacer.py} (67%) create mode 100644 ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py create mode 100644 ads/opctl/operator/lowcode/pii/model/processor/remover.py create mode 100644 tests/unitary/with_extras/operator/pii/__init__.py create mode 100644 tests/unitary/with_extras/operator/pii/mytest.py create mode 100644 tests/unitary/with_extras/operator/pii/test_factory.py create mode 100644 tests/unitary/with_extras/operator/pii/test_files/__init__.py create mode 100644 tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml create mode 100644 tests/unitary/with_extras/operator/pii/test_guardrail.py create mode 100644 tests/unitary/with_extras/operator/pii/test_pii.py diff --git a/ads/opctl/operator/lowcode/pii/MLoperator b/ads/opctl/operator/lowcode/pii/MLoperator index e4977c778..49dafdb5a 100644 --- a/ads/opctl/operator/lowcode/pii/MLoperator +++ b/ads/opctl/operator/lowcode/pii/MLoperator @@ -6,7 +6,10 @@ conda: pii_v1 gpu: no keywords: - PII + - Spacy backends: - job description: | - PII operator. + PII operator, that detects detect and redact Personally Identifiable Information + (PII) data in datasets by combining pattern match and machine learning solution. + Use `ads operator info -t pii` to get more details about the pii operator." diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index 3970554d5..12a25dd60 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -1,6 +1,7 @@ # PII Operator +The PII Operator aims to detect and redact Personally Identifiable Information (PII) in datasets. PII data includes information such as names, addresses, and social security numbers, which can be used to identify individuals. This operator combine pattern matching and machine learning solution to identify PII, and then redacts or anonymizes it to protect the privacy of individuals. Below are the steps to configure and run the PII Operator on different resources. @@ -19,18 +20,23 @@ ads operator init -t pii --overwrite --output ~/pii/ The most important files expected to be generated are: - `pii.yaml`: Contains pii-related configuration. -- `backend_operator_local_python_config.yaml`: This includes a local backend configuration for running pii in a local environment. The environment should be set up manually before running the operator. -- `backend_job_python_config.yaml`: Contains Data Science job-related config to run pii in a Data Science job within a conda runtime. The conda should be built and published before running the operator. +- `backend_operator_local_python_config.yaml`: This includes a local backend configuration for running pii operator in a local environment. The environment should be set up manually before running the operator. +- `backend_operator_local_container_config.yaml`: This includes a local backend configuration for running pii operator within a local container. The container should be built before running the operator. Please refer to the instructions below for details on how to accomplish this. +- `backend_job_container_config.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a container (BYOC) runtime. The container should be built and published before running the operator. Please refer to the instructions below for details on how to accomplish this. +- `backend_job_python_config.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a conda runtime. The conda should be built and published before running the operator. All generated configurations should be ready to use without the need for any additional adjustments. However, they are provided as starter kit configurations that can be customized as needed. -## 3. Running PII on the local conda environment +## 3. Running pii on the local conda environment To run forecasting locally, create and activate a new conda environment (`ads-pii`). Install all the required libraries listed in the `environment.yaml` file. ```yaml - datapane - scrubadub +- gender_guesser +- nameparser +- scrubadub_spacy - "git+https://github.com/oracle/accelerated-data-science.git@feature/forecasting#egg=oracle-ads" ``` @@ -50,9 +56,105 @@ ads operator run -f ~/pii/pii.yaml -b local The operator will run in your local environment without requiring any additional modifications. -## 4. Running PII in the Data Science job within conda runtime +## 4. Running pii on the local container -To execute the forecasting operator within a Data Science job using conda runtime, please follow the steps outlined below: +To run the pii operator within a local container, follow these steps: + +Use the command below to build the pii container. + +```bash +ads operator build-image -t pii +``` + +This will create a new `pii:v1` image, with `/etc/operator` as the designated working directory within the container. + + +Check the `backend_operator_local_container_config.yaml` config file. By default, it should have a `volume` section with the `.oci` configs folder mounted. + +```yaml +volume: + - "/Users//.oci:/root/.oci" +``` + +Mounting the OCI configs folder is only required if an OCI Object Storage bucket will be used to store the input forecasting data or output forecasting result. The input/output folders can also be mounted to the container. + +```yaml +volume: + - /Users//.oci:/root/.oci + - /Users//pii/data:/etc/operator/data + - /Users//pii/result:/etc/operator/result +``` + +The full config can look like: +```yaml +kind: operator.local +spec: + image: pii:v1 + volume: + - /Users//.oci:/root/.oci + - /Users//pii/data:/etc/operator/data + - /Users//pii/result:/etc/operator/result +type: container +version: v1 +``` + +Run the pii operator within a container using the command below: + +```bash +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_operator_local_container_config.yaml +``` + +## 5. Running pii in the Data Science job within container runtime + +To execute the pii operator within a Data Science job using container runtime, please follow the steps outlined below: + +You can use the following command to build the forecast container. This step can be skipped if you have already done this for running the operator within a local container. + +```bash +ads operator build-image -t pii +``` + +This will create a new `pii:v1` image, with `/etc/operator` as the designated working directory within the container. + +Publish the `pii:v1` container to the [Oracle Container Registry](https://docs.public.oneportal.content.oci.oraclecloud.com/en-us/iaas/Content/Registry/home.htm). To become familiar with OCI, read the documentation links posted below. + +- [Access Container Registry](https://docs.public.oneportal.content.oci.oraclecloud.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm#access) +- [Create repositories](https://docs.public.oneportal.content.oci.oraclecloud.com/en-us/iaas/Content/Registry/Tasks/registrycreatingarepository.htm#top) +- [Push images](https://docs.public.oneportal.content.oci.oraclecloud.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm#Pushing_Images_Using_the_Docker_CLI) + +To publish `pii:v1` to OCR, use the command posted below: + +```bash +ads operator publish-image pii:v1 --registry +``` + +After the container is published to OCR, it can be used within Data Science jobs service. Check the `backend_job_container_config.yaml` config file. It should contain pre-populated infrastructure and runtime sections. The runtime section should contain an image property, something like `image: iad.ocir.io//pii:v1`. More details about supported options can be found in the ADS Jobs documentation - [Run a Container](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/jobs/run_container.html). + +Adjust the `pii.yaml` config with proper input/output folders. When the forecasting is run in the Data Science job, it will not have access to local folders. Therefore, input data and output folders should be placed in the Object Storage bucket. Open the `pii.yaml` and adjust the following fields: + +```yaml +input_data: + url: oci://bucket@namespace/pii/input_data/data.csv +output_directory: + url: oci://bucket@namespace/pii/result/ +``` + +Run the pii operator on the Data Science jobs using the command posted below: + +```bash +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_job_container_config.yaml +``` + +The logs can be monitored using the `ads opctl watch` command. + +```bash +ads opctl watch +``` + + +## 6. Running pii in the Data Science job within conda runtime + +To execute the pii operator within a Data Science job using conda runtime, please follow the steps outlined below: You can use the following command to build the pii conda environment. @@ -83,10 +185,10 @@ More details about supported options can be found in the ADS Jobs documentation Adjust the `pii.yaml` config with proper input/output folders. When the pii is run in the Data Science job, it will not have access to local folders. Therefore, input data and output folders should be placed in the Object Storage bucket. Open the `pii.yaml` and adjust the following fields: ```yaml +input_data: + url: oci://bucket@namespace/pii/input_data/data.csv output_directory: url: oci://bucket@namespace/pii/result/ -test_data: - url: oci://bucket@namespace/pii/input_data/test.csv ``` Run the pii on the Data Science jobs using the command posted below: diff --git a/ads/opctl/operator/lowcode/pii/__main__.py b/ads/opctl/operator/lowcode/pii/__main__.py index fae0fda83..a914edc0a 100644 --- a/ads/opctl/operator/lowcode/pii/__main__.py +++ b/ads/opctl/operator/lowcode/pii/__main__.py @@ -15,19 +15,14 @@ from ads.opctl.operator.common.const import ENV_OPERATOR_ARGS from ads.opctl.operator.common.utils import _parse_input_args +from .model.guardrails import PIIGuardrail from .operator_config import PiiOperatorConfig def operate(operator_config: PiiOperatorConfig) -> None: """Runs the PII operator.""" - # import pdb - - # pdb.set_trace() - print("The operator is running...") - # from pii.guardrails import PIIGuardrail - - # guard = PIIGuardrail(config_uri="./responsibleai.yaml") - # guard.evaluate() + guard = PIIGuardrail(config=operator_config) + guard.evaluate() def verify(spec: Dict, **kwargs: Dict) -> bool: diff --git a/ads/opctl/operator/lowcode/pii/cmd.py b/ads/opctl/operator/lowcode/pii/cmd.py index f76b5faaf..1098b390b 100644 --- a/ads/opctl/operator/lowcode/pii/cmd.py +++ b/ads/opctl/operator/lowcode/pii/cmd.py @@ -6,11 +6,9 @@ from typing import Dict -import click - from ads.opctl import logger -from ads.opctl.operator.common.utils import _load_yaml_from_uri from ads.opctl.operator.common.operator_yaml_generator import YamlGenerator +from ads.opctl.operator.common.utils import _load_yaml_from_uri def init(**kwargs: Dict) -> str: @@ -32,6 +30,10 @@ def init(**kwargs: Dict) -> str: """ logger.info("==== PII related options ====") + default_detector = [{"name": "default.phone", "action": "anonymize"}] + return YamlGenerator( schema=_load_yaml_from_uri(__file__.replace("cmd.py", "schema.yaml")) - ).generate_example_dict(values={"type": kwargs.get("type")}) + ).generate_example_dict( + values={"type": kwargs.get("type"), "detectors": default_detector} + ) diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml index a4e2d1dc8..b542e1d6d 100644 --- a/ads/opctl/operator/lowcode/pii/environment.yaml +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -1,4 +1,4 @@ -name: PII +name: pii channels: - conda-forge dependencies: @@ -7,4 +7,7 @@ dependencies: - pip: - datapane - scrubadub + - gender_guesser + - nameparser + - scrubadub_spacy - "git+https://github.com/oracle/accelerated-data-science.git@feature/ads_pii_operator#egg=oracle-ads" diff --git a/ads/opctl/operator/lowcode/pii/model/constant.py b/ads/opctl/operator/lowcode/pii/model/constant.py new file mode 100644 index 000000000..5569a8021 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/constant.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + + +YAML_KEYS = [ + "detectors", + "custom_detectors", + "spacy_detectors", + "anonymization", + "name", + "label", + "patterns", + "model", + "named_entities", + "entities", +] + +################ +# Report Const # +################ +PII_REPORT_DESCRIPTION = ( + "This report will offer a comprehensive overview of the redaction of personal identifiable information (PII) from the provided data." + "The `Summary` section will provide an executive summary of this process, including key statistics, configuration, and model usage." + "The `Details` section will offer a more granular analysis of each row of data, including relevant statistics." +) +DETAILS_REPORT_DESCRIPTION = "The following report will show the details on each row. You can view the highlighted named entities and their labels in the text under `TEXT` tab." + +FLAT_UI_COLORS = [ + "#1ABC9C", + "#2ECC71", + "#3498DB", + "#9B59B6", + "#34495E", + "#16A085", + "#27AE60", + "#2980B9", + "#8E44AD", + "#2C3E50", + "#F1C40F", + "#E67E22", + "#E74C3C", + "#ECF0F1", + "#95A5A6", + "#F39C12", + "#D35400", + "#C0392B", + "#BDC3C7", + "#7F8C8D", +] diff --git a/ads/opctl/operator/lowcode/pii/model/factory.py b/ads/opctl/operator/lowcode/pii/model/factory.py new file mode 100644 index 000000000..542e8af0b --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/factory.py @@ -0,0 +1,83 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import uuid +import scrubadub +from scrubadub_spacy.detectors.spacy import SpacyEntityDetector + +from ads.common.extended_enum import ExtendedEnumMeta +from ads.opctl.operator.lowcode.pii.model.utils import construct_filth_cls_name + + +class SupportedDetector(str, metaclass=ExtendedEnumMeta): + """Supported pii detectors.""" + + Default = "default" + Spacy = "spacy" + + +class UnSupportedDetectorError(Exception): + def __init__(self, dtype: str): + super().__init__( + f"Detector: `{dtype}` " + f"is not supported. Supported models: {SupportedDetector.values}" + ) + + +class PiiBaseDetector: + @classmethod + def construct(cls, **kwargs): + raise NotImplementedError + + +class BuiltInDetector(PiiBaseDetector): + @classmethod + def construct(cls, entity, **kwargs): + return entity + + +class SpacyDetector(PiiBaseDetector): + DEFAULT_SPACY_NAMED_ENTITIES = ["DATE", "FAC", "GPE", "LOC", "ORG", "PER", "PERSON"] + DEFAULT_SPACY_MODEL = "en_core_web_trf" + + @classmethod + def construct(cls, entity, model, **kwargs): + spacy_entity_detector = SpacyEntityDetector( + named_entities=[entity], + name=f"spacy_{uuid.uuid4()}", + model=model, + ) + if entity.upper() not in cls.DEFAULT_SPACY_NAMED_ENTITIES: + filth_cls = type( + construct_filth_cls_name(entity), + (scrubadub.filth.Filth,), + {"type": entity.upper()}, + ) + spacy_entity_detector.filth_cls_map[entity.upper()] = filth_cls + return spacy_entity_detector + + +class PiiDetectorFactory: + """ + The factory class helps to instantiate proper detector object based on the detector config. + """ + + _MAP = { + SupportedDetector.Default: BuiltInDetector, + SupportedDetector.Spacy: SpacyDetector, + } + + @classmethod + def get_detector( + cls, + detector_type, + entity, + model=None, + ): + if detector_type not in cls._MAP: + raise UnSupportedDetectorError(detector_type) + + return cls._MAP[detector_type].construct(entity=entity, model=model) diff --git a/ads/opctl/operator/lowcode/pii/model/guardrails.py b/ads/opctl/operator/lowcode/pii/model/guardrails.py index f32f9b463..a138a59e8 100644 --- a/ads/opctl/operator/lowcode/pii/model/guardrails.py +++ b/ads/opctl/operator/lowcode/pii/model/guardrails.py @@ -4,16 +4,15 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ - +import os +import time import pandas as pd -from ads.opctl.operator.lowcode.pii.model.utils import from_yaml -from ads.opctl.operator.lowcode.pii.model.pii import config_scrubber, scrub, detect +from ads.opctl import logger +from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig +from ads.opctl.operator.lowcode.pii.model.pii import Scrubber, scrub, detect from ads.opctl.operator.lowcode.pii.model.report import PIIOperatorReport from ads.common import auth as authutil from datetime import datetime -import os -import time -import datapane as dp def get_output_name(given_name, target_name=None): @@ -26,52 +25,35 @@ def get_output_name(given_name, target_name=None): class PIIGuardrail: - def __init__(self, config_uri: str, auth: dict = None): - # load config.yaml for pii - self.spec = from_yaml(uri=config_uri).get("spec") - self.output_data_name = None - # config metric - for metric in self.spec.get("metrics", []): - # TODO: load other metric - # load pii metric - if metric.get("name", "") == "pii": - pii_load_args = metric.get("load_args") - self.scrubber = config_scrubber(**pii_load_args) - self.target_col = metric.get("target_col", "text") - self.output_data_name = metric.get("output_data_name", None) - - # config spec - self.src_data_uri = self.spec.get("test_data").get("url") - self.dst_uri = None - self.data = None - self.report_uri = None + def __init__(self, config: PiiOperatorConfig, auth: dict = None): + self.spec = config.spec + self.data = None # saving loaded data self.auth = auth or authutil.default_signer() - self.output_directory = self.spec.get("output_directory", {}).get("url", None) - if self.output_directory: - self.dst_uri = os.path.join( - self.output_directory, - get_output_name( - target_name=self.output_data_name, given_name=self.src_data_uri - ), - ) - - self.report_spec = self.spec.get("report", {}) - self.report_uri = ( - os.path.join( - self.report_spec.get("url", "./"), - self.report_spec.get("report_file_name", "report.html"), - ) - if self.report_spec - else None + self.scrubber = Scrubber(config=config).config_scrubber() + self.target_col = self.spec.target_column + self.output_data_name = self.spec.output_directory.name + # input attributes + self.src_data_uri = self.spec.input_data.url + + # output attributes + self.output_directory = self.spec.output_directory.url + self.dst_uri = os.path.join( + self.output_directory, + get_output_name( + target_name=self.output_data_name, given_name=self.src_data_uri + ), ) - self.show_rows = self.report_spec.get("show_rows", 25) - self.show_sensitive_content = self.report_spec.get( - "show_sensitive_content", False + + # Report attributes + self.report_uri = os.path.join( + self.spec.output_directory.url, + self.spec.report.report_filename, ) + self.show_rows = self.spec.report.show_rows or 25 + self.show_sensitive_content = self.spec.report.show_sensitive_content or False def load_data(self, uri=None, storage_options={}): - # POC: Only csv support - # csv -> pandas.DataFrame + # TODO: Support more format of input data uri = uri or self.src_data_uri if uri.endswith(".csv"): if uri.startswith("oci://"): @@ -101,8 +83,8 @@ def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}) data["entities_cols"] = data[self.target_col].apply( lambda x: detect(text=x, scrubber=self.scrubber) ) - from pii.utils import _safe_get_spec - from pii.pii import DEFAULT_SPACY_MODEL + from ads.opctl.operator.lowcode.pii.model.utils import _safe_get_spec + from ads.opctl.operator.lowcode.pii.model.pii import DEFAULT_SPACY_MODEL selected_spacy_model = [] for spec in _safe_get_spec( @@ -160,15 +142,13 @@ def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}) if dst_uri: self._save_output(data, ["id", "redacted_text"], dst_uri) - print("Mission completed!") - def _generate_report(self, context, report_uri): report_ = PIIOperatorReport(context=context) report_sections = report_.make_view() report_.save_report(report_sections=report_sections, report_path=report_uri) def _save_output(self, df, target_col, dst_uri): - # Based on extension of dst_uri call to_csv or to_json. + # TODO: Based on extension of dst_uri call to_csv or to_json. data_out = df[target_col] data_out.to_csv(dst_uri) return dst_uri diff --git a/ads/opctl/operator/lowcode/pii/model/pii.py b/ads/opctl/operator/lowcode/pii/model/pii.py index b3b68ef65..3f149b893 100644 --- a/ads/opctl/operator/lowcode/pii/model/pii.py +++ b/ads/opctl/operator/lowcode/pii/model/pii.py @@ -6,185 +6,120 @@ import scrubadub -import scrubadub_spacy -import os -import re -import logging -import uuid - -from ads.opctl.operator.lowcode.pii.model.utils import ( - load_html, - SupportInputFormat, - from_yaml, - _safe_get_spec, - default_config, - _read_from_file, - load_rtf, - construct_filth_cls_name, - _write_to_file, - _process_pos, - ReportContextKey, -) -from ads.opctl.operator.lowcode.pii.model.processor import POSTPROCESSOR_MAP - -DEFAULT_SPACY_NAMED_ENTITIES = ["DATE", "FAC", "GPE", "LOC", "ORG", "PER", "PERSON"] -DEFAULT_SPACY_MODEL = "en_core_web_trf" - -def config_post_processor(spec: dict): - """Return class scrubadub.post_processors.base.PostProcessor.""" - name = _safe_get_spec(spec, "name", "").lower() - if not name in POSTPROCESSOR_MAP.keys(): - raise ValueError( - f"Unsupport post processor: {name}. Only support {POSTPROCESSOR_MAP.keys()}." - ) - cls = POSTPROCESSOR_MAP.get(name) - if name == "number_replacer": - cls._ENTITIES = _safe_get_spec(spec, "entities", cls._ENTITIES) - - return cls - - -def config_spacy_detector(spec: dict): - """Return an instance of scrubadub_spacy.detectors.spacy.SpacyEntityDetector.""" - model = _safe_get_spec(spec, "model", DEFAULT_SPACY_MODEL) - - named_entities = [x.upper() for x in spec.get("named_entities", [])] - spacy_entity_detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector( - named_entities=named_entities, - name=f"spacy_{uuid.uuid4()}", - model=model, - ) - for named_entity in named_entities: - # DEFAULT_SPACY_NAMED_ENTITIES has been registered in filth_cls_map already. - if named_entity in DEFAULT_SPACY_NAMED_ENTITIES: - continue - - filth_cls = type( - construct_filth_cls_name(named_entity), - (scrubadub.filth.Filth,), - {"type": named_entity.upper()}, - ) - spacy_entity_detector.filth_cls_map[named_entity.upper()] = filth_cls - return spacy_entity_detector - - -def config_scrubber( - config: str or dict = None, -): - """ - Returns an instance of srubadub.Scrubber. - - Args: - config: A path to a yaml file or a dict. +from ads.opctl import logger +from ads.opctl.operator.common.utils import _load_yaml_from_uri +from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory +from ads.opctl.operator.lowcode.pii.model.processor import ( + POSTPROCESSOR_MAP, + SUPPORTED_REPLACER, + Remover, +) - Returns: - An instance of srubadub.Scrubber, which has been configured with the given config. - """ - if not config: - config = default_config() - logging.info(f"Loading config from {config}") +SUPPORT_ACTIONS = ["mask", "remove", "anonymize"] - if isinstance(config, str): - config = from_yaml(uri=config) - redact_spec_file = config["redactor"] +class DetectorType: + DEFAULT = "default" - detector_list = [] - scrubber = scrubadub.Scrubber() - scrubber.redact_spec_file = redact_spec_file - # Clean up default detectors - defautls_enable = scrubber._detectors.copy() - for d in defautls_enable: - scrubber.remove_detector(d) +class Scrubber: + def __init__(self, config: str or "PiiOperatorConfig" or dict): + logger.info(f"Loading config from {config}") + if isinstance(config, str): + config = _load_yaml_from_uri(config) - # Add scrubber built-in detectors - for detector in _safe_get_spec(redact_spec_file, "detectors", []): - detector_list.append(detector) + self.config = config + self.scrubber = scrubadub.Scrubber() - # Add spacy detectors - for spec in _safe_get_spec(redact_spec_file, "spacy_detectors", []): - spacy_entity_detector = config_spacy_detector(spec=spec) - detector_list.append(spacy_entity_detector) + self.detectors = [] + self.spacy_model_detectors = [] + self.post_processors = {} # replacer_name -> replacer_obj - # Add custom detectors - for custom in _safe_get_spec(redact_spec_file, "custom_detectors", []): - patterns = custom.get("patterns", "") + self._reset_scrubber() - class CustomFilth(scrubadub.filth.Filth): - type = custom.get("label", "").upper() + def _reset_scrubber(self): + # Clean up default detectors + defautls_enable = self.scrubber._detectors.copy() + for d in defautls_enable: + self.scrubber.remove_detector(d) - class CustomDetector(scrubadub.detectors.RegexDetector): - filth_cls = CustomFilth - regex = re.compile( - rf"{patterns}", + def _register(self, name, dtype, model, action, mask_with: str = None): + if action not in SUPPORT_ACTIONS: + raise ValueError( + f"Not supported `action`: {action}. Please select from {SUPPORT_ACTIONS}." ) - name = custom.get("name") - detector_list.append(CustomDetector()) + detector = PiiDetectorFactory.get_detector( + detector_type=dtype, entity=name, model=model + ) + self.scrubber.add_detector(detector) - for detector in detector_list: - scrubber.add_detector(detector) + if action == "anonymize": + entity = ( + detector + if isinstance(detector, str) + else detector.filth_cls_map[name.upper()].type + ) + if entity in SUPPORTED_REPLACER.keys(): + replacer_name = SUPPORTED_REPLACER.get(entity).name + replacer = self.post_processors.get( + replacer_name, POSTPROCESSOR_MAP.get(replacer_name)() + ) + if hasattr(replacer, "_ENTITIES"): + replacer._ENTITIES.append(name) + self.post_processors[replacer_name] = replacer + else: + raise ValueError( + f"Not supported `action` {action} for this entity {name}. Please try with other action." + ) + + if action == "remove": + remover = self.post_processors.get("remover", Remover()) + remover._ENTITIES.append(name) + self.post_processors["remover"] = remover + + def config_scrubber(self): + """Returns an instance of srubadub.Scrubber.""" + spec = ( + self.config["spec"] if isinstance(self.config, dict) else self.config.spec + ) + detectors = spec["detectors"] if isinstance(spec, dict) else spec.detector + + self.scrubber.redact_spec_file = spec + + for detector in detectors: + # example format for detector["name"]: default.phone or spacy.en_core_web_trf.person + d = detector["name"].split(".") + dtype = d[0] + dname = d[1] if len(d) == 2 else d[2] + model = None if len(d) == 2 else d[1] + + action = detector.get("action", "mask") + # mask_with = detector.get("mask_with", None) + self._register( + name=dname, + dtype=dtype, + model=model, + action=action, + # mask_with=mask_with, + ) - # Add post-processor - for post_processor in _safe_get_spec(redact_spec_file, "anonymization", []): - scrubber.add_post_processor(config_post_processor(post_processor)) + self._register_post_processor() + return self.scrubber - return scrubber + def _register_post_processor(self): + for _, v in self.post_processors.items(): + self.scrubber.add_post_processor(v) def scrub(text, spec_file=None, scrubber=None): if not scrubber: - scrubber = config_scrubber(spec_file) + scrubber = Scrubber(config=spec_file).config_scrubber() return scrubber.clean(text) def detect(text, spec_file=None, scrubber=None): if not scrubber: - scrubber = config_scrubber(spec_file) + scrubber = Scrubber(config=spec_file).config_scrubber() return list(scrubber.iter_filth(text, document_name=None)) - - -def _get_report_( - input_path, output_path, scrubber=None, report_context=None, subdirectory=None -) -> None: - filename_with_ext = os.path.basename(input_path) - file_name, file_ext = os.path.splitext(filename_with_ext) - - report_text = "" - if file_ext == SupportInputFormat.PLAIN: - report_text = _read_from_file(input_path) - elif file_ext == SupportInputFormat.HTML: - report_text = load_html(uri=input_path) - elif file_ext == SupportInputFormat.RTF: - report_text = load_rtf(uri=input_path) - else: - raise ValueError( - f"Unsupport file format: {file_ext}. Only support {SupportInputFormat.get_support_list()}." - ) - - # preprocess src to remove ** - report_text_ = report_text.replace("**", "") - - scrubbed_text = scrub(text=report_text_, scrubber=scrubber) - dst_uri = os.path.join(output_path, file_name + ".txt") - _write_to_file( - uri=dst_uri, - s=scrubbed_text, - encoding="utf-8", - ) - - # Only generate report if report_context is not None - if report_context: - entities = detect(text=report_text_, scrubber=scrubber) - file_summary = { - ReportContextKey.INPUT_FILE_NAME: input_path, - ReportContextKey.OUTPUT_NAME: dst_uri, - ReportContextKey.TOTAL_TOKENS: len(entities), - ReportContextKey.ENTITIES: _process_pos(entities, report_text_), - ReportContextKey.FILE_NAME: file_name, - } - report_context.get(ReportContextKey.FILE_SUMMARY).get(subdirectory).append( - file_summary - ) diff --git a/ads/opctl/operator/lowcode/pii/model/processor/__init__.py b/ads/opctl/operator/lowcode/pii/model/processor/__init__.py new file mode 100644 index 000000000..062a61aa7 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor/__init__.py @@ -0,0 +1,34 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +from .email_replacer import EmailReplacer +from .mbi_replacer import MBIReplacer +from .name_replacer import NameReplacer +from .number_replacer import NumberReplacer +from .remover import Remover + +POSTPROCESSOR_MAP = { + item.name.lower(): item + for item in [ + NameReplacer, + NumberReplacer, + EmailReplacer, + MBIReplacer, + Remover, + ] +} + +# Currently only support anonymization for the following entity. +SUPPORTED_REPLACER = { + "name": NameReplacer, + "number": NumberReplacer, + "phone": NumberReplacer, + "social_security_number": NumberReplacer, + "fin": NumberReplacer, + "mrn": NumberReplacer, + "email": EmailReplacer, + "mbi": MBIReplacer, +} diff --git a/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py new file mode 100644 index 000000000..ce77dc8ec --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py @@ -0,0 +1,24 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +from typing import Sequence + +from faker import Faker +from scrubadub.filth import Filth +from scrubadub.post_processors import PostProcessor + + +class EmailReplacer(PostProcessor): + name = "email_replacer" + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + if filth.replacement_string: + continue + if filth.type.lower() != "email": + continue + filth.replacement_string = Faker().email() + return filth_list diff --git a/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py new file mode 100644 index 000000000..8aa4f5e66 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import random +import string +from typing import Sequence + +from scrubadub.filth import Filth +from scrubadub.post_processors import PostProcessor + + +class MBIReplacer(PostProcessor): + name = "mbi_replacer" + CHAR_POOL = "ACDEFGHJKMNPQRTUVWXY" + + def generate_mbi(self): + return "".join(random.choices(self.CHAR_POOL + string.digits, k=11)) + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + if filth.replacement_string: + continue + if filth.type.lower() != "mbi": + continue + filth.replacement_string = self.generate_mbi() + return filth_list diff --git a/ads/opctl/operator/lowcode/pii/model/processor.py b/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py similarity index 67% rename from ads/opctl/operator/lowcode/pii/model/processor.py rename to ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py index 1ff204a00..9cb96f0ae 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py @@ -5,31 +5,16 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -"""Contains post processors for scrubadub -Usage: - -scrubber.add_post_processor(NameReplacer()) -scrubber.add_post_processor(NumberReplacer()) - -To keep the same name replacement mappings across multiple documents, -either use the same scrubber instance to clean all the documents, -or use the same NameReplace() instance for all scrubbers. -""" -import datetime -import random -import re -import string from typing import Sequence -import scrubadub import gender_guesser.detector as gender_detector - from faker import Faker -from scrubadub.filth import Filth from nameparser import HumanName +from scrubadub.filth import Filth +from scrubadub.post_processors import PostProcessor -class NameReplacer(scrubadub.post_processors.PostProcessor): +class NameReplacer(PostProcessor): name = "name_replacer" def __init__(self, name: str = None, mapping: dict = None): @@ -220,107 +205,3 @@ def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: for filth in name_filths: filth.replacement_string = self.replace(filth.text) return filth_list - - -class NumberReplacer(scrubadub.post_processors.PostProcessor): - name = "number_replacer" - _ENTITIES = [ - "number", - "mrn", - "fin", - "phone", - "social_security_number", - ] - - @staticmethod - def replace_digit(obj): - return random.choice("0123456789") - - def match_entity_type(self, filth_types): - if list(set(self._ENTITIES) & set(filth_types)): - return True - return False - - def replace_date(self, text): - date_formats = ["%m-%d-%Y", "%m-%d-%y", "%d-%m-%Y", "%d-%m-%y"] - for date_format in date_formats: - try: - date = datetime.datetime.strptime(text, date_format) - except ValueError: - continue - if date.year < 1900 or date.year > datetime.datetime.now().year: - continue - # Now the date is a valid data between 1900 and now - return text - return None - - def replace(self, text): - # Check dates - date = self.replace_date(text) - if date: - return date - return re.sub(r"\d", self.replace_digit, text) - - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: - for filth in filth_list: - # Do not process it if it already has a replacement. - if filth.replacement_string: - continue - if filth.type.lower() in self._ENTITIES: - filth.replacement_string = self.replace(filth.text) - # Replace the numbers for merged filth - if filth.type.lower() == "unknown" and hasattr(filth, "filths"): - filth_types = set([f.type for f in filth.filths]) - if self.match_entity_type(filth_types): - filth.replacement_string = self.replace(filth.text) - return filth_list - - -class EmailReplacer(scrubadub.post_processors.PostProcessor): - name = "email_replacer" - - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: - for filth in filth_list: - if filth.replacement_string: - continue - if filth.type.lower() != "email": - continue - filth.replacement_string = Faker().email() - return filth_list - - -class HIBNReplacer(scrubadub.post_processors.PostProcessor): - name = "hibn_replacer" - - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: - # TODO: Add support for anomymizing Health insurance beneficiary number ~ Consecutive sequence of alphanumeric characters - pass - - -class MBIReplacer(scrubadub.post_processors.PostProcessor): - name = "mbi_replacer" - CHAR_POOL = "ACDEFGHJKMNPQRTUVWXY" - - def generate_mbi(self): - return "".join(random.choices(self.CHAR_POOL + string.digits, k=11)) - - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: - for filth in filth_list: - if filth.replacement_string: - continue - if filth.type.lower() != "mbi": - continue - filth.replacement_string = self.generate_mbi() - return filth_list - - -POSTPROCESSOR_MAP = { - item.name.lower(): item - for item in [ - NameReplacer, - NumberReplacer, - EmailReplacer, - HIBNReplacer, - MBIReplacer, - ] -} diff --git a/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py new file mode 100644 index 000000000..7e79a2f3b --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py @@ -0,0 +1,67 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import datetime +import random +import re +from typing import Sequence + +from scrubadub.filth import Filth +from scrubadub.post_processors import PostProcessor + + +class NumberReplacer(PostProcessor): + name = "number_replacer" + _ENTITIES = [ + "number", + "mrn", + "fin", + "phone", + "social_security_number", + ] + + @staticmethod + def replace_digit(obj): + return random.choice("0123456789") + + def match_entity_type(self, filth_types): + if list(set(self._ENTITIES) & set(filth_types)): + return True + return False + + def replace_date(self, text): + date_formats = ["%m-%d-%Y", "%m-%d-%y", "%d-%m-%Y", "%d-%m-%y"] + for date_format in date_formats: + try: + date = datetime.datetime.strptime(text, date_format) + except ValueError: + continue + if date.year < 1900 or date.year > datetime.datetime.now().year: + continue + # Now the date is a valid data between 1900 and now + return text + return None + + def replace(self, text): + # Check dates + date = self.replace_date(text) + if date: + return date + return re.sub(r"\d", self.replace_digit, text) + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + # Do not process it if it already has a replacement. + if filth.replacement_string: + continue + if filth.type.lower() in self._ENTITIES: + filth.replacement_string = self.replace(filth.text) + # Replace the numbers for merged filth + if filth.type.lower() == "unknown" and hasattr(filth, "filths"): + filth_types = set([f.type for f in filth.filths]) + if self.match_entity_type(filth_types): + filth.replacement_string = self.replace(filth.text) + return filth_list diff --git a/ads/opctl/operator/lowcode/pii/model/processor/remover.py b/ads/opctl/operator/lowcode/pii/model/processor/remover.py new file mode 100644 index 000000000..53d90dba3 --- /dev/null +++ b/ads/opctl/operator/lowcode/pii/model/processor/remover.py @@ -0,0 +1,22 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +from typing import Sequence + +from scrubadub.filth import Filth +from scrubadub.post_processors import PostProcessor + + +class Remover(PostProcessor): + name = "remover" + _ENTITIES = [] + + def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + for filth in filth_list: + if filth.type.lower() in self._ENTITIES: + filth.replacement_string = "" + + return filth_list diff --git a/ads/opctl/operator/lowcode/pii/model/report.py b/ads/opctl/operator/lowcode/pii/model/report.py index 2584544ce..b89b5be98 100644 --- a/ads/opctl/operator/lowcode/pii/model/report.py +++ b/ads/opctl/operator/lowcode/pii/model/report.py @@ -5,15 +5,14 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -# helper function to make report -import yaml -import plotly.express as px -import pandas as pd -import datapane as dp import random -import plotly.graph_objects as go -import fsspec +import datapane as dp +import fsspec +import pandas as pd +import plotly.express as px +import plotly.graph_objects as go +import yaml PII_REPORT_DESCRIPTION = ( "This report will offer a comprehensive overview of the redaction of personal identifiable information (PII) from the provided data." @@ -22,9 +21,33 @@ ) DETAILS_REPORT_DESCRIPTION = "The following report will show the details on each row. You can view the highlighted named entities and their labels in the text under `TEXT` tab." +FLAT_UI_COLORS = [ + "#1ABC9C", + "#2ECC71", + "#3498DB", + "#9B59B6", + "#34495E", + "#16A085", + "#27AE60", + "#2980B9", + "#8E44AD", + "#2C3E50", + "#F1C40F", + "#E67E22", + "#E74C3C", + "#ECF0F1", + "#95A5A6", + "#F39C12", + "#D35400", + "#C0392B", + "#BDC3C7", + "#7F8C8D", +] +LABEL_TO_COLOR_MAP = {} + ################ -# Others utils # +# Report utils # ################ def compute_rate(elapsed_time, num_unit): return elapsed_time / num_unit @@ -50,35 +73,12 @@ def human_time_friendly(seconds): return ", ".join(accumulator) -FLAT_UI_COLORS = [ - "#1ABC9C", - "#2ECC71", - "#3498DB", - "#9B59B6", - "#34495E", - "#16A085", - "#27AE60", - "#2980B9", - "#8E44AD", - "#2C3E50", - "#F1C40F", - "#E67E22", - "#E74C3C", - "#ECF0F1", - "#95A5A6", - "#F39C12", - "#D35400", - "#C0392B", - "#BDC3C7", - "#7F8C8D", -] -LABEL_TO_COLOR_MAP = {} - - -# all spacy model: https://huggingface.co/spacy -# "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md", def make_model_card(model_name="", readme_path=""): - """Make render model_readme.md as model card.""" + """Make render model_readme.md as model_card tab. + All spacy model: https://huggingface.co/spacy + For example: "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md", + + """ readme_path = ( f"https://huggingface.co/spacy/{model_name}/raw/main/README.md" if model_name @@ -124,9 +124,6 @@ def make_model_card(model_name="", readme_path=""): ) -################ -# Report utils # -################ def map_label_to_color(labels): label_to_colors = {} for label in labels: @@ -161,7 +158,6 @@ def plot_pie(count_map) -> dp.Plot: def build_entity_df(entites, id) -> pd.DataFrame: text = [ent.text for ent in entites] types = [ent.type for ent in entites] - # pos = [f"{ent.beg}" + ":" + f"{ent.end}" for ent in entites] replaced_values = [ ent.replacement_string or "{{" + ent.placeholder + "}}" for ent in entites ] @@ -170,7 +166,6 @@ def build_entity_df(entites, id) -> pd.DataFrame: "Entity (Original Text)": text, "Type": types, "Redacted To": replaced_values, - # "Beg: End": pos, } df = pd.DataFrame(data=d) if df.size == 0: @@ -180,14 +175,12 @@ def build_entity_df(entites, id) -> pd.DataFrame: "Entity (Original Text)": "-", "Type": "-", "Redacted To": "-", - # "Begs: End": "-", } df = df.append(df2, ignore_index=True) return df class RowReportFields: - # TODO: rename class def __init__(self, context, show_sensitive_info: bool = True): self.total_tokens = context.get("total_tokens", "unknown") self.entites_cnt_map = context.get("statics", {}) diff --git a/ads/opctl/operator/lowcode/pii/model/utils.py b/ads/opctl/operator/lowcode/pii/model/utils.py index 0e007cca0..94481560d 100644 --- a/ads/opctl/operator/lowcode/pii/model/utils.py +++ b/ads/opctl/operator/lowcode/pii/model/utils.py @@ -5,43 +5,10 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -import os -import yaml -import fsspec import logging - from typing import Dict, List -from html2text import html2text -from striprtf.striprtf import rtf_to_text - -YAML_KEYS = [ - "detectors", - "custom_detectors", - "spacy_detectors", - "anonymization", - "name", - "label", - "patterns", - "model", - "named_entities", - "entities", -] - - -class SupportInputFormat: - PLAIN = ".txt" - HTML = ".html" - RTF = ".rtf" - MAPPING_EXT_TO_FORMAT = {HTML: "html", RTF: "rtf"} - - @classmethod - def get_support_list(cls): - return [cls.PLAIN, cls.HTML, cls.RTF] - - @classmethod - def map_ext_to_format(cls, ext): - return cls.MAPPING_EXT_TO_FORMAT.get(ext) +from .constant import YAML_KEYS class ReportContextKey: @@ -62,102 +29,6 @@ class ReportContextKey: INPUT_BASE = "input_base" -# def convert_to_html(file_ext, input_path, file_name): -# """Example: -# pandoc -f rtf -t html .rtf -o .html -# """ -# html_path = os.path.join(tempfile.mkdtemp(), file_name + ".html") -# cmd_specify_input_format = ( -# "" -# if file_ext == SupportInputFormat.PLAIN -# else f"-f {SupportInputFormat.map_ext_to_format(file_ext)}" -# ) -# cmd = f"pandoc {cmd_specify_input_format} -t html {input_path} -o {html_path}" -# os.system(cmd) -# assert os.path.exists( -# html_path -# ), f"Failed to convert {input_path} to html. You can run `{cmd}` in terminal to see the error." -# return html_path - - -def load_html(uri: str): - """Convert the given html file to text. - - Args: - uri (str): uri of the html file. - - Returns: - str: plain text of the html file. - """ - fs = open(uri, "rb") - html = fs.read().decode("utf-8", errors="ignore") - return html2text(html) - - -def load_rtf(uri: str, **kwargs): - """Convert the given rtf file to text. - - Args: - uri (str): uri of the rtf file. - - Returns: - str: plain text of the rtf file. - """ - fsspec_kwargs = kwargs.pop("fsspec_kwargs", {}) - content = _read_from_file(uri, **fsspec_kwargs) - return rtf_to_text(content) - - -def get_files(input_dir: str) -> List: - """Returns all files in the given directory.""" - files = [] - for dirpath, dirnames, filenames in os.walk(input_dir): - if dirpath.endswith(".ipynb_checkpoints"): - continue - for f in filenames: - if not f.endswith(".DS_Store"): - files.append(os.path.join(dirpath, f)) - return files - - -def _read_from_file(uri: str, **kwargs) -> str: - """Returns contents from a file specified by URI - - Parameters - ---------- - uri : str - The URI of the file. - - Returns - ------- - str - The content of the file as a string. - """ - with fsspec.open(uri, "r", **kwargs) as f: - return f.read() - - -def from_yaml( - yaml_string: str = None, - uri: str = None, - loader: callable = yaml.SafeLoader, - **kwargs, -) -> Dict: - """Loads yaml from given yaml string or uri of the yaml. - - Raises - ------ - ValueError - Raised if neither string nor uri is provided - """ - if yaml_string: - return yaml.load(yaml_string, Loader=loader) - if uri: - return yaml.load(_read_from_file(uri=uri, **kwargs), Loader=loader) - - raise ValueError("Must provide either YAML string or URI location") - - def _safe_get_spec(spec_file, key, default): try: return spec_file[key] @@ -167,16 +38,6 @@ def _safe_get_spec(spec_file, key, default): return default -def default_config() -> str: - """Returns the default config file which intended to process UMHC notes. - - Returns: - str: uri of the default config file. - """ - curr_dir = os.path.dirname(os.path.abspath(__file__)) - return os.path.abspath(os.path.join(curr_dir, "config", "umhc2.yaml")) - - def construct_filth_cls_name(name: str) -> str: """Constructs the filth class name from the given name. For example, "name" -> "NameFilth". diff --git a/ads/opctl/operator/lowcode/pii/operator_config.py b/ads/opctl/operator/lowcode/pii/operator_config.py index 78de6e08d..47e65b84a 100644 --- a/ads/opctl/operator/lowcode/pii/operator_config.py +++ b/ads/opctl/operator/lowcode/pii/operator_config.py @@ -7,9 +7,10 @@ import os from dataclasses import dataclass, field from typing import Dict, List + from ads.common.serializer import DataClassSerializable -from ads.opctl.operator.common.utils import _load_yaml_from_uri from ads.opctl.operator.common.operator_config import OperatorConfig +from ads.opctl.operator.common.utils import _load_yaml_from_uri @dataclass(repr=True) @@ -38,37 +39,28 @@ class Report(DataClassSerializable): @dataclass(repr=True) -class Redactor(DataClassSerializable): +class Detector(DataClassSerializable): """Class representing operator specification redactor directory details.""" - detectors: List[str] = None - # TODO: - spacy_detectors: Dict = None - anonymization: List[str] = None + name: str = None + action: str = None @dataclass(repr=True) class PiiOperatorSpec(DataClassSerializable): """Class representing pii operator specification.""" - name: str = None input_data: InputData = field(default_factory=InputData) output_directory: OutputDirectory = field(default_factory=OutputDirectory) report: Report = field(default_factory=Report) target_column: str = None - redactor: Redactor = field(default_factory=Redactor) + # TODO: adjust from_dict to accept List[Detector] + detectors: List[Dict] = field(default_factory=list) def __post_init__(self): """Adjusts the specification details.""" - # self.report_file_name = self.report_file_name or "report.html" + self.target_column = self.target_column or "target" - self.report.report_filename = self.report.report_filename or "report.html" - self.report.show_rows = self.report.show_rows or 25 - self.report.show_sensitive_content = self.report.show_sensitive_content or False - self.output_directory.url = self.output_directory.url or "result/" - self.output_directory.name = self.output_directory.name or os.path.basename( - self.input_data.url - ) @dataclass(repr=True) @@ -80,7 +72,7 @@ class PiiOperatorConfig(OperatorConfig): kind: str The kind of the resource. For operators it is always - `operator`. type: str - The type of the operator. For pii operator it is always - `forecast` + The type of the operator. For pii operator it is always - `pii` version: str The version of the operator. spec: PiiOperatorSpec diff --git a/ads/opctl/operator/lowcode/pii/schema.yaml b/ads/opctl/operator/lowcode/pii/schema.yaml index 1f08a1014..9599ed78c 100644 --- a/ads/opctl/operator/lowcode/pii/schema.yaml +++ b/ads/opctl/operator/lowcode/pii/schema.yaml @@ -63,15 +63,19 @@ spec: type: string default: report.html meta: - description: "Placed into output_directory location. Defaults to report.html" + description: "Placed into `output_directory` location. Defaults to `report.html`" show_rows: required: false type: number - default: 25 + default: 10 + meta: + description: "The number of rows that shows in the report. Defaults to `10`" show_sensitive_content: required: true default: false type: boolean + meta: + description: "Whether to show sensitive content in the report. Defaults to `False`" type: dict target_column: @@ -81,41 +85,25 @@ spec: meta: description: "Column with user data." - redactor: - type: dict + detectors: + type: list required: true schema: - detectors: - required: true - type: list - schema: + type: dict + schema: + name: + required: true type: string - default: ["phone", "social_security_number"] - meta: - description: "default detectors supported by scrubadub" - - spacy_detectors: - type: dict - required: false - schema: - model: - type: string - required: true - default: en_core_web_trf - named_entities: - type: list - required: true - default: ["PERSON"] - schema: - type: string - meta: - description: "Apply spacy NER model to detect the target entities." - - anonymization: - type: list - required: false - schema: + meta: + description: "The name of the detector. THe format is `.`." + action: + required: true type: string - meta: - description: "Anonylze the selected entities." + default: mask + allowed: + - anonymize + - mask + - remove + meta: + description: "The way to process the detected entity. Default to `mask`." type: dict diff --git a/tests/unitary/with_extras/operator/pii/__init__.py b/tests/unitary/with_extras/operator/pii/__init__.py new file mode 100644 index 000000000..fe904ad27 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/__init__.py @@ -0,0 +1,4 @@ +#!/usr/bin/env python + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ diff --git a/tests/unitary/with_extras/operator/pii/mytest.py b/tests/unitary/with_extras/operator/pii/mytest.py new file mode 100644 index 000000000..e20418620 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/mytest.py @@ -0,0 +1,43 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*-- + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +from ads.opctl.operator.lowcode.pii.model.pii import Scrubber + +from ads.opctl.operator.common.utils import _load_yaml_from_uri + + +test_yaml_uri = "/Users/mingkang/workspace/github/accelerated-data-science/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml" + +# config = _load_yaml_from_uri(uri=test_yaml_uri) + +# print(config) + +import scrubadub +from ads.opctl.operator.lowcode.pii.model.processor import Remover +from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory +from ads.opctl.operator.lowcode.pii.model.pii import Scrubber + +text = """ +This is John Doe. My number is (213)275-8452. +""" + +scrubber = Scrubber(config=test_yaml_uri) + + +# scrubber = scrubadub.Scrubber() +# print(scrubber._post_processors) +print(scrubber.scrubber._detectors) +# scrubber.add_detector("phone") +# # remover = Remover() +# # remover._ENTITIES.append("phone") +# # scrubber.add_post_processor(remover) +# scrubber._detectors["phone"].filth_cls.replacement_string = "***" +# print(scrubber._detectors["phone"].filth_cls.replacement_string) +# out = scrubber.clean(text) +print(scrubber.scrubber._post_processors) +out = scrubber.scrubber.clean(text) + +print(out) diff --git a/tests/unitary/with_extras/operator/pii/test_factory.py b/tests/unitary/with_extras/operator/pii/test_factory.py new file mode 100644 index 000000000..e72321c95 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_factory.py @@ -0,0 +1,47 @@ +#!/usr/bin/env python + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ +import unittest + +from parameterized import parameterized +from scrubadub_spacy.detectors.spacy import SpacyEntityDetector + +from ads.opctl.operator.lowcode.pii.model.factory import ( + PiiDetectorFactory, + UnSupportedDetectorError, +) + + +class TestPiiDetectorFactory(unittest.TestCase): + def test_get_default_detector(self): + detector_type = "default" + entity = "phone" + model = None + expected_result = "phone" + detector = PiiDetectorFactory.get_detector( + detector_type=detector_type, entity=entity, model=model + ) + assert detector == expected_result + + @parameterized.expand( + [ + ("spacy", "person", "en_core_web_trf"), + ("spacy", "other", "en_core_web_trf"), + ] + ) + def test_get_spacy_detector(self, detector_type, entity, model): + detector = PiiDetectorFactory.get_detector( + detector_type=detector_type, entity=entity, model=model + ) + assert isinstance(detector, SpacyEntityDetector) + assert entity.upper() in detector.filth_cls_map + + def test_get_detector_fail(self): + detector_type = "unknow" + entity = "myentity" + model = None + with self.assertRaises(UnSupportedDetectorError): + PiiDetectorFactory.get_detector( + detector_type=detector_type, entity=entity, model=model + ) diff --git a/tests/unitary/with_extras/operator/pii/test_files/__init__.py b/tests/unitary/with_extras/operator/pii/test_files/__init__.py new file mode 100644 index 000000000..fe904ad27 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_files/__init__.py @@ -0,0 +1,4 @@ +#!/usr/bin/env python + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ diff --git a/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml new file mode 100644 index 000000000..aa224d6d6 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml @@ -0,0 +1,18 @@ +kind: operator +spec: + detectors: + - action: anonymize + name: default.phone + - action: anonymize + name: spacy.en_core_web_trf.person + input_data: + url: data.csv + output_directory: + name: data-out.csv + url: result/ + report: + report_filename: report.html + show_sensitive_content: false + target_column: target +type: pii +version: v1 diff --git a/tests/unitary/with_extras/operator/pii/test_guardrail.py b/tests/unitary/with_extras/operator/pii/test_guardrail.py new file mode 100644 index 000000000..fe904ad27 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_guardrail.py @@ -0,0 +1,4 @@ +#!/usr/bin/env python + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ diff --git a/tests/unitary/with_extras/operator/pii/test_pii.py b/tests/unitary/with_extras/operator/pii/test_pii.py new file mode 100644 index 000000000..6e13ece62 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_pii.py @@ -0,0 +1,24 @@ +#!/usr/bin/env python + +# Copyright (c) 2023 Oracle and/or its affiliates. +# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ +import unittest +from unittest.mock import MagicMock, patch + +from ads.opctl.operator.lowcode.pii.model.pii import Scrubber +from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig + + +class TestScrubber(unittest.TestCase): + test_yaml_uri = "/Users/mingkang/workspace/github/accelerated-data-science/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml" + operator_config = PiiOperatorConfig.from_yaml(uri=test_yaml_uri) + config_dict = {} + + def test_init_with_yaml_file(self): + scrubber = Scrubber(config=self.test_yaml_uri) + + def test_init_with_piiOperatorConfig(self): + scrubber = Scrubber(config=self.operator_config) + + def test_init_with_config_dict(self): + scrubber = Scrubber(config=self.config_dict) From fdfda6064023b8ace1e5ef231a73dc9c223bc48b Mon Sep 17 00:00:00 2001 From: MING KANG Date: Sun, 12 Nov 2023 22:06:37 -0800 Subject: [PATCH 05/18] fixed: change type name used in YAML to list --- ads/opctl/operator/common/operator_yaml_generator.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/ads/opctl/operator/common/operator_yaml_generator.py b/ads/opctl/operator/common/operator_yaml_generator.py index 3e8301693..1bbc1ae03 100644 --- a/ads/opctl/operator/common/operator_yaml_generator.py +++ b/ads/opctl/operator/common/operator_yaml_generator.py @@ -103,6 +103,7 @@ def _generate_example( The result config. """ example = {} + for key, value in schema.items(): # only generate values for required fields if ( @@ -125,7 +126,8 @@ def _generate_example( example[key] = 1 elif data_type == "boolean": example[key] = True - elif data_type == "array": + elif data_type == "list": + # TODO: Handle list of dict example[key] = ["item1", "item2"] elif data_type == "dict": example[key] = self._generate_example( From 2fa52c16f02d0495ed50399e9bcd8abf3683c525 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Mon, 13 Nov 2023 00:52:41 -0800 Subject: [PATCH 06/18] added documentation --- docs/source/index.rst | 1 + .../common/yaml_schema/piiOperator.yaml | 109 ++++++++++++++++++ .../operators/pii_operator/examples.rst | 51 ++++++++ .../pii_operator/getting_started.rst | 63 ++++++++++ .../operators/pii_operator/index.rst | 37 ++++++ .../operators/pii_operator/install.rst | 13 +++ .../user_guide/operators/pii_operator/pii.rst | 48 ++++++++ .../operators/pii_operator/yaml_schema.rst | 9 ++ 8 files changed, 331 insertions(+) create mode 100644 docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml create mode 100644 docs/source/user_guide/operators/pii_operator/examples.rst create mode 100644 docs/source/user_guide/operators/pii_operator/getting_started.rst create mode 100644 docs/source/user_guide/operators/pii_operator/index.rst create mode 100644 docs/source/user_guide/operators/pii_operator/install.rst create mode 100644 docs/source/user_guide/operators/pii_operator/pii.rst create mode 100644 docs/source/user_guide/operators/pii_operator/yaml_schema.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index 0aee74570..ca4e6b4d2 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -43,6 +43,7 @@ Oracle Accelerated Data Science (ADS) user_guide/operators/index user_guide/operators/common/index user_guide/operators/forecasting_operator/index + user_guide/operators/pii_operator/index .. toctree:: :hidden: diff --git a/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml b/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml new file mode 100644 index 000000000..9599ed78c --- /dev/null +++ b/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml @@ -0,0 +1,109 @@ +kind: + allowed: + - operator + required: true + type: string + default: operator + meta: + description: "Which service are you trying to use? Common kinds: `operator`, `job`" + +version: + allowed: + - "v1" + required: true + type: string + default: v1 + meta: + description: "Operators may change yaml file schemas from version to version, as well as implementation details. Double check the version to ensure compatibility." + +type: + required: true + type: string + default: pii + meta: + description: "Type should always be `pii` when using a pii operator" + + +spec: + required: true + schema: + input_data: + required: true + type: dict + meta: + description: "This should be indexed by target column." + schema: + url: + required: true + type: string + default: data.csv + meta: + description: "The url can be local, or remote. For example: `oci://@/data.csv`" + + output_directory: + required: true + schema: + url: + required: true + type: string + default: result/ + meta: + description: "The url can be local, or remote. For example: `oci://@/`" + name: + required: true + type: string + default: data-out.csv + type: dict + + report: + required: true + schema: + report_filename: + required: true + type: string + default: report.html + meta: + description: "Placed into `output_directory` location. Defaults to `report.html`" + show_rows: + required: false + type: number + default: 10 + meta: + description: "The number of rows that shows in the report. Defaults to `10`" + show_sensitive_content: + required: true + default: false + type: boolean + meta: + description: "Whether to show sensitive content in the report. Defaults to `False`" + type: dict + + target_column: + type: string + required: true + default: target + meta: + description: "Column with user data." + + detectors: + type: list + required: true + schema: + type: dict + schema: + name: + required: true + type: string + meta: + description: "The name of the detector. THe format is `.`." + action: + required: true + type: string + default: mask + allowed: + - anonymize + - mask + - remove + meta: + description: "The way to process the detected entity. Default to `mask`." + type: dict diff --git a/docs/source/user_guide/operators/pii_operator/examples.rst b/docs/source/user_guide/operators/pii_operator/examples.rst new file mode 100644 index 000000000..0a300abb6 --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/examples.rst @@ -0,0 +1,51 @@ +======== +Examples +======== + +**Simple Example** + +The simplest yaml file is generated by the ``ads operator init --type pii`` and looks like the following: + +.. code-block:: yaml + + kind: operator + type: pii + version: v1 + spec: + input_data: + url: mydata.csv + target_column: target + detectors: + - name: default.phone + action: mask + + + +**Complex Example** + +The yaml can also be maximally stated as follows: + + +.. code-block:: yaml + + kind: operator + type: pii + version: v1 + spec: + output_directory: + url: oci://my-bucket@my-tenancy/results + name: mydata-out.csv + report: + report_filename: report.html + show_rows: 10 + show_sensitive_content: true + input_data: + url: oci://my-bucket@my-tenancy/mydata.csv + target_column: target + detectors: + - name: default.phone + action: mask + - name: default.social_security_number + action: remove + - name: spacy.en_core_web_trf.person + action: anonymize diff --git a/docs/source/user_guide/operators/pii_operator/getting_started.rst b/docs/source/user_guide/operators/pii_operator/getting_started.rst new file mode 100644 index 000000000..120a673b3 --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/getting_started.rst @@ -0,0 +1,63 @@ +=============== +Getting Started +=============== + +Configure +--------- + +After having set up ``ads opctl`` on your desired machine using ``ads opctl configure``, you are ready to begin using pii operator. At a bare minimum, you will need to provide the following details about your tasks: + +- Path to the input data (input_data) +- Name of the column with user data (target_column) +- Name of the detector will be used in the operator (detectors) + + +These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``: + +.. code-block:: yaml + + kind: operator + type: pii + version: v1 + spec: + input_data: + url: mydata.csv + target_column: target + detectors: + - name: default.phone + action: anonymize + + +Optionally, you are able to specify much more. The most common additions are: + +- Whether to show sensitive content in the report. (show_sensitive_content) +- Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory) +- Way to process the detected entity. (action) + +An extensive list of parameters can be found in the ``YAML Schema`` section. + + +Run +--- + +After you have your pii.yaml written, you simply run the operator using: + +.. code-block:: bash + + ads operator run -f pii.yaml + + +Interpret Results +----------------- + +The pii operator produces the following output files: ``mydata-out.csv`` and ``report.html``. + +We will go through each of these output files in turn. + +**mydata-out.csv** + +The name of this file can be customized based on output_directory parameters in the configuration yaml. This file contains the processed dataset. + +**report.html** + +The report.html file is customized based on report parameters in the configuration yaml. It contains a summary of statistics, a plot of entities distributions, details of the resolved entites, and details about any modelused. By default sensitive information is not shown in the report, but for debugging purposes you can disable this with ``show_sensitive_content``. It also includes a copy of YAML file, providing a fully detailed version of the original specification. diff --git a/docs/source/user_guide/operators/pii_operator/index.rst b/docs/source/user_guide/operators/pii_operator/index.rst new file mode 100644 index 000000000..cdf5d962b --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/index.rst @@ -0,0 +1,37 @@ +============ +PII Operator +============ + +The PII operator aims to detect and redact Personally Identifiable Information(PII) in datasets by combining pattern match and machine learning solution. + +Overview +-------- + +**Introduction to PII** + +Personal Identifiable Information (PII) refers to any information that can identify an individual, encompassing financial, medical, educational, and employment records. Failure to protect Personal Identifiable Information (PII) can lead to identity theft, financial loss, and reputational damage of individuals and businesses alike, highlighting the importance of taking appropriate measures to safeguard sensitive information. The Operators framework is OCI's most extensible, low-code, managed ecosystem for detecting and redacting pii in dataset. + +This technical documentation introduces using ``ads opctl`` for detecting and redacting pii tasks. This module is engineered with the principles of low-code development in mind, making it accessible to users with varying degrees of technical expertise. It operates on managed infrastructure, ensuring reliability and scalability, while its configurability through YAML allows users to customize redaction to their specific needs. + +**Automated Detection and Classification** + +By leveraging pattern matching and AI-powered solution, the ADS PII Operator efficiently identifies sentitive data on free form texts. + +**Intelligent Co-reference Resolution** + +A standout feature of the ADS PII Operator is its ability to maintain co-reference entity relationships even after anonymization, this not only anonymizes the data, but peserves the statistical properties of the data. + +**PII Operator Documentation** + +This documentation will explore the key concepts and capabilities of the PII operator, providing examples and practical guidance on how to use its various functions and modules. By the end of this guide, users will have a solid understanding of the PII operator and its capabilities, as well as the knowledge and tools needed to make informed decisions when designing solutions tailored to their specific requirements. + +.. versionadded:: 2.9.0 + +.. toctree:: + :maxdepth: 1 + + ./install + ./getting_started + ./pii + ./examples + ./yaml_schema diff --git a/docs/source/user_guide/operators/pii_operator/install.rst b/docs/source/user_guide/operators/pii_operator/install.rst new file mode 100644 index 000000000..ae581315b --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/install.rst @@ -0,0 +1,13 @@ +=========================== +Installing the PII Operator +=========================== + +The PII Operator can be installed from PyPi. + + +.. code-block:: bash + + python3 -m pip install oracle_ads[pii] + + +After that, the Operator is ready to go! diff --git a/docs/source/user_guide/operators/pii_operator/pii.rst b/docs/source/user_guide/operators/pii_operator/pii.rst new file mode 100644 index 000000000..af9a50c77 --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/pii.rst @@ -0,0 +1,48 @@ +============= +Configure PII +============= + +Let's explore each line of the pii.yaml so we can better understand options for extending and customizing the operator to our use case. + +Here is an example pii.yaml with every parameter specified: + +.. code-block:: yaml + + kind: operator + type: pii + version: v1 + spec: + output_directory: + url: oci://my-bucket@my-tenancy/results + name: mydata-out.csv + report: + report_filename: report.html + show_rows: 10 + show_sensitive_content: true + input_data: + url: oci://my-bucket@my-tenancy/mydata.csv + target_column: target + detectors: + - name: default.phone + action: anonymize + + +* **Kind**: The yaml file always starts with ``kind: operator``. There are many other kinds of yaml files that can be run by ``ads opctl``, so we need to specify this is an operator. +* **Type**: The type of operator is ``pii``. +* **Version**: The only available version is ``v1``. +* **Spec**: Spec contains the bulk of the information for the specific problem. + * **input_data**: This dictionary contains the details for how to read the input data. + * **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://@/path/to/data.csv``. + * **target_column**: This string specifies the name of the column where the user data is within the input data. + * **detectors**: This list contains the details for each detector and action that will be taken. + * **name**: The string specifies the name of the detector. The format should be ``.``. + * **action**: The string specifies the way to process the detected entity. Default to mask. + + * **output_directory**: (optional) This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime. + * **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://@/subfolder/``. + * **name**: The string specifies the name of the processed data file. + + * **report**: (optional) This dictionary specific details for the generated report. + * **report_filename**: Placed into output_directory location. Defaults to report.html. + * **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to false. + * **show_rows**: The number of rows that shows in the report. diff --git a/docs/source/user_guide/operators/pii_operator/yaml_schema.rst b/docs/source/user_guide/operators/pii_operator/yaml_schema.rst new file mode 100644 index 000000000..10cdb58ce --- /dev/null +++ b/docs/source/user_guide/operators/pii_operator/yaml_schema.rst @@ -0,0 +1,9 @@ +=========== +YAML Schema +=========== + +Following is the YAML schema for validating the YAML using `Cerberus `_: + +.. literalinclude:: ../common/yaml_schema/piiOperator.yaml + :language: yaml + :linenos: From 7461c945c3d0d3daf88d9b42442a81d3319d7897 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Mon, 13 Nov 2023 03:19:44 -0800 Subject: [PATCH 07/18] wip --- .../operator/lowcode/pii/model/guardrails.py | 111 +++++++++--------- ads/opctl/operator/lowcode/pii/model/pii.py | 44 ++++--- ads/opctl/operator/lowcode/pii/model/utils.py | 63 ++++++++++ .../with_extras/operator/pii/mytest.py | 43 ------- .../with_extras/operator/pii/test_factory.py | 13 +- .../operator/pii/test_files/pii_test.yaml | 4 +- .../with_extras/operator/pii/test_pii.py | 53 +++++++-- 7 files changed, 191 insertions(+), 140 deletions(-) delete mode 100644 tests/unitary/with_extras/operator/pii/mytest.py diff --git a/ads/opctl/operator/lowcode/pii/model/guardrails.py b/ads/opctl/operator/lowcode/pii/model/guardrails.py index a138a59e8..455edad47 100644 --- a/ads/opctl/operator/lowcode/pii/model/guardrails.py +++ b/ads/opctl/operator/lowcode/pii/model/guardrails.py @@ -6,81 +6,85 @@ import os import time -import pandas as pd from ads.opctl import logger from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig from ads.opctl.operator.lowcode.pii.model.pii import Scrubber, scrub, detect from ads.opctl.operator.lowcode.pii.model.report import PIIOperatorReport -from ads.common import auth as authutil -from datetime import datetime - -def get_output_name(given_name, target_name=None): - """Add ``-out`` suffix to the src filename.""" - if not target_name: - basename = os.path.basename(given_name) - fn, ext = os.path.splitext(basename) - target_name = fn + "_out" + ext - return target_name +from datetime import datetime +from ads.opctl.operator.lowcode.pii.model.utils import ( + _load_data, + _write_data, + get_output_name, +) +from ads.opctl.operator.lowcode.pii.model.utils import default_signer +from ads.common.object_storage_details import ObjectStorageDetails class PIIGuardrail: def __init__(self, config: PiiOperatorConfig, auth: dict = None): self.spec = config.spec - self.data = None # saving loaded data - self.auth = auth or authutil.default_signer() self.scrubber = Scrubber(config=config).config_scrubber() - self.target_col = self.spec.target_column - self.output_data_name = self.spec.output_directory.name - # input attributes - self.src_data_uri = self.spec.input_data.url - # output attributes - self.output_directory = self.spec.output_directory.url self.dst_uri = os.path.join( - self.output_directory, + self.spec.output_directory.url, get_output_name( - target_name=self.output_data_name, given_name=self.src_data_uri + target_name=self.spec.output_directory.name, + given_name=self.self.spec.input_data.url, ), ) - # Report attributes self.report_uri = os.path.join( self.spec.output_directory.url, self.spec.report.report_filename, ) - self.show_rows = self.spec.report.show_rows or 25 - self.show_sensitive_content = self.spec.report.show_sensitive_content or False - - def load_data(self, uri=None, storage_options={}): - # TODO: Support more format of input data - uri = uri or self.src_data_uri - if uri.endswith(".csv"): - if uri.startswith("oci://"): - storage_options = storage_options or self.auth - self.data = pd.read_csv(uri, storage_options=storage_options) - else: - self.data = pd.read_csv(uri) - return self.data - - def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}): + + try: + self.datasets = self.load_data() + except Exception as e: + logger.warning(f"Failed to load data from `{self.spec.input_data.url}`.") + logger.debug(f"Full traceback: {e}") + + def load_data(self, uri=None, storage_options=None): + """Loads input data.""" + input_data_uri = uri or self.spec.input_data.url + logger.info(f"Loading input data from `{input_data_uri}` ...") + + self.datasets = _load_data( + filename=input_data_uri, + storage_options=storage_options or default_signer(), + ) + + def process(self, **kwargs): + """Process input data.""" run_at = datetime.now() dt_string = run_at.strftime("%d/%m/%Y %H:%M:%S") start_time = time.time() - data = data or self.data - if data is None: - data = self.load_data(storage_options) - report_uri = report_uri or self.report_uri - dst_uri = dst_uri or self.dst_uri + data = kwargs.pop("input_data", None) or self.datasets + report_uri = kwargs.pop("report_uri", None) or self.report_uri + dst_uri = kwargs.pop("dst_uri", None) or self.dst_uri - data["redacted_text"] = data[self.target_col].apply( + # process user data + data["redacted_text"] = data[self.spec.target_column].apply( lambda x: scrub(x, scrubber=self.scrubber) ) elapsed_time = time.time() - start_time - # generate pii report + + if dst_uri: + logger.info(f"Saving data into `{dst_uri}` ...") + + _write_data( + data=data.loc[:, data.columns != self.spec.target_column], + filename=dst_uri, + storage_options=default_signer() + if ObjectStorageDetails.is_oci_path(dst_uri) + else {}, + ) + + # prepare pii report if report_uri: - data["entities_cols"] = data[self.target_col].apply( + data["entities_cols"] = data[self.spec.target_column].apply( lambda x: detect(text=x, scrubber=self.scrubber) ) from ads.opctl.operator.lowcode.pii.model.utils import _safe_get_spec @@ -110,7 +114,7 @@ def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}) context = { "run_summary": { "total_tokens": 0, - "src_uri": self.src_data_uri, + "src_uri": self.spec.input_data.url, "total_rows": len(data.index), "config": self.spec, "selected_detectors": list(self.scrubber._detectors.values()), @@ -118,13 +122,13 @@ def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}) "selected_spacy_model": selected_spacy_model, "timestamp": dt_string, "elapsed_time": elapsed_time, - "show_rows": self.show_rows, - "show_sensitive_info": self.show_sensitive_content, + "show_rows": self.spec.report.show_rows, + "show_sensitive_info": self.spec.report.show_sensitive_content, }, "run_details": {"rows": []}, } for ind in data.index: - text = data[self.target_col][ind] + text = data[self.spec.target_column][ind] ent_col = data["entities_cols"][ind] idx = data["id"][ind] page = { @@ -139,20 +143,11 @@ def evaluate(self, data=None, dst_uri=None, report_uri=None, storage_options={}) context = self._process_context(context) self._generate_report(context, report_uri) - if dst_uri: - self._save_output(data, ["id", "redacted_text"], dst_uri) - def _generate_report(self, context, report_uri): report_ = PIIOperatorReport(context=context) report_sections = report_.make_view() report_.save_report(report_sections=report_sections, report_path=report_uri) - def _save_output(self, df, target_col, dst_uri): - # TODO: Based on extension of dst_uri call to_csv or to_json. - data_out = df[target_col] - data_out.to_csv(dst_uri) - return dst_uri - def _process_context(self, context): """Count different type of filth.""" statics = {} # statics : count Filth type in total diff --git a/ads/opctl/operator/lowcode/pii/model/pii.py b/ads/opctl/operator/lowcode/pii/model/pii.py index 3f149b893..3762a9e22 100644 --- a/ads/opctl/operator/lowcode/pii/model/pii.py +++ b/ads/opctl/operator/lowcode/pii/model/pii.py @@ -7,6 +7,7 @@ import scrubadub +from ads.common.extended_enum import ExtendedEnumMeta from ads.opctl import logger from ads.opctl.operator.common.utils import _load_yaml_from_uri from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory @@ -16,20 +17,31 @@ Remover, ) -SUPPORT_ACTIONS = ["mask", "remove", "anonymize"] +class SupportedAction(str, metaclass=ExtendedEnumMeta): + """Supported action to process detected entities.""" -class DetectorType: - DEFAULT = "default" + MASK = "mask" + REMOVE = "remove" + ANONYMIZE = "anonymize" -class Scrubber: - def __init__(self, config: str or "PiiOperatorConfig" or dict): +class PiiScrubber: + def __init__(self, config): logger.info(f"Loading config from {config}") if isinstance(config, str): config = _load_yaml_from_uri(config) self.config = config + self.spec = ( + self.config["spec"] if isinstance(self.config, dict) else self.config.spec + ) + self.detector_spec = ( + self.spec["detectors"] + if isinstance(self.spec, dict) + else self.spec.detectors + ) + self.scrubber = scrubadub.Scrubber() self.detectors = [] @@ -45,9 +57,9 @@ def _reset_scrubber(self): self.scrubber.remove_detector(d) def _register(self, name, dtype, model, action, mask_with: str = None): - if action not in SUPPORT_ACTIONS: + if action not in SupportedAction.values(): raise ValueError( - f"Not supported `action`: {action}. Please select from {SUPPORT_ACTIONS}." + f"Not supported `action`: {action}. Please select from {SupportedAction.values()}." ) detector = PiiDetectorFactory.get_detector( @@ -71,7 +83,7 @@ def _register(self, name, dtype, model, action, mask_with: str = None): self.post_processors[replacer_name] = replacer else: raise ValueError( - f"Not supported `action` {action} for this entity {name}. Please try with other action." + f"Not supported `action` {action} for this entity `{name}`. Please try with other action." ) if action == "remove": @@ -81,14 +93,10 @@ def _register(self, name, dtype, model, action, mask_with: str = None): def config_scrubber(self): """Returns an instance of srubadub.Scrubber.""" - spec = ( - self.config["spec"] if isinstance(self.config, dict) else self.config.spec - ) - detectors = spec["detectors"] if isinstance(spec, dict) else spec.detector - self.scrubber.redact_spec_file = spec + self.scrubber.redact_spec_file = self.spec - for detector in detectors: + for detector in self.detector_spec: # example format for detector["name"]: default.phone or spacy.en_core_web_trf.person d = detector["name"].split(".") dtype = d[0] @@ -113,13 +121,13 @@ def _register_post_processor(self): self.scrubber.add_post_processor(v) -def scrub(text, spec_file=None, scrubber=None): +def scrub(text, config=None, scrubber=None): if not scrubber: - scrubber = Scrubber(config=spec_file).config_scrubber() + scrubber = PiiScrubber(config=config).config_scrubber() return scrubber.clean(text) -def detect(text, spec_file=None, scrubber=None): +def detect(text, config=None, scrubber=None): if not scrubber: - scrubber = Scrubber(config=spec_file).config_scrubber() + scrubber = PiiScrubber(config=config).config_scrubber() return list(scrubber.iter_filth(text, document_name=None)) diff --git a/ads/opctl/operator/lowcode/pii/model/utils.py b/ads/opctl/operator/lowcode/pii/model/utils.py index 94481560d..9f7525e50 100644 --- a/ads/opctl/operator/lowcode/pii/model/utils.py +++ b/ads/opctl/operator/lowcode/pii/model/utils.py @@ -6,9 +6,72 @@ import logging +import os +import pandas as pd from typing import Dict, List from .constant import YAML_KEYS +from ads.common.object_storage_details import ObjectStorageDetails +import fsspec +from ..errors import PIIInputDataError + + +def default_signer(**kwargs): + os.environ["EXTRA_USER_AGENT_INFO"] = "Pii-Operator" + from ads.common.auth import default_signer + + return default_signer(**kwargs) + + +def _call_pandas_fsspec(pd_fn, filename, storage_options, **kwargs): + if fsspec.utils.get_protocol(filename) == "file": + return pd_fn(filename, **kwargs) + + storage_options = storage_options or ( + default_signer() if ObjectStorageDetails.is_oci_path(filename) else {} + ) + + return pd_fn(filename, storage_options=storage_options, **kwargs) + + +def _load_data(filename, format, storage_options=None, columns=None, **kwargs): + if not format: + _, format = os.path.splitext(filename) + format = format[1:] + if format in ["json", "csv"]: + read_fn = getattr(pd, f"read_{format}") + data = _call_pandas_fsspec(read_fn, filename, storage_options=storage_options) + elif format in ["tsv"]: + data = _call_pandas_fsspec( + pd.read_csv, filename, storage_options=storage_options, sep="\t" + ) + else: + raise PIIInputDataError(f"Unrecognized format: {format}") + if columns: + # keep only these columns, done after load because only CSV supports stream filtering + data = data[columns] + return data + + +def _write_data(data, filename, format, storage_options, index=False, **kwargs): + if not format: + _, format = os.path.splitext(filename) + format = format[1:] + if format in ["json", "csv"]: + write_fn = getattr(data, f"to_{format}") + return _call_pandas_fsspec( + write_fn, filename, index=index, storage_options=storage_options + ) + raise PIIInputDataError(f"Unrecognized format: {format}") + + +def get_output_name(given_name, target_name=None): + """Add ``-out`` suffix to the src filename.""" + if not target_name: + basename = os.path.basename(given_name) + fn, ext = os.path.splitext(basename) + target_name = fn + "_out" + ext + return target_name class ReportContextKey: diff --git a/tests/unitary/with_extras/operator/pii/mytest.py b/tests/unitary/with_extras/operator/pii/mytest.py deleted file mode 100644 index e20418620..000000000 --- a/tests/unitary/with_extras/operator/pii/mytest.py +++ /dev/null @@ -1,43 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*-- - -# Copyright (c) 2023 Oracle and/or its affiliates. -# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ - -from ads.opctl.operator.lowcode.pii.model.pii import Scrubber - -from ads.opctl.operator.common.utils import _load_yaml_from_uri - - -test_yaml_uri = "/Users/mingkang/workspace/github/accelerated-data-science/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml" - -# config = _load_yaml_from_uri(uri=test_yaml_uri) - -# print(config) - -import scrubadub -from ads.opctl.operator.lowcode.pii.model.processor import Remover -from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory -from ads.opctl.operator.lowcode.pii.model.pii import Scrubber - -text = """ -This is John Doe. My number is (213)275-8452. -""" - -scrubber = Scrubber(config=test_yaml_uri) - - -# scrubber = scrubadub.Scrubber() -# print(scrubber._post_processors) -print(scrubber.scrubber._detectors) -# scrubber.add_detector("phone") -# # remover = Remover() -# # remover._ENTITIES.append("phone") -# # scrubber.add_post_processor(remover) -# scrubber._detectors["phone"].filth_cls.replacement_string = "***" -# print(scrubber._detectors["phone"].filth_cls.replacement_string) -# out = scrubber.clean(text) -print(scrubber.scrubber._post_processors) -out = scrubber.scrubber.clean(text) - -print(out) diff --git a/tests/unitary/with_extras/operator/pii/test_factory.py b/tests/unitary/with_extras/operator/pii/test_factory.py index e72321c95..431034bda 100644 --- a/tests/unitary/with_extras/operator/pii/test_factory.py +++ b/tests/unitary/with_extras/operator/pii/test_factory.py @@ -2,9 +2,7 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -import unittest - -from parameterized import parameterized +import pytest from scrubadub_spacy.detectors.spacy import SpacyEntityDetector from ads.opctl.operator.lowcode.pii.model.factory import ( @@ -13,7 +11,7 @@ ) -class TestPiiDetectorFactory(unittest.TestCase): +class TestPiiDetectorFactory: def test_get_default_detector(self): detector_type = "default" entity = "phone" @@ -24,11 +22,12 @@ def test_get_default_detector(self): ) assert detector == expected_result - @parameterized.expand( + @pytest.mark.parametrize( + "detector_type, entity, model", [ ("spacy", "person", "en_core_web_trf"), ("spacy", "other", "en_core_web_trf"), - ] + ], ) def test_get_spacy_detector(self, detector_type, entity, model): detector = PiiDetectorFactory.get_detector( @@ -41,7 +40,7 @@ def test_get_detector_fail(self): detector_type = "unknow" entity = "myentity" model = None - with self.assertRaises(UnSupportedDetectorError): + with pytest.raises(UnSupportedDetectorError): PiiDetectorFactory.get_detector( detector_type=detector_type, entity=entity, model=model ) diff --git a/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml index aa224d6d6..8ab3d656a 100644 --- a/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml +++ b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml @@ -3,8 +3,8 @@ spec: detectors: - action: anonymize name: default.phone - - action: anonymize - name: spacy.en_core_web_trf.person + - action: mask + name: default.text_blob_name input_data: url: data.csv output_directory: diff --git a/tests/unitary/with_extras/operator/pii/test_pii.py b/tests/unitary/with_extras/operator/pii/test_pii.py index 6e13ece62..df2929a06 100644 --- a/tests/unitary/with_extras/operator/pii/test_pii.py +++ b/tests/unitary/with_extras/operator/pii/test_pii.py @@ -2,23 +2,52 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -import unittest -from unittest.mock import MagicMock, patch +import os -from ads.opctl.operator.lowcode.pii.model.pii import Scrubber +import pytest + +from ads.opctl.operator.common.utils import _load_yaml_from_uri +from ads.opctl.operator.lowcode.pii.model.pii import PiiScrubber from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig -class TestScrubber(unittest.TestCase): - test_yaml_uri = "/Users/mingkang/workspace/github/accelerated-data-science/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml" +class TestPiiScrubber: + test_yaml_uri = os.path.join( + os.path.dirname(os.path.abspath(__file__)), "test_files", "pii_test.yaml" + ) operator_config = PiiOperatorConfig.from_yaml(uri=test_yaml_uri) - config_dict = {} + config_dict = _load_yaml_from_uri(uri=test_yaml_uri) + + name_entity = "John Doe" + phone_entity = "(800) 223-1711" + text = f""" + This is {name_entity}. My number is {phone_entity}. + """ + + @pytest.mark.parametrize( + "config", + [ + test_yaml_uri, + operator_config, + config_dict, + ], + ) + def test_init(self, config): + pii_scrubber = PiiScrubber(config=config) + + assert isinstance(pii_scrubber.detector_spec, list) + assert len(pii_scrubber.detector_spec) == 2 + assert pii_scrubber.detector_spec[0]["name"] == "default.phone" + + assert len(pii_scrubber.scrubber._detectors) == 0 + + def test_config_scrubber(self): + scrubber = PiiScrubber(config=self.test_yaml_uri).config_scrubber() - def test_init_with_yaml_file(self): - scrubber = Scrubber(config=self.test_yaml_uri) + assert len(scrubber._detectors) == 2 + assert len(scrubber._post_processors) == 1 - def test_init_with_piiOperatorConfig(self): - scrubber = Scrubber(config=self.operator_config) + processed_text = scrubber.clean(self.text) - def test_init_with_config_dict(self): - scrubber = Scrubber(config=self.config_dict) + assert self.name_entity not in processed_text + assert self.phone_entity not in processed_text From 0fb1a4fe92e00dc847d69419d82d4c7f26d4e13f Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:07:38 -0800 Subject: [PATCH 08/18] added tests --- .../operator/pii/test_files/pii_test.yaml | 10 +- .../operator/pii/test_files/test_data.csv | 3 + .../operator/pii/test_guardrail.py | 116 ++++++++++++++++++ .../pii/{test_pii.py => test_pii_scrubber.py} | 0 4 files changed, 122 insertions(+), 7 deletions(-) create mode 100644 tests/unitary/with_extras/operator/pii/test_files/test_data.csv rename tests/unitary/with_extras/operator/pii/{test_pii.py => test_pii_scrubber.py} (100%) diff --git a/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml index 8ab3d656a..b9ef962b4 100644 --- a/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml +++ b/tests/unitary/with_extras/operator/pii/test_files/pii_test.yaml @@ -6,13 +6,9 @@ spec: - action: mask name: default.text_blob_name input_data: - url: data.csv + url: ./test_data.csv output_directory: - name: data-out.csv - url: result/ - report: - report_filename: report.html - show_sensitive_content: false - target_column: target + url: ./test_result/ + target_column: text type: pii version: v1 diff --git a/tests/unitary/with_extras/operator/pii/test_files/test_data.csv b/tests/unitary/with_extras/operator/pii/test_files/test_data.csv new file mode 100644 index 000000000..250e24577 --- /dev/null +++ b/tests/unitary/with_extras/operator/pii/test_files/test_data.csv @@ -0,0 +1,3 @@ +id,text +00001cee341fdb12,"Hi, this is John Doe, my number is (805) 555-1234." +00097b6214686db5,"John has a beautiful puppy." diff --git a/tests/unitary/with_extras/operator/pii/test_guardrail.py b/tests/unitary/with_extras/operator/pii/test_guardrail.py index fe904ad27..ae8c7be60 100644 --- a/tests/unitary/with_extras/operator/pii/test_guardrail.py +++ b/tests/unitary/with_extras/operator/pii/test_guardrail.py @@ -2,3 +2,119 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ + +import os +import tempfile +from io import StringIO + +import yaml + +from ads.opctl.operator.lowcode.pii.constant import DEFAULT_REPORT_FILENAME +from ads.opctl.operator.lowcode.pii.model.guardrails import PIIGuardrail +from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig + + +class TestPiiGuardrail: + test_files_uri = os.path.join( + os.path.dirname(os.path.abspath(__file__)), "test_files" + ) + + def yaml_content_simple(self): + content = StringIO( + f""" +kind: operator +spec: + detectors: + - action: anonymize + name: default.phone + input_data: + url: {self.test_files_uri}/test_data.csv + output_directory: + url: {self.test_files_uri} + target_column: text +type: pii +version: v1 + +""" + ) + return content + + def yaml_content_complex(self): + content = StringIO( + """ +kind: operator +spec: + detectors: + - action: anonymize + name: default.phone + - action: mask + name: default.social_security_number + input_data: + url: oci://my-bucket@my-tenancy/input_data/mydata.csv + output_directory: + name: myProcesseData.csv + url: oci://my-bucket@my-tenancy/result/ + report: + report_filename: myreport.html + show_sensitive_content: true + show_rows: 10 + target_column: text +type: pii +version: v1 + +""" + ) + return content + + def test_init(self): + conf = yaml.load(self.yaml_content_complex(), yaml.SafeLoader) + operator_config = PiiOperatorConfig.from_yaml( + yaml_string=self.yaml_content_complex() + ) + guardrail = PIIGuardrail(config=operator_config) + + assert guardrail.dst_uri == os.path.join( + conf["spec"]["output_directory"]["url"], + conf["spec"]["output_directory"]["name"], + ) + assert guardrail.report_uri == os.path.join( + conf["spec"]["output_directory"]["url"], + conf["spec"]["report"]["report_filename"], + ) + assert len(guardrail.scrubber._detectors) == 2 + assert not guardrail.storage_options == {} + + def test_load_data(self): + conf = yaml.load(self.yaml_content_simple(), yaml.SafeLoader) + + operator_config = PiiOperatorConfig.from_yaml( + yaml_string=self.yaml_content_simple() + ) + guardrail = PIIGuardrail(config=operator_config) + guardrail.load_data() + + assert guardrail.datasets is not None + assert guardrail.storage_options == {} + assert guardrail.dst_uri == os.path.join( + conf["spec"]["output_directory"]["url"], + "test_data_out.csv", + ) + assert guardrail.report_uri == os.path.join( + conf["spec"]["output_directory"]["url"], + DEFAULT_REPORT_FILENAME, + ) + + def test_process(self): + operator_config = PiiOperatorConfig.from_yaml( + yaml_string=self.yaml_content_simple() + ) + guardrail = PIIGuardrail(config=operator_config) + with tempfile.TemporaryDirectory() as temp_dir: + dst_uri = os.path.join(temp_dir, "test_out.csv") + report_uri = os.path.join(temp_dir, DEFAULT_REPORT_FILENAME) + guardrail.process( + dst_uri=dst_uri, + report_uri=report_uri, + ) + assert os.path.exists(dst_uri) + assert os.path.exists(report_uri) diff --git a/tests/unitary/with_extras/operator/pii/test_pii.py b/tests/unitary/with_extras/operator/pii/test_pii_scrubber.py similarity index 100% rename from tests/unitary/with_extras/operator/pii/test_pii.py rename to tests/unitary/with_extras/operator/pii/test_pii_scrubber.py From d9a5af1010731838e554a676e48703cdd338e398 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:08:38 -0800 Subject: [PATCH 09/18] added support for return html --- ads/data_labeling/mixin/data_labeling.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/ads/data_labeling/mixin/data_labeling.py b/ads/data_labeling/mixin/data_labeling.py index 56f85f3a9..e2c65eb20 100644 --- a/ads/data_labeling/mixin/data_labeling.py +++ b/ads/data_labeling/mixin/data_labeling.py @@ -1,7 +1,7 @@ #!/usr/bin/env python # -*- coding: utf-8; -*- -# Copyright (c) 2021, 2022 Oracle and/or its affiliates. +# Copyright (c) 2021, 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ from typing import Dict, List @@ -188,6 +188,7 @@ def render_ner( content_column: str = "Content", annotations_column: str = "Annotations", limit: int = ROWS_TO_RENDER_LIMIT, + return_html: bool = False, ) -> None: """Renders NER dataset. Displays only first 50 rows. @@ -223,6 +224,8 @@ def render_ner( annotations_column=annotations_column, ) result_html = text_visualizer.render(items=items, options=options) + if return_html: + return result_html from IPython.core.display import HTML, Markdown, display From cffa536da2ec5809d9561ad0da1c8ee0941e17c1 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:09:05 -0800 Subject: [PATCH 10/18] implements pii operator --- ads/opctl/operator/lowcode/pii/__main__.py | 2 +- ads/opctl/operator/lowcode/pii/cmd.py | 2 +- .../lowcode/pii/{model => }/constant.py | 33 +++ .../operator/lowcode/pii/environment.yaml | 5 +- .../operator/lowcode/pii/model/factory.py | 16 +- .../operator/lowcode/pii/model/guardrails.py | 181 ++++++------ ads/opctl/operator/lowcode/pii/model/pii.py | 40 ++- .../operator/lowcode/pii/model/report.py | 277 ++++++++++-------- .../operator/lowcode/pii/operator_config.py | 18 +- ads/opctl/operator/lowcode/pii/schema.yaml | 5 +- .../operator/lowcode/pii/{model => }/utils.py | 117 +++----- 11 files changed, 375 insertions(+), 321 deletions(-) rename ads/opctl/operator/lowcode/pii/{model => }/constant.py (64%) rename ads/opctl/operator/lowcode/pii/{model => }/utils.py (53%) diff --git a/ads/opctl/operator/lowcode/pii/__main__.py b/ads/opctl/operator/lowcode/pii/__main__.py index a914edc0a..111b7ed3f 100644 --- a/ads/opctl/operator/lowcode/pii/__main__.py +++ b/ads/opctl/operator/lowcode/pii/__main__.py @@ -22,7 +22,7 @@ def operate(operator_config: PiiOperatorConfig) -> None: """Runs the PII operator.""" guard = PIIGuardrail(config=operator_config) - guard.evaluate() + guard.process() def verify(spec: Dict, **kwargs: Dict) -> bool: diff --git a/ads/opctl/operator/lowcode/pii/cmd.py b/ads/opctl/operator/lowcode/pii/cmd.py index 1098b390b..67bf14d27 100644 --- a/ads/opctl/operator/lowcode/pii/cmd.py +++ b/ads/opctl/operator/lowcode/pii/cmd.py @@ -30,7 +30,7 @@ def init(**kwargs: Dict) -> str: """ logger.info("==== PII related options ====") - default_detector = [{"name": "default.phone", "action": "anonymize"}] + default_detector = [{"name": ".", "action": "mask"}] return YamlGenerator( schema=_load_yaml_from_uri(__file__.replace("cmd.py", "schema.yaml")) diff --git a/ads/opctl/operator/lowcode/pii/model/constant.py b/ads/opctl/operator/lowcode/pii/constant.py similarity index 64% rename from ads/opctl/operator/lowcode/pii/model/constant.py rename to ads/opctl/operator/lowcode/pii/constant.py index 5569a8021..5c75ae74c 100644 --- a/ads/opctl/operator/lowcode/pii/model/constant.py +++ b/ads/opctl/operator/lowcode/pii/constant.py @@ -3,6 +3,39 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ +from ads.common.extended_enum import ExtendedEnumMeta + +DEFAULT_SHOW_ROWS = 25 +DEFAULT_TIME_OUT = 5 +DEFAULT_COLOR = "#D6D3D1" +DEFAULT_REPORT_FILENAME = "report.html" +DEFAULT_TARGET_COLUMN = "target" + + +class SupportedAction(str, metaclass=ExtendedEnumMeta): + """Supported action to process detected entities.""" + + MASK = "mask" + REMOVE = "remove" + ANONYMIZE = "anonymize" + + +class SupportedDetector(str, metaclass=ExtendedEnumMeta): + """Supported pii detectors.""" + + DEFAULT = "default" + SPACY = "spacy" + + +class DataFrameColumn(str, metaclass=ExtendedEnumMeta): + REDACTED_TEXT: str = "redacted_text" + ENTITIES: str = "entities_cols" + + +class YamlKey(str, metaclass=ExtendedEnumMeta): + """Yaml key used in pii.yaml.""" + + pass YAML_KEYS = [ diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml index b542e1d6d..e45cb2530 100644 --- a/ads/opctl/operator/lowcode/pii/environment.yaml +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -2,7 +2,7 @@ name: pii channels: - conda-forge dependencies: - - python=3.8 + - python=3.9 - pip - pip: - datapane @@ -10,4 +10,7 @@ dependencies: - gender_guesser - nameparser - scrubadub_spacy + - requests + - aiohttp + - ploty - "git+https://github.com/oracle/accelerated-data-science.git@feature/ads_pii_operator#egg=oracle-ads" diff --git a/ads/opctl/operator/lowcode/pii/model/factory.py b/ads/opctl/operator/lowcode/pii/model/factory.py index 542e8af0b..d5d0de7ae 100644 --- a/ads/opctl/operator/lowcode/pii/model/factory.py +++ b/ads/opctl/operator/lowcode/pii/model/factory.py @@ -5,18 +5,12 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ import uuid + import scrubadub from scrubadub_spacy.detectors.spacy import SpacyEntityDetector -from ads.common.extended_enum import ExtendedEnumMeta -from ads.opctl.operator.lowcode.pii.model.utils import construct_filth_cls_name - - -class SupportedDetector(str, metaclass=ExtendedEnumMeta): - """Supported pii detectors.""" - - Default = "default" - Spacy = "spacy" +from ads.opctl.operator.lowcode.pii.constant import SupportedDetector +from ads.opctl.operator.lowcode.pii.utils import construct_filth_cls_name class UnSupportedDetectorError(Exception): @@ -66,8 +60,8 @@ class PiiDetectorFactory: """ _MAP = { - SupportedDetector.Default: BuiltInDetector, - SupportedDetector.Spacy: SpacyDetector, + SupportedDetector.DEFAULT: BuiltInDetector, + SupportedDetector.SPACY: SpacyDetector, } @classmethod diff --git a/ads/opctl/operator/lowcode/pii/model/guardrails.py b/ads/opctl/operator/lowcode/pii/model/guardrails.py index 455edad47..41dc3514b 100644 --- a/ads/opctl/operator/lowcode/pii/model/guardrails.py +++ b/ads/opctl/operator/lowcode/pii/model/guardrails.py @@ -6,44 +6,67 @@ import os import time +from datetime import datetime + +from ads.common.object_storage_details import ObjectStorageDetails from ads.opctl import logger +from ads.opctl.operator.lowcode.pii.constant import DataFrameColumn +from ads.opctl.operator.lowcode.pii.model.pii import PiiScrubber, detect, scrub +from ads.opctl.operator.lowcode.pii.model.report import ( + PIIOperatorReport, + PiiReportPageSpec, + PiiReportSpec, +) from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig -from ads.opctl.operator.lowcode.pii.model.pii import Scrubber, scrub, detect -from ads.opctl.operator.lowcode.pii.model.report import PIIOperatorReport - -from datetime import datetime -from ads.opctl.operator.lowcode.pii.model.utils import ( +from ads.opctl.operator.lowcode.pii.utils import ( _load_data, _write_data, + default_signer, get_output_name, ) -from ads.opctl.operator.lowcode.pii.model.utils import default_signer -from ads.common.object_storage_details import ObjectStorageDetails class PIIGuardrail: - def __init__(self, config: PiiOperatorConfig, auth: dict = None): + def __init__(self, config: PiiOperatorConfig): + self.config = config self.spec = config.spec - self.scrubber = Scrubber(config=config).config_scrubber() + self.pii_scrubber = PiiScrubber(config=config) + self.scrubber = self.pii_scrubber.config_scrubber() - self.dst_uri = os.path.join( - self.spec.output_directory.url, - get_output_name( - target_name=self.spec.output_directory.name, - given_name=self.self.spec.input_data.url, - ), + output_filename = get_output_name( + target_name=self.spec.output_directory.name, + given_name=self.spec.input_data.url, ) + self.dst_uri = os.path.join(self.spec.output_directory.url, output_filename) + self.config.spec.output_directory.name = output_filename self.report_uri = os.path.join( self.spec.output_directory.url, self.spec.report.report_filename, ) - try: - self.datasets = self.load_data() - except Exception as e: - logger.warning(f"Failed to load data from `{self.spec.input_data.url}`.") - logger.debug(f"Full traceback: {e}") + self.report_context: PiiReportSpec = PiiReportSpec.from_dict( + { + "run_summary": { + "config": self.config, + "selected_detectors": self.pii_scrubber.detectors, + "selected_entities": self.pii_scrubber.entities, + "selected_spacy_model": self.pii_scrubber.spacy_model_detectors, + "show_rows": self.spec.report.show_rows, + "show_sensitive_info": self.spec.report.show_sensitive_content, + "src_uri": self.spec.input_data.url, + "total_tokens": 0, + }, + "run_details": {"rows": []}, + } + ) + + self.storage_options = ( + default_signer() + if ObjectStorageDetails.is_oci_path(self.spec.output_directory.url) + else {} + ) + self.datasets = None def load_data(self, uri=None, storage_options=None): """Loads input data.""" @@ -52,114 +75,90 @@ def load_data(self, uri=None, storage_options=None): self.datasets = _load_data( filename=input_data_uri, - storage_options=storage_options or default_signer(), + storage_options=storage_options or self.storage_options, ) + return self def process(self, **kwargs): """Process input data.""" - run_at = datetime.now() - dt_string = run_at.strftime("%d/%m/%Y %H:%M:%S") + self.report_context.run_summary.timestamp = datetime.now().strftime( + "%d/%m/%Y %H:%M:%S" + ) start_time = time.time() data = kwargs.pop("input_data", None) or self.datasets report_uri = kwargs.pop("report_uri", None) or self.report_uri dst_uri = kwargs.pop("dst_uri", None) or self.dst_uri + if not data: + try: + self.load_data() + data = self.datasets + except Exception as e: + logger.warning( + f"Failed to load data from `{self.spec.input_data.url}`." + ) + raise e + # process user data - data["redacted_text"] = data[self.spec.target_column].apply( + data[DataFrameColumn.REDACTED_TEXT] = data[self.spec.target_column].apply( lambda x: scrub(x, scrubber=self.scrubber) ) - elapsed_time = time.time() - start_time + self.report_context.run_summary.elapsed_time = time.time() - start_time + self.report_context.run_summary.total_rows = len(data.index) + # save output data if dst_uri: logger.info(f"Saving data into `{dst_uri}` ...") _write_data( data=data.loc[:, data.columns != self.spec.target_column], filename=dst_uri, - storage_options=default_signer() - if ObjectStorageDetails.is_oci_path(dst_uri) - else {}, + storage_options=kwargs.pop("storage_options", None) + or self.storage_options, ) # prepare pii report if report_uri: - data["entities_cols"] = data[self.spec.target_column].apply( + logger.info(f"Generating report to `{report_uri}` ...") + + data[DataFrameColumn.ENTITIES] = data[self.spec.target_column].apply( lambda x: detect(text=x, scrubber=self.scrubber) ) - from ads.opctl.operator.lowcode.pii.model.utils import _safe_get_spec - from ads.opctl.operator.lowcode.pii.model.pii import DEFAULT_SPACY_MODEL - - selected_spacy_model = [] - for spec in _safe_get_spec( - self.scrubber.redact_spec_file, "spacy_detectors", [] - ): - selected_spacy_model.append( + + for i in data.index: + text = data[self.spec.target_column][i] + ent_col = data[DataFrameColumn.ENTITIES][i] + page = PiiReportPageSpec.from_dict( { - "model": _safe_get_spec(spec, "model", DEFAULT_SPACY_MODEL), - "spacy_entites": [ - x.upper() for x in spec.get("named_entities", []) - ], + "id": i, + "total_tokens": len(ent_col), + "entities": ent_col, + "raw_text": text, } ) - selected_entities = [] - for spacy_models in selected_spacy_model: - selected_entities = selected_entities + spacy_models.get( - "spacy_entites", [] - ) - selected_entities = selected_entities + _safe_get_spec( - self.scrubber.redact_spec_file, "detectors", [] + self.report_context.run_details.rows.append(page) + self.report_context.run_summary.total_tokens += len(ent_col) + + self._process_context() + PIIOperatorReport( + report_spec=self.report_context, report_uri=report_uri + ).make_view().save_report( + storage_options=kwargs.pop("storage_options", None) + or self.storage_options ) - context = { - "run_summary": { - "total_tokens": 0, - "src_uri": self.spec.input_data.url, - "total_rows": len(data.index), - "config": self.spec, - "selected_detectors": list(self.scrubber._detectors.values()), - "selected_entities": selected_entities, - "selected_spacy_model": selected_spacy_model, - "timestamp": dt_string, - "elapsed_time": elapsed_time, - "show_rows": self.spec.report.show_rows, - "show_sensitive_info": self.spec.report.show_sensitive_content, - }, - "run_details": {"rows": []}, - } - for ind in data.index: - text = data[self.spec.target_column][ind] - ent_col = data["entities_cols"][ind] - idx = data["id"][ind] - page = { - "id": idx, - "total_tokens": len(ent_col), - "entities": ent_col, - "raw_text": text, - } - context.get("run_details").get("rows").append(page) - context.get("run_summary")["total_tokens"] += len(ent_col) - - context = self._process_context(context) - self._generate_report(context, report_uri) - - def _generate_report(self, context, report_uri): - report_ = PIIOperatorReport(context=context) - report_sections = report_.make_view() - report_.save_report(report_sections=report_sections, report_path=report_uri) - - def _process_context(self, context): + def _process_context(self): """Count different type of filth.""" statics = {} # statics : count Filth type in total - rows = context.get("run_details").get("rows") + rows = self.report_context.run_details.rows for row in rows: - entities = row.get("entities") + entities = row.entities row_statics = {} # count row for ent in entities: row_statics[ent.type] = row_statics.get(ent.type, 0) + 1 statics[ent.type] = statics.get(ent.type, 0) + 1 - row["statics"] = row_statics.copy() + row.statics = row_statics.copy() - context.get("run_summary")["statics"] = statics - return context + self.report_context.run_summary.statics = statics diff --git a/ads/opctl/operator/lowcode/pii/model/pii.py b/ads/opctl/operator/lowcode/pii/model/pii.py index 3762a9e22..0bb0d77ea 100644 --- a/ads/opctl/operator/lowcode/pii/model/pii.py +++ b/ads/opctl/operator/lowcode/pii/model/pii.py @@ -7,10 +7,13 @@ import scrubadub -from ads.common.extended_enum import ExtendedEnumMeta from ads.opctl import logger from ads.opctl.operator.common.utils import _load_yaml_from_uri from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory +from ads.opctl.operator.lowcode.pii.constant import ( + SupportedAction, + SupportedDetector, +) from ads.opctl.operator.lowcode.pii.model.processor import ( POSTPROCESSOR_MAP, SUPPORTED_REPLACER, @@ -18,15 +21,9 @@ ) -class SupportedAction(str, metaclass=ExtendedEnumMeta): - """Supported action to process detected entities.""" - - MASK = "mask" - REMOVE = "remove" - ANONYMIZE = "anonymize" - - class PiiScrubber: + """Class used for config scrubber and count the detectors in use.""" + def __init__(self, config): logger.info(f"Loading config from {config}") if isinstance(config, str): @@ -45,8 +42,9 @@ def __init__(self, config): self.scrubber = scrubadub.Scrubber() self.detectors = [] + self.entities = [] self.spacy_model_detectors = [] - self.post_processors = {} # replacer_name -> replacer_obj + self.post_processors = {} self._reset_scrubber() @@ -66,8 +64,9 @@ def _register(self, name, dtype, model, action, mask_with: str = None): detector_type=dtype, entity=name, model=model ) self.scrubber.add_detector(detector) + self.entities.append(name) - if action == "anonymize": + if action == SupportedAction.ANONYMIZE: entity = ( detector if isinstance(detector, str) @@ -86,7 +85,7 @@ def _register(self, name, dtype, model, action, mask_with: str = None): f"Not supported `action` {action} for this entity `{name}`. Please try with other action." ) - if action == "remove": + if action == SupportedAction.REMOVE: remover = self.post_processors.get("remover", Remover()) remover._ENTITIES.append(name) self.post_processors["remover"] = remover @@ -103,17 +102,28 @@ def config_scrubber(self): dname = d[1] if len(d) == 2 else d[2] model = None if len(d) == 2 else d[1] - action = detector.get("action", "mask") - # mask_with = detector.get("mask_with", None) + action = detector.get("action", SupportedAction.MASK) self._register( name=dname, dtype=dtype, model=model, action=action, - # mask_with=mask_with, ) + if dtype == SupportedDetector.SPACY: + exist = False + for spacy_detectors in self.spacy_model_detectors: + if spacy_detectors["model"] == model: + spacy_detectors["spacy_entites"].append(dname) + exist = True + break + if not exist: + self.spacy_model_detectors.append( + {"model": model, "spacy_entites": [dname]} + ) self._register_post_processor() + + self.detectors = list(self.scrubber._detectors.values()) return self.scrubber def _register_post_processor(self): diff --git a/ads/opctl/operator/lowcode/pii/model/report.py b/ads/opctl/operator/lowcode/pii/model/report.py index b89b5be98..44b7c6752 100644 --- a/ads/opctl/operator/lowcode/pii/model/report.py +++ b/ads/opctl/operator/lowcode/pii/model/report.py @@ -5,79 +5,90 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ +import os import random +import tempfile +from dataclasses import dataclass, field +from typing import Dict, List import datapane as dp import fsspec import pandas as pd import plotly.express as px import plotly.graph_objects as go +import requests import yaml -PII_REPORT_DESCRIPTION = ( - "This report will offer a comprehensive overview of the redaction of personal identifiable information (PII) from the provided data." - "The `Summary` section will provide an executive summary of this process, including key statistics, configuration, and model usage." - "The `Details` section will offer a more granular analysis of each row of data, including relevant statistics." +from ads.common.serializer import DataClassSerializable +from ads.opctl import logger +from ads.opctl.operator.lowcode.pii.constant import ( + DEFAULT_SHOW_ROWS, + DEFAULT_TIME_OUT, + DETAILS_REPORT_DESCRIPTION, + FLAT_UI_COLORS, + PII_REPORT_DESCRIPTION, + DEFAULT_COLOR, ) -DETAILS_REPORT_DESCRIPTION = "The following report will show the details on each row. You can view the highlighted named entities and their labels in the text under `TEXT` tab." - -FLAT_UI_COLORS = [ - "#1ABC9C", - "#2ECC71", - "#3498DB", - "#9B59B6", - "#34495E", - "#16A085", - "#27AE60", - "#2980B9", - "#8E44AD", - "#2C3E50", - "#F1C40F", - "#E67E22", - "#E74C3C", - "#ECF0F1", - "#95A5A6", - "#F39C12", - "#D35400", - "#C0392B", - "#BDC3C7", - "#7F8C8D", -] -LABEL_TO_COLOR_MAP = {} +from ads.opctl.operator.lowcode.pii.utils import ( + block_print, + compute_rate, + enable_print, + human_time_friendly, +) +from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig -################ -# Report utils # -################ -def compute_rate(elapsed_time, num_unit): - return elapsed_time / num_unit +@dataclass(repr=True) +class PiiReportPageSpec(DataClassSerializable): + """Class representing each page under Run Details in pii operator report.""" + entities: list = field(default_factory=list) + id: int = None + raw_text: str = None + statics: dict = field(default_factory=dict) + total_tokens: int = None -def human_time_friendly(seconds): - TIME_DURATION_UNITS = ( - ("week", 60 * 60 * 24 * 7), - ("day", 60 * 60 * 24), - ("hour", 60 * 60), - ("min", 60), - ) - if seconds == 0: - return "inf" - accumulator = [] - for unit, div in TIME_DURATION_UNITS: - amount, seconds = divmod(float(seconds), div) - if amount > 0: - accumulator.append( - "{} {}{}".format(int(amount), unit, "" if amount == 1 else "s") - ) - accumulator.append("{} secs".format(round(seconds, 2))) - return ", ".join(accumulator) + +@dataclass(repr=True) +class RunDetails(DataClassSerializable): + """Class representing Run Details Page in pii operator report.""" + + rows: list = field(default_factory=list) + + +@dataclass(repr=True) +class RunSummary(DataClassSerializable): + """Class representing Run Summary Page in pii operator report.""" + + config: PiiOperatorConfig = None + elapsed_time: str = None + selected_detectors: list = field(default_factory=list) + selected_entities: List[str] = field(default_factory=list) + selected_spacy_model: List[Dict] = field(default_factory=list) + show_rows: int = None + show_sensitive_info: bool = False + src_uri: str = None + statics: dict = None + timestamp: str = None + total_rows: int = None + total_tokens: int = None + + +@dataclass(repr=True) +class PiiReportSpec(DataClassSerializable): + """Class representing pii operator report.""" + + run_details: RunDetails = field(default_factory=RunDetails) + run_summary: RunSummary = field(default_factory=RunSummary) + + +LABEL_TO_COLOR_MAP = {} def make_model_card(model_name="", readme_path=""): """Make render model_readme.md as model_card tab. All spacy model: https://huggingface.co/spacy - For example: "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md", - + For example: "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md". """ readme_path = ( f"https://huggingface.co/spacy/{model_name}/raw/main/README.md" @@ -87,10 +98,20 @@ def make_model_card(model_name="", readme_path=""): if not readme_path: raise NotImplementedError("Does not support other spacy model so far.") - with fsspec.open(readme_path, "r") as file: - content = file.read() - _, front_matter, text = content.split("---", 2) - data = yaml.safe_load(front_matter) + try: + requests.get(readme_path, timeout=DEFAULT_TIME_OUT) + with fsspec.open(readme_path, "r") as file: + content = file.read() + _, front_matter, text = content.split("---", 2) + data = yaml.safe_load(front_matter) + except requests.ConnectionError: + logger.warning( + "You don't have internet connection. Therefore, we are not able to generate model card." + ) + return dp.Group( + dp.Text("-"), + columns=1, + ) try: eval_res = data["model-index"][0]["results"] @@ -113,7 +134,7 @@ def make_model_card(model_name="", readme_path=""): eval_res_tb = dp.Plot(data=fig, caption="Evaluation Results") except: eval_res_tb = dp.Text("-") - print( + logger.warning( "The given readme.md doesn't have correct template for Evaluation Results." ) @@ -125,6 +146,7 @@ def make_model_card(model_name="", readme_path=""): def map_label_to_color(labels): + """Pair label with corresponding color.""" label_to_colors = {} for label in labels: label = label.lower() @@ -162,7 +184,7 @@ def build_entity_df(entites, id) -> pd.DataFrame: ent.replacement_string or "{{" + ent.placeholder + "}}" for ent in entites ] d = { - "rowID": id, + "Row ID": id, "Entity (Original Text)": text, "Type": types, "Redacted To": replaced_values, @@ -171,7 +193,7 @@ def build_entity_df(entites, id) -> pd.DataFrame: if df.size == 0: # Datapane does not support empty dataframe, append a dummy row df2 = { - "rowID": id, + "Row ID": id, "Entity (Original Text)": "-", "Type": "-", "Redacted To": "-", @@ -181,13 +203,9 @@ def build_entity_df(entites, id) -> pd.DataFrame: class RowReportFields: - def __init__(self, context, show_sensitive_info: bool = True): - self.total_tokens = context.get("total_tokens", "unknown") - self.entites_cnt_map = context.get("statics", {}) - self.raw_text = context.get("raw_text", "") - self.id = context.get("id", "") + def __init__(self, row_spec: PiiReportPageSpec, show_sensitive_info: bool = True): + self.spec = row_spec self.show_sensitive_info = show_sensitive_info - self.entities = context.get("entities") def build_report(self) -> dp.Group: return dp.Group( @@ -198,7 +216,7 @@ def build_report(self) -> dp.Group: ], type=dp.SelectType.TABS, ), - label="rowId: " + str(self.id), + label="Row Id: " + str(self.spec.id), ) def _make_stats_card(self): @@ -206,16 +224,16 @@ def _make_stats_card(self): dp.Text("## Row Summary Statistics"), dp.BigNumber( heading="Total No. Of Entites Proceed", - value=self.total_tokens, + value=self.spec.total_tokens or 0, ), dp.Text(f"### Entities Distribution"), - plot_pie(self.entites_cnt_map), + plot_pie(self.spec.statics), ] if self.show_sensitive_info: stats.append(dp.Text(f"### Resolved Entities")) stats.append( dp.DataTable( - build_entity_df(self.entities, id=self.id), + build_entity_df(self.spec.entities, id=self.spec.id), label="Resolved Entities", ) ) @@ -224,16 +242,18 @@ def _make_stats_card(self): def _make_text_card(self): annotations = [] labels = set() - for ent in self.entities: + for ent in self.spec.entities: annotations.append((ent.beg, ent.end, ent.type)) labels.add(ent.type) - d = {"Content": [self.raw_text], "Annotations": [annotations]} - df = pd.DataFrame(data=d) + if len(annotations) == 0: + annotations.append((0, 0, "No entity detected")) + d = {"Content": [self.spec.raw_text], "Annotations": [annotations]} + df = pd.DataFrame(data=d) render_html = df.ads.render_ner( options={ - "default_color": "#D6D3D1", + "default_color": DEFAULT_COLOR, "colors": map_label_to_color(labels), }, return_html=True, @@ -242,38 +262,25 @@ def _make_text_card(self): class PIIOperatorReport: - def __init__(self, context: dict): + def __init__(self, report_spec: PiiReportSpec, report_uri: str): # set useful field for generating report from context - summary_context = context.get("run_summary", {}) - self.config = summary_context.get("config", {}) # for generate yaml - self.show_sensitive_info = summary_context.get("show_sensitive_info", True) - self.show_rows = summary_context.get("show_rows", 25) - self.total_rows = summary_context.get("total_rows", "unknown") - self.total_tokens = summary_context.get("total_tokens", "unknown") - self.elapsed_time = summary_context.get("elapsed_time", 0) - self.entites_cnt_map = summary_context.get("statics", {}) - self.selected_entities = summary_context.get("selected_entities", []) - self.spacy_detectors = summary_context.get("selected_spacy_model", []) - self.run_at = summary_context.get("timestamp", "today") - - rows = context.get("run_details", {}).get("rows", []) + self.report_spec = report_spec + self.show_rows = report_spec.run_summary.show_rows or DEFAULT_SHOW_ROWS + + rows = report_spec.run_details.rows rows = rows[0 : self.show_rows] self.rows_details = [ - RowReportFields(r, self.show_sensitive_info) for r in rows - ] # List[RowReportFields], len=show_rows - - self._validate_fields() + RowReportFields(r, report_spec.run_summary.show_sensitive_info) + for r in rows + ] - def _validate_fields(self): - """Check if any fields are empty.""" - # TODO - pass + self.report_uri = report_uri def make_view(self): title_text = dp.Text("# Personally Identifiable Information Operator Report") time_proceed = dp.BigNumber( heading="Ran at", - value=self.run_at, + value=self.report_spec.run_summary.timestamp or "today", ) report_description = dp.Text(PII_REPORT_DESCRIPTION) @@ -293,15 +300,27 @@ def make_view(self): ) ) self.report_sections = [title_text, report_description, time_proceed, structure] - return self.report_sections + return self + + def save_report(self, report_sections=None, report_uri=None, storage_options={}): + with tempfile.TemporaryDirectory() as temp_dir: + report_local_path = os.path.join(temp_dir, "___report.html") + block_print() + dp.save_report( + report_sections or self.report_sections, + path=report_local_path, + open=False, + ) + enable_print() - def save_report(self, report_sections, report_path): - dp.save_report( - report_sections or self.report_sections, - path=report_path, - open=False, - ) - return report_path + report_uri = report_uri or self.report_uri + with open(report_local_path) as f1: + with fsspec.open( + report_uri, + "w", + **storage_options, + ) as f2: + f2.write(f1.read()) def _build_summary_page(self): summary = dp.Blocks( @@ -342,51 +361,69 @@ def _make_summary_stats_card(self) -> dp.Group: 4. entities distribution 5. resolved Entities in sample data - optional """ + try: + process_rate = compute_rate( + self.report_spec.run_summary.elapsed_time, + self.report_spec.run_summary.total_rows, + ) + except Exception as e: + logger.warning("Failed to compute processing rate.") + logger.debug(f"Full traceback: {e}") + process_rate = "-" + summary_stats = [ dp.Text("## Summary Statistics"), dp.Group( dp.BigNumber( heading="Total No. Of Rows", - value=self.total_rows, + value=self.report_spec.run_summary.total_rows or "unknown", ), dp.BigNumber( heading="Total No. Of Entites Proceed", - value=self.total_tokens, + value=self.report_spec.run_summary.total_tokens, ), dp.BigNumber( heading="Rows per second processed", - value=compute_rate(self.elapsed_time, self.total_rows), + value=process_rate, ), dp.BigNumber( heading="Total Time Spent", - value=human_time_friendly(self.elapsed_time), + value=human_time_friendly( + self.report_spec.run_summary.elapsed_time + ), ), columns=2, ), dp.Text(f"### Entities Distribution"), - plot_pie(self.entites_cnt_map), + plot_pie(self.report_spec.run_summary.statics), ] - if self.show_sensitive_info: + if self.report_spec.run_summary.show_sensitive_info: entites_df = self._build_total_entity_df() summary_stats.append(dp.Text(f"### Resolved Entities")) summary_stats.append(dp.DataTable(entites_df)) return dp.Group(blocks=summary_stats, label="STATS") def _make_yaml_card(self) -> dp.Group: - # show pii config yaml - yaml_string = yaml.dump(self.config, Dumper=yaml.SafeDumper) + """Shows the full pii config yaml.""" + yaml_string = self.report_spec.run_summary.config.to_yaml() yaml_appendix_title = dp.Text(f"## Reference: YAML File") yaml_appendix = dp.Code(code=yaml_string, language="yaml") return dp.Group(blocks=[yaml_appendix_title, yaml_appendix], label="YAML") def _make_model_card(self) -> dp.Group: - # show each model card + """Generates model card.""" + if len(self.report_spec.run_summary.selected_spacy_model) == 0: + return dp.Group( + dp.Text("No model used."), + label="MODEL CARD", + ) + model_cards = [ dp.Group( make_model_card(model_name=x.get("model")), label=x.get("model"), ) - for x in self.spacy_detectors + for x in self.report_spec.run_summary.selected_spacy_model ] if len(model_cards) <= 1: @@ -405,16 +442,18 @@ def _make_model_card(self) -> dp.Group: def _build_total_entity_df(self) -> pd.DataFrame: frames = [] for row in self.rows_details: # RowReportFields - frames.append(build_entity_df(entites=row.entities, id=row.id)) + frames.append(build_entity_df(entites=row.spec.entities, id=row.spec.id)) result = pd.concat(frames) return result def _get_summary_desc(self) -> str: - entities_mark_down = ["**" + ent + "**" for ent in self.selected_entities] + entities_mark_down = [ + "**" + ent + "**" for ent in self.report_spec.run_summary.selected_entities + ] model_description = "" - for spacy_model in self.spacy_detectors: + for spacy_model in self.report_spec.run_summary.selected_spacy_model: model_description = ( model_description + f"You chose the **{spacy_model.get('model', 'unknown model')}** model for **{spacy_model.get('spacy_entites', 'unknown entities')}** detection." diff --git a/ads/opctl/operator/lowcode/pii/operator_config.py b/ads/opctl/operator/lowcode/pii/operator_config.py index 47e65b84a..d70e8770b 100644 --- a/ads/opctl/operator/lowcode/pii/operator_config.py +++ b/ads/opctl/operator/lowcode/pii/operator_config.py @@ -11,13 +11,17 @@ from ads.common.serializer import DataClassSerializable from ads.opctl.operator.common.operator_config import OperatorConfig from ads.opctl.operator.common.utils import _load_yaml_from_uri +from ads.opctl.operator.lowcode.pii.constant import ( + DEFAULT_SHOW_ROWS, + DEFAULT_REPORT_FILENAME, + DEFAULT_TARGET_COLUMN, +) @dataclass(repr=True) class InputData(DataClassSerializable): """Class representing operator specification input data details.""" - format: str = None url: str = None @@ -34,7 +38,7 @@ class Report(DataClassSerializable): """Class representing operator specification report details.""" report_filename: str = None - show_rows: int = 25 + show_rows: int = None show_sensitive_content: bool = False @@ -54,13 +58,19 @@ class PiiOperatorSpec(DataClassSerializable): output_directory: OutputDirectory = field(default_factory=OutputDirectory) report: Report = field(default_factory=Report) target_column: str = None - # TODO: adjust from_dict to accept List[Detector] detectors: List[Dict] = field(default_factory=list) def __post_init__(self): """Adjusts the specification details.""" - self.target_column = self.target_column or "target" + self.target_column = self.target_column or DEFAULT_TARGET_COLUMN + self.report = self.report or Report.from_dict( + { + "report_filename": DEFAULT_REPORT_FILENAME, + "show_rows": DEFAULT_SHOW_ROWS, + "show_sensitive_content": False, + } + ) @dataclass(repr=True) diff --git a/ads/opctl/operator/lowcode/pii/schema.yaml b/ads/opctl/operator/lowcode/pii/schema.yaml index 9599ed78c..ff295c7fa 100644 --- a/ads/opctl/operator/lowcode/pii/schema.yaml +++ b/ads/opctl/operator/lowcode/pii/schema.yaml @@ -50,13 +50,13 @@ spec: meta: description: "The url can be local, or remote. For example: `oci://@/`" name: - required: true + required: false type: string default: data-out.csv type: dict report: - required: true + required: false schema: report_filename: required: true @@ -67,7 +67,6 @@ spec: show_rows: required: false type: number - default: 10 meta: description: "The number of rows that shows in the report. Defaults to `10`" show_sensitive_content: diff --git a/ads/opctl/operator/lowcode/pii/model/utils.py b/ads/opctl/operator/lowcode/pii/utils.py similarity index 53% rename from ads/opctl/operator/lowcode/pii/model/utils.py rename to ads/opctl/operator/lowcode/pii/utils.py index 9f7525e50..50f28eed9 100644 --- a/ads/opctl/operator/lowcode/pii/model/utils.py +++ b/ads/opctl/operator/lowcode/pii/utils.py @@ -4,16 +4,15 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ - -import logging import os +import sys + +import fsspec import pandas as pd -from typing import Dict, List -from .constant import YAML_KEYS from ads.common.object_storage_details import ObjectStorageDetails -import fsspec -from ..errors import PIIInputDataError + +from .errors import PIIInputDataError def default_signer(**kwargs): @@ -34,7 +33,7 @@ def _call_pandas_fsspec(pd_fn, filename, storage_options, **kwargs): return pd_fn(filename, storage_options=storage_options, **kwargs) -def _load_data(filename, format, storage_options=None, columns=None, **kwargs): +def _load_data(filename, format=None, storage_options=None, columns=None, **kwargs): if not format: _, format = os.path.splitext(filename) format = format[1:] @@ -53,7 +52,9 @@ def _load_data(filename, format, storage_options=None, columns=None, **kwargs): return data -def _write_data(data, filename, format, storage_options, index=False, **kwargs): +def _write_data( + data, filename, format=None, storage_options=None, index=False, **kwargs +): if not format: _, format = os.path.splitext(filename) format = format[1:] @@ -74,33 +75,6 @@ def get_output_name(given_name, target_name=None): return target_name -class ReportContextKey: - RUN_SUMMARY = "run_summary" - FILE_SUMMARY = "file_summary" - REPORT_NAME = "report_name" - TOTAL_FILES = "total_files" - ELAPSED_TIME = "elapsed_time" - DATE = "date" - OUTPUT_DIR = "output_dir" - INPUT_DIR = "input_dir" - INPUT = "input" - TOTAL_T = "total_tokens" - INPUT_FILE_NAME = "input_file_name" - OUTPUT_NAME = "output_name" - ENTITIES = "entities" - FILE_NAME = "filename" - INPUT_BASE = "input_base" - - -def _safe_get_spec(spec_file, key, default): - try: - return spec_file[key] - except KeyError as e: - if not key in YAML_KEYS: - logging.warning(f"key: `{key}` is not supported.") - return default - - def construct_filth_cls_name(name: str) -> str: """Constructs the filth class name from the given name. For example, "name" -> "NameFilth". @@ -114,45 +88,38 @@ def construct_filth_cls_name(name: str) -> str: return "".join([s.capitalize() for s in name.split("_")]) + "Filth" -def _write_to_file(s: str, uri: str, **kwargs) -> None: - """Writes the given string to the given uri. - - Args: - s (str): The string to be written. - uri (str): The uri of the file to be written. - kwargs (dict ): keyword arguments to be passed into open(). - """ - with open(uri, "w", **kwargs) as f: - f.write(s) - - -def _count_tokens(file_summary): - """Counts the total number of tokens in the given file summary. +################ +# Report utils # +################ +def compute_rate(elapsed_time, num_unit): + return elapsed_time / num_unit - Args: - file_summary (dict): file summary. - e.g. { - "root1": [ - {..., "total_t": 10, ...}, - {..., "total_t": 3, ...}, - ], - ... - } - Returns: - int: total number of tokens. - """ - total_tokens = 0 - for _, files in file_summary.items(): - for file in files: - total_tokens += file.get("total_tokens") - return total_tokens - - -def _process_pos(entities, text) -> List: - """Processes the position of the given entities.""" - for entity in entities: - count_line_delimiter = text[: entity.beg].split("\n") - entity.pos = len(count_line_delimiter) - entity.line_beg = len(count_line_delimiter[-1]) - return entities +def human_time_friendly(seconds): + TIME_DURATION_UNITS = ( + ("week", 60 * 60 * 24 * 7), + ("day", 60 * 60 * 24), + ("hour", 60 * 60), + ("min", 60), + ) + if seconds == 0: + return "inf" + accumulator = [] + for unit, div in TIME_DURATION_UNITS: + amount, seconds = divmod(float(seconds), div) + if amount > 0: + accumulator.append( + "{} {}{}".format(int(amount), unit, "" if amount == 1 else "s") + ) + accumulator.append("{} secs".format(round(seconds, 2))) + return ", ".join(accumulator) + + +# Disable +def block_print(): + sys.stdout = open(os.devnull, "w") + + +# Restore +def enable_print(): + sys.stdout = sys.__stdout__ From 434e2860015a1939585940e6240f9c6837948cf9 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:32:49 -0800 Subject: [PATCH 11/18] fixed file name in readme --- ads/opctl/operator/lowcode/pii/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index 12a25dd60..d4e4c3021 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -20,10 +20,10 @@ ads operator init -t pii --overwrite --output ~/pii/ The most important files expected to be generated are: - `pii.yaml`: Contains pii-related configuration. -- `backend_operator_local_python_config.yaml`: This includes a local backend configuration for running pii operator in a local environment. The environment should be set up manually before running the operator. -- `backend_operator_local_container_config.yaml`: This includes a local backend configuration for running pii operator within a local container. The container should be built before running the operator. Please refer to the instructions below for details on how to accomplish this. -- `backend_job_container_config.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a container (BYOC) runtime. The container should be built and published before running the operator. Please refer to the instructions below for details on how to accomplish this. -- `backend_job_python_config.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a conda runtime. The conda should be built and published before running the operator. +- `pii_operator_local_python.yaml`: This includes a local backend configuration for running pii operator in a local environment. The environment should be set up manually before running the operator. +- `pii_operator_local_container.yaml`: This includes a local backend configuration for running pii operator within a local container. The container should be built before running the operator. Please refer to the instructions below for details on how to accomplish this. +- `pii_job_container.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a container (BYOC) runtime. The container should be built and published before running the operator. Please refer to the instructions below for details on how to accomplish this. +- `pii_job_python.yaml`: Contains Data Science job-related config to run pii operator in a Data Science job within a conda runtime. The conda should be built and published before running the operator. All generated configurations should be ready to use without the need for any additional adjustments. However, they are provided as starter kit configurations that can be customized as needed. From 6a304d4e005c1933931681edbc748513146541e8 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:39:35 -0800 Subject: [PATCH 12/18] adjust documentation for pii operator --- .../operators/common/yaml_schema/piiOperator.yaml | 5 ++--- docs/source/user_guide/operators/pii_operator/examples.rst | 4 +++- .../user_guide/operators/pii_operator/getting_started.rst | 7 ++++--- docs/source/user_guide/operators/pii_operator/pii.rst | 3 +-- 4 files changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml b/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml index 9599ed78c..ff295c7fa 100644 --- a/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml +++ b/docs/source/user_guide/operators/common/yaml_schema/piiOperator.yaml @@ -50,13 +50,13 @@ spec: meta: description: "The url can be local, or remote. For example: `oci://@/`" name: - required: true + required: false type: string default: data-out.csv type: dict report: - required: true + required: false schema: report_filename: required: true @@ -67,7 +67,6 @@ spec: show_rows: required: false type: number - default: 10 meta: description: "The number of rows that shows in the report. Defaults to `10`" show_sensitive_content: diff --git a/docs/source/user_guide/operators/pii_operator/examples.rst b/docs/source/user_guide/operators/pii_operator/examples.rst index 0a300abb6..037bee176 100644 --- a/docs/source/user_guide/operators/pii_operator/examples.rst +++ b/docs/source/user_guide/operators/pii_operator/examples.rst @@ -15,6 +15,8 @@ The simplest yaml file is generated by the ``ads operator init --type pii`` and input_data: url: mydata.csv target_column: target + output_directory: + url: result/ detectors: - name: default.phone action: mask @@ -34,7 +36,7 @@ The yaml can also be maximally stated as follows: spec: output_directory: url: oci://my-bucket@my-tenancy/results - name: mydata-out.csv + name: myProcessedData.csv report: report_filename: report.html show_rows: 10 diff --git a/docs/source/user_guide/operators/pii_operator/getting_started.rst b/docs/source/user_guide/operators/pii_operator/getting_started.rst index 120a673b3..a8c455ded 100644 --- a/docs/source/user_guide/operators/pii_operator/getting_started.rst +++ b/docs/source/user_guide/operators/pii_operator/getting_started.rst @@ -8,10 +8,10 @@ Configure After having set up ``ads opctl`` on your desired machine using ``ads opctl configure``, you are ready to begin using pii operator. At a bare minimum, you will need to provide the following details about your tasks: - Path to the input data (input_data) +- Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory) - Name of the column with user data (target_column) - Name of the detector will be used in the operator (detectors) - These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``: .. code-block:: yaml @@ -23,15 +23,16 @@ These details exactly match the initial pii.yaml file generated by running ``ads input_data: url: mydata.csv target_column: target + output_directory: + url: result/ detectors: - name: default.phone - action: anonymize + action: mask Optionally, you are able to specify much more. The most common additions are: - Whether to show sensitive content in the report. (show_sensitive_content) -- Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory) - Way to process the detected entity. (action) An extensive list of parameters can be found in the ``YAML Schema`` section. diff --git a/docs/source/user_guide/operators/pii_operator/pii.rst b/docs/source/user_guide/operators/pii_operator/pii.rst index af9a50c77..617467e8b 100644 --- a/docs/source/user_guide/operators/pii_operator/pii.rst +++ b/docs/source/user_guide/operators/pii_operator/pii.rst @@ -37,8 +37,7 @@ Here is an example pii.yaml with every parameter specified: * **detectors**: This list contains the details for each detector and action that will be taken. * **name**: The string specifies the name of the detector. The format should be ``.``. * **action**: The string specifies the way to process the detected entity. Default to mask. - - * **output_directory**: (optional) This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime. + * **output_directory**: This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime. * **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://@/subfolder/``. * **name**: The string specifies the name of the processed data file. From 3fc9da682f6bb55e74c59fd15f4f97043b0d2bb8 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 02:43:50 -0800 Subject: [PATCH 13/18] fixed file name in readme --- ads/opctl/operator/lowcode/pii/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index d4e4c3021..0798cf725 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -69,7 +69,7 @@ ads operator build-image -t pii This will create a new `pii:v1` image, with `/etc/operator` as the designated working directory within the container. -Check the `backend_operator_local_container_config.yaml` config file. By default, it should have a `volume` section with the `.oci` configs folder mounted. +Check the `pii_operator_local_container.yaml` config file. By default, it should have a `volume` section with the `.oci` configs folder mounted. ```yaml volume: @@ -101,7 +101,7 @@ version: v1 Run the pii operator within a container using the command below: ```bash -ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_operator_local_container_config.yaml +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/pii_operator_local_container.yaml ``` ## 5. Running pii in the Data Science job within container runtime @@ -142,7 +142,7 @@ output_directory: Run the pii operator on the Data Science jobs using the command posted below: ```bash -ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_job_container_config.yaml +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/pii_job_container.yaml ``` The logs can be monitored using the `ads opctl watch` command. @@ -172,7 +172,7 @@ ads opctl conda publish pii_v1 More details about configuring CLI can be found here - [Configuring CLI](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/cli/opctl/configure.html) -After the conda environment is published to Object Storage, it can be used within Data Science jobs service. Check the `backend_job_python_config.yaml` config file. It should contain pre-populated infrastructure and runtime sections. The runtime section should contain a `conda` section. +After the conda environment is published to Object Storage, it can be used within Data Science jobs service. Check the `pii_job_python.yaml` config file. It should contain pre-populated infrastructure and runtime sections. The runtime section should contain a `conda` section. ```yaml conda: @@ -194,7 +194,7 @@ output_directory: Run the pii on the Data Science jobs using the command posted below: ```bash -ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/backend_job_python_config.yaml +ads operator run -f ~/pii/pii.yaml --backend-config ~/pii/pii_job_python.yaml ``` The logs can be monitored using the `ads opctl watch` command. From a5e0eb448ca5dc024456fd3feaab925d5ecfadd3 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 03:00:55 -0800 Subject: [PATCH 14/18] fixed file name in readme --- ads/opctl/operator/lowcode/pii/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index 0798cf725..186324f10 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -37,7 +37,7 @@ To run forecasting locally, create and activate a new conda environment (`ads-pi - gender_guesser - nameparser - scrubadub_spacy -- "git+https://github.com/oracle/accelerated-data-science.git@feature/forecasting#egg=oracle-ads" +- "git+https://github.com/oracle/accelerated-data-science.git@feature/ads_pii_operator#egg=oracle-ads" ``` Please review the previously generated `pii.yaml` file using the `init` command, and make any necessary adjustments to the input and output file locations. By default, it assumes that the files should be located in the same folder from which the `init` command was executed. From 486165bbaaeb948cc116a1587e696b2a05232a03 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 03:01:16 -0800 Subject: [PATCH 15/18] added optional dependency for pii operator --- pyproject.toml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pyproject.toml b/pyproject.toml index a0c47e3da..643938077 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -163,6 +163,9 @@ forecast = [ "statsmodels", "sktime", ] +pii = [ + "datapane", +] [project.urls] "Github" = "https://github.com/oracle/accelerated-data-science" From 4601ddf0f73ff490629144e7f51eda4c7db7f1dc Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 14:54:26 -0800 Subject: [PATCH 16/18] updated optional dependency for pii operator --- ads/common/decorator/runtime_dependency.py | 1 + .../operator/lowcode/pii/environment.yaml | 7 ++-- .../operator/lowcode/pii/model/factory.py | 11 +++-- ads/opctl/operator/lowcode/pii/model/pii.py | 8 ++-- .../pii/model/processor/email_replacer.py | 22 +++++++--- .../pii/model/processor/mbi_replacer.py | 16 +++++--- .../pii/model/processor/name_replacer.py | 40 ++++++++++++++----- .../pii/model/processor/number_replacer.py | 16 +++++--- .../lowcode/pii/model/processor/remover.py | 16 +++++--- .../operator/lowcode/pii/model/report.py | 26 +++++++++--- pyproject.toml | 9 ++++- 11 files changed, 122 insertions(+), 50 deletions(-) diff --git a/ads/common/decorator/runtime_dependency.py b/ads/common/decorator/runtime_dependency.py index 08ae48e78..27473ae9a 100644 --- a/ads/common/decorator/runtime_dependency.py +++ b/ads/common/decorator/runtime_dependency.py @@ -65,6 +65,7 @@ class OptionalDependency: SPARK = "oracle-ads[spark]" HUGGINGFACE = "oracle-ads[huggingface]" FORECAST = "oracle-ads[forecast]" + PII = "oracle-ads[pii]" def runtime_dependency( diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml index e45cb2530..ca5b65680 100644 --- a/ads/opctl/operator/lowcode/pii/environment.yaml +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -5,12 +5,11 @@ dependencies: - python=3.9 - pip - pip: + - aiohttp - datapane - - scrubadub - gender_guesser - nameparser + - plotly + - scrubadub - scrubadub_spacy - - requests - - aiohttp - - ploty - "git+https://github.com/oracle/accelerated-data-science.git@feature/ads_pii_operator#egg=oracle-ads" diff --git a/ads/opctl/operator/lowcode/pii/model/factory.py b/ads/opctl/operator/lowcode/pii/model/factory.py index d5d0de7ae..102204ea3 100644 --- a/ads/opctl/operator/lowcode/pii/model/factory.py +++ b/ads/opctl/operator/lowcode/pii/model/factory.py @@ -6,9 +6,10 @@ import uuid -import scrubadub -from scrubadub_spacy.detectors.spacy import SpacyEntityDetector - +from ads.common.decorator.runtime_dependency import ( + OptionalDependency, + runtime_dependency, +) from ads.opctl.operator.lowcode.pii.constant import SupportedDetector from ads.opctl.operator.lowcode.pii.utils import construct_filth_cls_name @@ -38,8 +39,10 @@ class SpacyDetector(PiiBaseDetector): DEFAULT_SPACY_MODEL = "en_core_web_trf" @classmethod + @runtime_dependency(module="scrubadub", install_from=OptionalDependency.PII) + @runtime_dependency(module="scrubadub_spacy", install_from=OptionalDependency.PII) def construct(cls, entity, model, **kwargs): - spacy_entity_detector = SpacyEntityDetector( + spacy_entity_detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector( named_entities=[entity], name=f"spacy_{uuid.uuid4()}", model=model, diff --git a/ads/opctl/operator/lowcode/pii/model/pii.py b/ads/opctl/operator/lowcode/pii/model/pii.py index 0bb0d77ea..ba036d05e 100644 --- a/ads/opctl/operator/lowcode/pii/model/pii.py +++ b/ads/opctl/operator/lowcode/pii/model/pii.py @@ -4,9 +4,10 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ - -import scrubadub - +from ads.common.decorator.runtime_dependency import ( + OptionalDependency, + runtime_dependency, +) from ads.opctl import logger from ads.opctl.operator.common.utils import _load_yaml_from_uri from ads.opctl.operator.lowcode.pii.model.factory import PiiDetectorFactory @@ -24,6 +25,7 @@ class PiiScrubber: """Class used for config scrubber and count the detectors in use.""" + @runtime_dependency(module="scrubadub", install_from=OptionalDependency.PII) def __init__(self, config): logger.info(f"Loading config from {config}") if isinstance(config, str): diff --git a/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py index ce77dc8ec..69a9d92ef 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/email_replacer.py @@ -4,17 +4,27 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -from typing import Sequence +from ads.common.decorator.runtime_dependency import ( + OptionalDependency, + runtime_dependency, +) -from faker import Faker -from scrubadub.filth import Filth -from scrubadub.post_processors import PostProcessor +try: + import scrubadub +except ImportError: + raise ModuleNotFoundError( + f"`scrubadub` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) -class EmailReplacer(PostProcessor): +class EmailReplacer(scrubadub.post_processors.PostProcessor): name = "email_replacer" - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + @runtime_dependency(module="faker", install_from=OptionalDependency.PII) + def process_filth(self, filth_list): + from faker import Faker + for filth in filth_list: if filth.replacement_string: continue diff --git a/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py index 8aa4f5e66..013526cad 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/mbi_replacer.py @@ -6,20 +6,26 @@ import random import string -from typing import Sequence -from scrubadub.filth import Filth -from scrubadub.post_processors import PostProcessor +from ads.common.decorator.runtime_dependency import OptionalDependency +try: + import scrubadub +except ImportError: + raise ModuleNotFoundError( + f"`scrubadub` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) -class MBIReplacer(PostProcessor): + +class MBIReplacer(scrubadub.post_processors.PostProcessor): name = "mbi_replacer" CHAR_POOL = "ACDEFGHJKMNPQRTUVWXY" def generate_mbi(self): return "".join(random.choices(self.CHAR_POOL + string.digits, k=11)) - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + def process_filth(self, filth_list): for filth in filth_list: if filth.replacement_string: continue diff --git a/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py index 9cb96f0ae..2c7dde747 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/name_replacer.py @@ -5,19 +5,29 @@ # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ -from typing import Sequence +from ads.common.decorator.runtime_dependency import ( + OptionalDependency, + runtime_dependency, +) -import gender_guesser.detector as gender_detector -from faker import Faker -from nameparser import HumanName -from scrubadub.filth import Filth -from scrubadub.post_processors import PostProcessor +try: + import scrubadub +except ImportError: + raise ModuleNotFoundError( + f"`scrubadub` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) -class NameReplacer(PostProcessor): +class NameReplacer(scrubadub.post_processors.PostProcessor): name = "name_replacer" + @runtime_dependency(module="faker", install_from=OptionalDependency.PII) + @runtime_dependency(module="gender_guesser", install_from=OptionalDependency.PII) def __init__(self, name: str = None, mapping: dict = None): + import gender_guesser.detector as gender_detector + from faker import Faker + if mapping: self.mapping = mapping else: @@ -65,14 +75,14 @@ def unwrap_filth(self, filth_list): return processed @staticmethod - def has_initial(name: HumanName) -> bool: + def has_initial(name: "nameparser.HumanName") -> bool: for attr in ["first", "middle", "last"]: if len(str(getattr(name, attr)).strip(".")) == 1: return True return False @staticmethod - def has_non_initial(name: HumanName) -> bool: + def has_non_initial(name: "nameparser.HumanName") -> bool: for attr in ["first", "middle", "last"]: if len(str(getattr(name, attr)).strip(".")) > 1: return True @@ -87,7 +97,9 @@ def generate_component(name_component: str, generator): fake_component += "." return fake_component - def save_name_mapping(self, name: HumanName, fake_name: HumanName): + def save_name_mapping( + self, name: "nameparser.HumanName", fake_name: "nameparser.HumanName" + ): """Saves the names with initials to the mapping so that a new name will not be generated. For example, if name is "John Richard Doe", this method will save the following keys to the mapping: - J Doe @@ -124,6 +136,7 @@ def save_name_mapping(self, name: HumanName, fake_name: HumanName): f"{name.first} {name.middle[0]} {name.last}" ] = f"{fake_name.first} {fake_name.middle[0]} {fake_name.last}" + @runtime_dependency(module="nameparser", install_from=OptionalDependency.PII) def replace(self, text): """Replaces a name with fake name. @@ -138,6 +151,8 @@ def replace(self, text): str The replaced name as text. """ + from nameparser import HumanName + if isinstance(text, HumanName): name = text else: @@ -187,7 +202,10 @@ def replace(self, text): self.save_name_mapping(original_name, name) return str(name) - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + @runtime_dependency(module="nameparser", install_from=OptionalDependency.PII) + def process_filth(self, filth_list): + from nameparser import HumanName + filth_list = self.unwrap_filth(filth_list) name_filths = [] diff --git a/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py b/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py index 7e79a2f3b..5bf678991 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/number_replacer.py @@ -7,13 +7,19 @@ import datetime import random import re -from typing import Sequence -from scrubadub.filth import Filth -from scrubadub.post_processors import PostProcessor +from ads.common.decorator.runtime_dependency import OptionalDependency +try: + import scrubadub +except ImportError: + raise ModuleNotFoundError( + f"`scrubadub` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) -class NumberReplacer(PostProcessor): + +class NumberReplacer(scrubadub.post_processors.PostProcessor): name = "number_replacer" _ENTITIES = [ "number", @@ -52,7 +58,7 @@ def replace(self, text): return date return re.sub(r"\d", self.replace_digit, text) - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + def process_filth(self, filth_list): for filth in filth_list: # Do not process it if it already has a replacement. if filth.replacement_string: diff --git a/ads/opctl/operator/lowcode/pii/model/processor/remover.py b/ads/opctl/operator/lowcode/pii/model/processor/remover.py index 53d90dba3..0e014fe80 100644 --- a/ads/opctl/operator/lowcode/pii/model/processor/remover.py +++ b/ads/opctl/operator/lowcode/pii/model/processor/remover.py @@ -3,18 +3,22 @@ # Copyright (c) 2023 Oracle and/or its affiliates. # Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/ +from ads.common.decorator.runtime_dependency import OptionalDependency -from typing import Sequence +try: + import scrubadub +except ImportError: + raise ModuleNotFoundError( + f"`scrubadub` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) -from scrubadub.filth import Filth -from scrubadub.post_processors import PostProcessor - -class Remover(PostProcessor): +class Remover(scrubadub.post_processors.PostProcessor): name = "remover" _ENTITIES = [] - def process_filth(self, filth_list: Sequence[Filth]) -> Sequence[Filth]: + def process_filth(self, filth_list): for filth in filth_list: if filth.type.lower() in self._ENTITIES: filth.replacement_string = "" diff --git a/ads/opctl/operator/lowcode/pii/model/report.py b/ads/opctl/operator/lowcode/pii/model/report.py index 44b7c6752..42167ba87 100644 --- a/ads/opctl/operator/lowcode/pii/model/report.py +++ b/ads/opctl/operator/lowcode/pii/model/report.py @@ -11,31 +11,40 @@ from dataclasses import dataclass, field from typing import Dict, List -import datapane as dp import fsspec import pandas as pd -import plotly.express as px -import plotly.graph_objects as go import requests import yaml +from ads.common.decorator.runtime_dependency import ( + OptionalDependency, + runtime_dependency, +) from ads.common.serializer import DataClassSerializable from ads.opctl import logger from ads.opctl.operator.lowcode.pii.constant import ( + DEFAULT_COLOR, DEFAULT_SHOW_ROWS, DEFAULT_TIME_OUT, DETAILS_REPORT_DESCRIPTION, FLAT_UI_COLORS, PII_REPORT_DESCRIPTION, - DEFAULT_COLOR, ) +from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig from ads.opctl.operator.lowcode.pii.utils import ( block_print, compute_rate, enable_print, human_time_friendly, ) -from ads.opctl.operator.lowcode.pii.operator_config import PiiOperatorConfig + +try: + import datapane as dp +except ImportError: + raise ModuleNotFoundError( + f"`datapane` module was not found. Please run " + f"`pip install {OptionalDependency.PII}`." + ) @dataclass(repr=True) @@ -85,11 +94,13 @@ class PiiReportSpec(DataClassSerializable): LABEL_TO_COLOR_MAP = {} +@runtime_dependency(module="plotly", install_from=OptionalDependency.PII) def make_model_card(model_name="", readme_path=""): """Make render model_readme.md as model_card tab. All spacy model: https://huggingface.co/spacy For example: "en_core_web_trf": "https://huggingface.co/spacy/en_core_web_trf/raw/main/README.md". """ + readme_path = ( f"https://huggingface.co/spacy/{model_name}/raw/main/README.md" if model_name @@ -114,6 +125,8 @@ def make_model_card(model_name="", readme_path=""): ) try: + import plotly.graph_objects as go + eval_res = data["model-index"][0]["results"] metrics = [] values = [] @@ -158,7 +171,10 @@ def map_label_to_color(labels): return label_to_colors +@runtime_dependency(module="plotly", install_from=OptionalDependency.PII) def plot_pie(count_map) -> dp.Plot: + import plotly.express as px + cols = count_map.keys() cnts = count_map.values() ent_col_name = "EntityName" diff --git a/pyproject.toml b/pyproject.toml index b5635a17b..384aad795 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -54,6 +54,7 @@ classifiers = [ # In dependencies se "; platform_machine == 'aarch64'" to specify ARM underlying platform # Copied from install_requires list in setup.py, setup.py got removed in favor of this config file dependencies = [ + "PyYAML>=6", # pyyaml 5.4 is broken with cython 3 "asteval>=0.9.25", "cerberus>=1.3.4", "cloudpickle>=1.6.0", @@ -67,7 +68,6 @@ dependencies = [ "pandas>1.2.1,<2.1", "psutil>=5.7.2", "python_jsonschema_objects>=0.3.13", - "PyYAML>=6", # pyyaml 5.4 is broken with cython 3 "requests", "scikit-learn>=1.0", "tabulate>=0.8.9", @@ -173,7 +173,14 @@ forecast = [ "rich", ] pii = [ + "aiohttp", "datapane", + "gender_guesser", + "nameparser", + "oracle_ads[opctl]", + "plotly", + "scrubadub", + "scrubadub_spacy", ] [project.urls] From 79d48cad59e47fcccbe8619f507f307937e84467 Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 15:07:08 -0800 Subject: [PATCH 17/18] fixed typo --- ads/opctl/operator/lowcode/pii/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index 186324f10..156646ef4 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -27,9 +27,9 @@ The most important files expected to be generated are: All generated configurations should be ready to use without the need for any additional adjustments. However, they are provided as starter kit configurations that can be customized as needed. -## 3. Running pii on the local conda environment +## 3. Running Pii on the local conda environment -To run forecasting locally, create and activate a new conda environment (`ads-pii`). Install all the required libraries listed in the `environment.yaml` file. +To run pii operator locally, create and activate a new conda environment (`ads-pii`). Install all the required libraries listed in the `environment.yaml` file. ```yaml - datapane @@ -48,7 +48,7 @@ Use the command below to verify the pii config. ads operator verify -f ~/pii/pii.yaml ``` -Use the following command to run the forecasting within the `ads-pii` conda environment. +Use the following command to run the pii operator within the `ads-pii` conda environment. ```bash ads operator run -f ~/pii/pii.yaml -b local @@ -76,7 +76,7 @@ volume: - "/Users//.oci:/root/.oci" ``` -Mounting the OCI configs folder is only required if an OCI Object Storage bucket will be used to store the input forecasting data or output forecasting result. The input/output folders can also be mounted to the container. +Mounting the OCI configs folder is only required if an OCI Object Storage bucket will be used to store the input data or output result. The input/output folders can also be mounted to the container. ```yaml volume: @@ -130,7 +130,7 @@ ads operator publish-image pii:v1 --registry After the container is published to OCR, it can be used within Data Science jobs service. Check the `backend_job_container_config.yaml` config file. It should contain pre-populated infrastructure and runtime sections. The runtime section should contain an image property, something like `image: iad.ocir.io//pii:v1`. More details about supported options can be found in the ADS Jobs documentation - [Run a Container](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/jobs/run_container.html). -Adjust the `pii.yaml` config with proper input/output folders. When the forecasting is run in the Data Science job, it will not have access to local folders. Therefore, input data and output folders should be placed in the Object Storage bucket. Open the `pii.yaml` and adjust the following fields: +Adjust the `pii.yaml` config with proper input/output folders. When the operator is run in the Data Science job, it will not have access to local folders. Therefore, input data and output folders should be placed in the Object Storage bucket. Open the `pii.yaml` and adjust the following fields: ```yaml input_data: From 1b17d7761dd0032f6177f1b8a2e6e9a1a99e44ce Mon Sep 17 00:00:00 2001 From: MING KANG Date: Tue, 14 Nov 2023 15:08:10 -0800 Subject: [PATCH 18/18] added pii into test dependency --- dev-requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev-requirements.txt b/dev-requirements.txt index 2244c5951..038d2bfe2 100644 --- a/dev-requirements.txt +++ b/dev-requirements.txt @@ -1,5 +1,5 @@ -r test-requirements.txt --e ".[bds,data,geo,huggingface,notebook,onnx,opctl,optuna,spark,tensorflow,text,torch,viz,forecast]" +-e ".[bds,data,geo,huggingface,notebook,onnx,opctl,optuna,spark,tensorflow,text,torch,viz,forecast,pii]" arff category_encoders dask