diff --git a/ads/opctl/operator/lowcode/pii/README.md b/ads/opctl/operator/lowcode/pii/README.md index f24cda8ce..59b2c43f8 100644 --- a/ads/opctl/operator/lowcode/pii/README.md +++ b/ads/opctl/operator/lowcode/pii/README.md @@ -36,11 +36,12 @@ To run pii operator locally, create and activate a new conda environment (`ads-p - datapane - gender_guesser - nameparser +- oracle_ads[opctl] - plotly -- spacy_transformers - scrubadub - scrubadub_spacy -- oracle_ads[opctl] +- spacy-transformers==1.2.5 +- spacy==3.6.1 ``` Please review the previously generated `pii.yaml` file using the `init` command, and make any necessary adjustments to the input and output file locations. By default, it assumes that the files should be located in the same folder from which the `init` command was executed. diff --git a/ads/opctl/operator/lowcode/pii/environment.yaml b/ads/opctl/operator/lowcode/pii/environment.yaml index 4f5b75b67..ffd60045e 100644 --- a/ads/opctl/operator/lowcode/pii/environment.yaml +++ b/ads/opctl/operator/lowcode/pii/environment.yaml @@ -9,8 +9,9 @@ dependencies: - datapane - gender_guesser - nameparser + - oracle_ads[opctl] - plotly - - spacy_transformers - scrubadub - scrubadub_spacy - - oracle_ads[opctl] + - spacy-transformers==1.2.5 + - spacy==3.6.1 diff --git a/docs/source/user_guide/operators/pii_operator/getting_started.rst b/docs/source/user_guide/operators/pii_operator/getting_started.rst index a8c455ded..a5ce67d6a 100644 --- a/docs/source/user_guide/operators/pii_operator/getting_started.rst +++ b/docs/source/user_guide/operators/pii_operator/getting_started.rst @@ -10,9 +10,9 @@ After having set up ``ads opctl`` on your desired machine using ``ads opctl conf - Path to the input data (input_data) - Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory) - Name of the column with user data (target_column) -- Name of the detector will be used in the operator (detectors) +- The detector will be used in the operator (detectors) -These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``: +You can check :ref:`Configure Detector ` for more details on how to configure ``detectors`` parameter. These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``: .. code-block:: yaml @@ -32,10 +32,10 @@ These details exactly match the initial pii.yaml file generated by running ``ads Optionally, you are able to specify much more. The most common additions are: -- Whether to show sensitive content in the report. (show_sensitive_content) -- Way to process the detected entity. (action) +- Whether to show sensitive content in the report (show_sensitive_content) +- Way to process the detected entity (action) -An extensive list of parameters can be found in the ``YAML Schema`` section. +An extensive list of parameters can be found in the :ref:`YAML Schema `. Run @@ -57,7 +57,7 @@ We will go through each of these output files in turn. **mydata-out.csv** -The name of this file can be customized based on output_directory parameters in the configuration yaml. This file contains the processed dataset. +The name of this file can be customized based on ``output_directory`` parameters in the configuration yaml. This file contains the processed dataset. **report.html** diff --git a/docs/source/user_guide/operators/pii_operator/install.rst b/docs/source/user_guide/operators/pii_operator/install.rst index ae581315b..7386f69cf 100644 --- a/docs/source/user_guide/operators/pii_operator/install.rst +++ b/docs/source/user_guide/operators/pii_operator/install.rst @@ -7,7 +7,18 @@ The PII Operator can be installed from PyPi. .. code-block:: bash - python3 -m pip install oracle_ads[pii] + python3 -m pip install oracle_ads[pii]==2.9 After that, the Operator is ready to go! + +In order to run on a job, you will need to create and publish a conda pack with ``oracle_ads[pii]`` installed. The simplest way to do this is from a Notebook Session, running the following commands: + +.. code-block:: bash + + odsc conda create -n ads_pii -e + conda activate /home/datascience/conda/ads_pii_v1_0 + python3 -m pip install oracle-ads[pii]==2.9 + odsc conda publish -s /home/datascience/conda/ads_pii_v1_0 + +Ensure that you have properly configured your conda pack namespace and bucket in the Launcher -> Settings -> Object Storage Settings. For more details, see :doc:`ADS Conda Set Up <../../cli/opctl/configure>` diff --git a/docs/source/user_guide/operators/pii_operator/pii.rst b/docs/source/user_guide/operators/pii_operator/pii.rst index 617467e8b..92cc47254 100644 --- a/docs/source/user_guide/operators/pii_operator/pii.rst +++ b/docs/source/user_guide/operators/pii_operator/pii.rst @@ -35,13 +35,98 @@ Here is an example pii.yaml with every parameter specified: * **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://@/path/to/data.csv``. * **target_column**: This string specifies the name of the column where the user data is within the input data. * **detectors**: This list contains the details for each detector and action that will be taken. - * **name**: The string specifies the name of the detector. The format should be ``.``. + * **name**: The string specifies the name of the detector. The format should be ``.``. Check :ref:`Configure Detector ` for more details. * **action**: The string specifies the way to process the detected entity. Default to mask. * **output_directory**: This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime. * **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://@/subfolder/``. * **name**: The string specifies the name of the processed data file. * **report**: (optional) This dictionary specific details for the generated report. - * **report_filename**: Placed into output_directory location. Defaults to report.html. - * **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to false. + * **report_filename**: Placed into output_directory location. Defaults to ``report.html``. + * **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to ``false``. * **show_rows**: The number of rows that shows in the report. + + +.. _config_detector: + +Configure Detector +------------------ + +A detector consists of ``name`` and ``action``. The **name** parameter defines the detector that will be used, and the **action** parameter defines the way to process the entity. + +Configure Name +~~~~~~~~~~~~~~ + +We currently support the following type of detectors: + +* default +* spacy + +Default +^^^^^^^ + +Here scrubadub's pre-defined detector is used. You can designate the name in the format of ``default.`` (e.g., ``default.phone``). Check the supported detectors from `scrubadub `_. + +.. note:: + + If you want to de-identify `address` by this tool, `scrubadub_address` is required. + You will need to follow the `instructions`_ to install the required dependencies. + + .. _instructions: https://scrubadub.readthedocs.io/en/stable/addresses.html/ + + +spaCy +^^^^^ + +To use spaCy’s NER to identify entity, you can designate the name in the format of ``spacy..`` (e.g., ``spacy.en_core_web_sm.person``). +The "entity" value can correspond to any entity that spaCy recognizes. For a list of available models and entities, please refer to the `spaCy documentation `_. + + + +Configure Action +~~~~~~~~~~~~~~~~ + +We currently support the following types of actions: + +* mask +* remove +* anonymize + +Mask +^^^^ + +The ``mask`` action is used to mask the detected entity with the name of the entity type. It replaces the entity with a placeholder. For example, with the following configured detector: + +.. code-block:: yaml + + name: spacy.en_core_web_sm.person + action: mask + +After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is {{NAME}}." + +Remove +^^^^^^ + +The ``remove`` action is used to delete the detected entity from the text. It completely removes the entity without replacement. For example, with the following configured detector: + +.. code-block:: yaml + + name: spacy.en_core_web_sm.person + action: remove + +After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is ." + + +Anonymize +^^^^^^^^^ + +The ``anonymize`` action can be used to obfuscate the detected sensitive information. +Currently, we provide context-aware anonymization for name, email, and number-like entities. +For example, with the following configured detector: + +.. code-block:: yaml + + name: spacy.en_core_web_sm.person + action: anonymize + +After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is Joe Blow." diff --git a/docs/source/user_guide/operators/pii_operator/yaml_schema.rst b/docs/source/user_guide/operators/pii_operator/yaml_schema.rst index ee1318f8c..6a887b5e1 100644 --- a/docs/source/user_guide/operators/pii_operator/yaml_schema.rst +++ b/docs/source/user_guide/operators/pii_operator/yaml_schema.rst @@ -1,3 +1,5 @@ +.. _pii-yaml-schema: + =========== YAML Schema =========== diff --git a/pyproject.toml b/pyproject.toml index 32c34574a..c8caf66bb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -123,8 +123,8 @@ opctl = [ "nbconvert", "nbformat", "oci-cli", - "rich", "py-cpuinfo", + "rich", ] optuna = [ "optuna==2.9.0", @@ -154,20 +154,20 @@ viz = [ "seaborn>=0.11.0", ] forecast = [ + "autots[additional]", "datapane", - "prophet", - "pmdarima", - "statsmodels", - "sktime", - "optuna==2.9.0", - "oci-cli", - "shap", - "numpy", "holidays==0.21.13", + "neuralprophet", + "numpy", + "oci-cli", + "optuna==2.9.0", "oracle-ads[opctl]", "oracle-automlx==23.2.3", - "autots[additional]", - "neuralprophet", + "pmdarima", + "prophet", + "shap", + "sktime", + "statsmodels", ] pii = [ "aiohttp", @@ -176,9 +176,10 @@ pii = [ "nameparser", "oracle_ads[opctl]", "plotly", - "spacy_transformers", - "scrubadub", + "scrubadub==2.0.1", "scrubadub_spacy", + "spacy-transformers==1.2.5", + "spacy==3.6.1", ] [project.urls] diff --git a/tests/unitary/with_extras/operator/pii/test_factory.py b/tests/unitary/with_extras/operator/pii/test_factory.py index 431034bda..04e153cd0 100644 --- a/tests/unitary/with_extras/operator/pii/test_factory.py +++ b/tests/unitary/with_extras/operator/pii/test_factory.py @@ -25,8 +25,11 @@ def test_get_default_detector(self): @pytest.mark.parametrize( "detector_type, entity, model", [ - ("spacy", "person", "en_core_web_trf"), - ("spacy", "other", "en_core_web_trf"), + ("spacy", "person", "en_core_web_sm"), + ("spacy", "other", "en_core_web_sm"), + # ("spacy", "org", "en_core_web_trf"), + # ("spacy", "loc", "en_core_web_md"), + # ("spacy", "date", "en_core_web_lg"), ], ) def test_get_spacy_detector(self, detector_type, entity, model):