Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions ads/opctl/operator/lowcode/pii/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,12 @@ To run pii operator locally, create and activate a new conda environment (`ads-p
- datapane
- gender_guesser
- nameparser
- oracle_ads[opctl]
- plotly
- spacy_transformers
- scrubadub
- scrubadub_spacy
- oracle_ads[opctl]
- spacy-transformers==1.2.5
- spacy==3.6.1
```

Please review the previously generated `pii.yaml` file using the `init` command, and make any necessary adjustments to the input and output file locations. By default, it assumes that the files should be located in the same folder from which the `init` command was executed.
Expand Down
5 changes: 3 additions & 2 deletions ads/opctl/operator/lowcode/pii/environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ dependencies:
- datapane
- gender_guesser
- nameparser
- oracle_ads[opctl]
- plotly
- spacy_transformers
- scrubadub
- scrubadub_spacy
- oracle_ads[opctl]
- spacy-transformers==1.2.5
- spacy==3.6.1
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ After having set up ``ads opctl`` on your desired machine using ``ads opctl conf
- Path to the input data (input_data)
- Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory)
- Name of the column with user data (target_column)
- Name of the detector will be used in the operator (detectors)
- The detector will be used in the operator (detectors)

These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``:
You can check :ref:`Configure Detector <config_detector>` for more details on how to configure ``detectors`` parameter. These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``:

.. code-block:: yaml

Expand All @@ -32,10 +32,10 @@ These details exactly match the initial pii.yaml file generated by running ``ads

Optionally, you are able to specify much more. The most common additions are:

- Whether to show sensitive content in the report. (show_sensitive_content)
- Way to process the detected entity. (action)
- Whether to show sensitive content in the report (show_sensitive_content)
- Way to process the detected entity (action)

An extensive list of parameters can be found in the ``YAML Schema`` section.
An extensive list of parameters can be found in the :ref:`YAML Schema <pii-yaml-schema>`.


Run
Expand All @@ -57,7 +57,7 @@ We will go through each of these output files in turn.

**mydata-out.csv**

The name of this file can be customized based on output_directory parameters in the configuration yaml. This file contains the processed dataset.
The name of this file can be customized based on ``output_directory`` parameters in the configuration yaml. This file contains the processed dataset.

**report.html**

Expand Down
13 changes: 12 additions & 1 deletion docs/source/user_guide/operators/pii_operator/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,18 @@ The PII Operator can be installed from PyPi.

.. code-block:: bash

python3 -m pip install oracle_ads[pii]
python3 -m pip install oracle_ads[pii]==2.9


After that, the Operator is ready to go!

In order to run on a job, you will need to create and publish a conda pack with ``oracle_ads[pii]`` installed. The simplest way to do this is from a Notebook Session, running the following commands:

.. code-block:: bash

odsc conda create -n ads_pii -e
conda activate /home/datascience/conda/ads_pii_v1_0
python3 -m pip install oracle-ads[pii]==2.9
odsc conda publish -s /home/datascience/conda/ads_pii_v1_0

Ensure that you have properly configured your conda pack namespace and bucket in the Launcher -> Settings -> Object Storage Settings. For more details, see :doc:`ADS Conda Set Up <../../cli/opctl/configure>`
91 changes: 88 additions & 3 deletions docs/source/user_guide/operators/pii_operator/pii.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,98 @@ Here is an example pii.yaml with every parameter specified:
* **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://<bucket>@<namespace>/path/to/data.csv``.
* **target_column**: This string specifies the name of the column where the user data is within the input data.
* **detectors**: This list contains the details for each detector and action that will be taken.
* **name**: The string specifies the name of the detector. The format should be ``<type>.<entity>``.
* **name**: The string specifies the name of the detector. The format should be ``<type>.<entity>``. Check :ref:`Configure Detector <config_detector>` for more details.
* **action**: The string specifies the way to process the detected entity. Default to mask.
* **output_directory**: This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime.
* **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://<bucket>@<namespace>/subfolder/``.
* **name**: The string specifies the name of the processed data file.

* **report**: (optional) This dictionary specific details for the generated report.
* **report_filename**: Placed into output_directory location. Defaults to report.html.
* **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to false.
* **report_filename**: Placed into output_directory location. Defaults to ``report.html``.
* **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to ``false``.
* **show_rows**: The number of rows that shows in the report.


.. _config_detector:

Configure Detector
------------------

A detector consists of ``name`` and ``action``. The **name** parameter defines the detector that will be used, and the **action** parameter defines the way to process the entity.

Configure Name
~~~~~~~~~~~~~~

We currently support the following type of detectors:

* default
* spacy

Default
^^^^^^^

Here scrubadub's pre-defined detector is used. You can designate the name in the format of ``default.<entity>`` (e.g., ``default.phone``). Check the supported detectors from `scrubadub <https://scrubadub.readthedocs.io/en/stable/api_scrubadub_detectors.html>`_.

.. note::

If you want to de-identify `address` by this tool, `scrubadub_address` is required.
You will need to follow the `instructions`_ to install the required dependencies.

.. _instructions: https://scrubadub.readthedocs.io/en/stable/addresses.html/


spaCy
^^^^^

To use spaCy’s NER to identify entity, you can designate the name in the format of ``spacy.<model>.<entity>`` (e.g., ``spacy.en_core_web_sm.person``).
The "entity" value can correspond to any entity that spaCy recognizes. For a list of available models and entities, please refer to the `spaCy documentation <https://spacy.io/models/en>`_.



Configure Action
~~~~~~~~~~~~~~~~

We currently support the following types of actions:

* mask
* remove
* anonymize

Mask
^^^^

The ``mask`` action is used to mask the detected entity with the name of the entity type. It replaces the entity with a placeholder. For example, with the following configured detector:

.. code-block:: yaml

name: spacy.en_core_web_sm.person
action: mask

After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is {{NAME}}."

Remove
^^^^^^

The ``remove`` action is used to delete the detected entity from the text. It completely removes the entity without replacement. For example, with the following configured detector:

.. code-block:: yaml

name: spacy.en_core_web_sm.person
action: remove

After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is ."


Anonymize
^^^^^^^^^

The ``anonymize`` action can be used to obfuscate the detected sensitive information.
Currently, we provide context-aware anonymization for name, email, and number-like entities.
For example, with the following configured detector:

.. code-block:: yaml

name: spacy.en_core_web_sm.person
action: anonymize

After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is Joe Blow."
2 changes: 2 additions & 0 deletions docs/source/user_guide/operators/pii_operator/yaml_schema.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _pii-yaml-schema:

===========
YAML Schema
===========
Expand Down
27 changes: 14 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,8 @@ opctl = [
"nbconvert",
"nbformat",
"oci-cli",
"rich",
"py-cpuinfo",
"rich",
]
optuna = [
"optuna==2.9.0",
Expand Down Expand Up @@ -154,20 +154,20 @@ viz = [
"seaborn>=0.11.0",
]
forecast = [
"autots[additional]",
"datapane",
"prophet",
"pmdarima",
"statsmodels",
"sktime",
"optuna==2.9.0",
"oci-cli",
"shap",
"numpy",
"holidays==0.21.13",
"neuralprophet",
"numpy",
"oci-cli",
"optuna==2.9.0",
"oracle-ads[opctl]",
"oracle-automlx==23.2.3",
"autots[additional]",
"neuralprophet",
"pmdarima",
"prophet",
"shap",
"sktime",
"statsmodels",
]
pii = [
"aiohttp",
Expand All @@ -176,9 +176,10 @@ pii = [
"nameparser",
"oracle_ads[opctl]",
"plotly",
"spacy_transformers",
"scrubadub",
"scrubadub==2.0.1",
"scrubadub_spacy",
"spacy-transformers==1.2.5",
"spacy==3.6.1",
]

[project.urls]
Expand Down
7 changes: 5 additions & 2 deletions tests/unitary/with_extras/operator/pii/test_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ def test_get_default_detector(self):
@pytest.mark.parametrize(
"detector_type, entity, model",
[
("spacy", "person", "en_core_web_trf"),
("spacy", "other", "en_core_web_trf"),
("spacy", "person", "en_core_web_sm"),
("spacy", "other", "en_core_web_sm"),
# ("spacy", "org", "en_core_web_trf"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable testing with other model because of the size. Verified successfully in local.
Screenshot 2023-11-15 at 6 27 01 PM

# ("spacy", "loc", "en_core_web_md"),
# ("spacy", "date", "en_core_web_lg"),
],
)
def test_get_spacy_detector(self, detector_type, entity, model):
Expand Down