Skip to content
This repository has been archived by the owner on May 29, 2024. It is now read-only.

πŸ“Ÿ CLI tool to detect sensitive personal data

License

Notifications You must be signed in to change notification settings

okp4/detection-of-personal-data

Detection Of Personal data

versionlintbuildtest codecov conventional commits contributor covenant License

Purpose

detection-of-personal-data is a CLI tool to detect sensitive personal data, including names, contact information, health details, identification numbers, and financial details.

Users can input a variety of text files (e.g., .txt, .csv) which the service then processes, returning a JSON. The JSON not only indicates the presence of personal information but also provides tags for the detected data.

Technology

Nltk

NLTK is a leading platform for building Python programs to work with human language data. It provides easy - to - use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial - strength NLP libraries, and an active discussion forum.

RE (Regular Expression)

A regular expression is a method used in programming for pattern matching. Regular expressions provide a flexible and concise means to match strings of text.

State-of-the-art Machine Learning for PyTorch, TensorFlow and JAX. Transformers provides APIs to easily download and train state-of-the-art pretrained models.

Usage

Retrieve command help with:

poetry run detection-of-personal-data pii-detect --help
Usage: detection-of-personal-data pii-detect [OPTIONS]

  Represents cli 'pii_detect' command

Options:
  -i, --input TEXT               path to text file  [required]
  -o, --output TEXT              output directory where json file will be
                                 written  [default: .]
  -tr, --thresh <TEXT FLOAT>...  the minimum probability of private data for
                                 labels
  -f, --force                    overwrite existing file
  --dry-run                      passthrough, will not write anything
  --help                         Show this message and exit.

Example:

poetry run detection-of-personal-data pii-detect \
  -tr person 0.3 \
  -tr passport 0.3 \
  -i ./tests/data/inputs_test/text \
  -o ./tests/data/outputs -f

System requirements

Python

The repository targets python 3.9 and higher.

Poetry

The repository uses Poetry as python packaging and dependency management. Be sure to have it properly installed before.

curl -sSL https://install.python-poetry.org | python3

Docker

You can follow the link below on how to install and configure Docker on your local machine:

Everyday activity

Build

Project is built by poetry. Initialize the project using:

poetry install

Quality Assurance

⚠️ Ensure your code complies with our linters to pass CI checks.

Code linting is performed by flake8.

poetry run flake8 --count --show-source --statistics

Static type check is performed by mypy.

poetry run mypy .

To improve code quality, we use other linters in our workflows, if you want them to succeed in the CI, please check these additional linters.

Markdown linting is performed by markdownlint-cli.

markdownlint "**/*.md"

Docker linting is performed hadolint.

hadolint Dockerfile

Unit Testing

⚠️ Be sure to write tests that succeed to pass CI checks.

Unit testing is performed by the pytest testing framework.

poetry run pytest -v

Build & run docker image (locally)

Build a local docker image using the following command line:

docker build -t detection-of-personal-data .

Once built, you can run the container locally with the following command line:

docker run -ti --rm detection-of-personal-data

You want to get involved? 😍

Please check out OKP4 health files :

About

πŸ“Ÿ CLI tool to detect sensitive personal data

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published