Anonymization

Text anonymization is a Python library for anonymizing sensitive information in text data. Focused on Swiss French banking data.

Based on presidio for PII detection and camembert for NER.

Install

You must have conda and git installed.

Create a conda environment with python 3.10 and activate it:

conda create -n my_env python=3.10
conda activate my_env

Clone the project and install it:

git clone https://gitlab.idiap.ch/nba/anonymization.git
cd anonymization
pip install -e . # install in editable mode
configure  # Download models
pytest -sv tests  # (optional) run the test suite to make sure everything is working as expected

Quick start

Anonymize your text (.txt), CSV (.csv) or Excel (.xslx) /path/to/my_file.xslx file by calling:

anonymize -f /path/to/my_file.xslx

This generates an anonymized file here /path/to/my_file_anonymized.xslx

You can use the test example:

anoymize -f ./tests/example.txt -c ./tests/config.json

Advanced configuration

You can pass a customized configuration to run your anonymization.

To generate a default configuration file (used by default when running anonymize):

gen_config

This creates .json file with the following fields:

Keyword	Description
entities	List of entites you want to anonymize. By default it listed all the available entities. For example: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est <ANONYM_PER>, voici mon numéro <ANONYM_PHONE>"
flag_only	Boolean. If True, the anonymization will only flag sensitive component of the text but will not remove them. For example: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est , voici mon numéro <FLAG 079563684>".
language	Language selection in "fr", "en", "de". However, the current version is specialized for French language.
process_columns	List of integers. If your input file is an Excel of CSV file, the anonymization is only applied to the specified columns of the data.
pseudonymize	List of entities to pseudomize, i.e. replace the flaged text with fake one (e.g. use fake names). Should list entities already present in entities list. Entities that are not pseudomized are anonymized. For example, if onle "PERSON" is given to pseudonymize: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est Bernard, voici mon numéro <ANONYM_PHONE>"
use_camembert	Boolean. If true, use french camembert_ner for NER recognition. Detectors are cumulative (default all used).
use_spacy	Boolean. If true, use spacy for NER and PII detection. Detectors are cumulative (default all used).
use_swiss_ner	Boolean. If true, use spacy for NER sepcialized in Swiss entity recognition. Detectors are cumulative (default all used).

To use a constomized config.json configuration file:

anonymize -f /path/to/my_file.xslx -c config.json

For more help:

anonymize -h

DECLARATIONS

Please be advised that the use of this code comes with no guarantees or warranties. Users are responsible for its application, and no liability is assumed by the developer for any consequences arising from its use.

ACKNOWLEDGEMENTS

This package was developed with the support of the Banque Cantonale du Valais (BCVs).

LICENCE

SPDX-FileContributor: Théophile Gentilhomme theophile.gentilhomme@idiap.ch

SPDX-License-Identifier: GPL-3.0-only

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSES		LICENSES
anonymization		anonymization
config		config
scripts		scripts
tests		tests
.bandit		.bandit
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
THIRDPARTY.md		THIRDPARTY.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anonymization

Install

Quick start

Advanced configuration

DECLARATIONS

ACKNOWLEDGEMENTS

LICENCE

About

Releases

Packages

Languages

idiap/anonymization

Folders and files

Latest commit

History

Repository files navigation

Anonymization

Install

Quick start

Advanced configuration

DECLARATIONS

ACKNOWLEDGEMENTS

LICENCE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages