This guide will show you how to set up an ingest pipeline to redact specific information on ingest. This will allow you to keep PII from being stored with an elastic search index.
- NER model
- An NER model is used to identify information (entities) which does not have a standard pattern or structure. The most common entities identified by these models are people, organizations, locations
- Grok Patterns (Simlar to Regex)
- A list of grok patterns can be configured to identify data which has a standard pattern (SSN, Credit Card Numbers, etc.)
- Elastic Platinum or Enterprise license
- This is required to run the NER model
- Machine Learning node(s)
You can run this jypyter python notbook to load the HF mode and ingest pipeline
Alternatively see steps below for overview of installation
A compatible NER model can be loaded from Hugging Face model hub using eland
- The model we used in testing is the dslim/bert-base-NER.
- Any Elastic compatible NER model can be used.
An example ingest pipeline config is provided in this here
PUT _ingest/pipeline/pii_script-redact
{
... ingest pipeline json from example
}
- Inference Processor
- Set
model_id
to the id the model is stored with in Elastic- Kibana -> Machine Learning -> Trained Models -> listed under
id
column - use the GET Trained Models API
- Kibana -> Machine Learning -> Trained Models -> listed under
- Set
- Redact Processor
- Add new Grok patterns to match the patterns in your data
- Create one grok pattern per value you want to match and give it a name. This name will be used to mask.
- Add new Grok patterns to match the patterns in your data
- Configure Data to use the pipeline through one of these approaches
- Configure the process sending data to Elastic to use the ingest pipeline as part of the indexing request
- Configure the default pipeline in the index settigns
- Start the NER model
- This will deploy the model to ML nodes and make it available for the inference processor
- Ingest Data
- Data configured to use the ingest pipeline will now be processed