# Hands-On Practice: Data Cleaning with Data-juicer

## Introduction
In this hands-on practice session, you will learn how to perform data cleaning using the Data-juicer library in Python. Data cleaning is an essential step in the data preprocessing pipeline, ensuring that datasets are free from errors, inconsistencies, and missing values.

## Objectives
- Learn how to use the Data-juicer library for efficient data cleaning.
- Apply various data cleaning techniques to a real-world dataset.


## Setup
The dataset is available under the CC0 licence on [kaggle](https://www.kaggle.com/datasets/venky73/spam-mails-dataset).
To use it on this hands-on you need to convert it to a jsonl format. <br>
Additionally, you need to have the [data-juicer](https://github.com/alibaba/data-juicer), [presidio_analyzer](https://pypi.org/project/presidio-analyzer/), [presidio_anonymizer](https://pypi.org/project/presidio-anonymizer/) and [presidio_evaluator](https://pypi.org/project/presidio-evaluator/) python packages installed. <br>
Finnaly, you can request for ganerated PII to use faker [here](https://www.fakenamegenerator.com/order.php).

## Imports

In [None]:
# General
import json
from pathlib import Path
import pandas as pd
import os
import copy
from visualization import removed_viz, word_dist, char_dist, sample_diff

# Presidio analyzer
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_anonymizer import AnonymizerEngine
from tqdm import tqdm
from pathlib import Path
from functools import partialmethod

# Presidio evaluator
from presidio_evaluator.data_generator import PresidioDataGenerator
from presidio_evaluator.data_generator.faker_extensions import (
    AddressProviderNew,
    AgeProvider,
    IpAddressProvider,
    NationalityProvider,
    OrganizationProvider,
    PhoneNumberProviderNew,
    RecordsFaker,
    UsDriverLicenseProvider,
)


# Jupyterquiz
from jupyterquiz import display_quiz

with open("questions.json", "r") as file:
    questions=json.load(file)


tqdm.__init__ = partialmethod(tqdm.__init__, disable=True)

tqdm.pandas(disable=False)

# MinHash deduplication

<hr style="border: 1px solid red;">

> <span style="color:red">**Task** </span> : 
The first step is to see how the minhas dedup filter is applied on a real world dataset (papers titles and abstracts) <br>
Note that data-juicer is installed and to launch a preproccessing pipeline, the following command is used:<br>
`dj-process --config config.yaml`<br>
So all the steps are defined in the config.yaml file<br>
You need to define only one MinHash dedup filter in the config.yaml file and then run the command above

An example of a documented config file is available [here](https://github.com/alibaba/data-juicer/blob/main/configs/config_all.yaml).

**Ease level 1:**

In [None]:
content = f"""
project_name: "dedup"
dataset_path: '{os.environ.get('ALL_CCFRWORK')}/data_spellm/data_cleaning/data_dedup.jsonl'  # path to your dataset directory or file with weights(0.0-1.0), 1.0 as default.
                                                            # accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: ''                # path to processed result dataset. Supported suffixes 
# Process config example for dataset
np : 1        # number of processes to run in parallel (let it be 1 for the TP)

# process schedule
# a list of several process operators with their arguments
process:  
"""

**Ease level 2:**

In [None]:
content = f"""
project_name: "dedup"
dataset_path: '{os.environ.get('ALL_CCFRWORK')}/data_spellm/data_cleaning/data_dedup.jsonl'  # path to your dataset directory or file with weights(0.0-1.0), 1.0 as default.
                                                            # accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: './outputs/dedup/deduplicated.jsonl'                # path to processed result dataset. Supported suffixes 
# Process config example for dataset
np : 1       # number of processes to run in parallel (let it be 1 for the TP)

# process schedule
# a list of several process operators with their arguments
process:
    - document_minhash_deduplicator:                          # deduplicate text samples using MinHash-LSH method
      tokenization:                                       # tokenization method for text. One of [space, punctuation, character]
      window_size:                                        # window size of shingling
      num_permutations:                                   # number of permutations in minhash computing
      jaccard_threshold:                                  # the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
      num_bands: None                                     # number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
      num_rows_per_band: None                             # number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm
      lowercase:                                          # whether to convert text to lower case
      ignore_pattern:   
"""

**Solution:**

Run the folowing cell if you want to display the solution

In [None]:
# %load solutions/solution_data_cleaning_dedup.py

Now we run the deduplication filter on the dataset

In [None]:
# Write the config file
with open("config.yaml", 'w') as f:
    f.write(content)

In [None]:
!dj-process --config config.yaml

<hr style="border: 1px solid red;">

## Visualization
Because this dataset is annotated, you can see the classes of the document removed

In [None]:
original_path = Path(f"{os.environ.get('ALL_CCFRWORK')}/data_spellm/data_cleaning/data_dedup.json")
modified_path = Path('./outputs/dedup/deduplicated.jsonl')
removed_viz(original_path,modified_path)

In [None]:
display_quiz([questions["DC7"]],  border_radius=0, max_width=1000)

You can find statistiques and logs about the filter process in the output folder
<hr style="border: 1px solid red;">

## A word on data, dimensions and calibration

In [None]:
display_quiz([questions["DC6"]],  border_radius=0, max_width=1000)

In [None]:
display_quiz([questions["DC5"]],  border_radius=0, max_width=1000)

<hr style="border: 1px solid red;">

> <span style="color:red">**Task** </span> : 
Try tunning some of the parameters. Does the numbers of filtered document follow your intuition? 

You may see some negatives effects, those obervations can have different causes. One of them is a mismatch on the "dimentions/scales" of the parameters. 

To have a good intuition on what scale you could try, a quick look at your data is necessary. Fore instance you could try looking what are the lengths of samples in the datasets : 





In [None]:
%matplotlib widget
# Displays an histogram of the number of characters in each sample
char_dist(original_path)

In [None]:
%matplotlib widget
# Displays an histogram of the number of words in each sample
word_dist(original_path)

> <span style="color:red">**Task** </span> : 
After looking at the lengths repartition of messages, change parameters. What do you think are some reasonable values ?

<hr style="border: 1px solid red;">

## Text filters

A simplified configuration file with only texts filters is available in `config_text.yaml`

<hr style="border: 1px solid red;">

> <span style="color:red">**Task** </span> : Looking at the available filters, take a moment to choose the ones tha could be useful for this dataset (or the dataset you are going to use for your LLM) and see what changed in the processed dataset.

In [None]:
content = f"""
project_name: "dedup"
dataset_path: '{os.environ.get('ALL_CCFRWORK')}/data_spellm/data_cleaning/data_dedup.jsonl'  # path to your dataset directory or file with weights(0.0-1.0), 1.0 as default.
                                                            # accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: './outputs/dedup/deduplicated.jsonl'                # path to processed result dataset. Supported suffixes 
# Process config example for dataset
np : 1      # number of processes to run in parallel (let it be 1 for the TP)

# process schedule
# a list of several process operators with their arguments
# TODO: add more process operators to clean the dataset
process:  
"""

In [None]:
# Write the config file
with open("config.yaml", 'w') as f:
    f.write(content)

In [None]:
!dj-process --config config.yaml

## What does my filters changed?

The next code allows you to see differences between two versions of the dataset 

In [None]:
original = pd.read_json(original_path, lines=True)
modified = pd.read_json(modified_path, lines=True)

In [None]:
sample_diff(original,modified)

<hr style="border: 1px solid red;">


# Data Anonymization with Presidio

In this next phase of our hands-on practice on data cleaning, we delve into the critical aspect of data anonymization. Data anonymization is the process of transforming personally identifiable information (PII) within datasets into a form where the individual cannot be identified. This is crucial for ensuring data privacy and compliance with regulations such as GDPR, HIPAA, and CCPA.

For this part of the exercise, we will be leveraging the Presidio library, a powerful tool specifically designed for data anonymization tasks. Presidio offers a wide range of anonymization techniques tailored to various types of sensitive information, including names, locations, email addresses, and more.

## Objective

The primary objective of this task is to anonymize sensitive information within our dataset, ensuring that the data remains useful for analysis while protecting the privacy of individuals represented in the data.

## Library presentation 

There are three presidio libraries : `presidio_analyzer`, `presidio_anonymizer` and `presidio_evaluator`. <br>
The first two are useful for anonymization and the last one is used to put fake values in the dataset. <br>
The [presidio documentation](https://microsoft.github.io/presidio/anonymizer/) gives some hints and tips.

## Anonymization

<hr style="border: 1px solid red;">

> <span style="color:red">**Task** </span> : Let's create an anonymizer engine and anonimyze the dataset. 


**Exercice**

In [None]:
# Parameters
threshold = 0.6

# Load the deduplicated data into a pandas dataframe
dataset = pd.read_json('', lines=True)

# Initialize the NLP engine
nlp_engine = 

# Pass the engine to the analyzer, the dataset is only in english
analyzer = AnalyzerEngine(
    nlp_engine=, 
    supported_languages=
)

# Analyzer results are passed to the AnonymizerEngine for anonymization
anonymizer = 



# Define a function to anonymize a text
def anonymize_text(text: str, threshold: float) -> str:
    # Analyze the text
        analyzer_results = 
        # Filter data on threshold
        analyzer_results_filt = [
            result for result in analyzer_results if result.score > threshold
        ]
        # Anonymize the text (replace detected PII with the relevant type, e.g. PERSON, EMAIL_ADDRESS, etc.)
        ano_text = anonymizer.anonymize(
            text=text, analyzer_results=analyzer_results_filt
        ).text
        return ano_text

# Apply the function to the dataframe
dataset["text"] = dataset["text"].apply(
    lambda x: anonymize_text(x, threshold)
)
dataset.dropna(subset=["text"],inplace=True)

# Save the dataframe to a new file
save_path=Path('./outputs/anonymized/anonymized.jsonl')
os.makedirs(save_path.parent.as_posix(),exist_ok=True)
dataset.to_json(save_path.as_posix(), orient='records', lines=True)

**Solution:**

In [None]:
# %load solutions/solution_data_cleaning_presidio.py

Look at the new dataset, PII information are now replaced by placeholders. <br>
The dataset is anonymized ! Do you see any fails? 
<hr style="border: 1px solid red;">

## Replacing anonymized values with fakes ones

The `presidio_evaluator` librairy relies on [`faker`](https://pypi.org/project/Faker/) to replace removed PII data with fake ones.<br>
To do so, a file with fake information is needed. You can ask for one [here](https://www.fakenamegenerator.com/order.php) but you have one available at `./assets/fake_pii.csv`

<hr style="border: 1px solid red;">

> <span style="color:red">**Task** </span> : Replace all PII values in the dataset with generated ones 

**Exercice:**

In [None]:
# Load the data into a pandas dataframe
dataset = pd.read_json('./outputs/anonymized/anonymized.jsonl', lines=True)

# Create a new instance of the PresidioDataGenerator from a generated fake names file
fake_name_generator_file = f"{os.environ.get('ALL_CCFRWORK')}/data_spellm/data_cleaning/fake_pii.csv"
fake_name_generator_dataset = pd.read_csv(fake_name_generator_file)

# Update to match existing templates
fake_name_generator_dataset = PresidioDataGenerator.update_fake_name_generator_dataset(
    fake_name_generator_dataset
)

fake = RecordsFaker(records=fake_name_generator_dataset)
# TODO : Add the custom providers to the faker

# instanciate the data generator with the custom providers
data_generator = PresidioDataGenerator(
)

# To transform presidio placeholders for faker
translator_inv = {
    "<PERSON>": "{{name}}",
    "<IP_ADDRESS>": "{{ip_address}}",
    "<US_DRIVER_LICENSE>": "{{us_driver_license}}",
    "<ORGANIZATION>": "{{organization}}",
    "<STREET_ADDRESS>": "{{address}}",
    "<GPE>": "{{country}}",
    "<CREDIT_CARD>": "{{credit_card_number}}",
    "<IBAN_CODE>": "{{iban}}",
    "<PHONE_NUMBER>": "{{phone_number}}",
    "<DOMAIN_NAME>": "{{url}}",
    "<US_SSN>": "{{ssn}}",
    "<EMAIL_ADDRESS>": "{{email}}",
    "<DATE_TIME>": "{{date_time}}",
    "<TITLE>": "{{prefix}}",
    "<NRP>": "{{nationality}}",
    "<ZIP_CODE>": "{{zipcode}}",
    "<AGE>": "{{age}}",
    "<LOCATION>": "{{address}}",
}

def deanonymize(template: str) -> str :
    return list(data_generator.generate_fake_data(templates=,n_samples=))[0].fake

# Apply the function to the dataframe
dataset.dropna(subset=['text'],inplace=True)
dataset["fake"] = copy.deepcopy(dataset["text"])
dataset["fake"] = dataset["fake"].replace(translator_inv, regex=True)
dataset["text"] = dataset["fake"].progress_apply(deanonymize)
dataset.drop(columns=['fake'])

# Save the dataframe to a new file
save_path=Path('./outputs/deanonymized/deanonymized.jsonl')
os.makedirs(save_path.parent.as_posix(),exist_ok=True)
dataset.to_json('./outputs/deanonymized/deanonymized.jsonl', orient='records', lines=True)

**Solution:**

In [None]:
# %load solutions/solution_data_cleaning_faker.py

<hr style="border: 1px solid red;">