# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Sequential)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
%load_ext autoreload
%autoreload 2

In [32]:
import pandas
import IPython.display

from detection import Detection
from correction import Correction

## Error Detection with Raha

### 1. Instantiating the Data set via the detection class
We first instantiate the `Detection` class. Das muss so gemacht werden.

## Weird dataset setup with Raha

In [42]:
app_1 = Detection()
app_1.VERBOSE = True

beer = {
    "name": "beer",
    "path": "../datasets/beer/dirty.csv",
    "clean_path": "../datasets/beer/clean.csv"
}
flights = {
    "name": "flights",
    "path": "../datasets/flights/dirty.csv",
    "clean_path": "../datasets/flights/clean.csv"
}
movies = {
    "name": "movies",
    "path": "../datasets/movies/dirty.csv",
    "clean_path": "../datasets/movies/clean.csv"
}
rayyan = {
    "name": "rayyan",
    "path": "../datasets/rayyan/dirty.csv",
    "clean_path": "../datasets/rayyan/clean.csv"
}
tox = {
    "name": "tox",
    "path": "../datasets/tox/dirty.csv",
    "clean_path": "../datasets/tox/clean.csv"
}
toy = {
    "name": "toy",
    "path": "../datasets/toy/dirty.csv",
    "clean_path": "../datasets/toy/clean.csv"
}
hospital = {
    "name": "hospital",
    "path": "../datasets/hospital/dirty.csv",
    "clean_path": "../datasets/hospital/clean.csv"
}

datasets = [beer, flights, movies, rayyan, tox, toy, hospital]

d = app_1.initialize_dataset(dataset_dictionary)

## Error correction with Baran

In [43]:
app_2 = Correction()

# How many tuples would you label?
app_2.LABELING_BUDGET = 20
app_2.VERBOSE = True

### 2. Initializing the Dataset Object
We next initialize the dataset object.

In [44]:
d.detected_cells = dict(d.get_actual_errors_dictionary())
d = app_2.initialize_dataset(d)
#d.dataframe.head()

In [45]:
app_2.initialize_models(d)

The error corrector models are initialized.


In [46]:
while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
    app_2.sample_tuple(d)
    if not d.has_ground_truth:
        raise ValueError('Wir arbeiten nur mit maschinell gesampelten Labels')
    app_2.label_with_ground_truth(d)
    app_2.update_models(d)
    app_2.generate_features_synchronously(d)
    app_2.predict_corrections(d)

#for si in d.labeled_tuples:
#    d.sampled_tuple = si
#    app_2.update_models(d)
#    app_2.generate_features(d)
#    app_2.predict_corrections(d)

Tuple 532 is sampled.
Tuple 532 is labeled.
The error corrector models are updated with new labeled tuple 532.
24040 pairs of (a data error, a potential correction) are featurized.
7% (35 / 509) of data errors are corrected.
Tuple 386 is sampled.
Tuple 386 is labeled.
The error corrector models are updated with new labeled tuple 386.
24059 pairs of (a data error, a potential correction) are featurized.
16% (81 / 509) of data errors are corrected.
Tuple 157 is sampled.
Tuple 157 is labeled.
The error corrector models are updated with new labeled tuple 157.
24083 pairs of (a data error, a potential correction) are featurized.
21% (106 / 509) of data errors are corrected.
Tuple 256 is sampled.
Tuple 256 is labeled.
The error corrector models are updated with new labeled tuple 256.
24115 pairs of (a data error, a potential correction) are featurized.
22% (110 / 509) of data errors are corrected.
Tuple 238 is sampled.
Tuple 238 is labeled.
The error corrector models are updated with new lab

### 5. Storing Results
Baran can also store the error correction results.

In [None]:
app_2.store_results(d)

### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [22]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Baran's performance on hospital:
Precision = 0.92
Recall = 0.89
F1 = 0.91


In [39]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Baran's performance on hospital:
Precision = 0.56
Recall = 0.27
F1 = 0.37
