# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Sequential)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas
import IPython.display

from detection import Detection
from correction import Correction

## Error Detection with Raha

### 1. Instantiating the Data set via the detection class
We first instantiate the `Detection` class. Das muss so gemacht werden.

## Weird dataset setup with Raha

In [3]:
beers = {
    "name": "beers",
    "path": "../datasets/beers/dirty.csv",
    "clean_path": "../datasets/beers/clean.csv"
}
flights = {
    "name": "flights",
    "path": "../datasets/flights/dirty.csv",
    "clean_path": "../datasets/flights/clean.csv"
}
movies = {
    "name": "movies",
    "path": "../datasets/movies_1/dirty.csv",
    "clean_path": "../datasets/movies_1/clean.csv"
}
rayyan = {
    "name": "rayyan",
    "path": "../datasets/rayyan/dirty.csv",
    "clean_path": "../datasets/rayyan/clean.csv"
}
tax = {
    "name": "tax",
    "path": "../datasets/tax/dirty.csv",
    "clean_path": "../datasets/tax/clean.csv"
}
toy = {
    "name": "toy",
    "path": "../datasets/toy/dirty.csv",
    "clean_path": "../datasets/toy/clean.csv"
}
hospital = {
    "name": "hospital",
    "path": "../datasets/hospital/dirty.csv",
    "clean_path": "../datasets/hospital/clean.csv"
}

datasets = [beers, 
            flights, 
            hospital,
            rayyan]
            #movies, 
            # this dataset keeps crahsing for me
            # tax 
            # toy, 
experiments = ['adder', 'constant', 'ente']

## Error correction with Baran

In [6]:
results = []
for e in experiments:
    for dataset_dictionary in datasets:
        for run in range(15):
            app_1 = Detection()
            app_1.VERBOSE = True
            d = app_1.initialize_dataset(dataset_dictionary)
            app_2 = Correction()

            # How many tuples would you label?
            app_2.LABELING_BUDGET = 20
            app_2.VERBOSE = True

            d.detected_cells = dict(d.get_actual_errors_dictionary())
            d = app_2.initialize_dataset(d)

            # experiment setzen
            d.experiment = e

            app_2.initialize_models(d)

            while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
                app_2.sample_tuple(d)
                if not d.has_ground_truth:
                    raise ValueError('Wir arbeiten nur mit maschinell gesampelten Labels')
                app_2.label_with_ground_truth(d)
                app_2.update_models(d)
                app_2.generate_features_synchronously(d)
                #app_2.generate_features(d)
                app_2.predict_corrections(d)

            p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
            results.append({'dataset': d.name, 'experiment': e, 'run': run, 'precision': p, 'recall': r, 'f1': f})
            print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

The error corrector models are initialized.
Tuple 1287 is sampled.
Tuple 1287 is labeled.
The error corrector models are updated with new labeled tuple 1287.
213728 pairs of (a data error, a potential correction) are featurized.
79% (3432 / 4362) of data errors are corrected.
Tuple 543 is sampled.
Tuple 543 is labeled.
The error corrector models are updated with new labeled tuple 543.
213937 pairs of (a data error, a potential correction) are featurized.
94% (4089 / 4362) of data errors are corrected.
Tuple 386 is sampled.
Tuple 386 is labeled.
The error corrector models are updated with new labeled tuple 386.
215945 pairs of (a data error, a potential correction) are featurized.
94% (4092 / 4362) of data errors are corrected.
Tuple 889 is sampled.
Tuple 889 is labeled.
The error corrector models are updated with new labeled tuple 889.
216080 pairs of (a data error, a potential correction) are featurized.
95% (4130 / 4362) of data errors are corrected.
Tuple 309 is sampled.
Tuple 309 i

## Lauf mit constanter Vicinity
Hier habe ich `model[key][value] = 1.0` statisch gesetzt. Dadurch bleibt information über die Nachbarschaft von Tupeln erhalten, aber die relativen Wahrscheinlichkeiten verlieren ihre Bedeutung.

In [7]:
print(results)

[{'dataset': 'beers', 'experiment': 'adder', 'run': 0, 'precision': 0.8912540490513651, 'recall': 0.8830811554332875, 'f1': 0.8871487793643482}, {'dataset': 'beers', 'experiment': 'adder', 'run': 1, 'precision': 0.8960978988686216, 'recall': 0.8897294818890418, 'f1': 0.8929023352122397}, {'dataset': 'beers', 'experiment': 'adder', 'run': 2, 'precision': 0.9755183612290782, 'recall': 0.8952315451627694, 'f1': 0.9336521219366408}, {'dataset': 'beers', 'experiment': 'adder', 'run': 3, 'precision': 0.8766620816139385, 'recall': 0.8766620816139385, 'f1': 0.8766620816139385}, {'dataset': 'beers', 'experiment': 'adder', 'run': 4, 'precision': 0.859009628610729, 'recall': 0.859009628610729, 'f1': 0.8590096286107289}, {'dataset': 'beers', 'experiment': 'adder', 'run': 5, 'precision': 0.9844142785319256, 'recall': 0.8977533241632278, 'f1': 0.9390887290167865}, {'dataset': 'beers', 'experiment': 'adder', 'run': 6, 'precision': 0.9838749083801612, 'recall': 0.9232003668042182, 'f1': 0.952572442341

In [5]:
print(results)

[{'dataset': 'beers', 'experiment': 'adder', 'precision': 0.8915441176470589, 'recall': 0.8895002292526364, 'f1': 0.8905210006885472}, {'dataset': 'flights', 'experiment': 'adder', 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}, {'dataset': 'hospital', 'experiment': 'adder', 'precision': 0.9141716566866267, 'recall': 0.899803536345776, 'f1': 0.906930693069307}, {'dataset': 'rayyan', 'experiment': 'adder', 'precision': 0.6565656565656566, 'recall': 0.34282700421940926, 'f1': 0.45045045045045046}, {'dataset': 'beers', 'experiment': 'constant', 'precision': 0.8725355341586428, 'recall': 0.8725355341586428, 'f1': 0.8725355341586428}, {'dataset': 'flights', 'experiment': 'constant', 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}, {'dataset': 'hospital', 'experiment': 'constant', 'precision': 0.8834355828220859, 'recall': 0.8487229862475442, 'f1': 0.8657314629258517}, {'dataset': 'rayyan', 'experiment': 'constant', 'precision': 0.693089430894309, 'recall': 0.35970464135021096, 'f1': 0.473611111111

### 5. Storing Results
Baran can also store the error correction results.

In [None]:
app_2.store_results(d)

### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [None]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))