# End-to-End Data Cleaning Pipeline with Raha and Baran (Minimal and Sequential)
We build an end-to-end data cleaning pipeline with our configuration-free error detection and correction systems, Raha and Baran.

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
import pandas
import IPython.display

import raha

## Error Detection with Raha

### 1. Instantiating the Detection Class
We first instantiate the `Detection` class.

In [11]:
app_1 = raha.Detection()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20

# Would you like to see the logs?
app_1.VERBOSE = True

### 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [12]:
dataset_dictionary = {
    "name": "flights",
    "path": "../datasets/hospital/dirty.csv",
    "clean_path": "../datasets/hospital/clean.csv"
}
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,index,provider_number,name,address_1,address_2,address_3,city,state,zip,county,phone,type,owner,emergency_service,condition,measure_code,measure_name,score,sample,state_average
0,1,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-card-2,surgery patients who were taking heart drugs c...,empty,empty,al_scip-card-2
1,2,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-1,surgery patients who were given an antibiotic ...,empty,empty,al_scip-inf-1
2,3,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-2,surgery patients who were given the right kind...,empty,empty,al_scip-inf-2
3,4,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birminghxm,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-3,surgery patients whose preventive antibiotics ...,empty,empty,al_scip-inf-3
4,5,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-4,all heart surgery patients whose blood sugar (...,empty,empty,al_scip-inf-4


# Error Correction with Baran

### 1. Instantiating the Correction Class
We first instantiate the `Correction` class.

In [13]:
app_2 = raha.Correction()

# How many tuples would you label?
app_2.LABELING_BUDGET = 20

# Would you like to see the logs?
app_2.VERBOSE = True

### 2. Initializing the Dataset Object
We next initialize the dataset object.

In [14]:
d = app_2.initialize_dataset(d)
d.dataframe.head()

Unnamed: 0,index,provider_number,name,address_1,address_2,address_3,city,state,zip,county,phone,type,owner,emergency_service,condition,measure_code,measure_name,score,sample,state_average
0,1,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-card-2,surgery patients who were taking heart drugs c...,empty,empty,al_scip-card-2
1,2,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-1,surgery patients who were given an antibiotic ...,empty,empty,al_scip-inf-1
2,3,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-2,surgery patients who were given the right kind...,empty,empty,al_scip-inf-2
3,4,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birminghxm,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-3,surgery patients whose preventive antibiotics ...,empty,empty,al_scip-inf-3
4,5,10018,callahan eye foundation hospital,1720 university blvd,empty,empty,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-inf-4,all heart surgery patients whose blood sugar (...,empty,empty,al_scip-inf-4


### 3. Initializing the Error Corrector Models
Baran initializes the error corrector models.

In [15]:
app_2.initialize_models(d)

The error corrector models are initialized.


### 4. Interactive Tuple Sampling, Labeling, Model updating, Feature Generation, and Correction Prediction
Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier to each data column to predict the final correction of each data error. Since we already labeled tuples for Raha, we use the same labeled tuples and do not label new tuples here.

In [None]:
#while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
while len(d.labeled_tuples) < 2:
    app_2.sample_tuple(d)
    if d.has_ground_truth:
        app_2.label_with_ground_truth(d)
    else:
        print("Label the dirty cells in the following sampled tuple.")
        sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
        IPython.display.display(sampled_tuple)
        for j in range(d.dataframe.shape[1]):
            cell = (d.sampled_tuple, j)
            value = d.dataframe.iloc[cell]
            correction = input("What is the correction for value '{}'? Type in the same value if it is not erronous.\n".format(value))
            user_label = 1 if value != correction else 0
            d.labeled_cells[cell] = [user_label, correction]
        d.labeled_tuples[d.sampled_tuple] = 1
    app_2.update_models(d)
    app_2.generate_features_synchronously(d)
    app_2.predict_corrections(d)

for si in d.labeled_tuples:
    d.sampled_tuple = si
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

Tuple 763 is sampled.
Tuple 763 is labeled.
The error corrector models are updated with new labeled tuple 763.
> [0;32m/home/philipp/code/raha/raha/correction.py[0m(338)[0;36m_vicinity_based_corrector[0;34m()[0m
[0;32m    336 [0;31m                [0msum_scores[0m [0;34m=[0m [0msum[0m[0;34m([0m[0mmodels[0m[0;34m[[0m[0mj[0m[0;34m][0m[0;34m[[0m[0med[0m[0;34m[[0m[0;34m"column"[0m[0;34m][0m[0;34m][0m[0;34m[[0m[0mcv[0m[0;34m][0m[0;34m.[0m[0mvalues[0m[0;34m([0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    337 [0;31m                [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 338 [0;31m                [0;32mfor[0m [0mnew_value[0m [0;32min[0m [0mmodels[0m[0;34m[[0m[0mj[0m[0;34m][0m[0;34m[[0m[0med[0m[0;34m[[0m[0;34m"column"[0m[0;34m][0m[0;34m][0m[0;34m[[0m[0mcv[0m[0;34m][0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    339 [0;31m                    

ipdb>  print(models[j])


{0: {}, 1: {'1': {'10018': 1.0}, '2': {'10018': 1.0}, '3': {'10018': 1.0}, '4': {'10018': 1.0}, '5': {'10018': 1.0}, '6': {'10018': 1.0}, '7': {'10018': 1.0}, '8': {'10018': 1.0}, '9': {'10019': 1.0}, '10': {'10019': 1.0}, '11': {'10019': 1.0}, '12': {'10019': 1.0}, '13': {'10019': 1.0}, '14': {'1xx19': 1.0}, '15': {'10019': 1.0}, '16': {'10019': 1.0}, '17': {'10019': 1.0}, '18': {'10019': 1.0}, '19': {'10019': 1.0}, '20': {'10001': 1.0}, '21': {'10001': 1.0}, '22': {'10001': 1.0}, '23': {'10001': 1.0}, '24': {'10001': 1.0}, '25': {'10001': 1.0}, '26': {'10001': 1.0}, '27': {'10001': 1.0}, '28': {'10001': 1.0}, '29': {'10001': 1.0}, '30': {'10001': 1.0}, '31': {'10001': 1.0}, '32': {'10001': 1.0}, '33': {'10001': 1.0}, '34': {'10001': 1.0}, '35': {'10001': 1.0}, '36': {'10001': 1.0}, '37': {'10001': 1.0}, '38': {'10001': 1.0}, '39': {'10001': 1.0}, '40': {'10001': 1.0}, '41': {'10001': 1.0}, '42': {'10001': 1.0}, '43': {'10001': 1.0}, '44': {'10001': 1.0}, '45': {'10005': 1.0}, '46': {

ipdb>  print(j)


0


ipdb>  print(ed)


{'column': 10, 'old_value': '33422284xx', 'vicinity': ['764', '10036', 'andalusia regional hospital', '849 south three notch street', 'empty', 'empty', 'andalusia', 'al', '36420', 'covington', '33422284xx', 'acute care hospitals', 'proprietary', 'no', 'surgical infection prevention', 'scip-inf-6', 'surgery patients needing hair removed from the surgical area before surgery who had hair removed using a safer method (electric clippers or hair removal cream c not a razor)', '100%', '204 patients', 'al_scip-inf-6']}


### 5. Storing Results
Baran can also store the error correction results.

In [34]:
app_2.store_results(d)

The results are stored in ../datasets/hospital/raha-baran-results-flights/error-correction/correction.dataset.


### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [35]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Baran's performance on flights:
Precision = 0.86
Recall = 0.61
F1 = 0.71
