# DataWig Examples

## Installation

Clone the repository from git and set up virtualenv in the root dir of the package:

```
python3 -m venv venv
```

Install the package from local sources:

```
./venv/bin/pip install -e .
```

## Running DataWig
The DataWig API expects your data as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Here is an example of how the dataframe might look:

|Product Type | Description           | Size | Color |
|-------------|-----------------------|------|-------|
|   Shoe      | Ideal for Running     | 12UK | Black |
| SDCards     | Best SDCard ever ...  | 8GB  | Blue  |
| Dress       | This **yellow** dress | M    | **?** |

DataWig let's you impute missing values in two ways:
  * A `.complete` functionality inspired by [`fancyimpute`](https://github.com/iskandr/fancyimpute)
  * A `sklearn`-like API with `.fit` and `.predict` methods

## Quickstart Example

### Using `AutoGluonImputer.complete`


In [1]:
# This allows to import datawig
from pathlib import Path
import sys,os
path_root = Path(os.getcwd()).parents[2]
sys.path.append(str(path_root))

In [2]:
import os, random, warnings
import numpy as np
import datawig

random.seed(0)
warnings.filterwarnings("ignore")

# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric() 
# mask 10% of the values
df_with_missing = df.mask(np.random.rand(*df.shape) > .8)

# impute missing values
df_with_missing_imputed = datawig.AutoGluonImputer.complete(df_with_missing)

df['f(x) with_missing'] = df_with_missing['f(x)']
df['f(x) imputed'] = df_with_missing_imputed['f(x)']
df[-5:]

TypeError: __init__() got an unexpected keyword argument 'precision_threshold'

### Using `AutoGluonImputer.fit` and `.predict`

You can also impute values in specific columns only (called `output_column` below) using values in other columns (called `input_columns` below). DataWig currently supports imputation of categorical columns and numeric columns. Type inference is based on ``pandas`` 

#### Imputation of categorical columns

Let's first generate some random strings hidden in longer random strings:

In [3]:
df['f(x) with_missing'] = df_with_missing['f(x)']
df['f(x) imputed'] = df_with_missing_imputed['f(x)']

In [4]:
df = datawig.utils.generate_df_string( num_samples=200, 
                                       data_column_name='sentences', 
                                       label_column_name='label')
df.head(n=2)

Unnamed: 0,sentences,label
0,cMbm9 j7c1f RebhO BctvV m6Kop NQqEe,m6Kop
1,7zOfb O2NiT RwL85 Rz1TH G7Fgt m6Kop,m6Kop


In [5]:
df_train, df_test = datawig.utils.random_split(df)

imputer = datawig.AutoGluonImputer(
    input_columns=['sentences'], # column(s) containing information about the column we want to impute
    output_column='label' # the column we'd like to impute values for
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
imputed.head(n=5)

		Unable to import dependency mxnet. A quick tip is to install via `pip install mxnet --upgrade`, or `pip install mxnet_cu101 --upgrade`


> [0;32m/Users/felix/code/datawig/datawig/autogluon_imputer.py[0m(176)[0;36mpredict[0;34m()[0m
[0;32m    174 [0;31m                [0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    175 [0;31m                [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 176 [0;31m                [0;32mif[0m [0mself[0m[0;34m.[0m[0mprecision_threshold[0m [0;34m>[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    177 [0;31m                    [0mabove_precision[0m [0;34m=[0m [0mabove_precision[0m [0;34m&[0m[0;31m [0m[0;31m\[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    178 [0;31m                        [0;34m([0m[0mprobas[0m[0;34m[[0m[0mlabel[0m[0;34m][0m [0;34m>=[0m [0mself[0m[0;34m.[0m[0mprecision_thresholds[0m[0;34m[[0m[0mlabel[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m/Users/felix/code/datawig/datawig/autogluon_imputer.p

Unnamed: 0,sentences,label,label_imputed
57,PFTP5 IripW Wa0RH lm0lc Z9jZI DOBx3,Z9jZI,Z9jZI
31,O67IM lm0lc DdZ04 RwL85 n5RL0 Z9jZI,Z9jZI,Z9jZI
65,CH4F6 2V9SM Cffu4 Z9jZI zfx1h Rn9Xd,Z9jZI,Z9jZI
140,vvYxT lm0lc Z9jZI wy1Qq NQqEe OCyT4,Z9jZI,Z9jZI
89,ERA5K YkvB0 IlnyL Svkpo Z9jZI RwL85,Z9jZI,Z9jZI


#### Imputation of numerical columns

Imputation of numerical values works just like for categorical values.

Let's first generate some numeric values with a quadratic dependency:


In [5]:
import datawig

df = datawig.utils.generate_df_numeric( num_samples=200, 
                                        data_column_name='x', 
                                        label_column_name='y')         
df.head(n=5)

Unnamed: 0,x,y
0,1.895813,3.617395
1,-1.008764,1.024857
2,1.978105,3.919697
3,-2.638216,6.96594
4,2.480706,6.151376


In [6]:
df_train, df_test = datawig.utils.random_split(df)

imputer = datawig.AutoGluonImputer(
    input_columns=['x'], # column(s) containing information about the column we want to impute
    output_column='y', # the column we'd like to impute values for
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
imputed.head(n=5)

Unnamed: 0,x,y
57,1.464692,2.149859
31,-2.687957,7.225748
65,2.226667,4.958026
140,2.124441,4.502884
89,-0.434246,0.176235


In [15]:

import pandas as pd
from sklearn.metrics import classification_report

from sklearn.datasets import (
    load_diabetes,
    load_wine,
    make_hastie_10_2
)

def get_data(data_fn, noise=3e-1):
    X, y = data_fn(n_samples=10000)
    X = X + np.random.randn(*X.shape) * noise
    return pd.DataFrame(np.vstack([X.T, y]).T, columns= [str(i) for i in range(X.shape[-1] + 1)])




In [16]:

X = get_data(make_hastie_10_2)
label = X.columns[-1]
X[label] = X[label].astype(str)
features = X.columns[:-1]
df_train, df_test = datawig.utils.random_split(X.copy())

imputer = datawig.AutoGluonImputer(
    input_columns=[x for x in X.columns if x != label], # column(s) containing information about the column we want to impute
    output_column=label, # the column we'd like to impute values for
    verbosity=2,
    
)


In [17]:
#Fit an imputer model on the train data
imputer.fit(train_df=df_train, time_limit=10)


Beginning AutoGluon training ... Time limit = 10s
AutoGluon will save models to "AutogluonModels/ag-20211130_205043/"
AutoGluon Version:  0.3.1
Train Data Rows:    7200
Train Data Columns: 10
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Train Data Class Count: 2
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    6111.98 MB
	Train Data (Original)  Memory Usage: 0.58 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 10 | ['0', '1', '2', '3', '4

<datawig.autogluon_imputer.AutoGluonImputer at 0x7fe2661e2070>

In [41]:
features = X.columns[:-1]
xxx = df_test.copy(deep=True)
# xxx[label] = ''
imputed = imputer.predict(xxx[features], precision_threshold=.9, inplace=True)
imputed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10_imputed
1121,0.735197,0.712926,0.677537,0.446614,0.447349,-0.597103,-1.025037,-1.020038,-1.814596,-0.194096,-1.0
2877,-0.338972,-1.593443,0.584711,-0.310341,0.204547,1.566984,0.813121,-0.818627,0.449865,1.788736,
1785,-1.910418,2.056198,1.126950,0.831595,-2.170040,0.654610,-1.919194,-2.352304,-1.861600,-0.765301,1.0
9806,0.470405,0.425514,0.417396,0.742000,0.865707,0.335370,2.048923,-0.784050,-1.364779,-0.927087,-1.0
2232,0.143909,0.146218,0.146692,-1.601094,0.060779,0.522951,2.445150,-0.709292,1.639438,-1.330495,1.0
...,...,...,...,...,...,...,...,...,...,...,...
9372,-0.686623,-1.211837,1.069371,-1.356219,-1.286132,-0.697720,0.115574,-0.559375,-1.887478,-1.982212,1.0
7291,1.311047,-0.893888,1.069276,1.142341,0.155184,-0.479877,-1.275557,0.250859,-1.807859,2.170328,1.0
1344,1.672363,-0.131861,1.355230,1.459562,-0.469457,-0.376764,-2.754493,0.916328,0.464703,-0.647731,1.0
7293,-0.472197,1.765242,-0.043818,0.665194,-0.106228,-0.013310,0.923962,0.177422,-1.919143,-0.145779,-1.0


In [42]:
imputed['10_imputed'].fillna('').value_counts()

1.0     901
-1.0    887
        212
Name: 10_imputed, dtype: int64

In [43]:
print(classification_report(df_test[label],imputed[label+"_imputed"].fillna("")))

              precision    recall  f1-score   support

                   0.00      0.00      0.00         0
        -1.0       0.89      0.80      0.84       991
         1.0       0.88      0.79      0.83      1009

    accuracy                           0.79      2000
   macro avg       0.59      0.53      0.56      2000
weighted avg       0.89      0.79      0.84      2000



In [19]:
imputer.precision_thresholds

{'-1.0': {'precisions': array([0.58625954, 0.58562691, 0.58652374, 0.58742331, 0.58832565,
         0.58923077, 0.59013867, 0.59104938, 0.59196291, 0.59287926,
         0.59379845, 0.5947205 , 0.59564541, 0.59657321, 0.5975039 ,
         0.5984375 , 0.59937402, 0.60031348, 0.60125589, 0.60220126,
         0.60314961, 0.60410095, 0.60505529, 0.60601266, 0.60697306,
         0.60793651, 0.60890302, 0.60987261, 0.6108453 , 0.61182109,
         0.6128    , 0.61378205, 0.61476726, 0.61575563, 0.61674718,
         0.61774194, 0.6187399 , 0.6197411 , 0.62074554, 0.62175325,
         0.62276423, 0.6237785 , 0.62479608, 0.62581699, 0.62684124,
         0.62622951, 0.6272578 , 0.62828947, 0.62932455, 0.63036304,
         0.63140496, 0.63245033, 0.63349917, 0.6345515 , 0.63560732,
         0.63666667, 0.63772955, 0.63879599, 0.639866  , 0.6409396 ,
         0.64201681, 0.64309764, 0.64418212, 0.64527027, 0.6463621 ,
         0.64745763, 0.64855688, 0.64965986, 0.65076661, 0.65187713,
         0.6

In [25]:
precision_threshold = 0.99
precisions = imputer.precision_thresholds['1.0']['precisions']
thresholds = imputer.precision_thresholds['1.0']['thresholds']
precision_above = (precisions >= precision_threshold).nonzero()[0][0]

In [26]:
thresholds[min(precision_above, len(thresholds)-1)]

0.6890873908996582

In [15]:
min(precision_above, len(thresholds)-1)

74

In [16]:
len(thresholds)

75