# Bring Your Own Data (BYOD) Outlier Detection

In this notebook, we show a simple use-case of our system using [OECD](https://data.oecd.org/) dataset. In the dataset, we detect three different types of outliers:
* Global outliers: values that rarely appear in the real-world data. 
* Local outliers: values that are different from other values in the same attribute. 
* Null outliers: values that have no meaning

 ## Setup
 
 * Setup __HOME__ directory
 * Setup pandas options to display full dataframes

In [None]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

__HOME__ = Path("../byod-cleaning-api")

In [2]:
import pandas as pd

# options to display full dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [3]:
# Input csv file is read into pandas DataFrame
input_file = __HOME__ / "data/oecd_cropped.csv"

df = pd.read_csv(input_file, dtype=str)
df.head(5)

Unnamed: 0,GDP per capita,Gross national income (GNI) per capita,Household disposable income,Real GDP growth,Net saving rate in household disposable income,Gross fixed capital formation,"Agriculture, forestry, fishing",Industry including energy,Construction,"Trade, repairs, transport, accomm., food services","Information, communication",Finance and insurance,Real estate,"Professional, scientific, support services","Public admin., defence, education, health, social work",Other services (ISIC Rev.4 R - U),Government deficit,General government debt,General government revenues,General government expenditures,Government/compulsory expenditure on health,Voluntary expenditure on health,Public social expenditure,Private social expenditure,Public pension expenditure,Private pension expenditure,Net official development assistance (Aid),Total tax revenue,Tax on personal income,Tax on corporate profits,Taxes on goods and services,Taxes on the average worker,Imports of goods and services,Exports of goods and services,Goods trade balance: exports minus imports of goods,Imports of goods,Exports of goods,Service trade balance: exports minus imports of services,Imports of services,Exports of services,Current account balance of payments,Outward FDI stocks,Inward FDI stocks,Inflows of foreign direct investment,Outflows of foreign direct investment,Inflation rate: all items,Inflation rate: all items non food non energy,Inflation rate: food,Inflation rate: energy,Producer Price Indices (PPI): manufacturing,Long-term interest rates,Purchasing power parities,Exchange rates,Indices of price levels,Total primary energy supply (TPES),TPES per unit of GDP at 2000 prices and PPPs,Renewable energy,Crude oil import prices,Households with access to computers,Households with access to internet,Water abstactions,National fish landings in domestic ports,National fish landings in foreign ports,Aquaculture,Municipal waste total,Municipal waste total per capita,CO2 emissions from fuel combustion,Tertiary attainment in population aged 25-64,"Expenditure per student: primary, 2010 prices","Expenditure per student: secondary, 2010 prices","Expenditure per student: tertiary, 2010 prices",Youths 15-19 not in education nor employment,Youths 20-24 not in education nor employment,Employment rate in population aged 15-24,Employment rate in population aged 25-54,Employment rate in population aged 55-64,Incidence of part-time employment,Self-employment rate: total employment,Self-employment rate: male employment,Self-employment rate: female employment,Unemployment rate: total labour force,Unemployment rate: male labour force,Unemployment rate: female labour force,Long-term unemployment: total unemployed,"Labour compensation per unit labour input, total economy",Average time worked per person in employment,Gross domestic expenditure on R&D,Researchers: full-time equivalent,Total population,Population growth rates,Total fertitity rates,Youth population aged less than 15,Elderly population aged 65 and over,Foreign-born population,Foreign population,Unemployment rate in population of native-born men,Unemployment rate in population of foreign-born men,Unemployment rate in population of native-born women,Unemployment rate in population of foreign-born women,Life expectancy at birth,Life expectancy at birth: men,Life expectancy at birth: women,Infant mortality,"Overweight or obese, % of population aged 15 and over",Suicide Rates,Goods transport,Passenger transport
0,41 450,41 868,-0.5,1.8,6.5,4.2,0.7,17.3,5.7,20.2,4.2,6.3,8.7,12.8,21.8,2.2,-4.2,110.6,50.3,54.5,7.9,2.3,28.7,2.0,10.0,3.6,0.54,43.1,12.2,2.8,10.7,56.1,81.1,81.6,-13.5,337.0,323.5,10.1,95.2,105.3,-1.1,..,..,78 329,46 413,3.5,1.8,2.7,16.9,6.8,4.23,0.83,0.72,110,56.2,0.13,3 119.8,110.5,78.9,76.5,5 297,17,3,0 e,5 003,454,93,35,..,..,15 666,6.1,17.1,26.0,79.4,38.7,18.8,14.3,17.5,10.5,7.1,7.1,7.2,48.3,2.4,1 560,9 575,9.4,11 048,1.2,1.8,17.0,17.2,15.0,10.6,5.7,15.5,6.0,14.6,80.7 |,78.0 |,83.3 |,3.4,..,18.1,50 506 e,138 643
1,42 585,43 627,0.2,0.2,5.7,0.2,0.9,16.8,5.7,20.0,4.3,6.3,8.7,12.9,22.3,2.2,-4.2,120.5,51.6,55.9,8.0,2.2,28.7,2.0,10.0,1.2,0.47,44.2,12.3,3.0,11.1,56.0,81.7,82.3,-12.4,313.2,300.8,8.4,98.0,106.4,-0.1,419 640,486 226,6 518,33 834,2.8,2.2,3.1,6.1,2.8,3.0,0.82,0.78,104,53.8,0.12,3 364.7,110.83,80.3,77.7,5 301,18,4,0 e,4 944,446,92,35,9 818,12 323,15 953,8.3,17.5,25.3,79.3,39.5,18.7,14.3,17.6,10.5,7.5,7.6,7.4,44.7,3.1,1 560,10 123,10.0,11 128,0.7,1.8,17.0,17.5,15.3,10.8,5.8,17.6,5.9,15.9,80.5,77.8,83.1,3.8,..,17.4,..,..
2,43 746,44 467,0.2,0.2,5.1,-1.5,0.8,16.7,5.6,19.8,4.1,6.1,8.6,13.4,22.7,2.2,-3.1,118.5,52.7,55.8,8.1,2.3,29.2,1.9,10.3,1.2,0.45,45.1,12.9,3.1,10.9,55.7,80.5,81.7,-8.8,321.4,312.6,9.1,104.1,113.2,-0.3,465 528,482 946,25 188,29 480,1.1,1.4,3.4,-3.8,-0.2,2.41,0.81,0.75,109,56.0,0.13,3 505.0,108.45,81.9,80.0,4 829,16,6,0 e,4 867,436,94,36,10 072,12 911,16 697,6.7,18.7,23.6,79.1,41.7,18.2,15.1,18.8,10.7,8.4,8.6,8.2,46.0,2.7,1 558,10 413,10.2,11 178,0.5,1.7,17.0,17.7,15.5,10.9,6.8,18.2,6.8,16.0,80.7,78.1,83.2,3.5,..,16.2,..,132 125
3,44 720,45 029,1.1,1.3,5.1,5.8,0.7,16.5,5.5,19.7,4.1,6.0,8.7,13.8,22.7,2.3,-3.1,131.1,52.2,55.3,8.1,2.3,29.4,1.8,10.4,..,0.46,45.1,13.0,3.1,10.8,55.6,82.0,82.7,-7.0,317.3,310.3,7.1,117.8,124.9,-0.9,554 624,496 519,-12 392,-3 681,0.3,1.6,-0.4,-6.2,-2.5,1.71,0.8,0.75,109,53.0,0.12,3 398.2,98.49,..,82.8,..,20,5,0 e,4 762,424,87,37,10 191,13 086,17 002,5.4,18.9,23.2,79.1,42.7,18.1,14.6,17.9,10.9,8.5,9.0,7.9,49.9,1.1,1 545,10 785,11.1,11 227,0.4,1.7,17.0,17.9,..,..,7.2,18.7,6.5,16.3,81.4,78.8,83.9,3.4,51.0,16.1,..,134 954 e
4,45 739,45 480,0.4,1.7,4.3,2.7,0.8,16.7,5.3,19.7,4.1,6.0,8.7,14.2,22.2,2.3,-2.4,127.8,51.3,53.7,8.0,2.3 |,29.2,1.9,10.7,..,0.42,44.8,12.6,3.3,10.7,55.3,79.4,80.8,-1.1,254.9,253.9,6.4,107.1,113.4,-1.0,582 582,520 992,23 536,39 896,0.6,1.4,1.2,-5.8,-6.8,0.84,0.8,0.9,100,53.3,0.12,3 664.3,51.65,82.1,81.8,..,18,4,0 e,4 643,411,93,37,10 211,13 070,17 320,4.3,15.9,23.4,78.5,44.0,18.2,15.2,18.7,11.1,8.5,9.1,7.8,51.7,0.2,1 545,11 313,11.6,..,..,1.7,..,..,..,..,7.4,17.9,6.2,16.0,81.1,78.7,83.4,3.3,..,15.8,..,132 573


## Outlier Detection
---------------------------------
BYOD outlier detection service is deployed at https://bclean.mint.isi.edu/detect. 

The `POST` request takes data as follows:
```json
{
    "table":{
        "column1": ["val1", "val2"],
        "column2": ["val3", "val4"]
    }
}
```

In [4]:
data = df.to_dict(orient="list")

# show one column for example
data["GDP per capita"]

['41 450', '42 585', '43 746', '44 720', '45 739', '47 366', '49 526']

----------------------------------------
The response data has the following form:
```json
{
    "table":{
        "column1": ["[[[val1]]]", "val2"],
        "column2": ["val3", "val4"]
    }
}
```
where `[[[value]]]` denotes the outliers

In [7]:
from requests.auth import HTTPBasicAuth
import requests

auth = HTTPBasicAuth('mint', 'asf12jkj!%&')

response = requests.post("https://bclean.mint.isi.edu/detect", json={"table": data}, auth=auth)
result_df = pd.DataFrame.from_dict(response.json()["table"], orient="index").transpose()

--------------------------------
Outliers are annotated as `[[[value]]]`. For example, in the first column `GDP per capita`, all values are global outliers since the regex pattern `[0-9]+ [0-9]+` rarely appears in real-world data.

In [8]:
# show result in the same column order as original file
result_df[df.columns]

Unnamed: 0,GDP per capita,Gross national income (GNI) per capita,Household disposable income,Real GDP growth,Net saving rate in household disposable income,Gross fixed capital formation,"Agriculture, forestry, fishing",Industry including energy,Construction,"Trade, repairs, transport, accomm., food services","Information, communication",Finance and insurance,Real estate,"Professional, scientific, support services","Public admin., defence, education, health, social work",Other services (ISIC Rev.4 R - U),Government deficit,General government debt,General government revenues,General government expenditures,Government/compulsory expenditure on health,Voluntary expenditure on health,Public social expenditure,Private social expenditure,Public pension expenditure,Private pension expenditure,Net official development assistance (Aid),Total tax revenue,Tax on personal income,Tax on corporate profits,Taxes on goods and services,Taxes on the average worker,Imports of goods and services,Exports of goods and services,Goods trade balance: exports minus imports of goods,Imports of goods,Exports of goods,Service trade balance: exports minus imports of services,Imports of services,Exports of services,Current account balance of payments,Outward FDI stocks,Inward FDI stocks,Inflows of foreign direct investment,Outflows of foreign direct investment,Inflation rate: all items,Inflation rate: all items non food non energy,Inflation rate: food,Inflation rate: energy,Producer Price Indices (PPI): manufacturing,Long-term interest rates,Purchasing power parities,Exchange rates,Indices of price levels,Total primary energy supply (TPES),TPES per unit of GDP at 2000 prices and PPPs,Renewable energy,Crude oil import prices,Households with access to computers,Households with access to internet,Water abstactions,National fish landings in domestic ports,National fish landings in foreign ports,Aquaculture,Municipal waste total,Municipal waste total per capita,CO2 emissions from fuel combustion,Tertiary attainment in population aged 25-64,"Expenditure per student: primary, 2010 prices","Expenditure per student: secondary, 2010 prices","Expenditure per student: tertiary, 2010 prices",Youths 15-19 not in education nor employment,Youths 20-24 not in education nor employment,Employment rate in population aged 15-24,Employment rate in population aged 25-54,Employment rate in population aged 55-64,Incidence of part-time employment,Self-employment rate: total employment,Self-employment rate: male employment,Self-employment rate: female employment,Unemployment rate: total labour force,Unemployment rate: male labour force,Unemployment rate: female labour force,Long-term unemployment: total unemployed,"Labour compensation per unit labour input, total economy",Average time worked per person in employment,Gross domestic expenditure on R&D,Researchers: full-time equivalent,Total population,Population growth rates,Total fertitity rates,Youth population aged less than 15,Elderly population aged 65 and over,Foreign-born population,Foreign population,Unemployment rate in population of native-born men,Unemployment rate in population of foreign-born men,Unemployment rate in population of native-born women,Unemployment rate in population of foreign-born women,Life expectancy at birth,Life expectancy at birth: men,Life expectancy at birth: women,Infant mortality,"Overweight or obese, % of population aged 15 and over",Suicide Rates,Goods transport,Passenger transport
0,[[[41 450]]],[[[41 868]]],[[[-0.5]]],1.8,6.5,4.2,0.7,17.3,5.7,20.2,4.2,6.3,8.7,12.8,21.8,2.2,[[[-4.2]]],110.6,50.3,54.5,7.9,2.3,28.7,2.0,10.0,3.6,0.54,43.1,12.2,2.8,10.7,56.1,81.1,81.6,[[[-13.5]]],[[[337.0]]],323.5,10.1,95.2,105.3,[[[-1.1]]],[[[ ..]]],[[[ ..]]],[[[78 329]]],[[[46 413]]],3.5,1.8,2.7,16.9,6.8,4.23,0.83,0.72,110,56.2,0.13,[[[3 119.8]]],110.50,78.9,76.5,[[[5 297]]],17,3,0 e,[[[5 003]]],454,93,35,[[[ ..]]],[[[ ..]]],[[[15 666]]],6.1,17.1,26.0,79.4,38.7,18.8,14.3,17.5,10.5,7.1,7.1,7.2,48.3,2.4,[[[1 560]]],[[[9 575]]],9.4,[[[11 048]]],1.2,1.8,17.0,17.2,15.0,10.6,5.7,15.5,6.0,14.6,[[[80.7 |]]],[[[78.0 |]]],[[[83.3 |]]],3.4,[[[ ..]]],18.1,[[[50 506 e]]],[[[138 643]]]
1,[[[42 585]]],[[[43 627]]],0.2,0.2,5.7,0.2,0.9,16.8,5.7,20.0,4.3,6.3,8.7,12.9,22.3,2.2,[[[-4.2]]],120.5,51.6,55.9,8.0,2.2,28.7,2.0,10.0,1.2,0.47,44.2,12.3,3.0,11.1,56.0,81.7,82.3,[[[-12.4]]],313.2,300.8,8.4,98.0,106.4,[[[-0.1]]],[[[419 640]]],[[[486 226]]],[[[6 518]]],[[[33 834]]],2.8,2.2,3.1,6.1,2.8,3.0,0.82,0.78,104,53.8,0.12,[[[3 364.7]]],[[[110.83]]],80.3,77.7,[[[5 301]]],18,4,0 e,[[[4 944]]],446,92,35,[[[9 818]]],[[[12 323]]],[[[15 953]]],8.3,17.5,25.3,79.3,39.5,18.7,14.3,17.6,10.5,7.5,7.6,7.4,44.7,3.1,[[[1 560]]],[[[10 123]]],10.0,[[[11 128]]],0.7,1.8,17.0,17.5,15.3,10.8,5.8,17.6,5.9,15.9,80.5,77.8,83.1,3.8,[[[ ..]]],17.4,[[[ ..]]],[[[ ..]]]
2,[[[43 746]]],[[[44 467]]],0.2,0.2,5.1,[[[-1.5]]],0.8,16.7,5.6,19.8,4.1,6.1,8.6,13.4,22.7,2.2,[[[-3.1]]],118.5,52.7,55.8,8.1,2.3,29.2,1.9,10.3,1.2,0.45,45.1,12.9,3.1,10.9,55.7,80.5,81.7,[[[-8.8]]],321.4,312.6,9.1,104.1,113.2,[[[-0.3]]],[[[465 528]]],[[[482 946]]],[[[25 188]]],[[[29 480]]],1.1,1.4,3.4,[[[-3.8]]],[[[-0.2]]],2.41,0.81,0.75,109,56.0,0.13,[[[3 505.0]]],108.45,81.9,80.0,[[[4 829]]],16,6,0 e,[[[4 867]]],436,94,36,[[[10 072]]],[[[12 911]]],[[[16 697]]],6.7,18.7,23.6,79.1,41.7,18.2,15.1,18.8,10.7,8.4,8.6,8.2,46.0,2.7,[[[1 558]]],[[[10 413]]],10.2,[[[11 178]]],0.5,1.7,17.0,17.7,15.5,10.9,6.8,18.2,6.8,16.0,80.7,78.1,83.2,3.5,[[[ ..]]],16.2,[[[ ..]]],[[[132 125]]]
3,[[[44 720]]],[[[45 029]]],1.1,1.3,5.1,5.8,0.7,16.5,5.5,19.7,4.1,6.0,8.7,13.8,22.7,2.3,[[[-3.1]]],131.1,52.2,55.3,8.1,2.3,29.4,1.8,10.4,[[[ ..]]],0.46,45.1,13.0,3.1,10.8,55.6,82.0,82.7,[[[-7.0]]],317.3,310.3,7.1,117.8,124.9,[[[-0.9]]],[[[554 624]]],[[[496 519]]],[[[-12 392]]],[[[-3 681]]],0.3,1.6,[[[-0.4]]],[[[-6.2]]],[[[-2.5]]],1.71,0.8,0.75,109,53.0,0.12,[[[3 398.2]]],98.49,[[[ ..]]],82.8,[[[ ..]]],20,5,0 e,[[[4 762]]],424,87,37,[[[10 191]]],[[[13 086]]],[[[17 002]]],5.4,18.9,23.2,79.1,42.7,18.1,14.6,17.9,10.9,8.5,9.0,7.9,49.9,1.1,[[[1 545]]],[[[10 785]]],11.1,[[[11 227]]],0.4,1.7,17.0,17.9,[[[ ..]]],[[[ ..]]],7.2,18.7,6.5,16.3,81.4,78.8,83.9,3.4,51.0,16.1,[[[ ..]]],[[[134 954 e]]]
4,[[[45 739]]],[[[45 480]]],0.4,1.7,4.3,2.7,0.8,16.7,5.3,19.7,4.1,6.0,8.7,14.2,22.2,2.3,[[[-2.4]]],127.8,51.3,53.7,8.0,[[[2.3 |]]],29.2,1.9,10.7,[[[ ..]]],0.42,44.8,12.6,3.3,10.7,55.3,79.4,80.8,[[[-1.1]]],254.9,253.9,6.4,107.1,113.4,[[[-1.0]]],[[[582 582]]],[[[520 992]]],[[[23 536]]],[[[39 896]]],0.6,1.4,1.2,[[[-5.8]]],[[[-6.8]]],0.84,0.8,0.9,100,53.3,0.12,[[[3 664.3]]],51.65,82.1,81.8,[[[ ..]]],18,4,0 e,[[[4 643]]],411,93,37,[[[10 211]]],[[[13 070]]],[[[17 320]]],4.3,15.9,23.4,78.5,44.0,18.2,15.2,18.7,11.1,8.5,9.1,7.8,51.7,0.2,[[[1 545]]],[[[11 313]]],11.6,[[[ ..]]],[[[ ..]]],1.7,[[[ ..]]],[[[ ..]]],[[[ ..]]],[[[ ..]]],7.4,17.9,6.2,16.0,81.1,78.7,83.4,3.3,[[[ ..]]],15.8,[[[ ..]]],[[[132 573]]]
5,[[[47 366]]],[[[47 420]]],1.3,1.5,3.9,3.8,0.7,16.6,5.3,19.7,4.1,6.4,8.6,14.3,22.1,2.2,[[[-2.4]]],128.9,50.7,53.0,7.9,2.4,29.2,[[[ ..]]],[[[ ..]]],[[[ ..]]],0.5,44.1,12.2,3.5,10.8,53.9,81.4,82.7,0.1,274.6,274.7,5.1,108.7,113.8,[[[-0.6]]],[[[594 584]]],[[[499 567]]],[[[50 971]]],[[[20 968]]],2.0,2.0,2.2,0.7,[[[-2.7]]],0.48,0.79,0.9,100,56.5,0.12,[[[3 916.0]]],42.06,[[[ ..]]],84.8,[[[ ..]]],17,8,0,[[[4 746]]],418,92,38,[[[ ..]]],[[[ ..]]],[[[ ..]]],4.3,14.4,[[[22.8 |]]],[[[79.1 |]]],[[[45.4 |]]],17.8,[[[14.8 |]]],[[[18.4 |]]],[[[10.6 |]]],[[[7.8 |]]],[[[8.1 |]]],[[[7.6 |]]],51.6,0.3,[[[1 545]]],[[[11 871]]],11.7,[[[ ..]]],[[[ ..]]],1.7,[[[ ..]]],[[[ ..]]],[[[ ..]]],[[[ ..]]],6.5,15.9,6.0,15.5,81.5,79.0,84.0,3.2,[[[ ..]]],15.9,[[[ ..]]],[[[ ..]]]
6,[[[49 526]]],[[[50 109]]],1.2,1.7,4.0,1.8,0.7,16.7,5.2,19.5,4.2,6.2,8.6,14.5,22.1,2.2,[[[-0.8]]],122.3,51.3,52.1,8.0,2.4,29.2,[[[ ..]]],[[[ ..]]],[[[ ..]]],0.45,44.6,12.1,4.1,10.8,53.8,84.6,85.8,0.4,303.6,304.1,4.5,115.3,119.8,0.7,[[[675 488]]],[[[564 314]]],[[[-5 765]]],[[[24 208]]],2.1,1.5,1.3,8.2,8.4,0.72,0.78,0.89,101,55.6,0.12,[[[4 129.0]]],[[[53.01]]],85.1,86.0,[[[ ..]]],17,[[[ ..]]],[[[ ..]]],[[[4 659]]],408,[[[ ..]]],40,[[[ ..]]],[[[ ..]]],[[[ ..]]],4.0,15.2,22.7,79.5,48.2,16.5,14.3,17.4,10.7,7.1,7.1,7.1,48.8,1.7,[[[1 545]]],[[[12 308]]],12.0,[[[ ..]]],[[[ ..]]],1.6,[[[ ..]]],[[[ ..]]],[[[ ..]]],[[[ ..]]],5.8,13.1,5.7,13.8,81.6,79.2,83.9,3.6,[[[ ..]]],[[[ ..]]],[[[ ..]]],[[[ ..]]]
