# Detect and Delete outliers with Optimus

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.

You have to be careful when studying outliers because how do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier.

* http://colingorrie.github.io/outlier-detection.html
* http://blog.madhukaraphatak.com/statistical-data-exploration-spark-part-3/
* http://blog.caseystella.com/pyspark-openpayments-analysis-part-4.html
* http://rstudio-pubs-static.s3.amazonaws.com/228345_4c20226b21714b7e8fa3782e6c8a1779.html

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("..")

In [3]:
from optimus import Optimus

In [4]:
# Create optimus
op = Optimus()


                             ____        __  _                     
                            / __ \____  / /_(_)___ ___  __  _______
                           / / / / __ \/ __/ / __ `__ \/ / / / ___/
                          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
                          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
                              /_/                                  
                              
Transform and Roll out...
Just checking that all necessary environments vars are present...
-----
HADOOP_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
PYSPARK_PYTHON=python
SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_181
Pyarrow Installed
-----
Starting or getting SparkSession and SparkContext...
Optimus successfully imported. Have fun :).


In [5]:
from pyspark.sql.types import *
df = op.create.df(
    [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True),
                ("two strings", StringType(), True),
                ("filter", StringType(), True),
                ("num 2", "int", True),
                ("date", "string", True),
                ("num 3", "str", True)
                
            ],[
                ("  I like     fish  ", 1, "dog", "&^%$#housé", "cat-car", "a",1, "20150510", '3'),
                ("    zombies", 2, "cat", "tv", "dog-tv", "b",2, "20160510", '3'),
                ("simpsons   cat lady", 2, "frog", "table","eagle-tv-plus","1",3, "20170510", '4'),
                (None, 3, "eagle", "glass", "lion-pc", "c",4, "20180510", '5'),
                (None, 5, "eagle", "glass", "lion-pc", "c",4, "20180510", '5'),
               (None, 6, "eagle", "glass", "lion-pc", "c",4, "20180510", '5'),
             (None, 7, "eagle", "glass", "lion-pc", "c",4, "20180510", '5'),
             (None, 20, "eagle", "glass", "lion-pc", "c",4, "20180510", '5')
    
            ]
            )

df.table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4
,3,eagle,glass,lion-pc,c,4,20180510,5
,5,eagle,glass,lion-pc,c,4,20180510,5
,6,eagle,glass,lion-pc,c,4,20180510,5
,7,eagle,glass,lion-pc,c,4,20180510,5
,20,eagle,glass,lion-pc,c,4,20180510,5


From a quick inspection of the dataframe we can guess that the 1000 in the column `num` can be an outlier. You can perform a very intense search to see if it is actually and outlier, if you need something like that please check out [these articles and tutorials](http://www.datasciencecentral.com/profiles/blogs/11-articles-and-tutorials-about-outliers)

With optimus you can perform several analysis too to check if a value is an outlier. First lets run some visual analysis. Remember to check the [Main Example](https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb) for more.

## Outlier detection

One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point that is more than two standard deviations, say, from the mean (I am referring to sample means and standard deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean and the standard deviation, making this technique unreliable.

That's why we have programmed in Optimus the median absolute deviation from median, commonly shortened to the median absolute deviation (MAD). It is the median of the set comprising the absolute values of the differences between the median and each data point. If you want more information on the subject please read the amazing article by Leys et al. about dtecting outliers [here](http://www.sciencedirect.com/science/article/pii/S0022103113000668)

In [6]:
from optimus.outliers.outliers import OutlierDetector
od = OutlierDetector()

### Zscore

In [7]:
od.z_score(df, "num", threshold=1).table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4
,3,eagle,glass,lion-pc,c,4,20180510,5
,5,eagle,glass,lion-pc,c,4,20180510,5
,6,eagle,glass,lion-pc,c,4,20180510,5
,7,eagle,glass,lion-pc,c,4,20180510,5


In [8]:
od.z_score(df, ["num", "num 2"],1).table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string)
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4
,3,eagle,glass,lion-pc,c,4,20180510,5
,5,eagle,glass,lion-pc,c,4,20180510,5
,6,eagle,glass,lion-pc,c,4,20180510,5
,7,eagle,glass,lion-pc,c,4,20180510,5


### IQR

In [9]:
od.iqr(df, "num").table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4
,3,eagle,glass,lion-pc,c,4,20180510,5
,5,eagle,glass,lion-pc,c,4,20180510,5
,6,eagle,glass,lion-pc,c,4,20180510,5
,7,eagle,glass,lion-pc,c,4,20180510,5


### MAD

In [10]:
od.mad(df, "num", 1).table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4
,3,eagle,glass,lion-pc,c,4,20180510,5
,5,eagle,glass,lion-pc,c,4,20180510,5


In [89]:
od.mad(df, "num", 1).table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string),m_z_score  (double)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3,0.6745
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3,0.33725
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4,0.33725
,3,eagle,glass,lion-pc,c,4,20180510,5,0.0
,5,eagle,glass,lion-pc,c,4,20180510,5,0.6745


### Modified Zscore

In [91]:
od.modified_z_score(df, "num", 1).table()

words  (string),num  (int),animals  (string),thing  (string),two strings  (string),filter  (string),num 2  (int),date  (string),num 3  (string),m_z_score  (double)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱,1,dog,&^%$#housé,cat-car,a,1,20150510,3,0.6745
⸱⸱⸱⸱zombies,2,cat,tv,dog-tv,b,2,20160510,3,0.33725
simpsons⸱⸱⸱cat⸱lady,2,frog,table,eagle-tv-plus,1,3,20170510,4,0.33725
,3,eagle,glass,lion-pc,c,4,20180510,5,0.0
,5,eagle,glass,lion-pc,c,4,20180510,5,0.6745
