# Pandas use case:

# Part II - Outliers and Merging

Let's continue working on the three 'xxx_nmv' DataFrames. 

### Exercise II.1

Load the DataFrames from the previous part. A typical next step is to look at the distribution of each individual attribute in each of these three DataFrames. We will consider all values which are further than 3-times the standard-deviation from the mean to be outliers ("z-score outliers"). Which values in which DataFrame are outliers? Create three new DataFrames 'xxx_clean' where the tuples with outliers are removed, based on the 'xxx_nmv' DataFrames.

In [2]:
%matplotlib notebook
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [3]:
## ---------- SOLUTIONS

In [4]:
setosa_nmv = pd.read_csv('output/setosa_nmv.csv', index_col=0)
versicolor_nmv = pd.read_csv('output/versicolor_nmv.csv', index_col=0)
virginica_nmv =pd.read_csv('output/virginica_nmv.csv', index_col=0)

In [5]:
# using the .describe() method we easily see that there are no outliers in 
# versicolor_mnv and virginica_mnv
versicolor_nmv.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal length,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
sepal width,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
petal length,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1
petal width,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8


In [6]:
versicolor_clean = versicolor_nmv

In [7]:
virginica_nmv.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal length,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9
sepal width,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8
petal length,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9
petal width,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


In [8]:
virginica_clean = virginica_nmv

In [9]:
# alternativly, we could check for the "z-score outlier condition" explicitly:
((versicolor_nmv - versicolor_nmv.mean()).abs() <= versicolor_nmv.std()*3).all() 
# add another .all() to get ONE true/false value over all columns

sepal length    True
sepal width     True
petal length    True
petal width     True
dtype: bool

In [10]:
((virginica_nmv - virginica_nmv.mean()).abs() <= virginica_nmv.std()*3).all()

sepal length    True
sepal width     True
petal length    True
petal width     True
dtype: bool

In [11]:
# however, we do see outlier in all four attributes for setosa_mnv 
# (but only ones that are too large)
setosa_nmv.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal length,54.0,5.785185,5.863556,4.3,4.8,5.0,5.2,48.0
sepal width,54.0,4.457407,5.36669,2.3,3.2,3.4,3.7,32.0
petal length,54.0,2.044444,2.991287,1.0,1.4,1.5,1.6,19.0
petal width,54.0,0.257407,0.146148,0.1,0.2,0.2,0.3,1.0


In [12]:
# create a mask for the outlier
setosa_nmv_outlier_mask = ((setosa_nmv - setosa_nmv.mean()).abs() > setosa_nmv.std()*3)

In [13]:
# in which columns do we have outliers?
setosa_nmv_outlier_mask.all()

sepal length    False
sepal width     False
petal length    False
petal width     False
dtype: bool

In [14]:
# how many outliers are there in each column (note: these are not necessarily in different rows!)
setosa_nmv_outlier_mask.sum(axis=0)

sepal length    1
sepal width     2
petal length    2
petal width     1
dtype: int64

In [15]:
# return only the rows that contain outliers
setosa_nmv_outlier_mask[setosa_nmv_outlier_mask.any(axis=1)]

Unnamed: 0,sepal length,sepal width,petal length,petal width
13,False,True,True,True
31,False,False,True,False
33,True,False,False,False
52,False,True,False,False


In [16]:
# return only the rows that contain outliers
setosa_nmv[setosa_nmv_outlier_mask.any(axis=1)]

Unnamed: 0,sepal length,sepal width,petal length,petal width
13,4.9,31.0,15.0,1.0
31,4.8,3.4,19.0,0.2
33,48.0,3.4,1.9,0.2
52,4.4,32.0,1.3,0.3


In [17]:
# so let's remove all tuples containing at least one outlier
setosa_clean = setosa_nmv[-setosa_nmv_outlier_mask.any(axis=1)]
setosa_clean

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4
7,4.6,3.4,1.4,0.3
8,5.0,3.4,1.5,0.2
10,4.4,2.9,1.4,0.2
11,4.9,3.1,1.5,0.1


### Exercise II.2

Combine the three 'xxx_clean' DataFrames into one DataFrame 'iris' by adding a new column 'class' which is set to 'Iris-setosa', 'Iris-versicolor' or 'Iris-virginica' respectively. Make sure the index in 'iris' are unique numbers from 0 to however-many-tuples-you-have. Save your result in a file 'output/iris_clean.csv'.


In [18]:
## ---------- SOLUTIONS

In [19]:
# all the xxx_clean dfs are views of the original, so add a new column is not well-defined. So let's make fresh copies of these dfs!
setosa_clean = setosa_clean.copy()
versicolor_clean = versicolor_clean.copy()
virginica_clean = virginica_clean.copy()

setosa_clean['class'] = 'Iris-setosa'
versicolor_clean['class'] = 'Iris-versicolor'
virginica_clean['class'] = 'Iris-virginica'

In [20]:
iris = pd.concat([setosa_clean, versicolor_clean, virginica_clean])
# you may have come across pd.assign, this is deprecated and should not be used any more
iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
46,6.7,3.0,5.2,2.3,Iris-virginica
47,6.3,2.5,5.0,1.9,Iris-virginica
48,6.5,3.0,5.2,2.0,Iris-virginica
49,6.2,3.4,5.4,2.3,Iris-virginica


In [21]:
iris = iris.set_index(np.arange(len(iris)))
iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [22]:
iris.to_csv('output/iris_clean.csv')

------