<h2><u>Statistical Foundations</u></h2>

# Module 2 – Exploratory Data Analysis
<h2>Demo 1: Detecting and Removing Outliers</h2>

In this demo, you will be shown how to detect and remove outliers using Z-score and IQR score.

In [3]:
#Import the required libraries
import pandas as pd
from sklearn import datasets
from scipy import stats
import numpy as np

In [7]:
#Load the Boston House Pricing Dataset which is included in the sklearn dataset API
boston = datasets.load_breast_cancer()
x = boston.data
y = boston.target
columns = boston.feature_names

In [8]:
#Create the dataframe
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Using Z-Score

In [14]:
#Step1: Use Z-score function defined in scipy library to detect the outliers
boston_df_z = boston_df
z = np.abs(stats.zscore(boston_df))
print(z)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness   
0       1.097064      2.073335        1.269934   0.984375         1.568466  \
1       1.829821      0.353632        1.685955   1.908708         0.826962   
2       1.579888      0.456187        1.566503   1.558884         0.942210   
3       0.768909      0.253732        0.592687   0.764464         3.283553   
4       1.750297      1.151816        1.776573   1.826229         0.280372   
..           ...           ...             ...        ...              ...   
564     2.110995      0.721473        2.060786   2.343856         1.041842   
565     1.704854      2.085134        1.615931   1.723842         0.102458   
566     0.702284      2.045574        0.672676   0.577953         0.840484   
567     1.838341      2.336457        1.982524   1.735218         1.525767   
568     1.808401      1.221792        1.814389   1.347789         3.112085   

     mean compactness  mean concavity  mean concave points  mea

Looking at the code and the output above, it is difficult to say which data point is an outlier.
So let’s define a threshold to identify an outlier.

In [15]:
#Step2: Define a threshold
threshold = 3
x_new=np.where(z > 3)
print(np.where(z > 3))

(array([  0,   3,   3,   3,   3,   3,   3,   3,   3,   9,   9,   9,  12,
        12,  12,  12,  12,  14,  14,  23,  25,  31,  31,  35,  42,  42,
        42,  60,  68,  68,  68,  68,  71,  71,  71,  71,  72,  78,  78,
        78,  78,  78,  82,  82,  82,  82,  82,  82,  82,  83, 105, 105,
       108, 108, 108, 108, 108, 108, 112, 112, 116, 119, 119, 122, 122,
       122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122,
       138, 138, 146, 146, 146, 151, 151, 152, 152, 152, 152, 152, 152,
       176, 176, 180, 180, 180, 180, 180, 180, 180, 181, 181, 190, 190,
       190, 190, 190, 192, 202, 203, 212, 212, 212, 212, 212, 212, 212,
       213, 213, 213, 213, 213, 219, 219, 232, 236, 236, 239, 239, 258,
       258, 258, 259, 259, 265, 265, 265, 265, 265, 288, 288, 290, 290,
       314, 314, 318, 323, 339, 339, 345, 351, 352, 352, 352, 352, 352,
       352, 352, 352, 368, 368, 370, 376, 376, 376, 379, 379, 379, 388,
       389, 400, 416, 417, 417, 430, 461, 461, 461, 461, 461, 4

The first array contains the list of row numbers and second array contains the respective column numbers, which means that <b><i>z[55][1]</i> has a z-score higher than 3</b>.

In [20]:
#Step3: Print the z-score of z[1][2]
print(x_new[1][2])

5


So, the data point — 1sth record on column ZN is an outlier.

In [22]:
#Step4: Remove the outliers using the z-score
boston_df_z = boston_df_z[(z < 3).all(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df.shape)
print("The no. of rows after outlier filtering is: ", boston_df_z.shape)

The no. of rows before outlier filtering was:  (569, 30)
The no. of rows after outlier filtering is:  (495, 30)


  boston_df_z = boston_df_z[(z < 3).all(axis=1)]


Hence, we filtered out around 90+ rows from the dataset i.e. outliers have been removed.

### Using IQR Score

In [23]:
#Step1: Calculate the IQR
boston_df_iqr = boston_df
Q1 = boston_df_iqr.quantile(0.25)
Q3 = boston_df_iqr.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

mean radius                  4.080000
mean texture                 5.630000
mean perimeter              28.930000
mean area                  362.400000
mean smoothness              0.018930
mean compactness             0.065480
mean concavity               0.101140
mean concave points          0.053690
mean symmetry                0.033800
mean fractal dimension       0.008420
radius error                 0.246500
texture error                0.640100
perimeter error              1.751000
area error                  27.340000
smoothness error             0.002977
compactness error            0.019370
concavity error              0.026960
concave points error         0.007072
symmetry error               0.008320
fractal dimension error      0.002310
worst radius                 5.780000
worst texture                8.640000
worst perimeter             41.290000
worst area                 568.700000
worst smoothness             0.029400
worst compactness            0.191900
worst concav

In [24]:
#Step2: Detect the outliers
print(boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))

     mean radius  mean texture  mean perimeter  mean area  mean smoothness   
0          False         False           False      False            False  \
1          False         False           False      False            False   
2          False         False           False      False            False   
3          False         False           False      False            False   
4          False         False           False      False            False   
..           ...           ...             ...        ...              ...   
564        False         False           False      False            False   
565        False         False           False      False            False   
566        False         False           False      False            False   
567        False         False           False      False            False   
568        False         False           False      False             True   

     mean compactness  mean concavity  mean concave points  mea

TypeError: Cannot perform 'ror_' with a dtyped [bool] array and scalar of type [NoneType]

The data point where we have False that means these values are valid whereas <b><i>True</i> indicates presence of an outlier</b>.

In [25]:
#Step3: Remove the outliers using the IQR score
boston_df_out = boston_df_iqr[~((boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))).any(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df_iqr.shape)
print("The no. of rows after outlier filtering is: ", boston_df_out.shape)

The no. of rows before outlier filtering was:  (569, 30)
The no. of rows after outlier filtering is:  (398, 30)


Hence, the outliers have been removed.