# Finding outliers in data
#### What is an outlier?
* Anything that lies outside the normal distribution of dataset is known as outlier.
* Any data point that behaves differently from the set is known as outlier


#### Why do they exist?
Outliers exist because of 2 reasons:
* Variance in data - anamolies and ambiguities in data which can be quite different from normal distribution
* Entry error - human error while preparing the dataset or entering values

## How to identify outliers?
2 ways to identify outliers in any dataset
#### 1. Visualization plot
   Plot the data in scatter plot, histogram or box plot and outliers will be clearly visible as they are away from the center of the data
#### 2. Inter Quartile Range method (IQR)
    Uses IQR and Quantiles to define upper and lower bounds for data

### How to use Inter-Quartile method?
#### Steps
* Sort the data in the list in ascending order
* Calculate the first quartile (Q1) and the third quartile (Q3) and the IQR (Q3-Q1)
* Calculate the lower bound (Q1 - (IQR * 1.5))
* Calculate the upper bound (Q3 - (IQR * 1.5))
* Any value that lies below the lower bound or above the upper bound is a potential outlier

Pandas has built in methods to find quartiles. IQR, upper and lower bounds has to be calculated

In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.DataFrame(np.random.randn(900,3))

In [3]:
df.head()

Unnamed: 0,0,1,2
0,-0.131694,1.350226,0.416394
1,-1.098554,-0.069028,1.257827
2,0.876352,-0.14191,-0.261031
3,-1.292246,-1.116471,-1.236922
4,-0.957594,0.740286,3.131848


In [6]:
quantiles_df = (df.quantile([0.25,0.75]))

In [7]:
print("The 1st and 3rd Quartile of all columns\n")
print(quantiles_df)

The 1st and 3rd Quartile of all columns

             0         1         2
0.25 -0.668095 -0.668929 -0.636122
0.75  0.684575  0.638832  0.687872


In [8]:
Q1 = quantiles_df[0][0.25]
Q3 = quantiles_df[0][0.75]

In [9]:
IQR = Q3 - Q1

In [10]:
lowerbound = (Q1 - (IQR * 1.5))
upperbound = (Q3 + (IQR * 1.5))

In [11]:
print("The lower bound for first column\n")
print(lowerbound)
print("\n The upper bound for first column\n")
print(upperbound)

The lower bound for first column

-2.697100451184453

 The upper bound for first column

2.7135803524632545


In [13]:
col1=df[0]

In [14]:
print("The outliers in the first column below the lower bound are\n")
print(col1[col1 < lowerbound])

The outliers in the first column below the lower bound are

273   -2.903475
720   -2.846870
786   -2.890979
869   -2.707135
Name: 0, dtype: float64


In [15]:
print("The outliers in the first column above the upper bound are\n")
print(col1[col1 > upperbound])

The outliers in the first column above the upper bound are

324    3.287915
436    3.376404
572    3.052397
787    2.817193
Name: 0, dtype: float64


#### As we have identified the outliers, lets see how to handle the outliers
#### Assigning lowerbound value to the outliers below the lowerbound and upperbound value to the outliers above the upperbound

In [16]:
col1[col1 < lowerbound] = lowerbound
col1[col1 > upperbound] = upperbound

In [17]:
print("After handling the outliers\n")
print("The outliers in the first column below the lower bound are\n", col1[col1 < lowerbound])
print("The outliers in the first column above the upper bound are\n", col1[col1 > upperbound])

After handling the outliers

The outliers in the first column below the lower bound are
 Series([], Name: 0, dtype: float64)
The outliers in the first column above the upper bound are
 Series([], Name: 0, dtype: float64)


In [18]:
print(col1)

0     -0.131694
1     -1.098554
2      0.876352
3     -1.292246
4     -0.957594
         ...   
895    0.423038
896    0.049663
897    0.141169
898    1.643089
899   -1.685683
Name: 0, Length: 900, dtype: float64


In [19]:
df[0]=col1

In [20]:
print(df)

            0         1         2
0   -0.131694  1.350226  0.416394
1   -1.098554 -0.069028  1.257827
2    0.876352 -0.141910 -0.261031
3   -1.292246 -1.116471 -1.236922
4   -0.957594  0.740286  3.131848
..        ...       ...       ...
895  0.423038 -0.365692  0.888563
896  0.049663 -1.622724  1.617429
897  0.141169 -0.280255  0.126374
898  1.643089 -0.376239  0.635775
899 -1.685683  0.228994 -1.098492

[900 rows x 3 columns]
