# Handling Outliers
* Statistically, outliers are data points that significantly differ from other observations.
* Outliers can cause issues in statistical analysis (affecting mean and standard deviation).

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# HandsOn!

In [2]:
import pandas as pd
import numpy as np

# Checking for Outliers
* 1. Values outside the range of [-1.5 * IQR ~ 1.5 * IQR].
* 2. Values outside the 5th ~ 95th percentile range.

In [3]:
# random data including outliers
x = pd.Series([23, 1, 3, 5, 34, 6, 32, 7, 45, 34, 78])
x.describe()

count    11.000000
mean     24.363636
std      23.589289
min       1.000000
25%       5.500000
50%      23.000000
75%      34.000000
max      78.000000
dtype: float64

## 1. Using IQR
Statistically, values outside the range of [Q1 - 1.5 * IQR ~ Q3 + 1.5 * IQR] are typically considered outliers.

In [4]:
# function to return index of outliers


def idx_of_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    IQR = q3 - q1
    lower_bound = q1 - [IQR * 1.5]
    upper_bound = q3 + [IQR * 1.5]
    return np.where((x > upper_bound) | (x < lower_bound))

In [5]:
idx_of_outliers(x.values)

(array([10]),)

## 2. Values outside the 5th ~ 95th percentile range

In [6]:
print("5th percentile: ", x.quantile(q=0.05))
print("95th percentile: ", x.quantile(q=0.95))

5th percentile:  2.0
95th percentile:  61.5


In [7]:
x[(x < x.quantile(q=0.05)) | (x > x.quantile(q=0.95))]

1      1
10    78
dtype: int64

# Handling Outliers
* Remove outlier values.
* Transform values by methods such as applying the natural logarithm to reduce their magnitude.


In [9]:
# given random house data
houses = pd.DataFrame()
houses["Price"] = [534433, 392333, 293222, 4322032]
houses["Bedrooms"] = [2, 3.5, 2, 116]
houses["Square_Feets"] = [1500, 2500, 1500, 48000]
houses

Unnamed: 0,Price,Bedrooms,Square_Feets
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000


In [10]:
houses.describe()

Unnamed: 0,Price,Bedrooms,Square_Feets
count,4.0,4.0,4.0
mean,1385505.0,30.875,13375.0
std,1960186.0,56.754405,23088.146309
min,293222.0,2.0,1500.0
25%,367555.2,2.0,1500.0
50%,463383.0,2.75,2000.0
75%,1481333.0,31.625,13875.0
max,4322032.0,116.0,48000.0


## 1. The simplest method to handle outliers is to remove them.

In [None]:
q1 = houses["Bedrooms"].quantile(0.25)
q3 = houses["Bedrooms"].quantile(0.75)
iqr = q3 - q1
# Apply filter with respect to IQR
filter = (houses["Bedrooms"] >= q1 - 1.5 * iqr) & (houses["Bedrooms"] <= q3 + 1.5 * iqr)
houses.loc[filter]

Unnamed: 0,Price,Bedrooms,Square_Feets
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


## 2. Transform the column to reduce the impact of outliers.

In [12]:
# to logmetric

houses["Log_Square_Feets"] = [np.log(x) for x in houses["Square_Feets"]]
houses

Unnamed: 0,Price,Bedrooms,Square_Feets,Log_Square_Feets
0,534433,2.0,1500,7.31322
1,392333,3.5,2500,7.824046
2,293222,2.0,1500,7.31322
3,4322032,116.0,48000,10.778956


# Conclusion
* There is no single correct method for identifying and handling outliers.  
* Consider various approaches and choose the most appropriate one.
