# Outlier Detection and Handling

This notebook covers different techniques to detect outliers in your dataset. For this notebook, a dataset from the scikit-learn library will be used. 

<hr>

The first way is by using the IQR or the interquartile range.

Outlier Detection - IQR 

As a reminder, the IQR is the value when Q3 - Q1. Once the IQR is calculated, an upper and lower bound can be defined to identify outliers. 

In [4]:
# Import packages needed 
from sklearn.datasets import fetch_california_housing
import pandas as pd 

# Read California housing data as a datarame
data = fetch_california_housing(as_frame = True)
df = data["data"]
df["MedHouseVal"] = data["target"]

In [5]:
# First we need to identify Q1 and Q3
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)

# Calculate the IQR 
IQR = Q3 - Q1 

# Subset data by filtering for the upper and lower bounds
outliers = df[((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis = 1)]
outliers

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
41,1.2852,51.0,3.759036,1.248996,517.0,2.076305,37.83,-122.26,1.500
57,0.8172,52.0,6.102459,1.372951,728.0,2.983607,37.82,-122.28,0.853
59,2.5625,2.0,2.771930,0.754386,94.0,1.649123,37.82,-122.29,0.600
...,...,...,...,...,...,...,...,...,...
20608,1.7167,24.0,5.400000,1.273171,768.0,3.746341,39.10,-121.59,0.488
20620,4.5625,40.0,4.125000,0.854167,151.0,3.145833,39.05,-121.48,1.000
20621,2.3661,37.0,7.923567,1.573248,484.0,3.082803,39.01,-121.47,0.775
20629,2.0943,28.0,5.519802,1.020902,6912.0,3.801980,39.12,-121.39,1.083


Using the IQR technique, we found 4,328 outliers, so to remove them, you can do the following: 

In [7]:
df_no_outliers = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis = 1)]
df_no_outliers.reset_index(drop = True, inplace = True)
df_no_outliers

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
1,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
2,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
3,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
4,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
...,...,...,...,...,...,...,...,...,...
16307,3.7125,28.0,6.779070,1.148256,1041.0,3.026163,39.27,-121.56,1.168
16308,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
16309,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
16310,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


Once outliers are removed, there 16,312 rows left in the dataset. 

<hr>

Outlier Detection - Z-scores

Another statistical measure to detect outliers is the Z-score. The Z-score represents the number of standard deviations a data point is away from the mean. A common number threshold to detect outliers by using the Z-score is -3 and +3. Any data point with a Z-score below -3 or above +3 is considered an outlier. 

Using Z-scores requires the following characteristics of your data: 

- Data is assumed to be following a normal distribution. 
- Z-scores are senstitive to extreme values which can influence the mean and standard deviation which can lead to outlier detection. 
- A threshold of 3 is commonly used but depending on your dataset, this threshold can change. 

Use the Python library, scipy, to calculate z-scores. 

In [15]:
# Use libraries scipy and numpy to calculate z-scores
from scipy import stats
import numpy as np 

# Reread data into a dataframe
data = fetch_california_housing(as_frame = True)
df = data["data"]

# Cacluate z-scores using the California housing dataset
z_scores = stats.zscore(df)
z_scores

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,2.344766,0.982143,0.628559,-0.153758,-0.974429,-0.049597,1.052548,-1.327835
1,2.332238,-0.607019,0.327041,-0.263336,0.861439,-0.092512,1.043185,-1.322844
2,1.782699,1.856182,1.155620,-0.049016,-0.820777,-0.025843,1.038503,-1.332827
3,0.932968,1.856182,0.156966,-0.049833,-0.766028,-0.050329,1.038503,-1.337818
4,-0.012881,1.856182,0.344711,-0.032906,-0.759847,-0.085616,1.038503,-1.337818
...,...,...,...,...,...,...,...,...
20635,-1.216128,-0.289187,-0.155023,0.077354,-0.512592,-0.049110,1.801647,-0.758826
20636,-0.691593,-0.845393,0.276881,0.462365,-0.944405,0.005021,1.806329,-0.818722
20637,-1.142593,-0.924851,-0.090318,0.049414,-0.369537,-0.071735,1.778237,-0.823713
20638,-1.054583,-0.845393,-0.040211,0.158778,-0.604429,-0.091225,1.778237,-0.873626


In [22]:
# Get absolute values of z-scores
absolute_zscores = np.abs(z_scores)

# Select points that fall within the -3 and +3 threshold
filtered_points = (absolute_zscores < 3).all(axis = 1)

# Filter the dataframe 
filtered_df = df[filtered_points]
filtered_df.reset_index(drop = True, inplace = True)
filtered_df


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
19789,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
19790,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
19791,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
19792,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


After calcuting and using Z-scores, there were 846 outliers that were removed from the data. 

Some other ways to detect outliers are the following: 

- Histograms
- Box plots
- Scatter plots