# 🚨 Day-3: Outlier Detection
---
## 🌟 What are Outliers?

**Outliers** are data points that are significantly different from the rest of the data.  
They can be unusually **high** or **low** compared to other values.

Example:  
If most students scored between **60 to 90 marks**, but one student scored **5 marks** and another scored **100 marks**, these two are considered **outliers**.

---
## 🔥 Why Detect Outliers?

Outliers can:
- Distort summary statistics (Mean, Standard Deviation, etc.)
- Affect visualizations
- Mislead Machine Learning models
- Impact data distribution and model accuracy
---

## 🎯 How to Detect Outliers?

| Method              | Description  
                                                    
| **IQR Method**      | Values outside the **IQR range** are considered outliers|

| **Z-Score Method**  | Data points with Z-score > 3 or < -3 are outliers|

In [22]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#sample data
data = {'Value': [5, 7, 9, 10, 15, 18, 21, 25, 30, 35, 40, 100, 150]}
df = pd.DataFrame(data)

IQR METHOD

In [34]:
# Compute Q1, Q3
Q1 = df['Value'].quantile(0.25)  # First Quartile (25th percentile)
Q3 = df['Value'].quantile(0.75)  # Third Quartile (75th percentile)

# Compute IQR
IQR = Q3 - Q1  

# Compute lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Print results
print(f"IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

# Find outliers
outliers = df[(df['Value'] < lower_bound) | (df['Value'] > upper_bound)]
print("\nOutliers:")
print(outliers)



IQR: 25.0
Lower Bound: -27.5
Upper Bound: 72.5

Outliers:
    Value
11    100
12    150


In [None]:
#df with no outliers
df_cleaned = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]
print(df_cleaned)

    Value
0       5
1       7
2       9
3      10
4      15
5      18
6      21
7      25
8      30
9      35
10     40


capping outliers
which means replacing extreme values with nearest acceptable limit instead of removing them. this is useful when outliers have important info. 

In [36]:
#capping
df['Value'] = df['Value'].apply(lambda x: upper_bound if x > upper_bound else lower_bound if x < lower_bound else x)
print(df)

    Value
0     5.0
1     7.0
2     9.0
3    10.0
4    15.0
5    18.0
6    21.0
7    25.0
8    30.0
9    35.0
10   40.0
11   72.5
12   72.5


USING "Z" METHOD

In [None]:
mean = df['Value'].mean()
std_dev = df['Value'].std()
print(mean,"\n",std_dev)

35.76923076923077 
 21.0


In [None]:
#z fomula:
df['zscore'] = (df['Value'] - mean)/std_dev
df

Unnamed: 0,Value,zscore
0,5,-2.118145
1,7,-1.980466
2,9,-1.842786
3,10,-1.773947
4,15,-1.429748
5,18,-1.223229
6,21,-1.01671
7,25,-0.741351
8,30,-0.397152
9,35,-0.052954


In [None]:
#these are the outliers
outliers = df[(df['zscore']<-3)|(df['zscore']>3)]
outliers

Unnamed: 0,Value,zscore
11,100,4.421628
12,150,7.863614


In [None]:
#df with no outliers
df_no_outliers = df[(df['zscore']>-3)&(df['zscore']<3)]
df_no_outliers

Unnamed: 0,Value,zscore
0,5,-2.118145
1,7,-1.980466
2,9,-1.842786
3,10,-1.773947
4,15,-1.429748
5,18,-1.223229
6,21,-1.01671
7,25,-0.741351
8,30,-0.397152
9,35,-0.052954


🎯 Pro Tip:
Always analyze whether to:

-Remove outliers
(or)
-Cap outliers (replace extreme values)
(or)
-Keep them (if they carry important information)