#### What is an outlier?

In [2]:
data =[10, 22, 12, 11, 13, 300]
#300 is an outlier because it is far away from the normal range (10â€“22).

##### Why handle outliers?

##### How to detect outliers

In [3]:
#1)Using IQR (Interquartile Range)
import pandas as pd

data = pd.DataFrame({'Marks': [50, 52, 49, 51, 100]})

Q1 = data['Marks'].quantile(0.25)
Q3 = data['Marks'].quantile(0.75)
IQR = Q3 - Q1

lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

print("Lower Limit:", lower_limit)
print("Upper Limit:", upper_limit)

outliers = data[(data['Marks'] < lower_limit) | (data['Marks'] > upper_limit)]
print("Outliers:\n", outliers)

Lower Limit: 47.0
Upper Limit: 55.0
Outliers:
    Marks
4    100


##### How to handle outliers

##### 1.Remove outliers

##### Delete extreme values (only if they are errors).

In [4]:
filtered_data = data[(data['Marks'] >= lower_limit) & (data['Marks'] <= upper_limit)]
print(filtered_data)

   Marks
0     50
1     52
2     49
3     51


##### 2.Capping / Flooring (Winsorization)

##### Replace extreme values with the nearest limit.

In [5]:
import numpy as np

data['Marks'] = np.where(data['Marks'] > upper_limit, upper_limit,np.where(data['Marks'] < lower_limit, lower_limit, data['Marks']))
print(data)

   Marks
0   50.0
1   52.0
2   49.0
3   51.0
4   55.0


##### 3.Replace with Mean or Median

##### Use when dataset is small or you want to keep the same size.

In [6]:
median_value = data['Marks'].median()
data['Marks'] = np.where((data['Marks'] > upper_limit) | (data['Marks'] < lower_limit),median_value, data['Marks'])
print(data)

   Marks
0   50.0
1   52.0
2   49.0
3   51.0
4   55.0


##### 4.Transformation Method

#### Apply log or square root transformation to reduce extreme effects.

In [8]:
data['Log_Marks'] = np.log(data['Marks'])
print(data)

   Marks  Log_Marks
0   50.0   3.912023
1   52.0   3.951244
2   49.0   3.891820
3   51.0   3.931826
4   55.0   4.007333
