# HANDLING OUTLIERS

There are several techniques for handling outliers, such as:

1) Removing outliers:

--This can be done by specifying a threshold value and removing data points that fall outside of this threshold.
For example, removing all data points that are more than three standard deviations away from the mean.


--This can be done by using a method that uses the interquartile range (IQR) to calculate the threshold value. The threshold value is defined as the difference between the 75th and 25th percentiles, multiplied by a scale factor (usually 1.5 or 3).
Data points that fall outside of this threshold are considered outliers.

In [None]:
import numpy as np

# calculate the interquartile range
q75, q25 = np.percentile(data, [75, 25])
iqr = q75 - q25

# calculate the threshold values
min_val = q25 - (iqr*1.5)
max_val = q75 + (iqr*1.5)

# remove outliers
outliers = data[(data < min_val) | (data > max_val)]
data = data[(data >= min_val) & (data <= max_val)]


# ---------

2) Transforming the data:

--This can be done by using mathematical functions such as log, square root or reciprocal to transform the data.

--This technique can be used to reduce the impact of outliers by making the data more symmetrical or by making the outliers less extreme.

In [None]:
import numpy as np

data = np.log(data)


----------------------------------------------------------------

3) Using robust algorithms:

--This can be done by using algorithms that are less sensitive to outliers.
For example, linear regression can be made more robust to outliers by using a technique called "least absolute deviation" instead of "least squares".

--Another example is the use of decision tree algorithm that splits the data based on the feature value rather than mean values, this way the outlier's impact is reduced.