##### Quartile

1. Interquartile Range (IQR) Method
The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is a measure of statistical dispersion. Outliers are typically defined as data points that fall below the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR.

Steps to Identify Outliers Using IQR:

Calculate the first quartile (Q1) and the third quartile (Q3).

Compute the IQR as IQR = Q3 - Q1.

Determine the lower bound as Q1 - 1.5 * IQR.

Determine the upper bound as Q3 + 1.5 * IQR.

Any data points below the lower bound or above the upper bound are considered outliers.

In [None]:
import pandas as pd
import numpy as np

# Creating a DataFrame with a column that has outliers
data = {'Values': [10, 12, 14, 15, 18, 22, 24, 100, 110, 120]}
df = pd.DataFrame(data)

# Calculating Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)

# Calculating the IQR
IQR = Q3 - Q1

# Determining the bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identifying outliers
outliers_iqr = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]

print("IQR Method:")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
print("Outliers identified using IQR method:")
print(outliers_iqr)
# Output:
# IQR Method:
# Lower Bound: -3.5, Upper Bound: 45.5
# Outliers identified using IQR method:
#    Values
# 7     100
# 8     110
# 9     120


#### Standard deviation

The Standard Deviation Method is a commonly used technique to identify outliers in datasets. It assumes that the data follows a normal distribution (bell-shaped curve). In this method, we set thresholds based on the standard deviation to determine which data points fall significantly far from the mean.

Let me break this down for you:

Key Concepts:
Mean:

The mean is the average value of all the data points in the dataset.
It represents the "center" of the data.
Standard Deviation (std):

The standard deviation is a measure of how much the data points in the dataset vary from the mean.
A small standard deviation means that most of the data points are close to the mean, while a large standard deviation indicates that the data points are spread out over a wide range of values.
Normal Distribution:

If data follows a normal distribution (or a bell curve), most of the data points cluster around the mean.
A large portion of the data (approximately 68%) lies within 1 standard deviation of the mean.
About 95% of the data lies within 2 standard deviations of the mean.
Roughly 99.7% of the data lies within 3 standard deviations of the mean.
Why Set Thresholds?
The threshold is used to define a cutoff point that separates "normal" data from potential outliers.

If a data point is more than a certain number of standard deviations away from the mean, it is considered unusual or an outlier.
Commonly, we set the threshold at 2 or 3 standard deviations from the mean, based on how strict we want to be in identifying outliers:
2 standard deviations: Identifies points in the outer 5% of the data (more aggressive in identifying outliers).
3 standard deviations: Identifies points in the outer 0.3% of the data (more conservative in identifying outliers).
Why 2 or 3 Standard Deviations?
This choice comes from the empirical rule (also known as the 68-95-99.7 rule):

1 standard deviation from the mean: Covers 68% of the data.
2 standard deviations from the mean: Covers 95% of the data.
3 standard deviations from the mean: Covers 99.7% of the data.
Interpretation:
If you choose a 2 standard deviation threshold, you are saying: "Any data point that falls outside the range of the mean ± 2 standard deviations is unusual and may be an outlier."
If you choose a 3 standard deviation threshold, you are stricter and are looking for only the most extreme outliers. This will exclude fewer data points as outliers.

In [None]:
import pandas as pd

# Example Data
data = {'Values': [10, 12, 14, 15, 18, 22, 24, 100, 110, 120]}
df = pd.DataFrame(data)

# Calculating the mean and standard deviation
mean = df['Values'].mean()  # Mean of the data
std = df['Values'].std()  # Standard deviation of the data

# Setting a threshold of 2 standard deviations
threshold = 2

# Calculating the lower and upper bounds
lower_bound_std = mean - threshold * std
upper_bound_std = mean + threshold * std

# Identifying outliers
outliers_std = df[(df['Values'] < lower_bound_std) | (df['Values'] > upper_bound_std)]

print(f"Lower Bound: {lower_bound_std}, Upper Bound: {upper_bound_std}")
print("Outliers identified using Standard Deviation method:")
print(outliers_std)


Mean and Standard Deviation:

Let's say the mean (mean) of this dataset is 55.5, and the standard deviation (std) is 31.21.
These numbers summarize where the "center" of the data is (mean) and how spread out the data is (standard deviation).
Threshold:

In this example, we set the threshold to 2 standard deviations. This means we are interested in data points that are more than 2 standard deviations away from the mean.
Lower and Upper Bounds:

We calculate the bounds as:
Lower Bound = mean - 2 * std = 55.5 - 2 * 31.21 = -6.92
Upper Bound = mean + 2 * std = 55.5 + 2 * 31.21 = 117.92
Any data points outside this range are considered outliers.
Identifying Outliers:

In the given data, the points 100, 110, and 120 exceed the upper bound of 117.92, and hence they are flagged as outliers.

# Z-score

The Z-score method is another commonly used technique for detecting outliers. The Z-score tells you how many standard deviations a data point is from the mean of the dataset. The further a data point is from the mean, the more likely it is to be an outlier.

What is a Z-score?
The Z-score, also known as the standard score, is calculated by the following formula:

(equation of z-score):ask chatgpt

​
 
Where:

X is the value of the data point.
\mu is the mean of the dataset.
\sigma is the standard deviation of the dataset.
Interpretation of Z-score:
A Z-score of 0 indicates that the data point is exactly at the mean of the dataset.
A positive Z-score indicates that the data point is above the mean.
A negative Z-score indicates that the data point is below the mean.
The magnitude of the Z-score tells you how far away (in terms of standard deviations) the data point is from the mean.
Why Set a Z-score Threshold?
In most datasets, if the data is normally distributed (i.e., follows a bell-shaped curve):

About 68% of the data falls within 1 standard deviation of the mean (Z-scores between -1 and 1).
About 95% of the data falls within 2 standard deviations of the mean (Z-scores between -2 and 2).
About 99.7% of the data falls within 3 standard deviations of the mean (Z-scores between -3 and 3).
Therefore, any data point that has a Z-score greater than 3 or less than -3 is considered an extreme value or outlier, as it lies far beyond the range of the vast majority of the data.

Steps to Identify Outliers Using Z-score:
Calculate the Z-score for each data point.
Set a Z-score threshold (typically 3 or -3) to determine what you consider an outlier.
Identify data points that have Z-scores beyond the set threshold (either above +3 or below -3).

In [None]:
import pandas as pd
from scipy.stats import zscore

# Creating a DataFrame with a column that has potential outliers
data = {'Values': [10, 12, 14, 15, 18, 22, 24, 100, 110, 120]}
df = pd.DataFrame(data)

# Calculating the Z-scores for the 'Values' column
df['Z-score'] = zscore(df['Values'])

# Identifying outliers with Z-score above 3 or below -3
outliers_zscore = df[(df['Z-score'] > 3) | (df['Z-score'] < -3)]

print("Z-score Method:")
print("Outliers identified using Z-score method:")
print(outliers_zscore)


Explanation:
Calculating Z-scores:

The Z-score for each data point is calculated using the zscore function from the scipy.stats library. This function automatically computes the Z-score for every value in the 'Values' column. The Z-scores are added to a new column, 'Z-score', in the DataFrame.

Here’s what the Z-score calculation does:

For each data point, it subtracts the mean of the data (μ) from the value (X) and divides by the standard deviation (σ).
This standardizes the data, transforming the values into a scale where the mean is 0 and the standard deviation is 1.
Here's what happens to each value in the Values column:

Small values like 10 and 12 get low Z-scores (close to zero).
Large values like 100, 110, and 120 get higher Z-scores (greater than 3), which suggests they are significantly far from the mean.
Setting the Z-score Threshold:

The threshold of 3 or -3 is chosen because, according to the properties of a normal distribution, almost all data (99.7%) should lie within ±3 standard deviations from the mean. Any data point with a Z-score beyond this range (greater than +3 or less than -3) is considered an outlier because it is far from the central cluster of data.

Identifying Outliers:

After calculating the Z-scores, we filter the DataFrame to find rows where the Z-score is greater than 3 or less than -3. In this case, the outliers are the values 100, 110, and 120, which have Z-scores of approximately 2.56, 2.96, and 3.37, respectively.

### Capping and Flooring

Capping and flooring are methods used to handle outliers by limiting extreme values to a specific range, rather than removing or transforming them completely. These techniques are common in machine learning and statistics when you want to mitigate the influence of extreme values without discarding any data points.

Capping: Limiting extreme high values (outliers above the upper bound) to a maximum threshold.
Flooring: Limiting extreme low values (outliers below the lower bound) to a minimum threshold.
By capping or flooring, you replace the extreme outliers with predefined thresholds so that the distribution becomes more robust, preventing those extreme values from unduly influencing the analysis or models.

Methods for Capping and Flooring
You can choose thresholds for capping and flooring using:

Percentiles: For example, capping values at the 95th percentile and flooring values at the 5th percentile.
Statistical Rules: Using the IQR (Interquartile Range) or standard deviation methods to define thresholds.

In [None]:
import pandas as pd
import numpy as np

# Creating a DataFrame with outliers
data = {'Values': [10, 12, 14, 15, 18, 22, 24, 100, 110, 120]}
df = pd.DataFrame(data)
0
# Calculate the 5th and 95th percentiles
lower_cap = df['Values'].quantile(0.05)
upper_cap = df['Values'].quantile(0.95)

# Apply flooring and capping
df['Capped_Values'] = df['Values'].apply(lambda x: upper_cap if x > upper_cap else (lower_cap if x < lower_cap else x))

print("Original DataFrame:")
print(df)
# Output:
#    Values  Capped_Values
# 0      10           10.0
# 1      12           12.0
# 2      14           14.0
# 3      15           15.0
# 4      18           18.0
# 5      22           22.0
# 6      24           24.0
# 7     100           99.0
# 8     110           99.0
# 9     120           99.0

print(f"Lower Cap (5th percentile): {lower_cap}")
print(f"Upper Cap (95th percentile): {upper_cap}")


Example: Capping and Flooring Based on IQR

In [None]:
# Calculate the first quartile (Q1) and the third quartile (Q3)
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Define lower and upper thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Apply flooring and capping
df['IQR_Capped_Values'] = df['Values'].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))

print("Original DataFrame with IQR-based Capped Values:")
print(df)
# Output:
#    Values  Capped_Values  IQR_Capped_Values
# 0      10           10.0               10.0
# 1      12           12.0               12.0
# 2      14           14.0               14.0
# 3      15           15.0               15.0
# 4      18           18.0               18.0
# 5      22           22.0               22.0
# 6      24           24.0               24.0
# 7     100           99.0               31.5
# 8     110           99.0               31.5
# 9     120           99.0               31.5

print(f"Lower Bound (IQR): {lower_bound}")
print(f"Upper Bound (IQR): {upper_bound}")
