# **`Outliers` in Data Preprocessing**
> Outliers are data points in a dataset that are distant from other observations. They may occur due to variability in the data or due to measurement errors. In statistics and machine learning, they are often considered as "noise" that can negatively impact the performance of certain models and are therefore sometimes removed during the data preprocessing stage.

**Outliers are also known as:**

- Anomalies
- Aberrations
- Deviations
- Exceptions
- Peculiarities

**Types of outliers:**

- **Point outliers**: These are single data points that lie far from the rest of the distribution.
- **Contextual outliers**: These are data points that deviate significantly based on a specific context. For example, selling ice cream is common in summer (context), but would be considered an outlier in winter.
- **Collective outliers**: These are collections of data points that as a group deviate significantly from the entire data set, even if the individual data points may not be outliers.

**Causes of outliers:**

- **Measurement error**: Outliers can be caused by errors in data collection, recording, or entry.
- **Data processing error**: Mistakes in data processing can also lead to outliers.
- **Sampling error**: Sometimes, outliers can be the result of a flaw in the sampling process.
- **Natural Outlier**: When an outlier is not artificial (due to error), it is a natural outlier. Sometimes, it's just an unusual data point in your dataset.
- **Changes in behavior of the observed system**: In time-series data, this could be a sudden change in trend or seasonality.

**Top Methods for detecting and removing outliers:**

1. `Z-Score Method`
2. `IQR Method`
3. `Clustering Method` (e.g., K-Means)
4. Isolation Forest
5. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
6. LOF (Local Outlier Factor)
7. Robust Random Cut Forest
8. Elliptic Envelope
9. One-Class SVM
10. Median Absolute Deviation Method

#### **1. Import Libraries:**

In [1]:
# Step 1: Import the required libraries
import pandas as pd
import numpy as np

#### **2. Create the Sample Data:**

In [10]:
# Step 2: Create the data
data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})
data

Unnamed: 0,Age
0,20
1,21
2,22
3,23
4,24
5,25
6,26
7,27
8,28
9,29


#### **3. Calculate the 'Mean' & 'Standard Deviation':**

In [8]:
# Step 3: Calculate the mean and standard deviation
mean = np.mean(data['Age'])
std = np.std(data['Age'])
print('Mean:', mean)
print('Standard Deviation:', std)

Mean: 27.083333333333332
Standard Deviation: 7.543853274171114


#### **4. Calculate the 'Z-Score':**

In [6]:
# Step 4: Calculate the Z-Score
data['Z-Score'] = (data['Age'] - mean) / std
data

Unnamed: 0,Age,Z-Score
0,20,-0.938954
1,21,-0.806396
2,22,-0.673838
3,23,-0.54128
4,24,-0.408721
5,25,-0.276163
6,26,-0.143605
7,27,-0.011047
8,28,0.121512
9,29,0.25407


#### **5. Print the Data:**

In [9]:
# Step 5: Print the data
print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")


----------------------------------------
Here is the data with outliers:
     Age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628
11   50  3.037793
----------------------------------------


### **1. Detectng & Removing Outliers using Z-SCORE Method:**

The Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean.

Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

In terms of outlier detection, if the Z-score of a data point is more than a threshold (commonly 3), it is considered an outlier. This is based on the empirical rule that nearly all of the data (99.7%) lies within three standard deviations from the mean in a normal distribution.

The formula for calculating the Z-score of a data point is:

```markdown
`Z = (X - μ) / σ`
```

Where:
- `Z` is the Z-score,
- `X` is the value of the data point,
- `μ` is the mean of the dataset, and
- `σ` is the standard deviation of the dataset.

To remove outliers, you can filter your dataset to only include data points where the absolute Z-score is less than the threshold.

#### **1.1 Z-SCORE Method using '`Numpy Library`':**

In [23]:
# Step 1: Import the required libraries
import pandas as pd
import numpy as np

# Step 2: Create the data
data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})

# Step 3: Calculate the mean and standard deviation
mean = np.mean(data['Age'])
std = np.std(data['Age'])

# Step 4: Calculate the Z-Score
data['Z-Score'] = (data['Age'] - mean) / std

# Step 5: Print the data
print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")
# Step 6: Print the outliers
print(f"Here are the outliers based on the z-score threshold, 3:\n {data[data['Z-Score'] > 3]}")
print("----------------------------------------")
# Step 7: Remove the outliers
data = data[data['Z-Score'] <= 3]

# Step 8: Print the data without outliers
print(f"Here is the data without outliers:\n {data}")

----------------------------------------
Here is the data with outliers:
     Age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628
11   50  3.037793
----------------------------------------
Here are the outliers based on the z-score threshold, 3:
     Age   Z-Score
11   50  3.037793
----------------------------------------
Here is the data without outliers:
     Age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628


#### **1.2 Z-SCORE Method using '`Scipy Library`':**

In [16]:
# Import libraries
import numpy as np
from scipy import stats

# Sample data
data = [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]

# Calculate the Z-score for each data point
z_scores = np.abs(stats.zscore(data))

# Set a threshold for identifying outliers
threshold = 2.5 
outliers = np.where(z_scores > threshold)[0]

# print the data
print("----------------------------------------")
print("Data:", data)
print("----------------------------------------")

# Print the outliers and their values
print("Indices of Outliers:", outliers)
print("Outliers:", [data[i] for i in outliers])

# Remove outliers
data = [data[i] for i in range(len(data)) if i not in outliers]
print("----------------------------------------")
print("Data without outliers:", data)

----------------------------------------
Data: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]
----------------------------------------
Indices of Outliers: [9]
Outliers: [110.0]
----------------------------------------
Data without outliers: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0]


### **2. Detecting Outliers with `IQR-Method`:**
The Interquartile Range (IQR) method is a statistical technique to identify outliers. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. 

In the IQR method, an outlier is any value that falls below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Here's the formula:

```
Lower bound = Q1 - 1.5*IQR
Upper bound = Q3 + 1.5*IQR
```

Where:
- Q1 is the first quartile (25th percentile)
- Q3 is the third quartile (75th percentile)
- IQR is the interquartile range (Q3 - Q1)

Any data point that falls below the lower bound or above the upper bound is considered an outlier.

To remove outliers, you can filter your dataset to only include data points where the value is between the lower and upper bounds.

#### **2.1 IQR Methof using '`Numpy Library`':**

In [22]:
# Step 1: Import the required libraries
import pandas as pd
import numpy as np

# Step 2: Create the data
data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})

# Step 3: Calculate the first and third quartile
Q1 = np.percentile(data['Age'], 25, interpolation = 'midpoint')
Q3 = np.percentile(data['Age'], 75, interpolation = 'midpoint')

# Step 4: Calculate the IQR
IQR = Q3 - Q1

# Step 5: Calculate the lower and upper bound
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

# Step 6: Print the data
print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")
# Step 7: Print the outliers
print(f"Here are the outliers based on the IQR threshold:\n {data[(data['Age'] < lower_bound) | (data['Age'] > upper_bound)]}")
print("----------------------------------------")
# Step 8: Remove the outliers
data = data[(data['Age'] >= lower_bound) & (data['Age'] <= upper_bound)]

# Step 9: Print the data without outliers
print(f"Here is the data without outliers:\n {data}")

----------------------------------------
Here is the data with outliers:
     Age
0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
10   30
11   50
----------------------------------------
Here are the outliers based on the IQR threshold:
     Age
11   50
----------------------------------------
Here is the data without outliers:
     Age
0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
10   30


### **3. Clustering Method (K-Means):**
The K-Means clustering method is a technique that can be used for outlier detection. K-Means is an iterative algorithm that divides a group of n datasets into k non-overlapping subgroups (clusters) based on the mean distance from the centroid.

Here's how it can be used for outlier detection:

1. The K-Means algorithm is applied to the dataset and divides the data into k clusters.
2. For each data point, the distance to its cluster centroid is calculated.
3. If the distance of a data point to its cluster centroid is above a certain threshold, it is considered an outlier.

The formula to calculate the distance of a data point to its cluster centroid in a Euclidean space is:

```
d = sqrt((x1 - x2)^2 + (y1 - y2)^2 + ... + (n1 - n2)^2)
```

Where:
- `d` is the distance,
- `(x1, y1, ..., n1)` are the coordinates of the data point,
- `(x2, y2, ..., n2)` are the coordinates of the centroid.

To remove outliers, you can filter your dataset to only include data points where the distance to the centroid is below the threshold.

Please note that the choice of the number of clusters (k) and the threshold is critical in this method. A poor choice can lead to poor outlier detection.

In [24]:
# Import library
from sklearn.cluster import KMeans

# Sample data
data = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]

# Create a K-means model with two clusters (normal and outlier)
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(data)

# Predict cluster labels
labels = kmeans.predict(data)

# Identify outliers based on cluster labels
outliers = [data[i] for i, label in enumerate(labels) if label == 1]

# print data
print("Data:", data)
print("Outliers:", outliers)
# Remove outliers
data = [data[i] for i, label in enumerate(labels) if label == 0]
print("Data without outliers:", data)

Data: [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]
Outliers: [[30, 30], [31, 31], [32, 32]]
Data without outliers: [[2, 2], [3, 3], [3, 4]]
