## Anomalies and Outliers

The term "outlier" refers to an observation that lies an abnormal distance from other values in a random sample from a population. One of the common methods for identifying outliers in a dataset is by looking at the data's distance from the mean in terms of standard deviations.

#### Identifying Outliers Using Standard Deviations:
When the data follows a normal distribution (or is approximately normal), about 68% of the data falls within one standard deviation of the mean, about 95% falls within two standard deviations, and about 99.7% falls within three standard deviations. Therefore, if a data point is more than two standard deviations away from the mean, it's relatively rare (covering roughly 5% of the data).

To identify outliers using the "more than 2 standard deviations from the mean" rule:

- Calculate the mean and standard deviation of the dataset. 
- Determine the **lower boundary** as (Mean - 2 x SD) and the **upper boundary** as (Mean + 2 x SD).
- Any data point that falls outside of these boundaries can be considered an **outlier.**


<div style="text-align: center;">
    <img src="./img/2/outliers_standard_diviation.gif" width="500" height="400">
</div>

## Cleaning data

"Shifting and scaling" are common data preprocessing techniques, especially in the context of machine learning and statistics. These techniques can improve the performance of many algorithms by transforming the data into a more appropriate form. Let's delve into the details of each:

#### Why shift and scale?:

- Many algorithms, especially those that use gradient descent (like neural networks) or distance-based metrics (like K-means clustering or KNN), benefit from data preprocessing.

- Features with different scales can cause problems. For instance, a feature with a range of [0, 1000] will dominate a feature with a range of [0, 1] in distance-based algorithms.

- Gradient descent can converge faster if the data is well-scaled and centered.
For algorithms that use regularization, features on different scales will lead to different impacts on the regularization term.

In [4]:
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Generate a sample dataset
data = np.array([
    [1, 2000],
    [2, 2500],
    [3, 2100],
    [4, 2400],
    [5, 2300]
])

# Shifting and scaling using Z-score normalization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
print("Standardized Data:")
print(data_standardized)

# Scaling using Min-Max scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = min_max_scaler.fit_transform(data)
print("\nMin-Max Scaled Data:")
print(data_min_max_scaled)


Standardized Data:
[[-1.41421356 -1.40182605]
 [-0.70710678  1.29399328]
 [ 0.         -0.86266219]
 [ 0.70710678  0.75482941]
 [ 1.41421356  0.21566555]]

Min-Max Scaled Data:
[[0.   0.  ]
 [0.25 1.  ]
 [0.5  0.2 ]
 [0.75 0.8 ]
 [1.   0.6 ]]
