# **Data Transformation:**
> Data transformation in the context of data preprocessing refers to the process of `changing the format`, `structure`, or `val`ues of data to prepare it for analysis. 

This can involve a wide range of activities, including:

**1. Standardization**: Rescaling data to have a `mean` of `0` and a `standard deviation` of `1`. 
- This is often used when the algorithm you're using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

**2. Normalization**: Scaling numerical data to fall within a certain range, often `0` to `1`, to allow for fair comparison between different features.

**3. Encoding categorical variables**: `Converting categorical` variables into a format that can be used by machine learning algorithms, such as one-hot encoding or ordinal encoding.

**4. Discretization**: Converting `continuous data` into `discrete bins`, for example, converting age into age groups.

**5. Handling datetime variables**: Extracting components of date-time variables, such as the year, month, day, or day of the week, or calculating the duration between dates.

**6. Feature extraction**: `Creating new features from existing ones`, such as creating a "total income" feature from "monthly income" and "number of months".

These transformations help to make the data more suitable for analysis and can improve the performance of machine learning models.

## **1. Linear Transformation:**

### **1.1 Standardization / Standard Scaling:**

`Standard scaling` is a method of scaling the data such that the distribution of the data is centered around 0, with a standard deviation of 1. This is done by subtracting the mean of the data from each data point and then dividing by the standard deviation of the data. This is a very common method of scaling data, and is used in many machine learning algorithms.

The formula is as follows:

z = (x - μ) / σ

Where:
- Z is the standardized value,
- X is the original value,
- μ is the mean of the feature,
- σ is the standard deviation of the feature.

**Types**:

There's only one type of standard scaling, but it's worth noting that similar techniques include `Min-Max scaling` (normalization), which scales data to a specified range (usually 0 to 1), `MaxAbsScalar`, which scale each feature by its maximum absolue value, and `Robust scaling`, which scales data according to the interquartile range and is less affected by outliers.

**Summary:** (for linear data)
- Standardization / Standard Scaling: `-3 to +3`  (can handle negative values)
- MinMaxSacaler: `0 to 1 `                        (can handle only positive values)
- MaxAbsScalar: `-1 to +1 `                       (can handlenegative values)
- RobustScalar: 

**When to Use**:

- Standard scaling is used when the algorithm you're using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis. 
- It's also useful when features in your dataset have different scales but need to be on the same scale for the algorithm to perform well, such as in support vector machines (SVM) or k-nearest neighbors (KNN).

**Limitations**:

- Standard scaling does not normalize the distribution of the data, so it might not be suitable for data that does not follow a Gaussian distribution / Normal Distribution.
- It's sensitive to outliers. If there are outliers in the data, the scaled data will also have outliers.
- The interpretability of the features is lost after scaling. The scaled features are hard to interpret in their original context.

#### **For Non-Parametric Distribution:**
- **Quantile Transformer:** transformer into linear transformation 
- 

#### **1.1 Using SKLearn Library:**

In [20]:
# import libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

In [21]:
# make an example dataset
df = {
    'age': [25,30,35,40,45],
    'height': [165,170,175,180,185],
    'weight': [55,60,65,70,75]
}

# conver this data to pandas datafram
df = pd.DataFrame(df)
df.head()

Unnamed: 0,age,height,weight
0,25,165,55
1,30,170,60
2,35,175,65
3,40,180,70
4,45,185,75


#### **1. Standard Scaling:**
(-3 to +3)

> Standard Scaler is a preprocessing technique used in machine learning to standardize the dataset’s features to have a mean of 0 and a standard deviation of 1. It is also known as `Z-score normalization`.

The `formula` for standard scaling is:

Z = (X - μ) / σ

Where:
- Z is the standardized value,
- X is the original value,
- μ is the mean of the feature,
- σ is the standard deviation of the feature.

**When to use?**
- Standard Scaler is used when the features of the input dataset have large differences between their ranges, or when a standard normal distribution is assumed based on the machine learning algorithm used. 
- Algorithms like Support Vector Machines (SVM), Linear Regression, Logistic Regression, and K-Nearest Neighbors (KNN) perform better when data is standardized. 

However, it's important to note that Standard Scaler is sensitive to outliers, so if the dataset contains significant outliers, another scaling method may be more appropriate.

In [22]:
# import the scalar
scalar = StandardScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
scaled_df
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

Unnamed: 0,age,height,weight
0,-1.414214,-1.414214,-1.414214
1,-0.707107,-0.707107,-0.707107
2,0.0,0.0,0.0
3,0.707107,0.707107,0.707107
4,1.414214,1.414214,1.414214


#### **2. MinMaxScalar:**
(0 to 1)
> MinMax Scaler is a data preprocessing technique used to `normalize the range of independent variables` or features of data. It scales and translates each feature individually such that it is in the given range on the training set, typically between 0 and 1, or so that the maximum absolute value of each feature is scaled to unit size.

The `formula` for MinMax Scaler is:

```
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
```

Where:
- `X` is the original feature vector,
- `X.min(axis=0)` is the minimum value of the feature,
- `X.max(axis=0)` is the maximum value of the feature,
- `X_std` is the standard deviation of the feature,
- `X_scaled` is the scaled feature.

**When to use?**
- MinMax Scaler is used when the distribution is not Gaussian or the standard deviation is very small. 
- It is also used when preserving zero entries in sparse data is important. 
- However, it is sensitive to outliers, so if the dataset contains significant outliers, MinMax Scaler might not be the best choice.

In [23]:
# import the scalar
scalar = MinMaxScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

Unnamed: 0,age,height,weight
0,0.0,0.0,0.0
1,0.25,0.25,0.25
2,0.5,0.5,0.5
3,0.75,0.75,0.75
4,1.0,1.0,1.0


#### **3. MaxAbsScalar:**
(-1 to 1)
> MaxAbsScaler is a data preprocessing technique that scales each feature by its `maximum absolute value`. This is a type of scaling that does not shift/center the data, and thus it does not destroy any sparsity.

The `formula` for MaxAbsScaler is:


X_scaled = X / abs(X.max)


Where:
- `X_scaled` is the scaled feature,
- `X` is the original feature,
- `abs(X.max)` is the maximum absolute value of the feature.

**When to use?**
- MaxAbsScaler is meant for data that is already centered or sparse data. 
- It does not shift the data, and thus does not destroy any sparsity. 
- This makes it a suitable technique for handling sparse datasets where zero entries need to be preserved. 
- It's also useful when the dataset contains features with varying scales but does not contain large outliers, as MaxAbsScaler does not reduce the impact of outliers.

In [24]:
# import the scalar
scalar = MaxAbsScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
scaled_df
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

Unnamed: 0,age,height,weight
0,0.555556,0.891892,0.733333
1,0.666667,0.918919,0.8
2,0.777778,0.945946,0.866667
3,0.888889,0.972973,0.933333
4,1.0,1.0,1.0


#### **4. RobustScalar:**
(-1 to 1)
> RobustScaler is a preprocessing technique that scales features using statistics that are robust to outliers. This method removes the median and scales the data according to the Interquartile Range (`IQR`). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

The `formula` for RobustScaler is:


X_scaled = (X - Q1) / (Q3 - Q1)


Where:
- `X_scaled` is the scaled feature,
- `X` is the original feature,
- `Q1` is the first quartile of the feature,
- `Q3` is the third quartile of the feature.

**When to use?**
- RobustScaler is used when you want to reduce the effects of outliers, as it uses the Interquartile Range, which is not influenced by outliers. 
- It's a good choice for data that contains outliers or when standardizing distribution is not the aim. 
- It's also useful when the data is not normally distributed, as it does not distort the relative distances between the various feature values.

In [25]:
from sklearn.preprocessing import RobustScaler

# import the scalar
scalar = RobustScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
scaled_df
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

Unnamed: 0,age,height,weight
0,-1.0,-1.0,-1.0
1,-0.5,-0.5,-0.5
2,0.0,0.0,0.0
3,0.5,0.5,0.5
4,1.0,1.0,1.0
