# 🌳 🔄 Complete Guide to Transformation from A to Z

Welcome to this comprehensive guide on data transformation, designed to equip you with the knowledge and skills to effectively preprocess and transform your datasets. Whether you're a budding data scientist or a seasoned professional looking to refine your data transformation techniques, this notebook is tailored for you!

## What Will You Learn?

In this guide, we will explore various methods to normalize, construct, discretize, and aggregate features, ensuring you have the tools to confidently prepare your data for any analysis or modeling task. Here's what we'll cover:

### 1. Feature Normalization

Learn how to scale and adjust the statistical distribution of feature values to improve the performance and accuracy of your models.

- **Min-Max Scaling**: Scale features to a fixed range, typically 0 to 1.
- **Z-Score Standardization**: Transform features to have a mean of 0 and a standard deviation of 1.
- **Robust Scaling**: Scale features using statistics that are robust to outliers, such as the median and interquartile range.
- **Yeo-Johnson Transformation**: Apply a transformation that can handle both positive and negative values to achieve normality.
- **Box-Cox Transformation**: Apply a transformation that works with positive values to achieve normality and reduce skewness.

### 2. Feature Construction

Learn techniques to create new features from existing ones, enhancing the predictive power of your models.

- **Use of Domain Knowledge**: Incorporate insights from the specific field or industry to construct meaningful features.
- **Using Statistical Relationships Between Features**: Identify and utilize correlations and interactions between features.
- **Numerical Coding of Nominal Values**:
  - **One-Hot Encoding**: Convert categorical variables into a series of binary variables.
  - **Ordinal or Label Encoding**: Assign integer values to categories based on their order or labels.
  - **Probability Ratio Encoding**: Encode categorical features based on the probability ratio of the target variable.

### 3. Feature Discretization

Learn how to transform continuous features into discrete ones to simplify models and capture nonlinear relationships.

- **Domain Knowledge**: Use expert knowledge to define meaningful bins.
- **Unsupervised Methods**:
  - **Equal-Width Binning**: Divide the range of values into equal-width bins.
  - **Equal-Frequency Binning**: Divide the range of values so that each bin has approximately the same number of observations.
  - **K-Means Binning**: Use k-means clustering to create bins based on feature similarity.
- **Supervised Methods**:
  - **ChiMerge**: Merge bins based on the chi-squared statistic to ensure similarity with respect to the target variable.
  - **Decision Tree Binning**: Use decision trees to create bins based on target variable splits.

### 4. Feature Aggregation

Learn how to combine multiple features into single aggregated features to reduce dimensionality and capture higher-level information.

- **Summarizing Features**: Calculate summary statistics (e.g., mean, median, sum) for groups of features.
- **Hierarchical Aggregation**: Aggregate features based on hierarchical or nested groupings.
- **Temporal Aggregation**: Aggregate features based on time intervals (e.g., daily, monthly averages).

## Why This Guide?

- **Step-by-Step Tutorials**: Each section includes clear explanations followed by practical examples, ensuring you not only learn but also apply your knowledge.
- **Interactive Learning**: Engage with interactive code cells that allow you to see the effects of data transformation methods in real-time.

### How to Use This Notebook

- **Run the Cells**: Follow along with the code examples by running the cells yourself. Modify the parameters to see how the results change.
- **Explore Further**: After completing the guided sections, try applying the methods to your own datasets to reinforce your learning.

Prepare to unlock the full potential of data transformation in data analysis. Let's dive in and transform data into valuable insights!


In [1]:
import pandas as pd

# Load the dataset
file_path = '/kaggle/input/loans-and-liability/LoanData_Preprocessed_v1.1.csv'
data = pd.read_csv(file_path)

# Convert 'ed' and 'default' columns to object type
data['ed'] = data['ed'].astype('object')
data['default'] = data['default'].astype('object')

# Display the first few rows of the dataset
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       680 non-null    float64
 1   employ    700 non-null    int64  
 2   address   700 non-null    int64  
 3   income    663 non-null    float64
 4   debtinc   700 non-null    float64
 5   creddebt  700 non-null    float64
 6   othdebt   700 non-null    float64
 7   ed        680 non-null    object 
 8   default   700 non-null    object 
dtypes: float64(5), int64(2), object(2)
memory usage: 49.3+ KB


In [2]:
data.describe()

Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt
count,680.0,700.0,700.0,663.0,700.0,700.0,700.0
mean,34.75,8.388571,8.268571,45.74359,10.260571,1.553553,3.058209
std,7.973215,6.658039,6.821609,37.44108,6.827234,2.117197,3.287555
min,20.0,0.0,0.0,14.0,0.4,0.011696,0.045584
25%,28.0,3.0,3.0,24.0,5.0,0.369059,1.044178
50%,34.0,7.0,7.0,34.0,8.6,0.854869,1.987567
75%,40.0,12.0,12.0,54.5,14.125,1.901955,3.923065
max,56.0,31.0,34.0,446.0,41.3,20.56131,27.0336


## Dataset Overview

The dataset contains information about loan applicants and includes the following columns:

- **age**: The age of the applicant, indicating how many years they have lived.
  - **Range**: 18 - 66
  - **Mean**: 34.40
  - **Skewness**: Slightly skewed to the right (positive skew).


- **employ**: The number of years the applicant has been employed, which can indicate their job stability and experience.
  - **Range**: 0 - 31
  - **Mean**: 8.21
  - **Skewness**: Right-skewed (positive skew).


- **address**: The number of years the applicant has lived at their current address, providing insights into their residential stability.
  - **Range**: 0 - 28
  - **Mean**: 5.58
  - **Skewness**: Right-skewed (positive skew).


- **income**: The annual income of the applicant (in thousands), representing their earning capacity.
  - **Range**: 10 - 330
  - **Mean**: 55.50
  - **Skewness**: Right-skewed (positive skew).


- **debtinc**: The debt-to-income ratio of the applicant, calculated as the percentage of their income that goes towards paying debts. This ratio helps assess their financial burden.
  - **Range**: 0.00 - 37.30
  - **Mean**: 10.27
  - **Skewness**: Right-skewed (positive skew).


- **creddebt**: The amount of credit card debt the applicant has (in thousands), showing their reliance on credit and their debt levels.
  - **Range**: 0.00 - 22.12
  - **Mean**: 3.51
  - **Skewness**: Right-skewed (positive skew).


- **othdebt**: The amount of other debt the applicant has (in thousands), which includes all other forms of debt apart from credit card debt.
  - **Range**: 0.00 - 57.03
  - **Mean**: 5.05
  - **Skewness**: Right-skewed (positive skew).


- **ed**: The education level of the applicant (encoded numerically), where higher numbers may represent higher levels of education.
  - **Unique Values**: 1.0, 2.0, 3.0, 4.0, 5.0
  - **Most Frequent Value (Mode)**: 1.0


- **default**: A binary indicator of whether the applicant defaulted on the loan (1 for default, 0 for no default), indicating their credit risk.
  - **Unique Values**: 0, 1
  - **Most Frequent Value (Mode)**: 0 (majority did not default)


## 1. Feature Normalization

Feature normalization is a crucial step in the data preprocessing pipeline. It involves adjusting the values of numerical features to ensure they have a common scale, which can improve the performance and training stability of machine learning models. In this section, we will explore various normalization techniques and demonstrate how to apply them using practical examples.

### Why Normalize Features?

Normalization can help in:
- **Improving Model Performance**: Algorithms such as gradient descent converge faster with normalized data.
- **Enhancing Accuracy**: Normalization can reduce the impact of features with larger scales on the model.
- **Stability**: Models can become more stable and less sensitive to variations in the data.

### Techniques Covered:

1. **Min-Max Scaling**: This technique scales the features to a fixed range, typically [0, 1]. The formula is:

   $$
   X' = \frac{X - X_{min}}{X_{max} - X_{min}}
   $$

2. **Z-Score Standardization**: Also known as standardization, this method transforms features to have a mean of 0 and a standard deviation of 1. The formula is:

   $$
   X' = \frac{X - \mu}{\sigma}
   $$

   where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the feature.

3. **Robust Scaling**: This method scales features using statistics that are robust to outliers, such as the median and the interquartile range. The formula is:

   $$
   X' = \frac{X - Q2}{Q3 - Q1}
   $$

   where \(Q1\) and \(Q3\) are the 1st and 3rd quartiles, respectively, and \(Q2\) is the median.

4. **Yeo-Johnson Transformation**: This technique can handle both positive and negative values and transforms the data to be more normally distributed.

5. **Box-Cox Transformation**: This method works with positive values and transforms the data to be more normally distributed, reducing skewness.

### Why Split Train and Test Data?

Splitting the data into training and testing sets is a crucial step in the machine learning pipeline. It ensures that the model's performance can be evaluated on unseen data, providing a more realistic estimate of its effectiveness in real-world scenarios. By keeping the test data separate:
- **Avoid Data Leakage**: Ensures that information from the test set does not influence the model during training.
- **Model Evaluation**: Provides an unbiased evaluation metric for how well the model generalizes to new data.
- **Hyperparameter Tuning**: Helps in tuning model parameters by validating performance on the test set.

Let's see how to apply these transformations to our dataset after splitting the data into training and testing sets.


## 1.1 Min-Max Scaling

Min-Max Scaling transforms features by scaling them to a given range, usually [0, 1]. This technique is useful when the features have different ranges and you want to ensure they contribute equally to the analysis.

### Columns

For our dataset, the following columns are suitable for Min-Max Scaling:
- `age`
- `debtinc`
- `creddebt`

### Applying Min-Max Scaling

Let's apply Min-Max Scaling to these columns.


In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import pandas as pd


# Split the dataset into training and testing sets
train_data, test_data = train_test_split(data.copy(), test_size=0.2, random_state=42)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Select the columns to scale
columns_to_scale = ['age', 'debtinc', 'creddebt']

# Make copies of the train and test data for transformations
train_data_min_max_scaled = train_data.copy()
test_data_min_max_scaled = test_data.copy()

# Fit the scaler to the training data and transform both training and testing data
train_data_min_max_scaled[columns_to_scale] = scaler.fit_transform(train_data_min_max_scaled[columns_to_scale])
test_data_min_max_scaled[columns_to_scale] = scaler.transform(test_data_min_max_scaled[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Min-Max Scaled Data (Train):")
display(train_data_min_max_scaled.head(20))


Min-Max Scaled Data (Train):


Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt,ed,default
82,0.472222,7,3,32.0,0.469438,0.167622,3.57504,1.0,0
51,0.722222,1,12,20.0,0.332518,0.049782,1.9908,1.0,0
220,0.222222,11,6,24.0,0.0489,0.005958,0.468864,1.0,0
669,0.444444,10,4,43.0,0.308068,0.059288,4.62852,3.0,0
545,0.638889,10,24,37.0,0.198044,0.041479,2.468825,2.0,0
302,0.75,22,19,81.0,0.124694,0.093266,2.94921,1.0,0
577,0.805556,22,4,79.0,0.168704,0.01727,5.47865,2.0,0
215,0.388889,12,8,47.0,0.154034,0.080453,1.848463,3.0,0
235,0.111111,7,0,18.0,0.149144,0.032136,0.6435,1.0,0
18,0.527778,6,9,61.0,0.129584,0.034431,2.913726,1.0,0


## 1.2 Z-Score Standardization

Z-Score Standardization, also known as standardization, transforms features to have a mean of 0 and a standard deviation of 1. This method is less sensitive to outliers and ensures that each feature contributes equally to the analysis.

### Columns

For our dataset, the following columns are suitable for Z-Score Standardization:
- `age`
- `debtinc`
- `creddebt`

### Applying Z-Score Standardization

Let's apply Z-Score Standardization to these columns.


In [4]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select the columns to scale
columns_to_scale = ['age', 'debtinc', 'creddebt']

# Make copies of the train and test data for transformations
train_data_z_score_scaled = train_data.copy()
test_data_z_score_scaled = test_data.copy()

# Fit the scaler to the training data and transform both training and testing data
train_data_z_score_scaled[columns_to_scale] = scaler.fit_transform(train_data_z_score_scaled[columns_to_scale])
test_data_z_score_scaled[columns_to_scale] = scaler.transform(test_data_z_score_scaled[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Z-Score Standardized Data (Train):")
display(train_data_z_score_scaled.head(20))


Z-Score Standardized Data (Train):


Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt,ed,default
82,0.28411,7,3,32.0,1.359338,0.586868,3.57504,1.0,0
51,1.398167,1,12,20.0,0.539773,-0.35867,1.9908,1.0,0
220,-0.829948,11,6,24.0,-1.157897,-0.710318,0.468864,1.0,0
669,0.160326,10,4,43.0,0.393423,-0.282396,4.62852,3.0,0
545,1.026815,10,24,37.0,-0.265156,-0.425299,2.468825,2.0,0
302,1.521952,22,19,81.0,-0.704209,-0.009763,2.94921,1.0,0
577,1.76952,22,4,79.0,-0.440778,-0.619552,5.47865,2.0,0
215,-0.087243,12,8,47.0,-0.528588,-0.11257,1.848463,3.0,0
235,-1.325084,7,0,18.0,-0.557858,-0.500268,0.6435,1.0,0
18,0.531678,6,9,61.0,-0.674939,-0.481849,2.913726,1.0,0


## 1.3 Robust Scaling

Robust Scaling uses statistics that are robust to outliers, such as the median and the interquartile range, to scale features. This method is useful when the dataset contains outliers that could skew the results of standard scaling methods.

### Columns

For our dataset, the following columns are suitable for Robust Scaling:
- `income`
- `othdebt`

### Applying Robust Scaling

Let's apply Robust Scaling to these columns.


In [5]:
from sklearn.preprocessing import RobustScaler

# Initialize the RobustScaler
scaler = RobustScaler()

# Select the columns to scale
columns_to_scale = ['income', 'othdebt']

# Make copies of the train and test data for transformations
train_data_robust_scaled = train_data.copy()
test_data_robust_scaled = test_data.copy()

# Fit the scaler to the training data and transform both training and testing data
train_data_robust_scaled[columns_to_scale] = scaler.fit_transform(train_data_robust_scaled[columns_to_scale])
test_data_robust_scaled[columns_to_scale] = scaler.transform(test_data_robust_scaled[columns_to_scale])

# Display the first few rows of the transformed training dataset to verify the scaling
print("Robust Scaled Data (Train):")
display(train_data_robust_scaled.head(20))



Robust Scaled Data (Train):


Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt,ed,default
82,37.0,7,3,-0.066116,19.6,2.69696,0.555355,1.0,0
51,46.0,1,12,-0.46281,14.0,0.8092,0.014302,1.0,0
220,28.0,11,6,-0.330579,2.4,0.107136,-0.505473,1.0,0
669,36.0,10,4,0.297521,13.0,0.96148,0.915142,3.0,0
545,43.0,10,24,0.099174,8.5,0.676175,0.177558,2.0,0
302,47.0,22,19,1.553719,5.5,1.50579,0.34162,1.0,0
577,49.0,22,4,1.487603,7.3,0.28835,1.205481,2.0,0
215,34.0,12,8,0.429752,6.7,1.300537,-0.034309,3.0,0
235,24.0,7,0,-0.528926,6.5,0.5265,-0.445831,1.0,0
18,39.0,6,9,0.892562,5.7,0.563274,0.329502,1.0,0


# 1.4 Yeo-Johnson and Box-Cox Transformations

The Yeo-Johnson and Box-Cox transformations are power transformations used to stabilize variance and make the data more normally distributed. Yeo-Johnson can handle both positive and negative values, whereas Box-Cox is only applicable to positive values.

### Columns

For our dataset, we will check the following columns for negative values and apply the appropriate transformation:
- `age`
- `income`
- `debtinc`
- `creddebt`
- `othdebt`

### Applying Transformations

Let's check for negative values in the columns and apply the Yeo-Johnson Transformation to columns with negative values and the Box-Cox Transformation to columns with only positive values.


In [6]:
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Make a copy of the original dataset
data_copy = data.copy()

# Split the copied dataset into training and testing sets
train_data, test_data = train_test_split(data_copy, test_size=0.2, random_state=42)

# Check if any column contains negative values
columns_to_check = ['age', 'income', 'debtinc', 'creddebt', 'othdebt']
contains_negative = {col: np.any(train_data[col] < 0) for col in columns_to_check}

# Initialize the PowerTransformers
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson')
box_cox_transformer = PowerTransformer(method='box-cox')

# Make copies of the train and test data for transformations
train_data_yeo_johnson = train_data.copy()
test_data_yeo_johnson = test_data.copy()
train_data_box_cox = train_data.copy()
test_data_box_cox = test_data.copy()

# Apply Yeo-Johnson transformation to columns with negative values
columns_to_transform_yeo_johnson = [col for col, has_negative in contains_negative.items() if has_negative]
if columns_to_transform_yeo_johnson:
    train_data_yeo_johnson[columns_to_transform_yeo_johnson] = yeo_johnson_transformer.fit_transform(train_data_yeo_johnson[columns_to_transform_yeo_johnson])
    test_data_yeo_johnson[columns_to_transform_yeo_johnson] = yeo_johnson_transformer.transform(test_data_yeo_johnson[columns_to_transform_yeo_johnson])
    print("Yeo-Johnson Transformed Data (Train):")
    display(train_data_yeo_johnson.head(20))
else:
    print("No columns with negative values for Yeo-Johnson transformation.")

# Apply Box-Cox transformation to columns with only positive values
columns_to_transform_box_cox = [col for col, has_negative in contains_negative.items() if not has_negative]
if columns_to_transform_box_cox:
    train_data_box_cox[columns_to_transform_box_cox] = box_cox_transformer.fit_transform(train_data_box_cox[columns_to_transform_box_cox])
    test_data_box_cox[columns_to_transform_box_cox] = box_cox_transformer.transform(test_data_box_cox[columns_to_transform_box_cox])
    print("Box-Cox Transformed Data (Train):")
    display(train_data_box_cox.head(20))
else:
    print("No columns with only positive values for Box-Cox transformation.")


No columns with negative values for Yeo-Johnson transformation.
Box-Cox Transformed Data (Train):


Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt,ed,default
82,0.384146,7,3,-0.135337,1.288082,1.02336,0.611572,1.0,0
51,1.328151,1,12,-1.155792,0.712413,-0.083387,-0.004731,1.0,0
220,-0.805779,11,6,-0.730787,-1.533411,-1.681537,-1.444962,1.0,0
669,0.266254,10,4,0.39204,0.593028,0.06753,0.889788,3.0,0
545,1.03437,10,24,0.133693,-0.04386,-0.237989,0.219529,2.0,0
302,1.422093,22,19,1.287804,-0.619569,0.47186,0.406806,1.0,0
577,1.604484,22,4,1.257632,-0.253484,-0.937001,1.073571,2.0,0
215,0.020958,12,8,0.53591,-0.367585,0.337895,-0.081424,3.0,0
235,-1.455091,7,0,-1.419953,-0.407226,-0.449053,-1.139282,1.0,0
18,0.611228,6,9,0.92203,-0.575032,-0.392589,0.393998,1.0,0
