# **Feature Engineering Notebook**

## Objectives

*  Evaluate which transformations are beneficial for our dataset

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* Generate a list of engineering approaches for each variable


---

# Imports

In [None]:
import os
import pandas as pd
# for vs code
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt
import scipy.stats as stats
# Ignore FutureWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

---

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

# Load Cleaned Data

Train Set

In [None]:
file_path = "outputs/datasets/cleaned"

TrainSet = pd.read_csv(f"{file_path}/TrainSet.csv")
TrainSet.head(3)

Test Set

In [None]:
TestSet = pd.read_csv(f"{file_path}/TestSet.csv")
TestSet.head(3)

# Data Exploration

To identify potential transformations, we first revisit the Profile Report generated earlier.
This allows us to assess:

* The distributions of numerical variables (to detect skewness or outliers)
* The cardinality and balance of categorical variables
* Possible data scaling needs due to large differences in magnitude

Based on these insights, we will determine which variables may benefit from transformations such as scaling, normalization or encoding before modeling.

In [None]:
# Convert object columns to categorical so that it can be displayed 
# properly in the report
TrainSet_cat = TrainSet.copy()
for col in TrainSet_cat.select_dtypes(include='object').columns:
    TrainSet_cat[col] = TrainSet_cat[col].astype('category')
    
pandas_report = ProfileReport(df=TrainSet_cat, minimal=True)
pandas_report.to_notebook_iframe()

The Profile Report suggests different transformations depending on the variable type and distribution.

Categorical variables should be handled differently based on whether they are nominal or ordinal. The numerical variables are mostly skewed, so numerical transformations may help improve model performance. Additionally, correlated features should be identified for possible removal to reduce redundancy. Finally, variables should be scaled to ensure comparable ranges, which benefits many machine learning algorithms. 

Transformation Steps:
1. Categorical variables
    * Nominal Variables: OrdinalEncoder (arbitrary)
        * `person_home_ownership`, `loan_intent`, `cb_person_default_on_file`
    * Ordinal Variables: OrdinalEncoder (ordered)
        * `loan_grade`
2. Numerical Variables
    * All other numerical variables: Numerical transformation, since they do not have a normal distribution 
3. All Variables: Smart correlated selection, so any correlated features will be removed
4. Scaling variables

---

# Feature Engineering

Set the target variable

In [None]:
target_var = "loan_status"

## Categorical Variables

### Nominal Variables: OrdinalEncoder (arbitrary)

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= (TrainSet
                        .select_dtypes(include='object')
                        .columns.drop("loan_grade").tolist())
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder

In [None]:
encoder = OrdinalEncoder(
    encoding_method='arbitrary', 
    variables=variables_engineering)
df_feat_eng = encoder.fit_transform(df_engineering)
df_feat_eng.head(3)

* Step 3: Assess transformation by comparing engineered variables distribution to original ones

In [None]:
for var in variables_engineering:
    # Show encoding order
    print(f"Encoding Dict: {var}")
    print(encoder.encoder_dict_[var])
    
    fig, axes = plt.subplots(1, 2, figsize=(10, 6))
    fig.suptitle(f"Distribution Comparison for '{var}' (Raw vs Encoded)", 
                 fontsize=14, y=1.05)

    # Raw variable countplot
    sns.countplot(data=df_engineering, x=var, palette="Set2", ax=axes[0])
    axes[0].set_title("Raw Variable")
    axes[0].set_xlabel("")
    axes[0].tick_params(axis="x", rotation=90)

    # Encoded variable bar plot
    sns.countplot(data=df_feat_eng, x=var, palette="Set2", ax=axes[1])
    axes[1].set_title("Encoded Variable")
    axes[1].set_xlabel("")
    axes[1].tick_params(axis="x", rotation=90)

    plt.tight_layout()
    plt.show()


* For nominal categorical variables, the OrdinalEncoder successfully transformed each category into a numeric code. Therefore, this transformation will be applied in the pipeline

* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(
    encoding_method='arbitrary', 
    variables=variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.fit_transform(TestSet)

TrainSet.head(3)

### Ordinal Variables: OrdinalEncoder (ordered)

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= ["loan_grade", target_var]
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder

In [None]:
encoder = OrdinalEncoder(encoding_method='ordered', variables="loan_grade")
df_feat_eng["loan_grade"]  = (
    encoder.fit_transform(df_engineering[["loan_grade"]], 
                          df_engineering[target_var])
    )
df_feat_eng.head(3)


* Step 3: Assess transformation by comparing engineered variables distribution to original ones

In [None]:
var = variables_engineering[0]

# Show encoding order
print(f"Encoding Dict: {var}")
print(encoder.encoder_dict_[var])
    
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle(f"Distribution Comparison for '{var}' (Raw vs Encoded)", 
             fontsize=14, y=1.05)

# Raw variable countplot
sns.countplot(data=df_engineering, x=var, palette="Set2", ax=axes[0], 
              order= df_engineering[var].value_counts().index)
axes[0].set_title("Raw Variable (Countplot)")
axes[0].set_xlabel("")
axes[0].tick_params(axis="x", rotation=90)

# Encoded variable bar plot
sns.countplot(data=df_feat_eng, x=var, palette="Set2", ax=axes[1])
axes[1].set_title("Encoded Variable (Countplot)")
axes[1].set_xlabel("")
axes[1].tick_params(axis="x", rotation=90)

plt.tight_layout()
plt.show()

* The OrdinalEncoder successfully transformed the ordered categorical variables into numeric form while preserving their order. Therefore, this transformation will be applied in the pipeline

* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method='ordered', 
                         variables="loan_grade")
TrainSet["loan_grade"] = encoder.fit_transform(TrainSet[["loan_grade"]], 
                                               TrainSet[target_var])
TestSet["loan_grade"] = encoder.transform(pd.DataFrame(TestSet["loan_grade"]))
TrainSet.head(3)

## Numerical Variables

### Custom function

The following function gets a DataFrame as input and applies a defined set of numerical
feature engineering transformations. This will help to decide which transformations to apply to the data.


In [None]:
# Customized from Churnometer Walkthrough Project
def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    Perform quick feature engineering on numerical variables to evaluate 
    transformations that improve distribution shapes.

    Steps:
    - Checks for missing values and valid analysis type.
    - Applies multiple numerical transformations (log, power, etc.).
    - Visualizes each transformed variable’s distribution and normality.
    
    Parameters:
        df (pd.DataFrame): Input dataset.
        analysis_type (str): Currently supports 'numerical' only.
    
    Returns:
        pd.DataFrame: Dataset with additional transformed features.
    """
    check_missing_values(df)
    allowed_types = ['numerical']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop through each variable and apply transformations
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # Create duplicate columns for each transformation method
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformations and evaluate
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """Validate user-specified analysis type."""
    if analysis_type is None:
        raise SystemExit(
            f"Please pass 'analysis_type' as one of: {allowed_types}"
            )
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"Invalid 'analysis_type'. Must be one of: {allowed_types}"
            )


def check_missing_values(df):
    """Ensure no missing values exist before applying transformations."""
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            "Missing values detected — handle them before feature engineering."
            )


def define_list_column_transformers(analysis_type):
    """Return list of transformations for the given analysis type."""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]
    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    """Dispatch function to apply transformations by type."""
    df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
        df_feat_eng, column)
    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, 
                           analysis_type, df_feat_eng):
    """
    Visualize distributions and normality for original and transformed 
    variables.
    """
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformations: {list_applied_transformers}\n")
    for col in [column] + list_applied_transformers:
        DiagnosticPlots_Numerical(df_feat_eng, col)
        print("\n")


def DiagnosticPlots_Numerical(df, variable):
    """Plot histogram and Q–Q plot for a numerical variable."""
    fig, axes = plt.subplots(1, 2, figsize=(10, 4))
    sns.histplot(data=df, x=variable, kde=True, ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    axes[0].set_title('Histogram')
    axes[1].set_title('Q–Q Plot')
    fig.suptitle(f"{variable}", fontsize=16, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_Numerical(df_feat_eng, column):
    """
    Apply a set of numerical transformations and track which succeed.
    
    Returns:
        df_feat_eng (pd.DataFrame): Updated dataframe with successful 
        transformations.
        list_methods_worked (list): List of successfully applied 
        transformations.
    """
    list_methods_worked = []

    # Log (base e)
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # Log (base 10)
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # Reciprocal
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # Power
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # Box-Cox
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # Yeo-Johnson
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Numerical Transformations

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= (
    TrainSet_cat
    .select_dtypes(include=['int64', 'float64'])
    .columns.drop(target_var).tolist()
    )
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder and assess transformation by comparing engineered variables distributions to original ones

In [None]:
df_engineering = FeatureEngineeringAnalysis(
    df=df_engineering, analysis_type='numerical'
    )

* Several numerical transformations were tested, including Log (base e), Log10, Reciprocal, Power, Box-Cox, and Yeo-Johnson 
* For most variables, transformations helped to make the distributions closer to normal. The chosen transformations, based on QQ plots showing values closest to the diagonal, are:
    - Log (base e): `person_age`, `person_income`, `cb_person_cred_hist_length`  
    - Power: `person_emp_length`, `loan_amnt`, `loan_percent_income`  
    - None: `loan_int_rate`  
* These transformations will be applied in the pipeline to improve feature distributions for modeling, particularly for algorithms sensitive to variable scaling and distribution shape

Apply the selected transformation to the Train and Test set:

In [None]:
num_pipeline = Pipeline([
    ('log_transform', vt.LogTransformer(
        variables=['person_age', 'person_income', 'cb_person_cred_hist_length'])
     ),
    ('power_transform', vt.PowerTransformer(
        variables=['person_emp_length', 'loan_amnt', 'loan_percent_income'])
     )
])

TrainSet = num_pipeline.fit_transform(TrainSet)
TestSet = num_pipeline.transform(TestSet)
TrainSet.head(3)

Show distributions after numerical transformation

In [None]:
pandas_report = ProfileReport(df=TrainSet[variables_engineering], minimal=True)
pandas_report.to_notebook_iframe()

### SmartCorrelatedSelection Variables

Here we're looking for groups of features that correlate amongst themselves,
we want to remove any surplus correlated features since they’ll add the same information to the model.
The transformer takes care of finding the groups and drops the features based on the method,
threshold and selection method that we decided. This means for every group of correlated features,
the transformer will remove all but one feature.

* Step 1: Create a separate DataFrame

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

Confirm that all data types are numerical

In [None]:
df_engineering.dtypes

* Step 2: Create engineered variables applying the transformations

In [None]:
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", 
                                    threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

* After applying the `SmartCorrelatedSelection`, meaningful correlations were found between the following variable pairs:
    - `cb_person_cred_hist_length` & `person_age`  
    - `loan_grade` & `loan_int_rate`  
    - `loan_amnt` & `loan_percent_income`  

* From each pair, one variable will be dropped to avoid multicollinearity. These correlations were also visible in the PPS heatmap during the exploratory data analysis  

In [None]:
corr_sel.features_to_drop_

* No other features were flagged for removal, confirming that overall multicollinearity is limited. Therefore, we will include the `SmartCorrelatedSelection` step in our pipeline to automatically handle these high correlations during preprocessing

Apply the SmartCorrelatedSelection to the Train and Test set:

In [None]:
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", 
                                    threshold=0.6, selection_method="variance")

TrainSet = corr_sel.fit_transform(TrainSet)
TestSet = corr_sel.transform(TestSet)

## Scaling

We apply scaling to the dataset so that all variables are on a comparable range. This is important because many machine learning algorithms are sensitive to the magnitude of features. Without scaling, variables with larger ranges could dominate the learning process, leading to biased or suboptimal models.

Note that after scaling, especially the categorical values are standardized and lose their original interpretability, but this ensures that all numerical inputs contribute comparably to the model.

In [None]:
df_engineering = TrainSet.copy()
scaler = StandardScaler()
scaled_array  = scaler.fit_transform(df_engineering)
df_engineering = pd.DataFrame(scaled_array, columns=df_engineering.columns)
df_engineering.head(3)

After applying the StandardScaler, all variables have a mean of 0 and a standard deviation of 1

In [None]:
df_engineering.describe().round(2).T

Apply StandardScaler to the Train and Test set:

In [None]:
scaler = StandardScaler()

TrainSet = scaler.fit_transform(TrainSet)
TestSet = scaler.transform(TestSet)

---

# Conclusions and Next Steps

Feature Engineering Steps we will apply:
1. Categorical variables
    * Nominal Variables: OrdinalEncoder (arbitrary)  
        - `person_home_ownership`, `loan_intent`, `cb_person_default_on_file`  
    * Ordinal Variables: OrdinalEncoder (ordered)  
        - `loan_grade`
2. Numerical variables   
    - Log (base e): `person_age`, `person_income`, `cb_person_cred_hist_length`  
    - Power: `person_emp_length`, `loan_amnt`, `loan_percent_income`  
    - None: `loan_int_rate` 
3. `SmartCorrelatedSelection` will remove one variable from each highly correlated pair to reduce multicollinearity 
2. Scaling variables

Next Steps:
* Model loan defaults and evaluate model performance

