# **Feature Engineering Notebook**

## Objectives

*  Evaluate which transformations are beneficial for our dataset

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* Generate a list of engineering approaches for each variable


---

In [None]:
# Ignore FutureWarnings
# import warnings
# warnings.filterwarnings("ignore", category=FutureWarning)

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

Explore variables and think about which apporaches could be useful -> Asses by looking at the profile because some vars are numerical some are categorical. e.g. default, although categorical, is already in the format 0/1, so it wont  need an additional ordinal encoder
-> check what discretization and outlier approaches are and if we have to test for them here

use feature engineering analysis function to apply a set of feature engineering transformers on the data (num, ordinal enc or outlier) and show which transformer leads to improved distributions/which transformer should be applied to the data in the ml pipeline

* checking ordinal encoding results: Category names wil have been transformed to numbers, qqplot and boxpot are not informative in this case  (maybe change function so that it will not be shown??)
If all transformations look fine -> apply transformations to train and test sets

* numerical transofrm results: applies a range to transformers to the data, only the ones that work. e.g. some cant have negative values, so it wouldnt be applied on such variables. Then evaluate if any numerical transformation did improve the distribution to make it look more normally distributed. look for bell shape in histogram, diagonal in qq plot, boxplot doesnt help in this case. -> if no transofrmation shows an improvement, conclude that no transformation should be applied for this variable


Using the results, evaluate your data exploration from the beginning. do you need to apply all transformations that you tought of, or are they not neccessary? e.g. if normalization doesnt make the distribution more nomrally distributed, then you should levae it 

in the end, run smartr correlation selection to determine which features can be dropped due to correlations with other features




# Load Cleaned Data

Train Set

In [None]:
import pandas as pd
file_path = "outputs/datasets/cleaned"

TrainSet = pd.read_csv(f"{file_path}/TrainSet.csv")
TrainSet.head(3)
TrainSet.dtypes

Test Set

In [None]:
TestSet = pd.read_csv(f"{file_path}/TestSet.csv")
TestSet.head(3)

# Data Exploration

To identify potential transformations, we first revisit the Profile Report generated earlier.
This allows us to assess:

* The distributions of numerical variables (to detect skewness or outliers)
* The cardinality and balance of categorical variables
* Possible data scaling needs due to large differences in magnitude

Based on these insights, we will determine which variables may benefit from transformations such as scaling, normalization or encoding before modeling.

In [None]:
from ydata_profiling import ProfileReport
    
# Convert object columns to categorical so that it can be displayed properly in the report
TrainSet_cat = TrainSet.copy()
for col in TrainSet_cat.select_dtypes(include='object').columns:
    TrainSet_cat[col] = TrainSet_cat[col].astype('category')
    
pandas_report = ProfileReport(df=TrainSet_cat, minimal=True)
pandas_report.to_notebook_iframe()

The ProfileReport suggests different transformations depending on the variable type and distribution.

Categorical variables should be handled differently based on whether they are nominal or ordinal. The numerical variables are mostly uniformally distributed, so numerical transformations may help improve model performance. Additionally, numerical variables should be scaled to ensure comparable ranges, which benefits many machine learning algorithms. There are no outliers that have to be treated in this dataset. Finally, correlated features should be identified for possible removal to reduce redundancy.

Transformation Steps:
1. Categorical variables
    * Nominal Variables: OneHotEncoder
        * `EmploymentType`, `MaritalStatus`, `LoanPurpose`, `HasMortgage`, `HasDependents`, `HasCoSigner`
    * Ordinal Variables: OrdinalEncoder
        * `Education`
2. Numerical Variables
    * `NumCreditLines` and `LoanTerm`: No transformation as they have to be treated as ordinal categorical variables
    * All other numerical variables: Numerical transformation, since they do not have a normal distribution 
3. All Variables: Smart correlated selection, so any correlated features will be removed
4. Scaling of numerical variables

---

# Feature Engineering

In [None]:
import scipy.stats as stats
# for vs code
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.encoding import OneHotEncoder
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

## Categorical Variables

### Nominal Variables: OneHotEncoder

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= ["EmploymentType", "MaritalStatus", "LoanPurpose", "HasMortgage", "HasDependents", "HasCoSigner"]
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder

In [None]:
encoder = OneHotEncoder(variables=variables_engineering)
df_feat_eng = encoder.fit_transform(df_engineering)
df_feat_eng.head(3)

* Step 3: Assess transformation by comparing engineered variables distribution to original ones

In [None]:
for var in variables_engineering:
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd

    # Count tables
    raw_counts = df_engineering[var].value_counts().reset_index()
    raw_counts.columns = [var, "Count"]

    encoded_counts = (
        df_feat_eng.filter(like=var, axis=1)
        .sum()
        .reset_index()
    )
    encoded_counts.columns = ["Encoded Column", "Count"]

    # Create figure with 2 plots side by side
    fig, axes = plt.subplots(1, 2, figsize=(10, 6))
    fig.suptitle(f"Distribution Comparison for '{var}' (Raw vs Encoded)", fontsize=14, y=1.05)

    # Raw variable countplot
    sns.countplot(data=df_engineering, x=var, color="#432371", ax=axes[0])
    axes[0].set_title("Raw Variable (Countplot)")
    axes[0].set_xlabel("")
    axes[0].tick_params(axis="x", rotation=90)

    # Encoded variable bar plot
    df_feat_eng.filter(like=var, axis=1).sum().plot(
        kind="bar", color="#432371", width=0.8, ax=axes[1]
    )
    axes[1].set_title("Encoded Variables (One-Hot Columns)")
    axes[1].tick_params(axis="x", rotation=90)
    axes[1].set_xlabel("")
    axes[1].set_ylabel("Count")

    plt.tight_layout()
    plt.show()

    # --- Show tables below the plots ---
    print("Count Table (Raw Variable):")
    display(raw_counts)

    print("\nCount Table (Encoded Variables):")
    display(encoded_counts)
    
    print("---")


* For nominal categorical variables, the OneHotEncoder successfully transformed each category into separate binary columns. Therefore, this transformation will be applied in the pipeline

### Ordinal Variables: OrdinalEncoder

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= ["Education", "Default"]
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder

In [None]:
# encoder = OrdinalEncoder(encoding_method='arbitrary', variables=variables_engineering)
# df_feat_eng = encoder.fit_transform(df_engineering)
# df_feat_eng.head(3)
encoder = OrdinalEncoder(encoding_method='ordered', variables="Education")
df_feat_eng = encoder.fit_transform(pd.DataFrame(df_engineering["Education"]), pd.DataFrame(df_engineering["Default"]))
df_feat_eng.head(3)


* Step 3: Assess transformation by comparing engineered variables distribution to original ones

In [None]:
#for var in variables_engineering:
import matplotlib.pyplot as plt
import seaborn as sns

var = variables_engineering[0]
# Count tables
raw_counts = df_engineering[var].value_counts().reset_index()
raw_counts.columns = [var, "Count"]

encoded_counts = df_feat_eng[var].value_counts().reset_index()
encoded_counts.columns = [var, "Count"]

# Create figure with 2 plots side by side
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle(f"Distribution Comparison for '{var}' (Raw vs Encoded)", fontsize=14, y=1.05)

# Raw variable countplot
sns.countplot(data=df_engineering, x=var, color="#432371", ax=axes[0])
axes[0].set_title("Raw Variable (Countplot)")
axes[0].set_xlabel("")
axes[0].tick_params(axis="x", rotation=90)

# Encoded variable bar plot
sns.countplot(data=df_feat_eng, x=var, color="#432371", ax=axes[1])
axes[0].set_title("Encoded Variable (Countplot)")
axes[0].set_xlabel("")
axes[0].tick_params(axis="x", rotation=90)

plt.tight_layout()
plt.show()

# Show tables below the plots
print("Count Table (Raw Variable):")
display(raw_counts)

print("\nCount Table (Encoded Variables):")
display(encoded_counts)
print("Class Mapping:")
print(encoder.encoder_dict_)

print("---")

* The OrdinalEncoder successfully transformed the ordered categorical variables into numeric form while preserving their order. Therefore, this transformation will be applied in the pipeline

## Numerical Variables

### Custom function

It gets a DataFrame as input and applies a defined set of numerical
feature engineering transformers. This will help you to decide which transformers to apply to your data.


In [None]:
# TODO FUNKTION VEREINFACHEN??

import scipy.stats as stats
# for vs code
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):

    df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
        df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 2, figsize=(10, 4))
    # sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    sns.histplot(data=df, x=variable, kde=True, ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()

def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


### Numerical Transformations

* Step 1: Select variables and create a separate DataFrame

In [None]:
variables_engineering= TrainSet.select_dtypes(include=['int64', 'float64']).columns.drop(["NumCreditLines", "LoanTerm", "Default"]).tolist()
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 2: Create engineered variables by applying the encoder and assess transformation by comparing engineered variables distributions to original ones

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

* Several numerical transformers were tested, including Log, Log base 10, Reciprocal, Power, Box-Cox and Yeo-Johnson. None of them helped to achieve a bell-shaped distribution or values aligned along the diagonal in the QQ plot. Therefore, no numerical transformation will be applied

---

# Push files to Repo

In [None]:
import joblib
import os

# Set the file_path and a version tag, which will be the folder name.
# It's appropriate since it's a form of version control.
version = 'v1'
file_path = f'.../{version}'
variable_to_save = df # or pipeline
filename = "dataset.csv" # or "pipeline.pkl"

# Try to generate output folder
try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

# Save the dataset as csv file for further use
variable_to_save.to_csv(f"{file_path}/{filename}", index=False)

# Save the variable as pkl file for further use
joblib.dump(value=variable_to_save ,
            filename=f"{file_path}/{filename}")


%matplotlib inline 
import matplotlib.pyplot as plt

# Save the figure as png file for further use
plt.savefig(f"{file_path}/{filename}", bbox_inches='tight',dpi=150)


---

# Conclusions and Next Steps

* Fill in conclusions and next steps