# **Feature Engineering Notebook**

## Objectives

- Engineer features for Classification, Regression and Cluster models

## Inputs

- outputs/datasets/cleaned/TrainSetCleaned.csv

- outputs/datasets/cleaned/TestSetCleaned.csv

## Outputs

- Generate a list with variables to engineer

## Conclusions
- Feature Engineering Transformers
    - Ordinal categorical encoding: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']
    - Numerical transformation: ['GrLivArea', 'LotArea', 'LotFrontage', 'GarageArea', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF']
    - Outlier winsoriser: ['GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF' '1stFlrSF', '2ndFlrSF']


---

## Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

Change the working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Cleaned Dataset

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(5)

Check there are no missing values

In [None]:
vars_with_missing_data = TrainSet.columns[TrainSet.isna().sum() > 0].to_list()
vars_with_missing_data

Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/TestSetCleaned.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head(5)

Check there are no missing values

In [None]:
vars_with_missing_data = TestSet.columns[TestSet.isna().sum() > 0].to_list()
vars_with_missing_data

## Data Exploration

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_notebook_iframe()

## Feature Engineering

Custom functions adapted from the customer churn project.
Used to quickly test transformations on numerical and categorical features.
Results can be reviewed using tools like ydata-profiling.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder

sns.set_theme(style="whitegrid")
warnings.filterwarnings("ignore")


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - Used for quick feature engineering on numerical and categorical variables
      to decide which transformation can better transform the distribution shape.
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions.
    """
    check_missing_values(df)
    allowed_types = ["numerical", "ordinal_encoder", "outlier_winsorizer"]
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column
        )

        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng
        )

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}"
        )
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}"
        )


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            "There is a missing value in your dataset. Please handle that before getting into feature engineering."
        )


def define_list_column_transformers(analysis_type):
    if analysis_type == "numerical":
        list_column_transformers = [
            "log_e",
            "log_10",
            "reciprocal",
            "power",
            "box_cox",
            "yeo_johnson",
        ]
    elif analysis_type == "ordinal_encoder":
        list_column_transformers = ["ordinal_encoder"]
    elif analysis_type == "outlier_winsorizer":
        list_column_transformers = ["iqr"]
    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include="category").columns:
        df_feat_eng[col] = df_feat_eng[col].astype("object")

    if analysis_type == "numerical":
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column
        )
    elif analysis_type == "outlier_winsorizer":
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column
        )
    elif analysis_type == "ordinal_encoder":
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column
        )

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(
    column, list_applied_transformers, analysis_type, df_feat_eng
):
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:
        if analysis_type != "ordinal_encoder":
            DiagnosticPlots_Numerical(df_feat_eng, col)
        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)
        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(
        data=df_feat_eng,
        x=col,
        palette=["#423371"],
        order=df_feat_eng[col].value_counts().index,
    )
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])
    axes[0].set_title("Histogram")
    axes[1].set_title("QQ Plot")
    axes[2].set_title("Boxplot")
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(
            encoding_method="arbitrary", variables=[f"{column}_ordinal_encoder"]
        )
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")
    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)
    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []
    try:
        disc = Winsorizer(
            capping_method="iqr", tail="both", fold=1.5, variables=[f"{column}_iqr"]
        )
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)
    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base="10")
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Feature Engineering Summary

- The transformers to be taken into account are:
    - Categorical Encoding
    - Numerical Transformation
    - Outlier winsorizer
    - Smart Correlation Selectionion

## Dealing with Feature Engineering

### Ordinal Categorical Encoding 

- Step 1: Select variables

In [None]:
variables_engineering = ["BsmtExposure", "BsmtFinType1", "GarageFinish", "KitchenQual"]
variables_engineering

- Step 2: Create a separate dataframe with the selected variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head()

- Step 3: Apply transformations and check distributions to choose the best method for each variable.

df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

The ordinal encoding has successfully transformed each category into a corresponding numeric value.

- Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method="arbitrary", variables=variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Transformation

- Step 1: Select variables

In [None]:
variables_engineering = [
    "GrLivArea",
    "LotArea",
    "LotFrontage",
    "GarageArea",
    "MasVnrArea",
    "OpenPorchSF",
    "TotalBsmtSF",
    "BsmtUnfSF",
    "BsmtFinSF1",
    "1stFlrSF",
    "2ndFlrSF",
]
variables_engineering

- Step 2: Create a separate dataframe with the selected variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head()

- Step 3: Apply transformations and check distributions to choose the best method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(
    df=df_engineering, analysis_type="numerical"
)

Based on the results of the numerical transformations, we made the following decisions:

- For variables like GrLivArea, LotArea, and LotFrontage that have large values and a wide range, the log_e transformation helps reduce the gap between the many zero or small values and the large ones, making the distribution easier to work with.

- For TotalBsmtSF, the power, box-cox, and yeo-johnson transformations all gave similar results. We chose the power transformation because it’s simple and effective.

- For variables such as GarageArea, MasVnrArea, and OpenPorchSF, which have many zero values, none of the transformations made a big difference. However, the power transformation did the best job at smoothing out the distribution, so we decided to use it.

In [None]:
lt = vt.LogTransformer(variables=["GrLivArea", "LotArea", "LotFrontage"])

pt = vt.PowerTransformer(
    variables=[
        "GarageArea",
        "MasVnrArea",
        "OpenPorchSF",
        "TotalBsmtSF",
        "1stFlrSF",
        "2ndFlrSF",
    ]
)

transformers = [lt, pt]
for t in transformers:
    TrainSet = t.fit_transform(TrainSet)
    TestSet = t.fit_transform(TestSet)

print("* Numerical transformation done!")

## Winsorizer

- We use the winsorizer (not the trimmer) so that we can keep all the data points, including outliers, but reduce their impact on the model's predictions.
- We then select variables that are likely to contain outliers.

In [None]:
variables_engineering = [
    "GarageArea",
    "LotArea",
    "LotFrontage",
    "MasVnrArea",
    "OpenPorchSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
]

- We apply similar step as above and create a separate dataframe

In [None]:
df_engineering = TrainSet[variables_engineering].copy()

- Apply the transformation and assess the distributions

In [None]:
df_engineering = FeatureEngineeringAnalysis(
    df=df_engineering, analysis_type="outlier_winsorizer"
)

- We apply the Winsorizer to the train and test datasets.

In [None]:
winsoriser = Winsorizer(
    capping_method="iqr", tail="both", fold=1.5, variables=variables_engineering
)
TrainSet = winsoriser.fit_transform(TrainSet)
TestSet = winsoriser.fit_transform(TestSet)

print("* Outlier winsoriser transformation done!")

## Check for highly correlated features

- After applying categorical encoding, numerical transformations, and outlier handling,
- We now check for multicollinearity.

SmartCorrelatedSelection Variables

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
# Set up the correlation filter
corr_sel = SmartCorrelatedSelection(
    variables=None, method="spearman", threshold=0.8, selection_method="variance"
)
# Apply to the training data
corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

- View features that will be dropped

In [None]:
corr_sel.features_to_drop_

---

## Summary and the next steps

**Summary**

- Feature Engineering Transformers:

    - Ordinal categorical encoding: ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

    - Numerical transformation: ['GrLivArea', 'LotArea', 'LotFrontage', 'GarageArea', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF']

    - Outlier winsorizer: ['GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF']

**Next Step**:  
Proceed to the Modeling and Evaluation notebook.
