# **04 - FeatureEngineering**

## Objectives

* Feature transformation to enhance the linearity of relationships between predictor variables and the target (dependent variable) in a machine learning model.

## Inputs

* outputs/datasets/cleaned/train_set_cleaned.csv
* outputs/datasets/cleaned/test_set_cleaned.csv

## Outputs

* Identify variables for ordinal categorical encoding.
* Select numerical features requiring transformation for improved model performance.
* Apply Winsorization to cap extreme feature values and mitigate the impact of outliers.
* Detect and address multicollinearity by identifying highly correlated features.

## Additional Comments

* 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

Imports the Pandas library and reads CSV file train_set_cleaned.csv into DataFrame TrainSet and displays the first 10 rows.

In [None]:
import pandas as pd
TrainSet = pd.read_csv("outputs/datasets/cleaned/train_set_cleaned.csv")
TrainSet.head(10)

Identifies columns in the TrainSet DataFrame that contain missing (NaN) values, the list is stored in the variable vars_with_missing_data.

In [None]:
vars_with_missing_data = TrainSet.columns[TrainSet.isna().sum() > 0].to_list()
vars_with_missing_data

Reads CSV file test_set_cleaned.csv into DataFrame TestSet and displays the first 10 rows.

In [None]:
TestSet = pd.read_csv("outputs/datasets/cleaned/test_set_cleaned.csv")
TestSet.head(10)

Identifies columns in the TestSet DataFrame that contain missing (NaN) values, the list is stored in the variable vars_with_missing_data.

In [None]:
vars_with_missing_data = TestSet.columns[TestSet.isna().sum() > 0].to_list()
vars_with_missing_data

Generates an exploratory data analysis (EDA) report for the TrainSet DataFrame using the ydata_profiling library.

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df=TrainSet, minimal=True)
profile.to_notebook_iframe()

Feature engineering by applying transformations to numerical and categorical data, including log, power, Box-Cox, ordinal encoding, and outlier Winsorization. It validates inputs, checks for missing values, applies transformations, and evaluates results using diagnostic plots, ensuring effective preprocessing for analysis.

In [9]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set_theme(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - function to check for feature engineering for numerical and categorical variables
  - when variables are transformed, the distribution can be checked again
  - Change of distribution after transformation verified with further pandas profiling

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    df_feat_eng,list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing value in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers



def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")



def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(4, 3))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show()
  print("\n")



def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(12, 4))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.tight_layout()
  plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked



def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)


  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked

Feature engineering to the KitchenQual variable.

In [10]:
variables_engineering= ['KitchenQual']

Creates a copy of the KitchenQual column from TrainSet and stores it in df_engineering, ensuring that modifications do not affect the original dataset.

In [11]:
df_engineering = TrainSet[variables_engineering].copy()

Displays the first five rows of the df_engineering DataFrame.

In [None]:
df_engineering.head()

Applies the FeatureEngineeringAnalysis function to df_engineering with an ordinal_encoder analysis type.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

Applies ordinal encoding to categorical variables in both the TrainSet and TestSet using an arbitrary encoding method.

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.fit_transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

Feature engineering to 1stFlrSF, 2ndFlrSF, GrLivArea, LotArea, LotFrontage, GarageArea, MasVnrArea, OpenPorchSF, TotalBsmtSF

In [15]:
variables_engineering = ['1stFlrSF',
                         '2ndFlrSF',
                         'GrLivArea',
                         'LotArea',
                         'LotFrontage',
                         'GarageArea',
                         'MasVnrArea', 
                         'OpenPorchSF',
                         'TotalBsmtSF',
                         ]

Creates a copy of the 1stFlrSF, 2ndFlrSF, GrLivArea, LotArea, LotFrontage, GarageArea, MasVnrArea, OpenPorchSF, TotalBsmtSF columns from TrainSet and stores it in df_engineering, ensuring that modifications do not affect the original dataset.

In [16]:
df_engineering = TrainSet[variables_engineering].copy()

Applies the FeatureEngineeringAnalysis function to df_engineering with a numerical analysis type.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

Log Transformation is applied to the columns GrLivArea, LotArea, and LotFrontage to address skewness in the data by compressing large values. Power Transformation is applied to the columns GarageArea, MasVnrArea, OpenPorchSF, TotalBsmtSF, 1stFlrSF, and 2ndFlrSF to stabilize variance and make the data more normally distributed.

In [None]:
lt = vt.LogTransformer(variables=['GrLivArea', 'LotArea', 'LotFrontage'])
pt = vt.PowerTransformer(variables=['GarageArea', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF'])

TrainSet = lt.fit_transform(TrainSet)
TestSet = lt.transform(TestSet)

TrainSet = pt.fit_transform(TrainSet)
TestSet = pt.transform(TestSet)

print("* Power Transformation and Log Transformation done!")

Feature engineering to 1stFlrSF, 2ndFlrSF, GarageArea, LotArea, LotFrontage, MasVnrArea, OpenPorchSF, TotalBsmtSF

In [19]:
variables_engineering = ['1stFlrSF', '2ndFlrSF', 'GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF']

Creates a copy of the 1stFlrSF, 2ndFlrSF, GarageArea, LotArea, LotFrontage, MasVnrArea, OpenPorchSF, TotalBsmtSF columns from TrainSet and stores it in df_engineering, ensuring that modifications do not affect the original dataset.

In [20]:
df_engineering = TrainSet[variables_engineering].copy()

Applies the FeatureEngineeringAnalysis function to df_engineering with a outlier_winsorizer analysis type.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='outlier_winsorizer')

Applies Winsorization to handle outliers in the TrainSet and TestSet using the IQR method. It caps extreme values at 1.5 times the IQR for both tails and specified variables.

In [None]:
winsoriser = Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables = variables_engineering)
TrainSet = winsoriser.fit_transform(TrainSet)
TestSet = winsoriser.fit_transform(TestSet)

print("* Outlier winsoriser transformation done!")

Remove highly correlated features (correlation > 0.8) based on their variance, stores the correlated feature sets in corr_sel.correlated_feature_sets_

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

The names of features that are dropped due to high correlation with others.

In [None]:
corr_sel.features_to_drop_

## Conclusions and Next Steps

* The categorical variable KitchenQual was encoded using ordinal encoding, allowing the model to recognize the inherent ranking among kitchen quality levels.
* Log and power transformations were applied to numerical features (GrLivArea, LotArea, LotFrontage, GarageArea, MasVnrArea, OpenPorchSF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF) to stabilize variance and make relationships with the target variable more linear.