# **Feature Engineering Notebook**

## Objectives

* Engineering features for regression model

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* Generate a list of feature engineering transformations to be applied to specific variables.

## Conclusions

The following transformations are needed for feature engineering:
* Ordinal Categorical Encoding - ['BsmtExposure', 'BsmtFinType1', 'GarageFinish',  'KitchenQual']
* Log-e Numerical Encoding - ['LotArea', 'LotFrontage']
* Power Numerical Encoding - ['BsmtUnfSF', 'OpenPorchSF']
* Yeo-Johnson Numerical Encoding - 'TotalBsmtSF'
* Smart Correlated Selection features to drop - ['1stFlrSF', 'GrLivArea', 'OverallQual']

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Cleaned Data

Train set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Test set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

# Data Exploration

Evaluate potential transformations

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation and PPS Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Calculate correlations and PPScore for train set

In [None]:
ts_corr_pearson, ts_corr_spearman, pps_matrix = CalculateCorrAndPPS(TrainSet)

Display heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = ts_corr_pearson,
                  df_corr_spearman = ts_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.3,
                  figsize=(12,10), font_annot=10)

* The cleaned data has a different ppscore profile and some differences displayed on the heatmaps, which can easily be explained by the data cleaning process.
* The most significant difference was the removal of the variable 'GarageYrBlt' which removed the highest ppscore, relating to the correlation between 'GarageYrBlt' and 'YearBlt'.
* The observed differences are not a concern for predicting the target, as there are several other variables with strong correlations to the target, as well as surplus features with multicollinearity.

## Feature Engineering

* The following custom function is taken from CI's Feature Engine module.
* This code provides a systematic approach to feature engineering analysis, allowing quick assessment of analysis types. 
* It applies transformation methods to improve the distributional characteristics of variables in a dataframe.

In [None]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')



def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop over each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing values in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers



def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")



def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(20, 5))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show();
  print("\n")



def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(20, 6))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.show();


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked




def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []
  
  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)


  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked

## Feature Engineering Spreadsheet Summary

The following feature engineering approaches will be tried:
* Categorical Encoding
* Numerical Transformation
* Outliers - Winsorizer Transformation
* Smart Correlation Selection

## Dealing with Feature Engineering 

### Categorical Encoding - Ordinal: replaces categories with ordinal numbers 

* Step 1: Select variable(s)

In [None]:
variables_engineering= ['BsmtExposure', 'BsmtFinType1', 'GarageFinish',  'KitchenQual']

variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

Categorical Transformation Conclusions:
* For all selected variables, the transformation is effective, since it converted categories to numbers.

* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Encoding - attempt to transform data towards normal distribution.

* Step 1: Select variable(s)

In [None]:
variables_engineering= ['1stFlrSF','2ndFlrSF', 'BedroomAbvGr',
                        'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea',
                        'GrLivArea', 'LotArea', 'LotFrontage',
                        'MasVnrArea', 'OpenPorchSF', 'OverallCond',
                        'OverallQual', 'TotalBsmtSF', 'YearBuilt',
                        'YearRemodAdd'
]
variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

Numerical Transformation Conclusions:

* Transform with log_e transformer
    * The QQ plot was closer to a straight line, and the box plot closer to symmetrical about the median line, when log_e was used to transform the following variables: 
    * ['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage']
* Transform with power transformer
    * Slight improvements were observed to the symmetry of the box plots after transformation for the follwoing variables:
    * ['BsmtUnfSF', 'OpenPorchSF']
* Transform with Yeo-Johnson transformer
    * The box plot was more symmetrical when applied to the following variable:
    * 'TotalBsmtSF'
* No transformation on the remaining variables:
    * '2ndFlrSF'
    * 'BedroomAbvGr'
    * 'BsmtFinSF1'
    * 'GarageArea'
    * 'MasVnrArea'
    * 'OverallCond'
    * 'OverallQual'
    * 'YearBuilt'
    * 'YearRemodAdd'

* Step 4 - Apply the selected transformation to the Train and Test set

In [None]:
variables_engineering_log= ['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage']
variables_engineering_power= ['BsmtUnfSF', 'OpenPorchSF']
variables_engineering_yeo_johnson= ['TotalBsmtSF']

encoder_log = vt.LogTransformer(variables = variables_engineering_log, base='e')
TrainSet = encoder_log.fit_transform(TrainSet)
TestSet = encoder_log.transform(TestSet)
print("* Numerical encoding - log_e transformation done!")

encoder_power = vt.PowerTransformer(variables = variables_engineering_power)
TrainSet = encoder_power.fit_transform(TrainSet)
TestSet = encoder_power.transform(TestSet)
print("* Numerical encoding - power transformation done!")

encoder_yeo_johnson = vt.YeoJohnsonTransformer(variables = variables_engineering_yeo_johnson)
TrainSet = encoder_yeo_johnson.fit_transform(TrainSet)
TestSet = encoder_yeo_johnson.transform(TestSet)
print("* Numerical encoding - Yeo Johnson transformation done!")


### Handle Outliers - assess whether data would benefit from dropping outliers

* Step 1: Select variable(s)

In [None]:
variables_engineering = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1',
                        'BsmtUnfSF', 'GarageArea', 'GrLivArea',
                        'LotArea', 'LotFrontage', 'MasVnrArea',
                        'OpenPorchSF', 'TotalBsmtSF']
variables_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 3: Create engineered variables(s) applying the transformation(s)

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='outlier_winsorizer')

Outlier Conclusions
* Taking into account the business case, and analysing the outlier transformations, it was decided not to perform outlier transformations on any variables.
* This is because the outliers are likely to be real, as opposed to errors in the data, and may indicate changing trends in the data.  

### SmartCorrelatedSelection of Variables - removes all but one correlated features.

* Step 1: Select variable(s) - not required, as SmartCorrelatedSelection is performed on all variables.

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = TrainSet.copy().drop(['SalePrice'],axis=1)
df_engineering.head(3)

* Step 3: Create engineered variables(s) applying the transformation(s)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
print("* The following groups of correlated variables have been identified:")
corr_sel.correlated_feature_sets_

In [None]:
print("* SmartCorrelationSelection has identified the following features to drop:")
corr_sel.features_to_drop_

# Conclusions

The following transformations are needed for feature engineering:
* Ordinal Categorical Encoding - ['BsmtExposure', 'BsmtFinType1', 'GarageFinish',  'KitchenQual']
* Log-e Numerical Encoding - ['LotArea', 'LotFrontage']
* Power Numerical Encoding - ['BsmtUnfSF', 'OpenPorchSF']
* Yeo-Johnson Numerical Encoding - 'TotalBsmtSF'
* Smart Correlated Selection features to drop - ['1stFlrSF', 'GrLivArea', 'OverallQual']


