# **Feature Engineering**

## Objectives

* Engineer features for classification and cluster models

## Inputs

* outputs/datasets/cleaned/TrainSetCleaned.csv
* outputs/datasets/cleaned/TestSetCleaned.csv

## Outputs

* Generate a list of variables to engineer


---

# Change working directory

* Need to change working directory from the current **jupyter_notebooks** folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head()

Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/TestSetCleaned.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head()

---

# Data Exploration

Running generating `ProfileReport` on the test set to investigate potential transformations that may be made

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation and PPS Analysis

No expected change compared to data cleaning notebook.

---

# Feature Engineering

## Custom Function

Will use altered form of the custom function from the feature engineering lessons, as there is no available numerical feature engineering for this project and the only feature engineering will be done via categorical encoders. Three encoders will be compared, `OneHotEncoder`, `OrdinalEncoder`, and `TargetEncoder`.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine.encoding import OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.preprocessing import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    Used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    Calls transformer_evaluation on the transformed data to assess the 
    distribution of transformed data
    Taken from Walkthrough Project 02 - Churnometer, altered for the feature
    engineering in this project

    Args:
        df: the DataFrame containing the dataset
        analysis_type: a string that declares which encoder should be applied
    
    Returns:
        df_feat_eng: a DataFrame that contains both the original dataset variables
                    and their respective feature engineered variables for comparison
    """
    check_missing_values(df)
    allowed_types = ['target_encoder', 'ordinal_encoder', 'one_hot_encoder']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        if column == 'edible':
            continue
        else:
            df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
            for method in list_column_transformers:
                df_feat_eng[f"{column}_{method}"] = df[column]

            df_feat_eng, list_applied_transformers = apply_transformers(
                analysis_type, df_feat_eng, column, df['edible'])

            transformer_evaluation(
                column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ 
    Checks analysis type against list of permitted type, throws error if not allowed
    Taken from Walkthrough Project 02 - Churnometer

    Args:
        analysis_type: a string declaring the desired feature engineering method to use
        allowed_types: a list of strings of the feature engineering types the function 
                        supports
    Returns:
        None
    """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    """
    Checks the dataset for any null values so they are not passed into the function, 
    throws error if there is
    Taken from Walkthrough Project 02 - Churnometer

    Args:
        df: the DataFrame containing the dataset

    Returns:
        None
    """
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ 
    Set suffix columns according to analysis_type
    Taken from Walkthrough Project 02 - Churnometer
    
    Args:
        analysis_type: A string declaring which feature engineering 
                    method to use
    
    Returns:
        list_column_transformers: A list declaring which feature 
                                engineering method to use
    """
    if analysis_type == 'target_encoder':
        list_column_transformers = ["target_encoder"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'one_hot_encoder':
        list_column_transformers = ['one_hot_encoder']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column, target):
    """
    Applies the transformers to a variable column by calling the 
    appropriate encoder function
    Taken from Walkthrough Project 02 - Churnometer - altered to support 
    different feature engineering methods

    Args:
        analysis_type: the feature engineering method to apply
        df_feat_eng: the DataFrame containing the dataset variable to be 
                    transformed
        column: the dataset variable to transform
        target: the target variable series from the dataset DataFrame, 
                needed for the target encoder

    Returns:
        df_feat_eng: a DataFrame that has the original column and it's 
                    feature engineered column(s) appended to it
        list_applied_tranformers: a list of the different transformations 
                                applied for the column, for use in 
                                transformer_evaluation
    """
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'target_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_TargetEncoder(
            df_feat_eng, column, target)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_OrdinalEncoder(
            df_feat_eng, column)

    elif analysis_type == 'one_hot_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_OneHotEncoder(
            df_feat_eng, column)
    
    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    """ 
    Assesses how the transformations alter feature distribution for a variable
    Taken from Walkthrough Project 02 - Churnometer - altered to support 
    different feature engineering methods

    Args:
        column: the variable to assess the effect of the transformation on
        list_applied_transformers: the columns in the Dataframe containing the
                                    transformed variables
        
    Returns:
        None
    """
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:
        if col == column:
            DiagnosticPlots_Categories(df_feat_eng, col)
        else:
            if analysis_type == "one_hot_encoder":
                for sub_col in [i for i in list(df_feat_eng.drop([column], axis=1).columns) if "one_hot_encoder" in i and column in i]:
                    DiagnosticPlots_Numerical(df_feat_eng, sub_col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    """
    Provides a diagnostic plot for the distribution of a categorical variable
    Taken from Walkthrough Project 02 - Churnometer 

    Args:
        df_feat_eng: DataFrame containing the variable to plot distribution of
        col: the variable to plot
    
    Returns:
        None
    """
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    """
    Provides the diagnotic plots for the distribution of an encoded numerical variable
    Taken from Walkthrough Project 02 - Churnometer 

    Args:
        df: DataFrame containing the transformed variable to plot diagnostic plots of
        variable: the variable to plot diagnostic plots of 

    Returns:
        None
    """
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_TargetEncoder(df_feat_eng, column, target):
    """
    Applies target encoder to dataset column

    Args:
        df_feat_eng: DataFrame containing the variable to apply target encoding to
        column: the variable to perform target encoding on
        target: the ordered data series of target values for encoding

    Returns:
        df_feat_eng: DataFrame with the target encoded variable appended
        list_methods_worked: a list of the encoded variables produced
    """
    list_methods_worked = []
    try:
        encoder = TargetEncoder()
        df_feat_eng[f"{column}_target_encoder"] = encoder.fit_transform(df_feat_eng[f"{column}_target_encoder"].to_frame(), target)
        list_methods_worked.append(f"{column}_target_encoder")
    except Exception:
        df_feat_eng.drop([f"{column}_target_encoder"], axis=1, inplace=True)
    return df_feat_eng, list_methods_worked


def FeatEngineering_OrdinalEncoder(df_feat_eng, column):
    """
    Applies ordinal encoder to dataset column

    Args:
        df_feat_eng: DataFrame containing the variable to apply ordinal encoding to
        column: the variable to perform ordinal encoding on

    Returns:
        df_feat_eng: DataFrame with the ordinal encoded variable appended
        list_methods_worked: a list of the encoded variables produced
    """
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(categories=[list(df_feat_eng[column].unique())])
        df_feat_eng[f"{column}_ordinal_encoder"] = encoder.fit_transform(df_feat_eng[f"{column}_ordinal_encoder"].to_frame())
        list_methods_worked.append(f"{column}_ordinal_encoder")
    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)
    return df_feat_eng, list_methods_worked


def FeatEngineering_OneHotEncoder(df_feat_eng, column):
    """
    Applies OneHotEncoder encoder to dataset column

    Args:
        df_feat_eng: DataFrame containing the variable to apply ordinal encoding to
        column: the variable to perform ordinal encoding on

    Returns:
        df_feat_eng: DataFrame with the ordinal encoded variable appended
        list_methods_worked: a list of the encoded variables produced
    """
    list_methods_worked = []
    try:
        encoder = OneHotEncoder(variables=[f"{column}_one_hot_encoder"], drop_last=False)
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_one_hot_encoder")
    except Exception:
        df_feat_eng.drop([f"{column}_one_hot_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

## Dealing with Feature Engineering

### Categorical Encoding - Ordinal Encoding

Select variables 

In [None]:
variables_engineering = list(TrainSet.columns)
variables_engineering

Create separate dataframe to store variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head()

Create engineered variables by applying the transformations, assess variable distribution and select the most suitable method for each variable.

In [None]:
%matplotlib inline
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

As discussed previously in the Edibility Study notebook, a downside to the use of ordinal encoders is that categories are arbitrarily ordered. For a variable with two possible categories, this is unimportant, as the swapped numerical ordering of the categories won't have an effect on said number's correlation to the target variable, only whether the variable has a negative or positive correlation to the target (which is physically meaningless for categorical variables). However, in the case of categorical variables with three or more  (e.g. the `odor` variable, which has 9 distinct categories), ordering becomes relevant, and has an outsized effect on an encoded variables correlation to the target. This will adversely effect the performance of any model created using the data in an arbitrary fashion. Therefore this encoder will not be used.

### Categorical Encoding - One Hot Encoder

Create separate dataframe to store variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head()

Create engineered variables by applying the transformations, assess variable distribution and select the most suitable method for each variable. (Warning - this code cell produces a huge output, may take upwards of a minute to run).

In [None]:
%matplotlib inline
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='one_hot_encoder')

As can be seen from the above output, `OneHotEncoder` applied to this dataset generates a massive number of variables, some of them being entirely skewed due to the rare occurence/common occurence of certain categories compared to others. It is unlikely this encoder will produce well-performing models, as many of the variables effectively have one unique value, and for those that have a better distribution there are too many variables, producing a situation of ['Curse of Dimensionality'](https://en.wikipedia.org/wiki/Curse_of_dimensionality), where there are so many variables that model performance deteriorates rather than improves. As such this encoder is not appropriate.

### Categorical Encoding - Target Encoder

Create separate dataframe to store variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head()

Create engineered variables by applying the transformations, assess variable distribution and select the most suitable method for each variable.

In [None]:
%matplotlib inline
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='target_encoder')

As can be seen in the above output, the encoder succesfully transforms the variables into numerical values, which in most cases appear to have good distributions per their histograms, QQ plots, and boxplots. This encoding method neither has the disadvantage of arbitrary numerical ordering as with ordinal encoding nor the Curse of Dimensionality as with one hot encoding.
* For all variables, target encoding will be used, with the target being taken from the TrainSet

Apply the selected transformation to the Train and Test set

In [None]:
encoder = TargetEncoder()
TrainSet = encoder.fit_transform(TrainSet, TrainSet['class'])
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - target transformation done!")

### SmartCorrelatedSelection Variables

Create separate DataFrame with variables.

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head()

Create engineered variables applying the transformation.

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

---

# Conclusion

Will add the following transformations to the ML Pipeline for feature engineering:

* Target categorical encoding: `['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',  'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']` with respect to `edible` as the target
* Smart Correlation Selection: `['bruises', 'odor', 'gill-attachment', 'gill-color', 'stalk-surface-below-ring', 'spore-print-color']`