# Feature Engineering Notebook

## Objectives

* Perform Correlation and PPS Analysis
* Engineer features for ML models

## Inputs

* outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv
* outputs/datasets/cleaned/TestSetCleaned.csv
* outputs/datasets/cleaned/TrainSetCleaned.csv

## Outputs

* generate a list with variables to engineer

## Conclusions

Feature Engineering Transformers
  
  * One Hot Encoding:
    `['Patient age at treatment',
    'Partner/Sperm provider age',
    'Patient/Egg provider age',
    'Total number of previous IVF cycles',
    'Fresh eggs collected',
    'Total eggs mixed',
    'Total embryos created',
    'Embryos transferred',
    'Total embryos thawed',
    'Date of embryo transfer',
    'Specific treatment type',
    'Egg source',
    'Sperm source',
    'Patient ethnicity']`

  * Smart Correlation Selection:
    `['Patient age at treatment_18-34',
 'Total embryos thawed_0 - fresh cycle',
 'Fresh cycle',
 'Frozen cycle',
 'Total embryos created_0 - frozen cycle',
 'Total eggs mixed_0 - frozen cycle',
 'Fresh eggs collected_0 - frozen cycle',
 'Total embryos thawed_1-5',
 'Embryos transferred_1e',
 'Patient/Egg provider age_35-37',
 'Patient age at treatment_38-39',
 'Patient/Egg provider age_40-42',
 'Date of embryo transfer_NT']`



---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load cleaned data

In [None]:
import pandas as pd
# Read the DataFrame from the compressed CSV file
df = pd.read_csv('outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv')
df.head(3)

### Correlation and PPS Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps
import warnings

warnings.filterwarnings("ignore")


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(
            df,
            annot=True,
            xticklabels=True,
            yticklabels=True,
            mask=mask,
            cmap="viridis",
            annot_kws={"size": font_annot},
            ax=axes,
            linewidth=0.5,
        )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(
            df,
            annot=True,
            xticklabels=True,
            yticklabels=True,
            mask=mask,
            cmap="rocket_r",
            annot_kws={"size": font_annot},
            linewidth=0.05,
            linecolor="grey",
        )
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman", numeric_only=True)
    df_corr_pearson = df.corr(method="pearson", numeric_only=True)

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(["x", "y", "ppscore"]).pivot(
        columns="x", index="y", values="ppscore"
    )

    pps_score_stats = (
        pps_matrix_raw.query("ppscore < 1").filter(["ppscore"]).describe().T
    )
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(
    df_corr_pearson,
    df_corr_spearman,
    pps_matrix,
    CorrThreshold,
    PPS_Threshold,
    figsize=(20, 12),
    font_annot=8,
):

    print("\n")
    print(
        "* Analyse how the target variable for your ML models are correlated with other variables (features and target)"
    )
    print(
        "* Analyse multi-colinearity, that is, how the features are correlated among themselves"
    )

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(
        df=df_corr_spearman,
        threshold=CorrThreshold,
        figsize=figsize,
        font_annot=font_annot,
    )

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(
        df=df_corr_pearson,
        threshold=CorrThreshold,
        figsize=figsize,
        font_annot=font_annot,
    )

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(
        f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n"
    )
    heatmap_pps(
        df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot
    )

Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

Display Heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=7)

---

## Load Train and Test Sets

Tain Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Test Set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

---

## Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

In [None]:
print (f"Number of empty entries followed by the unique values and data type at each column:\n")

for column in TrainSet.columns:
    # Check how many empty fields there are in each column
    empty_fields_count = TrainSet[column].isnull().sum()
    # Check unique values there are in each column
    unique_values = TrainSet[column].unique()
    # Check data type of each column
    data_type = TrainSet[column].dtype
    
    print (f"- {column}: {empty_fields_count}, {unique_values}, {data_type}\n")

## Feature Engineering

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine.encoding import OneHotEncoder

sns.set_theme(style="whitegrid")
warnings.filterwarnings("ignore")

def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - Used for quick feature engineering on categorical variables
    to decide which transformation can better transform the distribution shape.
    """
    check_missing_values(df)
    allowed_types = ["one_hot_encoder"]
    check_user_entry_on_analysis_type(analysis_type, allowed_types)

    # For one-hot encoding, no additional transformers are needed.
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df)

    # For each variable, assess how the transformations perform
    for column in df.columns:
        transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """Check analysis type"""
    if analysis_type is None or analysis_type not in allowed_types:
        raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit("There is a missing value in your dataset. Please handle that before feature engineering.")


def apply_transformers(analysis_type, df):
    list_applied_transformers = []

    if analysis_type == "one_hot_encoder":
        df, list_applied_transformers = FeatEngineering_OneHotEncoder(df)

    return df, list_applied_transformers


def FeatEngineering_OneHotEncoder(df):
    list_methods_worked = []
    try:
        # Initialize the OneHotEncoder from feature_engine
        encoder = OneHotEncoder(drop_last=True)

        # Fit and transform the dataframe
        df = encoder.fit_transform(df)

        # Record the applied transformations
        list_methods_worked.extend(df.columns)

    except Exception as e:
        print(f"Error during one-hot encoding: {e}")

    return df, list_methods_worked


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For one-hot encoding, show only the relevant columns that were created in this transformation
    transformed_columns = [col for col in df_feat_eng.columns if column in col]
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    print(df_feat_eng[transformed_columns].head().to_string())
    print("\n")


## Transformers

* Categorical Encoding (OneHotEncoder)
* Smart Correlation Selection (SmartCorrelatedSelection)


#### OneHotEncoder for features without inherent order:

Select variables

In [None]:
variables_eng_ohe= [
        'Patient age at treatment',
        'Partner/Sperm provider age',
        'Patient/Egg provider age',
        'Total number of previous IVF cycles',
        'Specific treatment type',
        'Egg source',
        'Sperm source',
        'Patient ethnicity',
        'Fresh eggs collected',
        'Total eggs mixed',
        'Total embryos created',
        'Embryos transferred',
        'Total embryos thawed',
        'Date of embryo transfer',
        ]

variables_eng_ohe

Create a separate DataFrame with the selected variables

In [None]:
df_eng_ohe = TrainSet[variables_eng_ohe].copy()
df_eng_ohe.head(3)

Apply the transformation

In [None]:
df_eng_ohe = FeatureEngineeringAnalysis(df=df_eng_ohe, analysis_type='one_hot_encoder')

Conclusion on how the transformation(s) look to be effective:
  * For all variables, the transformation is effective, since it converted categories to numbers.

Apply the selected transformation to the Train and Test Set

In [None]:
ohe_encoder = OneHotEncoder(top_categories=None, drop_last=True, variables = variables_eng_ohe)
TrainSet = ohe_encoder.fit_transform(TrainSet)
TestSet = ohe_encoder.transform(TestSet)

print("One Hot Encoding - transformation completed")

### SmartCorrelatedSelection Variables

* To be applied in all variables

Create a new dataframe to apply the transformer on

In [None]:
df_eng_smart_corr_sel = TrainSet.copy()
df_eng_smart_corr_sel.head(3)

Apply the transformation

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.9, selection_method="variance")

corr_sel.fit_transform(df_eng_smart_corr_sel)
corr_sel.correlated_feature_sets_

Check what freatures should be dropped

In [None]:
corr_sel.features_to_drop_

---

## Conclusion


Transformations needed for feature engineering:


Feature Engineering Transformers

  * One Hot Encoding:

    `['Patient age at treatment',
    'Partner/Sperm provider age',
    'Patient/Egg provider age',
    'Total number of previous IVF cycles',
    'Fresh eggs collected',
    'Total eggs mixed',
    'Total embryos created',
    'Embryos transferred',
    'Total embryos thawed',
    'Date of embryo transfer',
    'Specific treatment type',
    'Egg source',
    'Sperm source',
    'Patient ethnicity']`

  * Smart Correlation Selection:
    `['Patient age at treatment_18-34',
 'Total embryos thawed_0 - fresh cycle',
 'Fresh cycle',
 'Frozen cycle',
 'Total embryos created_0 - frozen cycle',
 'Total eggs mixed_0 - frozen cycle',
 'Fresh eggs collected_0 - frozen cycle',
 'Total embryos thawed_1-5',
 'Embryos transferred_1e',
 'Patient/Egg provider age_35-37',
 'Patient age at treatment_38-39',
 'Patient/Egg provider age_40-42',
 'Date of embryo transfer_NT']`
    
