# Classification

## Objectives

*   Fit and evaluate a deep learning classification model to predict if a treatment will be successful or not.


## Inputs

* outputs/datasets/collection/FertilityTreatmentData.csv.gz
* Instructions from the notebooks 02 and 04 on which variables to use for data cleaning and feature engineering.

## Outputs

* Train set (features and target)
* Test set (features and target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* Machine learning model creation and training
* Learning curve plot for model performance
* Model evaluation on pickle file
* Prediction on random data


---

## Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/patriciahalley/Documents/Code_institute/git/ivf-predictor/jupyter_notebooks'

To make the parent of the current directory the new current directory:
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

A new current directory has been set


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/patriciahalley/Documents/Code_institute/git/ivf-predictor'

---

## Load Data

In [4]:
import numpy as np
import pandas as pd

# Open dataset
df = pd.read_csv("outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv.gz")
        
print(df.shape)
df.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'outputs/datasets/cleaned/FertilityTreatmentDataCleaned.csv'

---

## Set output directory

In [None]:
import joblib
import os

version = "v1"
file_path = f"outputs/ml_pipeline_dl/ivf_success_predictor_dl/{version}"

try:
    # Check if the directories exist
    if os.path.exists(file_path):
        print("Old version is already available. Please create a new version.")
    else:
        # Create the directory if it does not exist
        os.makedirs(name=file_path)
        print(f"Directory {file_path} created successfully.")
except Exception as e:
    print(f"An error occurred: {e}")

---

## ML Pipeline

Custom Transformers:

### Preprocessing Pipeline

For the deep learning model, all categorical variables were encoded using One-Hot Encoding to avoid implicit ordinal relationships, ensuring the model does not misinterpret any order among categorical values, which could lead to learning non-existent relationships.

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OneHotEncoder


def PreprocessingPipelineDL():
    pipeline_base = Pipeline(
        [
            (
                "one_hot_encoding",
                OneHotEncoder(
                    drop_last=True,
                    variables=[
                        "Patient age at treatment",
                        "Total number of previous IVF cycles",
                        "Patient/Egg provider age",
                        "Partner/Sperm provider age",
                        "Specific treatment type",
                        "Egg source",
                        "Sperm source",
                        "Patient ethnicity",
                        "Fresh eggs collected",
                        "Total eggs mixed",
                        "Total embryos created",
                        "Embryos transferred",
                        "Total embryos thawed",
                        "Date of embryo transfer",
                    ],
                ),
            ),
            (
                "smart_correlation",
                SmartCorrelatedSelection(
                    method="spearman", threshold=0.9
                ),
            ),
        ]
    )

    return pipeline_base


PreprocessingPipelineDL()

---

## Sequential model (Deep Learning)

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split

# Define the features (X) and target (y)
X = df.drop(columns=["Live birth occurrence"])  # Drop the target column from the features
y = df["Live birth occurrence"]  # Define the target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,               # Features DataFrame
    y,               # Target Series
    test_size=0.2,   # 20% of the data will be used for testing
    random_state=0   # Set a random state for reproducibility
)


Apply the preprocessing pipeline

In [None]:
X_train.head(3)

In [None]:
pipeline_feat_eng_dl = PreprocessingPipelineDL()
X_train = pipeline_feat_eng_dl.fit_transform(X_train)

In [None]:
X_train.head(3)

Apply the pipeline to the test set

In [None]:
X_test = pipeline_feat_eng_dl.transform(X_test)

In [None]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Check Train Set Target distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

print(y_train.value_counts())

sns.set_style("whitegrid")
y_train.value_counts().plot(kind="bar", title="Train Set Target Distribution")
plt.show()

## Handle Target Imbalance

### Use SMOTE (Synthetic Minority Oversampling TEchnique) to balance Train Set target

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Check Train Set Target distribution after resampling

In [None]:
import matplotlib.pyplot as plt

print(y_train.value_counts())

y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

Further split Train set to validation and Train set

In [None]:
from sklearn.model_selection import train_test_split

# Create new training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
                                    X_train, y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)


---

### Push to Repo

#### Train Set

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv.gz", index=False, compression="gzip")

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv.gz", index=False, compression="gzip")

#### Test Set

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv.gz", index=False, compression="gzip")

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv.gz", index=False, compression="gzip")

#### Validation Set

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv.gz", index=False, compression="gzip")

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv.gz", index=False, compression="gzip")

---

### Scale Features

In [None]:
from sklearn.preprocessing import StandardScaler


def PipelineScale():
    pipeline_base = Pipeline([("feat_scaling", StandardScaler())])

    return pipeline_base

In [None]:
pipeline_scale = PipelineScale() 

# Fit the pipeline on the training data
# Fit and transform the training set
X_train = pipeline_scale.fit_transform(X_train)  

# Use the fitted pipeline to transform the validation and test sets
# Transform validation set
X_val = pipeline_scale.transform(X_val)
# Transform test set
X_test = pipeline_scale.transform(X_test)  


X_train[:2,]

---

# Model creation

## ML model

Define early stopping callback

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    patience=20,
    restore_best_weights=True,  
    monitor='val_accuracy',
    verbose=1, 
    mode='max' 
)

### Define deep learning model

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout

input_shape = X_train.shape[1:]

def create_dl_model(input_shape):
    model = tf.keras.Sequential([
        Dense(128, activation='relu', input_shape=input_shape),
        Dropout(0.5),
        # First hidden layer with 64 neurons and a dropout rate of 0.5
        Dense(64, activation='relu', input_shape=input_shape),
        Dropout(0.5),
        # Second hidden layer with 32 neurons and a dropout rate of 0.5
        Dense(32, activation='relu'),
        Dropout(0.5),
        # Output layer for binary classification
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

### Model Summary 

In [None]:
dl_model = create_dl_model(input_shape)
dl_model.summary()

Train deep learning model

In [None]:
dl_model.fit(
    X_train,
    y_train,
    epochs=100,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping],
    verbose=1
)

---

## Model Performace

### Model learning curve

In [None]:
losses = pd.DataFrame(dl_model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.savefig(f'{file_path}/dl_model_training_losses.png', bbox_inches='tight', dpi=150)
plt.show()

print("\n")
losses[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.savefig(f'{file_path}/dl_model_training_acc.png', bbox_inches='tight', dpi=150)
plt.show()

 ### Model Evaluation

Evaluate model on test set

In [None]:
evaluation = dl_model.evaluate(X_test,y_test)
evaluation

### Confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import pandas as pd

def confusion_matrix_and_report(X, y, pipeline, label_map):
    # Get predictions from the pipeline
    prediction = pipeline.predict(X).reshape(-1)
    
    # Apply thresholding if prediction outputs probabilities
    prediction = np.where(prediction < 0.5, 0, 1)

    # Print confusion matrix
    print("---  Confusion Matrix  ---")
    print(
        pd.DataFrame(
            confusion_matrix(y_true=y, y_pred=prediction), 
            columns=["Actual " + label for label in label_map],
            index=["Predicted " + label for label in label_map]
        )
    )
    print("\n")

    # Print classification report
    print("---  Classification Report  ---")
    print(classification_report(y, prediction, target_names=label_map), "\n")

def clf_performance(X_train, y_train, X_test, y_test, X_val, y_val, pipeline, label_map):
    # Evaluate on Train Set
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    # Evaluate on Validation Set
    print("#### Validation Set #### \n")
    confusion_matrix_and_report(X_val, y_val, pipeline, label_map)

    # Evaluate on Test Set
    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)


In [None]:
clf_performance(X_train, y_train,
                X_test,y_test,
                X_val, y_val,
                dl_model,
                label_map= ['No Success', 'Success']
                )

---

## Predict on new data

Predict class probabilities

Print the 50 first rows to choose samples to manually test the model. Live birth occurence 0 or 1 would be 0 or 1 on the the final array as it would translate to "No Success" and "Success" respectively.

In [None]:
df["Live birth occurrence"].head(50)

Take a sample from the test set and use it as if it was live data.

Choose a index from the dataset that should result in "No Sucess", i. e. Live birth occurrence 0 to test

In [None]:
# Set a specific row of data as the index value 
index = 9

# Extract a specific row from the X_test dataset using the index with the slice method
# The row extracted corresponds to the value of `index-1` to `index`.
# That means if `index` is 1, the row extracted will be the first row of the dataset
# (from index-1, which is 0, to index, which is 1 and 1 is not includded).
# The comma after `index-1:index` keeps the row in a 2D array format (like a single row in a matrix).
live_data = X_test[index-1:index,]

# Display the extracted data.
live_data

Use model.predict and pass the data

In [None]:
prediction_proba = dl_model.predict(live_data)
prediction_proba

Set threshold using NumPy function np.where().

Make a condition: (prediction_proba < 0.5), if that is true, it converts to 0; otherwise, it is 1.

In [None]:
prediction_class = np.where(prediction_proba<0.5,0,1)
prediction_class

In this case, 0 means "No success" and 1 means "Success".

---

### ML Pipelines (Feat Eng, scaling) and Model

Both pipelines should be used in conjunction to predict Live Data.

* To predict on Train Set, Test Set we use only pipeline_scale, since the data is already processed.

#### Save Pipeline responsible for Feature Engineering (One Hot Encoding and Smart Correlation)

In [None]:
pipeline_feat_eng_dl

In [None]:
import joblib
import gzip

# Save the model directly into a compressed gzip file
with gzip.open(f"{file_path}/clf_pipeline_feat_eng_dl.pkl.gz", 'wb') as f_out:
    joblib.dump(pipeline_feat_eng_dl, f_out)

 #### Save Pipeline responsible for Feature Scaling

In [None]:
pipeline_scale

In [None]:

with gzip.open(f"{file_path}/clf_pipeline_scale.pkl.gz", 'wb') as f_out:
    joblib.dump(pipeline_scale, f_out)

### Save the model as pkl compressed

In [None]:
dl_model.save(f"{file_path}/ivf_success_predictor_dl_model.keras")

---

### Save evaluation pickle

In [None]:
import joblib
import gzip

with gzip.open(f"{file_path}/evaluation.pkl.gz", 'wb') as f_out:
    joblib.dump(evaluation, f_out)