<a href="https://colab.research.google.com/github/iMaasai/DSProjects/blob/master/fetal_health_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install pycaret

In [None]:
#importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score

##Loading Data

In [None]:
#read data from csv
df = pd.read_csv("foetal_health.csv")
df.head()

#Exploratory data Analysis

In [None]:
#features and Data Types
df.info()

*   The dataset is comprised of 2,126 non-null values in all 22 columns
*   All the columns are of float64 data type

In [None]:
#number of unique values in each feature
df.nunique()

Many of the features have a low number of unique values (below 100)

In [None]:
#rename columns for consistency and correct typo
df = df.rename(columns={"baseline value":"baseline_value", "prolongued_decelerations":"prolonged_decelerations"})

In [None]:
#display basic summmary statists of numeric features
df.describe().T

In [None]:
# getting value counts for severe_decelerations column (investigate the extreme skew)
df.severe_decelerations.value_counts()

 Only 7 observed severe decelerations per second in the whole dataset

In [None]:
# checking value counts for target variable
df.fetal_health.value_counts()

In [None]:
#visualizing value counts for target variable
ax = sns.countplot(data= df, x="fetal_health", stat="percent")
ax.bar_label(ax.containers[0])
plt.title('Target Variable Value Counts')
plt.show();

The above value counts and visual indicate a highly imbalanced dataset - 77% of the observations are in one class. It is expected since most foetuses would be expected to be "Normal" with a few "Suspect" cases and even fewer "Pathological" ones.

In [None]:
#check data skewness
df.skew()

In [None]:
#visualizing distributions of the features
hist_plot = df.hist(figsize = (20,20))


From the skewness data and the above histograms we can observe:

*  Very high positive skew observed in severe_decelerations
*   High positive skew in: fetal_movement, prolonged_decelerations, histogram_variance, histogram_number_of_zeroes

Might require transformation to make the data more normally distributed.





In [None]:
#heatmap to show correlations in dataset
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr().round(decimals=1), annot=True)
plt.title('Feature Correlations')
plt.show();

Investigating correlation of different features to fetal_health:

*   Relatively high correlation: prolonged_decelerations, abnormal_short_term_variability, percentage_of_time_with_abnormal_long_term_variability, accelerations, histogram_mode

These might be important features to consider.

We also observe high multicollinearity between some related features. Might call for multicollinearity handling.





In [None]:
#percentage_of_time_with_abnormal_long_term_variability correlation to baseline_value by fetal_health
plt.figure(figsize=(10,6))
sns.scatterplot(x="percentage_of_time_with_abnormal_long_term_variability", y="baseline_value", hue='fetal_health', data=df)
plt.title('baseline_value vs percentage_of_time_with_abnormal_long_term_variability Correlations')
plt.show();

In [None]:
#prolongued_decelerations correlation to baseline_value by fetal_health
plt.figure(figsize=(10,6))
sns.scatterplot(data =df,x="prolonged_decelerations",y="baseline_value", hue="fetal_health")
plt.title('baseline_value vs prolongued_decelerations Correlations')
plt.show();

In [None]:
#abnormal_short_term_variability correlation to baseline_value by fetal_health
plt.figure(figsize=(10,6))
sns.scatterplot(data =df,x="abnormal_short_term_variability",y="baseline_value", hue="fetal_health")
plt.title('baseline_value vs abnormal_short_term_variability Correlations')
plt.show();

In [None]:
#accelerations correlation to baseline_value by fetal_health
plt.figure(figsize=(10,6))
sns.scatterplot(data =df,x="accelerations",y="baseline_value", hue="fetal_health")
plt.title('baseline_value vs accelerations Correlations')
plt.show();

In [None]:
#histogram_mode correlation to baseline_value by fetal_health
plt.figure(figsize=(10,6))
sns.scatterplot(data =df,x="histogram_mode",y="baseline_value", hue="fetal_health")
plt.title('baseline_value vs histogram_mode Correlations')
plt.show();

The above scatter plots visualize the five features that had the highest correlations to fetal_health and the visualizations support the assertion that they are most impactful.

In [None]:
#Defining independent and dependent attributes in training and test sets
X=df.drop(["fetal_health"],axis=1)
y=df["fetal_health"]

In [None]:
#Plotting the input features using box plots
plt.figure(figsize=(20,8))
sns.boxplot(data = X)
plt.xticks(rotation=60)
plt.title('Input features Box Plot')
plt.show();

The input features are widely spread out and there appears to be a substantial number of outliers.

This might call for outlier handling and data normalization (scaling)

#Preprocessing

In [None]:
# Setting up a standard scaler for the features and analyzing it thereafter
col_names = list(X.columns)
s_scaler = StandardScaler()
X_scaled= s_scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=col_names)
X_scaled.describe().T

In [None]:
col_names

In [None]:
#Plotting the scaled features using boxen plots
plt.figure(figsize=(20,8))
sns.boxplot(data = X_scaled)
plt.xticks(rotation=60)
plt.title('Input features Box Plot after Scaling')
plt.show();

The scaled data is now in a similar range.

In [None]:
df.shape

#Modelling

##Pycaret
PyCaret is an open-source, low-code machine learning library in Python that automates ML workflows inspired by the caret library in R programming language. It's an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle.
It makes experiments exponentially fast and efficient.

PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.

Credibility (from Github):

*   1.7k forks
*   8.3k stars

Links:

*   [Pycaret Home](https://pycaret.org/)
*   [Pycaret Github](https://github.com/pycaret/pycaret/)





##Setup

In [None]:
#pycaret setup:initializes the experiment in PyCaret and creates the transformation pipeline based on all the parameters passed in the function
from pycaret.classification import *
s = setup(df, target = 'fetal_health', train_size = 0.8, fix_imbalance = True)

In [None]:
#view transformed train dataset
get_config('X_train')

In [None]:
#view transformed test dataset
get_config('X_test')

##Train

In [None]:
#check available models
models()

In [None]:
#compare baseline models
best = compare_models(sort = 'F1')

The compare_models function trains and evaluates the performance of all estimators available in the model library using cross-validation.It returns the top-performing model based on the F1 parameter.

Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets (10 in this case), using one of these folds as a validation set, and training the model on the remaining folds.

In [None]:
print(best)


The output shows that the lightgbm model is the best performer for our use case. We shall thus explore ways to use this model and optimize it.

In [None]:
#renaming the top performing model for identification
lightgbm = best

In [None]:
print(lightgbm)

##Tune Model
The tune_model function tunes the hyperparameters of the model. The output of this function is a scoring grid with cross-validated scores by fold. The best model is selected based on the metric defined in optimize parameter.
By default, PyCaret using RandomGridSearch from sklearn.

In [None]:
tuned_lightgbm, tuner = tune_model(lightgbm, n_iter = 20, optimize = 'F1', return_tuner = True, choose_better = True)

Fitting 10 folds for each of 20 iterations, totalling 200 fits gives the above output

In [None]:
# default model
print(lightgbm)

# tuned model
print(tuned_lightgbm)

The original model (with above hyperparameters) performs better than the tuned models.

In [None]:
#view tuner attributes used to optimize model
print(tuner)

## Asess Model
Pycaret offers functions for analyzing the performance of a trained model on the hold-out set.

While although widely used, classification accuracy is almost universally inappropriate for imbalanced classification. The reason is, a high accuracy is achievable by a no skill model that only predicts the majority class.

I thus evaluate on the F1-score as it imposes a penalty on the incorrectly classified samples also - critical for us as we intend to reduce false classes in this health use case.

F1-score = (2 * Precision * Recall) / (Precision + Recall)

In [None]:
# plot confusion matrix for model
plot_model(tuned_lightgbm, plot = 'confusion_matrix', plot_kwargs = {'percent' : True})

In [None]:
# plot class report for model
plot_model(tuned_lightgbm, plot = 'class_report')

In [None]:
# plot ROC AUC of model
plot_model(tuned_lightgbm, plot = 'auc')

In [None]:
# plot Precision Recall Curve of model
plot_model(tuned_lightgbm, plot = 'pr')

In [None]:
# plot class prediction error for model
plot_model(tuned_lightgbm, plot = 'error')

In [None]:
# plot final model parameters
plot_model(tuned_lightgbm, plot = 'parameter')

##Interpret Model
Analyzes the predictions generated from a trained model. Feature importance shows us the most informative features to the model.

In [None]:
# plot feature importance for model
plot_model(tuned_lightgbm, plot = 'feature_all')

#Predictions

In [None]:
# assign labels to the testing dataset using the trained model
#also include a Score column (probability of predicted class)
predictions_df = predict_model(tuned_lightgbm)

In [None]:
predictions_df

In [None]:
#save the predictions dataframe into a csv file
predictions_df.to_csv('sample_predictions.csv')

##References
[(PDF) Fetal Health Classification from Cardiotocographic Data Using Machine Learning](https://www.researchgate.net/publication/356126999_Fetal_Health_Classification_from_Cardiotocographic_Data_Using_Machine_Learning)

[Comparison of machine learning algorithms to classify fetal health using cardiotocogram data - ScienceDirect](https://www.sciencedirect.com/science/article/pii/S1877050921023541)

[GitHub - pycaret/pycaret: An open-source, low-code machine learning library in Python](https://github.com/pycaret/pycaret/)

[Home - PyCaret](https://pycaret.org/)

[Tour of Evaluation Metrics for Imbalanced Classification - MachineLearningMastery.com](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/)

[main.pdf (sciencedirectassets.com)](https://pdf.sciencedirectassets.com/280203/1-s2.0-S1877050921X00208/1-s2.0-S1877050921023541/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQCad99aO%2FOw%2BylLXN3E%2FxiRV76rrWcvK2VeS%2B5F3PTEDwIgfQheV21d%2B%2BIZcWzC0BeWMjyXLHNlz6pTaFinzHXuyjMqvAUI0v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwwNTkwMDM1NDY4NjUiDC%2FU2cq0Q%2Bk0Hf6zliqQBTi%2FF1HdKLs6vLfRpQeoKre49WmVWFY82ICbUxpAEceQpEc2ovaGX18o7pSS1okHQq9aRUltOP6x7nlvA0xqNh1vJMD5HOjSvavJoc7QfFswi%2FIUE9MOXINe1eEaYzx%2F0bU8UW7MofbjNr330s8G%2FeE7IPLvC%2BmxuVPykEBBWxt2zBcA6fGqUXvLf8aJ8Xw75AVDg4eTgNxJISytr3ZmfJMeaFiUyO6EjEX8hrV0alS0zK8tIV0LieEoCBs7gWXFD0QjOCWT%2BMxqGnftfApodhvIi%2FPhHP79bwoNnNpXloZ5pusr8j0LvKv8Dy7c4sB4HE5bYR147eTPLYyHDXiU9lPC8aACAzkhrWTzdkTz9UunDf25n6tvsYR0dkpcxBuEp%2F134eWiTb4%2Fjs0nhnMLZMRrHbBEtuYqK%2BQCU6V%2BPiLWk33B4DdOTAGEOVkyoEZlUJGB3Zp%2BJCN%2Fe59gheCDhl9Y9awtUfrMcuVEX4xrzABC18%2BdRpbxT9ZYp8poWANDBD3uFMhSI6TYNAuEt983FBveuFq8eaj8xOlzz4R0CKi5RjoMHEx1OU2tHSP0Mlgt2IzKyjp5zku55m%2FS1hYjxjfAGud2BFNQNRXi0oJ6sycwuDmSg8y5pkgJd0%2BmCZkPO8evri1idXpp%2F2L4Qqp2AUcgbmN5WoiEh%2FdBR8O1APGhCgOJqrSzXqPoK32GmikCWJzh8oqP%2Bu4q%2FM%2Fboh6offOmE3pi94VB%2B5rLBfE%2B6QFT0PwU%2F4cfb7xqlPEZ4h%2FUKzRJRzw4m%2FyB1bGGAxTWhDfqnefoy%2F4DZ4W7zTDpi1D9iML1LOswmtGglbBuvfjVWRmWyNR0jNkcvQrE%2FuyQIqxQcfR917tnNmVsf0MXW3KYMLfSlLAGOrEBy%2FJv%2BIntDUeModXhU7bmbsgm3DPapD4z57M3e0LDhLd3RSMVANytAOucLb4ZTYGUPidEvZTUxrSVf7Ar5a1ROA9CTB68Psl0Mxu%2FoqESIwY3hlE9Z9VwoqlFii1AcVqcFXabtUzJcBLRKgFabQS3kl5guk8ZZI6dKJAl9UzZ3bG0RgthW3cTeeRxbxBhipsFV%2ByG%2F3pe87x4tkREeBXUVoobAF3O%2FTR%2FAz9K5Zk3uYA3&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240328T094646Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYZBJNXOUX%2F20240328%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=758b5bd02d409ff06323dfc54e60f0fc1061deea24213186bbbbd685b703852e&hash=98b78ed33327533326dec45cc71a21a5fb5922efa6c031a85684da02abfda6ee&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S1877050921023541&tid=spdf-4a474329-1912-446a-b53c-f40d0d6bb732&sid=a30599c4265b194b78686ea3b99afcb543a0gxrqb&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=11035e5905570d510a51&rr=86b6b1e6d96c8a44&cc=ke)

[(PDF) Fetal health classification from cardiotocographic data using machine learning](https://www.researchgate.net/publication/356666279_Fetal_health_classification_from_cardiotocographic_data_using_machine_learning)