# **Predicting Fetal Health: A Comprehensive Classification Study**
---
### **Overview**
In fetal health classification, our objective is to predict the well-being of unborn infants—an endeavor of profound importance that directly impacts expectant mothers and one of the most vulnerable populations. This study is based on a comprehensive fetal health classification dataset comprising 2,126 records derived from Cardiotocogram (CTG) examinations. Each record was labeled by three expert obstetricians into one of three categories: Normal, Suspect, or Pathological, providing a clinically reliable foundation for analysis.

To achieve this objective, we employ three machine learning classifiers: logistic regression, random forest, and decision tree. Each model brings distinct analytical strengths, enabling a robust comparative assessment of their ability to identify patterns associated with fetal health outcomes. Through systematic training and evaluation, these models learn to distinguish between healthy conditions and potential signs of fetal distress. The resulting models support healthcare professionals in making timely, data-driven decisions that promote safer pregnancies and improved neonatal outcomes. Ultimately, this study highlights the transformative role of data-driven methods in healthcare, where analytical insights contribute to protecting and improving the health of future generations.

### **Project Goals**
#### **Overall Goal**
To develop and evaluate machine learning models that accurately classify fetal health status using Cardiotocogram (CTG) data, with the aim of supporting early detection of fetal distress and informed clinical decision-making.

#### **Specific Goals**
- To explore and understand the fetal health dataset by Analyzing the distribution of fetal health classes (Normal, Suspect, Pathological)
- To preprocess and prepare the data for modeling: handle missing values, address duplicates
- To develop multiple classification models: logistic regression, decision tree, and random forest classifiers
- To compare model performance: Assess models using accuracy score
- To Assess Fetal Health Predictions


### Module 1
#### Task 1: Data Dive for Exploring Fetal Health Insights
Access the fetal health data for cleaning and analysis to ensure we have accurate input for informed decision-making in prenatal care. This task involves loading the fetal health dataset into our working environment. Obtaining the data in a structured format is the first critical step before any cleaning, analysis, or modeling can take place.

In [12]:
#--- Import library ----
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#--- Load dataset ----
df= pd.read_csv('fetal_health.csv')
#--- Inspect the dataset ----
df.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


In [2]:
#--- Inspect the structure of the fetal health dataset ----
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

#### Task 2: Managing Duplicates in Fetal Health Data
Ensure data integrity by identifying duplicate records that could distort analysis and lead to inaccurate insights. In this task, we scan the dataset to determine if there are duplicate rows. Duplicate records can introduce bias into our analysis, so it’s important to know their count before proceeding.

In [4]:
duplicates= df.duplicated().sum()
duplicates

13

#### Task 3: Enhancing Data Integrity
Clean the dataset by removing duplicate records to ensure that every observation is unique, which is crucial for reliable analysis and model training. After identifying duplicate records, we now need to remove them from the dataset. This task ensures that each fetal health observation is counted only once.

In [5]:
#--- Remove duplicate rows ----
df.drop_duplicates(inplace=True)
#--- Inspect data ----
df.duplicated().sum()

0

#### Task 4: Managing Missing Values in Fetal Health Dataset
Before using the data for predictive modeling, it's essential to ensure that there are no gaps or missing values that could compromise the analysis. From a business perspective, ensuring data completeness leads to more reliable decision-making. In this task, we count the number of missing values in each column, which helps in understanding the quality and readiness of the dataset for further processing.

In [6]:
null_values= df.isnull().sum()
null_values

baseline value                                            0
accelerations                                             0
fetal_movement                                            0
uterine_contractions                                      0
light_decelerations                                       0
severe_decelerations                                      0
prolongued_decelerations                                  0
abnormal_short_term_variability                           0
mean_value_of_short_term_variability                      0
percentage_of_time_with_abnormal_long_term_variability    0
mean_value_of_long_term_variability                       0
histogram_width                                           0
histogram_min                                             0
histogram_max                                             0
histogram_number_of_peaks                                 0
histogram_number_of_zeroes                                0
histogram_mode                          

#### Task 5: Class Distribution Analysis
For healthcare professionals to trust model predictions, it’s crucial that the data represents all health conditions fairly. In this project, we will analyze the distribution of fetal health categories to ensure balanced representation. Understanding the frequency of each class in the fetal health data helps to identify potential imbalances, guiding decisions on whether to adjust or balance the dataset prior to modeling. In this task, we will count the frequency of each unique value in the 'fetal_health' column of the DataFrame df. This count informs us about how many instances fall into each fetal health category (for example, "Normal," "Suspect," "Pathological"). Knowing the class distribution is vital for detecting class imbalances that could affect model training and performance.

In [8]:
values= df['fetal_health'].map({1.0: 'Normal', 2.0: 'Suspect', 3.0: 'Pathological'}).value_counts()
values

fetal_health
Normal          1646
Suspect          292
Pathological     175
Name: count, dtype: int64

### Module 2
#### Task 1: Split Data into Features and Target
For predictive modeling to be effective, it's critical that the model is trained on the correct data. Separating the input features from the target variable ensures that the model learns the relationship between predictors and outcomes without any data leakage. In this case, the objective is to split the fetal health dataset into two parts: one that contains all the input features and another that contains the target variable, which is the fetal health status. This division is key to building robust and reliable machine learning models. we will create two datasets from the original DataFrame df. The first dataset, x, will consist of all columns except the target column 'fetal_health'. The second dataset, y, will include only the 'fetal_health' column, which serves as the target variable. This separation ensures that each part of our model's training process uses the appropriate data.

In [11]:
#--- Split data into X and y ----
X= df.drop(columns = 'fetal_health')
y= df['fetal_health']
X.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,130.0,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,130.0,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,117.0,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,117.0,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0


#### Task 2: Split Data into Training and Testing Sets
To develop a robust predictive model that performs well in real-world scenarios, it is essential to evaluate the model on data it has never seen before. By splitting the dataset into training data (used to build the model) and testing data (used to evaluate its performance), we ensure unbiased validation of the model’s predictive capabilities. This separation helps to detect overfitting and provides a realistic estimate of how the model will perform on new data. In this task, we will divide our dataset into two parts: a training set and a testing set. The training set will consist of 70% of the data, while the testing set will contain the remaining 30%. This split is achieved using the train_test_split() function from scikit-learn, with the random_state parameter set to 42 to ensure reproducibility.

In [13]:
#--- Assume X is the feature matrix and y is the target variable ----
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### Task 3: Train a Logistic Regression Model
To provide healthcare professionals with actionable insights, it's essential to have a model that can accurately classify fetal health. The goal is to quickly identify potential risks by building a predictive model. Here, we'll build and train a logistic regression model—a simple yet powerful classification algorithm—to serve as a baseline for further comparisons with more complex models. In this task, we will initialize a logistic regression model using the liblinear solver and then train it with the training data (x_train for features and y_train for the target). The trained model will be stored in the variable model1b for future predictions and evaluations.

In [14]:
#--- Initiate Logistic Regression model ----
model1b = LogisticRegression(random_state=42, solver='liblinear')
#--- Fit the model with data ----
model1b.fit(X_train, y_train)

#### Task 4: Make Predictions with the Trained Model
Once the logistic regression model is trained, it’s critical to evaluate how well it performs on new, unseen data. This step is key to ensuring that the model can reliably predict fetal health in real-world scenarios. By generating predictions on the test dataset, healthcare professionals can assess the model's potential to identify cases that require further medical attention. In this task, we will use the trained logistic regression model (stored in model1b) to predict the target variable using the testing features (x_test). The resulting predictions will be stored in the variable y_predict1b for further evaluation, such as comparing these predictions to the actual fetal health labels.

In [16]:
#--- Predict Fetal Health ----
y_predict1b= model1b.predict(X_test)
y_predict1b

array([1., 1., 1., 2., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 3., 1., 1.,
       1., 1., 2., 3., 1., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 3., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1.,
       2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 3., 1., 2., 2., 1., 1., 3., 2., 1., 1., 1., 1.,
       1., 1., 1., 1., 3., 2., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1.,
       2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       3., 1., 1., 1., 1., 1., 1., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 3., 3.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 3., 1., 1., 1., 1., 1., 1.,
       3., 1., 1., 1., 1.

#### Task 5: Calculate Model Accuracy
For healthcare decision-making, it’s crucial to have confidence in the model’s predictions. By calculating the accuracy, we validate the model’s performance and ensure that it reliably identifies fetal health outcomes. High accuracy means that the model's predictions closely match the actual patient data, which is vital for making informed clinical decisions. In this task, we will compare the predictions generated by our logistic regression model (stored in y_predict1b) with the true outcomes from the testing data (y_test). You will compute the accuracy score, which represents the proportion of correct predictions. This accuracy score is then stored in a variable named accuracy_lr.

In [17]:
#--- Check the model accuracy
accuracy_lr= accuracy_score(y_test, y_predict1b)
accuracy_lr

0.889589905362776

#### Task 6: Initialize a Random Forest Classifier
In healthcare, making accurate predictions is vital for early detection and intervention. To capture more complex patterns in fetal health data and improve predictive accuracy, we need an advanced model. This task involves setting up a Random Forest classifier—a robust ensemble learning method—to leverage multiple decision trees and generate more reliable predictions. This approach not only enhances the model's ability to generalize from the data but also provides deeper insights into patient risk profiles. we will initialize a Random Forest classifier and generate more reliable predictions.

In [20]:
#--- Initiate with Random Forest ----
model2b= RandomForestClassifier(criterion='gini', n_estimators=100, max_depth=4, random_state=33)

#### Task 7: Train the Random Forest Classifier
To support clinical decision-making, the model must be trained to accurately classify fetal health based on complex data patterns. In this task, we will train the previously initialized Random Forest classifier on the training dataset. This process enables the model to learn from historical data and capture intricate relationships, which are essential for making robust predictions in real-world healthcare settings. In this task we will train the Random Forest classifier (stored in the variable model2b) using the training features (x_train) and corresponding target labels (y_train). This training step adjusts the model’s internal parameters so that it can later generalize and make accurate predictions on unseen data.

In [21]:
#--- Fit the model with train data ----
model2b.fit(X_train, y_train)

#### Task 8: Make Predictions with the Random Forest Model
To ensure our advanced predictive system is reliable in clinical settings, it's crucial to evaluate how well our Random Forest model can classify fetal health outcomes on new, unseen cases. Accurately predicting these outcomes is vital for early risk detection and timely medical intervention. we'll use the trained Random Forest model to generate predictions on the testing dataset. In this task we will apply the trained Random Forest model (stored in model2b) to the testing feature data (x_test) to predict the fetal health status. The predicted results will be stored in a variable named y_pred2b for later performance evaluation.

In [22]:
# Predict fetal health
y_pred2b= model2b.predict(X_test)
y_pred2b

array([1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 2., 3., 1., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 2.,
       1., 1., 1., 1., 1., 3., 1., 2., 2., 1., 2., 3., 2., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 2., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 2., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 2.,
       3., 1., 1., 2., 1., 1., 1., 2., 1., 2., 1., 1., 1., 1., 1., 1., 1.,
       2., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 3., 3.,
       2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 3., 1., 1., 1., 1., 1., 1.,
       3., 1., 1., 1., 1.

#### Task 9: Calculate Accuracy of the Random Forest Model
In clinical settings, having a reliable and accurate predictive model is essential for making informed decisions. To ensure that our advanced Random Forest model can be trusted for early risk detection, we need to evaluate its performance by calculating its accuracy. This metric tells us the proportion of correct predictions, providing a quick measure of the model's overall reliability. In this task we will compare the predictions made by the Random Forest model (stored in y_pred2b) against the true fetal health labels in the testing set (y_test). The accuracy score, which ranges between 0 (no correct predictions) and 1 (all predictions correct), will be computed and stored in a variable named accuracy_rf.

In [23]:
#--- Check the model accuracy ----
accuracy_rf= accuracy_score(y_test, y_pred2b)
accuracy_rf

0.9211356466876972

#### Task 10: Initialize a Decision Tree Classifier
In a clinical setting, it is essential to provide not only accurate but also interpretable predictions. Clinicians often prefer models that are easy to understand and explain. A Decision Tree classifier offers a transparent decision-making process, making it ideal for explaining how predictions are made in fetal health classification. By initializing a Decision Tree with controlled complexity, we can offer clear insights into the factors influencing each prediction. In this task we will initialize a Decision Tree classifier and generate more reliable predictions similar to task 6.

In [24]:
#--- Initiate Decision Tree ----
model3b= DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=33)

#### Task 11: Train the Decision Tree Classifier
In clinical settings, it is vital to not only predict fetal health outcomes accurately but also to provide a transparent, interpretable decision-making process. A Decision Tree model is particularly valuable for this purpose because it allows clinicians to see the exact decision rules leading to each prediction. The business goal is to build a model that learns simple, actionable rules from historical patient data to classify fetal health reliably. In this task, we will train the Decision Tree model (stored in model3b) using the training dataset. This involves fitting the model on the input features (x_train) and the corresponding target values (y_train). By doing so, the model adjusts its internal parameters and learns the decision boundaries that separate different fetal health outcomes. This training step is fundamental for preparing the model to make accurate predictions on new, unseen data.

In [25]:
#--- Fit the model with the train data ----
model3b.fit(X_train, y_train)

#### Task 12: Make Predictions with the Decision Tree Model
In a clinical environment, it’s essential that predictions are both accurate and understandable. A Decision Tree model, with its transparent decision-making process, offers clinicians clear insights into how predictions are made. The business goal here is to generate predictions regarding fetal health that are easily interpretable, thereby increasing trust and facilitating informed decision-making. In this task we will use the trained Decision Tree model (stored in model3b) to predict fetal health outcomes using the testing feature data (x_test). The predicted outcomes will be stored in a variable named y_pred3b. This step is crucial for evaluating how well the model generalizes to new, unseen data and for providing clinicians with an interpretable view of the predictions.

In [26]:
# Predict fetal heath ----
y_pred3b= model3b.predict(X_test)
y_pred3b

array([1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 3., 1., 1.,
       1., 1., 2., 3., 1., 3., 1., 1., 1., 2., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 2., 1., 1., 2., 1.,
       2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 2.,
       1., 1., 1., 1., 1., 3., 1., 2., 2., 1., 2., 3., 2., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 2., 3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 2., 1., 1., 1., 3., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1.,
       1., 1., 1., 1., 2., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 3., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 1., 1., 1., 1., 1., 2.,
       3., 1., 1., 2., 1., 2., 1., 2., 1., 2., 1., 1., 1., 1., 1., 1., 1.,
       2., 2., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 2., 1., 1., 1., 3., 3.,
       2., 2., 1., 2., 1., 1., 1., 1., 1., 1., 3., 1., 1., 1., 1., 1., 1.,
       3., 2., 1., 1., 1.

#### Task 13: Calculate Accuracy of the Decision Tree Model
In clinical applications, even simpler, interpretable models must be rigorously evaluated to ensure that they provide reliable predictions. The goal here is to verify that our straightforward Decision Tree model delivers accurate classifications of fetal health outcomes. By measuring its accuracy, we gain a clear metric that shows how well the model's predictions match the actual clinical data, thus supporting its use in decision-making. In this task we will calculate the accuracy of the Decision Tree model by comparing the true labels in the testing set (y_test) with the predictions generated by the model (y_pred3b). The resulting accuracy score—reflecting the proportion of correct predictions—will be stored in a variable named accuracy_dt.

In [27]:
#--- Check the model Accuracy ----
accuracy_dt= accuracy_score(y_test, y_pred3b)
accuracy_dt

0.9037854889589906

### Module 3
#### Task 1: Model Performance Unveiled: Assessing Fetal Health Predictions
For effective clinical decision-making, it is essential that the performance of predictive models is both transparent and easily interpretable. The goal is to compare the actual fetal health outcomes with the predictions made by the advanced Random Forest model. By creating a side-by-side comparison table, stakeholders can quickly see where the model performs well and identify areas that may require further tuning. This visualization supports confidence in the model's ability to aid in early risk detection. In this task we will generate predictions using the trained Random Forest classifier (model2b) on the test dataset (x_test). Then, we will create a DataFrame (prediction_df) that contains two columns: one for the actual fetal health labels (y_test) and another for the predicted labels (y_pred2b). This table provides a clear, visual comparison between the model’s predictions and the true outcomes.

In [28]:
#--- Generate predictions using the trained Random Forest classifier ---
y_pred2b = model2b.predict(X_test)

#--- Create a DataFrame to compare actual vs predicted values ----
prediction_df = pd.DataFrame({
    'Actual_Fetal_Health': y_test.reset_index(drop=True),
    'Predicted_Fetal_Health': y_pred2b
})

#--- Inspect data ----
prediction_df

Unnamed: 0,Actual_Fetal_Health,Predicted_Fetal_Health
0,2.0,1.0
1,1.0,1.0
2,1.0,1.0
3,2.0,1.0
4,2.0,1.0
...,...,...
629,3.0,3.0
630,1.0,1.0
631,1.0,1.0
632,2.0,2.0
