A2: Modelling Case Study (Individual)

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load the birthweight dataset
birthweight_data = pd.read_csv('./birthweight2.csv')
kaggle_test_df = pd.read_csv('./kaggle_test_data.csv')

# displaying the head of the dataset
birthweight_data.head(n=3)
kaggle_test_df.head(n=3)

Unnamed: 0,bwt_id,mage,meduc,monpre,npvis,fage,feduc,omaps,fmaps,cigs,drink,male,mwhte,mblck,moth,fwhte,fblck,foth
0,bwt_14,30,16.0,5,10.0,38,16.0,9,9,0.0,0.0,1,1,0,0,1,0,0
1,bwt_16,29,12.0,1,9.0,28,12.0,9,10,0.0,0.0,0,1,0,0,1,0,0
2,bwt_24,28,16.0,1,12.0,30,16.0,8,9,0.0,0.0,1,1,0,0,1,0,0


EXPLORATORY DATA ANALYSIS

In [24]:
# Descriptive statistics
descriptive_stats = birthweight_data.describe()

In [25]:
# Correlation matrix for numeric columns only
numeric_birthweight_data = birthweight_data.select_dtypes(include=['number'])
correlation_matrix = numeric_birthweight_data.corr()

In [26]:
#Pearson correlation coefficients for all features in relation to 'bwght'
pearson_correlation_matrix = numeric_birthweight_data.corr(method='pearson')

# Extracting the correlation values specifically for 'bwght'
pearson_correlation_bwght = pearson_correlation_matrix['bwght'].sort_values(ascending=False)

pearson_correlation_bwght

bwght     1.000000
omaps     0.405356
fmaps     0.382021
npvis     0.180625
fage      0.138822
male      0.061118
feduc     0.060550
monpre    0.055095
mage      0.054616
fblck     0.041825
mblck     0.022982
fwhte     0.016804
mwhte     0.015437
meduc     0.001286
drink    -0.029008
cigs     -0.041545
moth     -0.049353
foth     -0.074563
Name: bwght, dtype: float64

In [27]:
# Adding a new binary feature for low birthweight
numeric_birthweight_data['low_bwght'] = (numeric_birthweight_data['bwght'] < 2500).astype(int)

# Recalculating the correlations with the new binary target
binary_correlations = numeric_birthweight_data.corr()['low_bwght'].sort_values()

binary_correlations

bwght       -0.773622
omaps       -0.386374
fmaps       -0.325326
npvis       -0.134160
fage        -0.092089
monpre      -0.062930
mage        -0.044626
fblck       -0.044164
mblck       -0.044164
male        -0.038470
moth        -0.035606
feduc       -0.023717
drink       -0.015042
foth        -0.003005
meduc        0.029491
fwhte        0.036859
cigs         0.058257
mwhte        0.058676
low_bwght    1.000000
Name: low_bwght, dtype: float64

In [28]:
# Handling missing numerical data with mean imputation
for column in ['meduc', 'npvis', 'feduc', 'cigs', 'drink']:
    numeric_birthweight_data[column].fillna(numeric_birthweight_data[column].mean(), inplace=True)

# Confirming if there are any missing values left
missing_values_after_imputation = numeric_birthweight_data.isnull().sum()

FEATURE ENGINEERING

In [29]:
# Adding a new binary feature for low birthweight
numeric_birthweight_data['low_bwght'] = (numeric_birthweight_data['bwght'] < 2500).astype(int)

# Recalculating the correlations with the new binary target
binary_correlations = numeric_birthweight_data.corr()['low_bwght'].sort_values()

binary_correlations

bwght       -0.773622
omaps       -0.386374
fmaps       -0.325326
npvis       -0.133407
fage        -0.092089
monpre      -0.062930
mage        -0.044626
fblck       -0.044164
mblck       -0.044164
male        -0.038470
moth        -0.035606
feduc       -0.023435
drink       -0.014409
foth        -0.003005
meduc        0.029307
fwhte        0.036859
cigs         0.055495
mwhte        0.058676
low_bwght    1.000000
Name: low_bwght, dtype: float64

CANDIDATE MODEL DEVELOPMENT

In [30]:
# Impute missing values for 'omaps' and 'fmaps' with the mode
for column in ['omaps', 'fmaps']:
    numeric_birthweight_data[column].fillna(numeric_birthweight_data[column].mode()[0], inplace=True)

# Check if there are any missing values left after the second imputation
missing_values_after_second_imputation = numeric_birthweight_data.isnull().sum()

# If there are no more missing values, we can proceed to model development
# For this, we'll select a few features for the model based on the correlations and domain knowledge
# Let's use 'omaps', 'fmaps', 'meduc', 'feduc', 'cigs', 'drink', 'mage', 'fage' as predictors
# and 'low_bwght' as the target variable.

# Selecting the features for the model
features = ['omaps', 'fmaps', 'meduc', 'feduc', 'cigs', 'drink', 'mage', 'fage']
target = 'low_bwght'

# Splitting the data into features (X) and target (y)
X = numeric_birthweight_data[features]
y = numeric_birthweight_data[target]

missing_values_after_second_imputation, X.head(), y.head()

(mage         0
 meduc        0
 monpre       0
 npvis        0
 fage         1
 feduc        0
 omaps        0
 fmaps        0
 cigs         0
 drink        0
 male         0
 mwhte        0
 mblck        0
 moth         0
 fwhte        0
 fblck        0
 foth         0
 bwght        0
 low_bwght    0
 dtype: int64,
    omaps  fmaps      meduc      feduc      cigs    drink  mage  fage
 0    8.0    9.0  12.000000  17.000000  0.000000  0.00000    28  31.0
 1    8.0    9.0  13.655941  13.902743  1.194226  0.02356    21  21.0
 2    9.0    9.0  15.000000  16.000000  0.000000  0.00000    27  32.0
 3    9.0   10.0  17.000000  17.000000  0.000000  0.00000    33  39.0
 4    9.0    9.0  15.000000  16.000000  1.194226  0.02356    30  36.0,
 0    0
 1    1
 2    0
 3    0
 4    0
 Name: low_bwght, dtype: int64)

MODEL EVALUATION

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Since we have one missing value in 'fage', we'll drop that row
X = X.dropna()
y = y[X.index]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the models
logreg_model = LogisticRegression(max_iter=1000, random_state=42)
rf_model = RandomForestClassifier(random_state=42)
gbm_model = GradientBoostingClassifier(random_state=42)

# Fit the models on the training data
logreg_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
gbm_model.fit(X_train, y_train)

# Make predictions on the validation set
logreg_pred = logreg_model.predict(X_val)
rf_pred = rf_model.predict(X_val)
gbm_pred = gbm_model.predict(X_val)

# Calculate precision scores for the validation set
logreg_precision = precision_score(y_val, logreg_pred)
rf_precision = precision_score(y_val, rf_pred)
gbm_precision = precision_score(y_val, gbm_pred)

# Calculate confusion matrices for the models
logreg_cm = confusion_matrix(y_val, logreg_pred)
rf_cm = confusion_matrix(y_val, rf_pred)
gbm_cm = confusion_matrix(y_val, gbm_pred)

# Store precision scores and confusion matrices in a dictionary for comparison
model_performance = {
    'Logistic Regression': {'Precision': logreg_precision, 'Confusion Matrix': logreg_cm},
    'Random Forest': {'Precision': rf_precision, 'Confusion Matrix': rf_cm},
    'GBM': {'Precision': gbm_precision, 'Confusion Matrix': gbm_cm}
}

model_performance

{'Logistic Regression': {'Precision': 0.3333333333333333,
  'Confusion Matrix': array([[72,  2],
         [ 7,  1]])},
 'Random Forest': {'Precision': 0.0,
  'Confusion Matrix': array([[72,  2],
         [ 8,  0]])},
 'GBM': {'Precision': 0.0,
  'Confusion Matrix': array([[69,  5],
         [ 8,  0]])}}

FINAL MODEL SELECTION

In [34]:
# Applying the same imputations to the test set as we did to the training set
for column in ['meduc', 'npvis', 'feduc', 'cigs', 'drink']:
    kaggle_test_df[column].fillna(kaggle_test_df[column].mean(), inplace=True)

# Impute missing values for 'omaps' and 'fmaps' in the test set with the mode 
# from the training set
for column in ['omaps', 'fmaps']:
    if kaggle_test_df[column].isnull().sum() > 0:  
        kaggle_test_df[column].fillna(birthweight_data[column].mode()[0], inplace=True)

# Checking that there is no missing values before predictions
assert kaggle_test_df[features].isnull().sum().sum() == 0

# Select the same features for the test set
X_test = kaggle_test_df[features]

# Predict using the Logistic Regression model
test_predictions = logreg_model.predict(X_test)

# Prepare the submission dataframe
submission_df = kaggle_test_df[['bwt_id']].copy()
submission_df['low_bwght'] = test_predictions

# Path to save the submission file
submission_file_path = './birthweight_prediction_K7.csv'
submission_df.to_csv(submission_file_path, index=False)

submission_file_path, submission_df.head()

('./birthweight_prediction_K7.csv',
    bwt_id  low_bwght
 0  bwt_14          0
 1  bwt_16          0
 2  bwt_24          0
 3  bwt_30          0
 4  bwt_57          0)

ANALYSIS QUESTION NO. 1
Are there any strong positive or strong negative linear (Pearson) correlations with birthweight? Answer this question based on the original, continuous form of birthweight. (minimum 5 sentences)

The exploratory data analysis (EDA) on the birthweight dataset revealed significant insights
into the factors affecting birthweight (`bwght`). Apgar scores at one and five minutes 
post-birth show the strongest positive correlation, indicating that higher birthweights are 
associated with better immediate health status. The number of prenatal visits also has a 
positive relationship with birthweight, suggesting the importance of regular healthcare in 
pregnancy for fetal growth. On the other hand, smoking and alcohol consumption during pregnancy are 
negatively correlated with birthweight, highlighting the detrimental effects of these behaviors
on fetal development. These findings highlights the importance of good health practices and 
regular medical care during pregnancy to promote higher birthweights and healthier newborns.

ANALYSIS QUESTION NO. 2
Is there an official threshold that signifies when birthweight gets more dangerous? In other words, is there a cutoff point between a healthy birthweight and a non-healthy birthweight? Provide credible sources as necessary. (minimum 5 sentences)

Based on UNICEF - World Health Organization (WHO), low birthweight as a birthweight of less than 2,500 grams (approximately 5 pounds, 8 ounces). This threshold is considered significant as it indicates newborns who may be at higher risk for early growth retardation, infectious disease, developmental delays, and even death in severe cases. The delineation between a healthy and non-healthy birthweight essentially revolves around this cutoff point. Babies born with a weight below this threshold may require additional medical attention and interventions to support their development and overall health. It's important to note that while this threshold is globally recognized, individual health considerations and circumstances can also impact the health outcomes of newborns, making personalized medical advice crucial for those born close to or below this weight .

Reference: 

UNICEF-WHO Joint Database on Low birth weight. (http://data.unicef.org/nutrition/low-birthweight; https://www.who.int/nutgrowthdb/lbw-estimates).

ANALYSIS QUESTION NO. 3
After transforming birthweight (bwght) using this threshold, did correlations and/or phi coefficients improve? Why or why not? (minimum 5 sentences)

By transforming the continuous birthweight variable (bwght) into a binary variable using a threshold (2,500 grams) changes the nature of the analysis from examining linear relationships to assessing association and classification accuracy. This transformation enables the use of phi coefficients, a measure of association for binary variables, which can be more directly interpreted in the context of risk factors or predictors of low birthweight.Upon transforming bwght into a binary classification of low versus normal birthweight, correlations in the dataset are recalibrated. While a continuous variable might show a moderate linear correlation with birthweight, its relationship with the binary classification of low birthweight could be more pronounced if the variable is a critical determinant of low birthweight cases.

ANALYSIS QUESTION 4
Which two features in your machine learning model had the largest impact on birthweight? Present one actionable insight for each of these. (minimum 5 sentences per feature)

The `npvis` had the largest positive coefficient, it suggests that increased prenatal care is strongly associated with higher birthweights, indicating the need to make prenatal services more accessible and encouraging regular attendance. Conversely, if `cigs` had a significant negative coefficient, this would indicate that smoking during pregnancy is a strong predictor of lower birthweights, necessitating robust smoking cessation programs targeted at expectant mothers. Efforts to educate on the dangers of smoking while pregnant and providing support for quitting would be critical. These two features—prenatal care and smoking—could become focal points for public health initiatives to improve birth outcomes. Tailored interventions based on these insights could lead to significant improvements in neonatal health.

ANALYSIS QUESTION 5
Present your final model's confusion matrix and explain what each error means (false positives and false negatives). Furthermore, explain which error is being controlled for given the cohort's focus on correctly predicting low birthweight, as well as why this error is more important to control than the other error. (minimum 5 sentences)

The primary aim is to accurately predict instances of low birthweight, making the minimization of False Negatives the most critical aspect. A False Negative implies that an infant who is actually at risk might miss out on crucial, immediate care and intervention, posing a significant risk to the infant's health. Therefore, it's more vital to decrease the occurrence of false negatives than false positives, as the ramifications of failing to detect a low birthweight baby far outweigh the inconvenience of extra examinations for a baby mistakenly flagged as at risk.