# BUILDING A MACHINE LEARNING MODEL

This is the third and final part of the analysis of my low birth weight dataset. The dataset has already been cleaned, new featues created, and the data statistically explored and analyzed. This notebook will primarily deal with building a machine learning model to predict the likelihood of a woman deliverying a low birth wright baby. 

***To do:***
* Importing and preprocessing the data for our machine learning model 
* Normalizing both quantitative and qualitative columns (one-hot encoded)
* Build the model
* Test and evaluate the model

## Importing the dataset

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC





In [3]:
df = pd.read_csv('Cleaned2 - Low Birth Weight.csv')
df.head()

Unnamed: 0,MATERNALAGE,LEVELOFEDUCATION,OCCUPATION,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,HEPATITISBSTATUS,SYPHILLISSTATUS,RETROSTATUS,...,BABYLENGTH,HEADCIRCUMFERENCE,NICUADMISSION,RESPIRATORYDISTRESS,STILLBIRTH,IUGR,NEONATALOUTCOME,CAT_MATERNALAGE,CAT_GRAVIDITY,CAT_PARITY
0,18.0,Secondary,Self employed,1.0,0.0,11.0,10.598487,Non Reactive,Non Reactive,Non Reactive,...,49.210993,33.162943,No,No,No,No,Alive,0-20,1-1,0-0
1,31.0,Illiterate,Unemployed,3.0,2.0,7.113177,10.598487,Non Reactive,Non Reactive,Non Reactive,...,54.0,33.0,No,Yes,No,No,Alive,21-35,3-9,2-9
2,20.0,Secondary,Unemployed,2.0,0.0,4.0,10.9,Non Reactive,Non Reactive,Non Reactive,...,49.210993,33.162943,No,No,No,No,Alive,0-20,2-2,0-0
3,19.0,Secondary,Self employed,1.0,0.0,2.0,8.6,Non Reactive,Non Reactive,Non Reactive,...,49.0,30.0,No,No,No,No,Alive,0-20,1-1,0-0
4,32.0,Tertiary,Civil Servant,4.0,3.0,8.0,11.5,Non Reactive,Non Reactive,Non Reactive,...,45.0,35.0,No,No,No,No,Alive,21-35,3-9,2-9


## Preprocessing data for machine learning model

Let's get rid of some columns, namely 'NICUADMISSION', 'RESPIRATORYDISTRESS', 'STILLBIRTH', 'IUGR'
These columns are taken after the birth of a child, and cannot be used to predict LBW

Other numerical columns we need to get rid of include: 'SBPAFTERDELIVERY', 'DBPAFTERDELIVERY', 'BIRTHWEIGHT', 'APGARAT1MIN', 'APGARAT5MIN', 'BABYLENGTH', 'HEADCIRCUMFERENCE'. 
These recordings are also taken after the birth of a child, and cannot be used to predict LBW.


In [4]:
df = df.drop(columns=[
    'NICUADMISSION', 'RESPIRATORYDISTRESS', 'STILLBIRTH', 'IUGR', 'LBW',
    'SBPAFTERDELIVERY', 'DBPAFTERDELIVERY', 'BIRTHWEIGHT', 'APGARAT1MIN',
    'APGARAT5MIN', 'BABYLENGTH', 'HEADCIRCUMFERENCE'
])


In [5]:
df.head()

Unnamed: 0,MATERNALAGE,LEVELOFEDUCATION,OCCUPATION,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,HEPATITISBSTATUS,SYPHILLISSTATUS,RETROSTATUS,...,AntepartumHemorrhage,Postpartumhemorrhage,ECLAMPSIA,SEVEREPREECLAMPSIA,BABYSEX,LOWBIRTHWEIGHT,NEONATALOUTCOME,CAT_MATERNALAGE,CAT_GRAVIDITY,CAT_PARITY
0,18.0,Secondary,Self employed,1.0,0.0,11.0,10.598487,Non Reactive,Non Reactive,Non Reactive,...,No,No,No,No,Male,Normal Birth Weight,Alive,0-20,1-1,0-0
1,31.0,Illiterate,Unemployed,3.0,2.0,7.113177,10.598487,Non Reactive,Non Reactive,Non Reactive,...,No,No,No,No,Male,Normal Birth Weight,Alive,21-35,3-9,2-9
2,20.0,Secondary,Unemployed,2.0,0.0,4.0,10.9,Non Reactive,Non Reactive,Non Reactive,...,No,No,No,No,Male,Normal Birth Weight,Alive,0-20,2-2,0-0
3,19.0,Secondary,Self employed,1.0,0.0,2.0,8.6,Non Reactive,Non Reactive,Non Reactive,...,No,No,No,No,Female,Low Birth Weight,Alive,0-20,1-1,0-0
4,32.0,Tertiary,Civil Servant,4.0,3.0,8.0,11.5,Non Reactive,Non Reactive,Non Reactive,...,No,No,No,No,Male,Normal Birth Weight,Alive,21-35,3-9,2-9


***let's get the target label 'LOWBIRTHWEIGHT' separate from the other features***

In [6]:
# Get your X and Y data
Y = df['LOWBIRTHWEIGHT']

X = df.drop('LOWBIRTHWEIGHT', axis=1)

Let's check and encode the target values to 0 and 1

In [7]:
np.unique(Y, return_counts=True)

(array(['Low Birth Weight', 'Normal Birth Weight'], dtype=object),
 array([ 286, 1070], dtype=int64))

In [8]:
replacement = [0, 1]
Y = Y.replace(dict(zip(Y.unique(), replacement)))

In [9]:
Y.unique()

array([0, 1], dtype=int64)

## Normalization and One-Hot Encoding



***Normalizing Categorical Features Through One-Hot Encoding***

In [10]:
numerical_columns = X.select_dtypes(include=['number']).columns
categorical_columns = X.select_dtypes(exclude=['number']).columns

In [11]:
# Let's Categorize the Columns as either a Categorical Column or Numberical Column
categorical_columns = ['CAT_MATERNALAGE', 'LEVELOFEDUCATION', 'OCCUPATION', 'CAT_GRAVIDITY', 'CAT_PARITY',
             'HEPATITISBSTATUS', 'SYPHILLISSTATUS', 'RETROSTATUS', 'BLOODGROUP', 
             'PTDlt37WEEKS', 'MODEOFDELIVERY', 'MATERNALOUTCOME', 'AntepartumHemorrhage', 'Postpartumhemorrhage', 
             'ECLAMPSIA', 'SEVEREPREECLAMPSIA', 'BABYSEX', 'NEONATALOUTCOME']

numerical_columns = ['MATERNALAGE', 'GRAVIDITY', 'PARITY', 'NO.ANTENALVISITS', 'HB_Delivery', 'GESTATIONALAGE', 
               'SBPBEFOREDELIVERY', 'DBPBEFOREDELIVERY']


In [12]:
# Let's use get_dummies to create our one-hot encoded values
X_encoded = pd.get_dummies(X, columns=categorical_columns)
X_encoded.head()

Unnamed: 0,MATERNALAGE,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,GESTATIONALAGE,SBPBEFOREDELIVERY,DBPBEFOREDELIVERY,CAT_MATERNALAGE_0-20,CAT_MATERNALAGE_21-35,...,Postpartumhemorrhage_No,Postpartumhemorrhage_Yes,ECLAMPSIA_No,ECLAMPSIA_Yes,SEVEREPREECLAMPSIA_No,SEVEREPREECLAMPSIA_Yes,BABYSEX_Female,BABYSEX_Male,NEONATALOUTCOME_Alive,NEONATALOUTCOME_Dead
0,18.0,1.0,0.0,11.0,10.598487,39,90.0,60.0,1,0,...,1,0,1,0,1,0,0,1,1,0
1,31.0,3.0,2.0,7.113177,10.598487,38,110.0,80.0,0,1,...,1,0,1,0,1,0,0,1,1,0
2,20.0,2.0,0.0,4.0,10.9,38,100.0,70.0,1,0,...,1,0,1,0,1,0,0,1,1,0
3,19.0,1.0,0.0,2.0,8.6,38,100.0,70.0,1,0,...,1,0,1,0,1,0,1,0,1,0
4,32.0,4.0,3.0,8.0,11.5,39,127.0,70.0,0,1,...,1,0,1,0,1,0,0,1,1,0


This extends the number of columns from 25 to 51

***Normalizing Numerical Features***

In [13]:
# First we create a Min-Max Scaler for scaling
scaler = MinMaxScaler()
X_encoded[numerical_columns] = scaler.fit_transform(df[numerical_columns])

X_encoded.head()

Unnamed: 0,MATERNALAGE,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,GESTATIONALAGE,SBPBEFOREDELIVERY,DBPBEFOREDELIVERY,CAT_MATERNALAGE_0-20,CAT_MATERNALAGE_21-35,...,Postpartumhemorrhage_No,Postpartumhemorrhage_Yes,ECLAMPSIA_No,ECLAMPSIA_Yes,SEVEREPREECLAMPSIA_No,SEVEREPREECLAMPSIA_Yes,BABYSEX_Female,BABYSEX_Male,NEONATALOUTCOME_Alive,NEONATALOUTCOME_Dead
0,0.135135,0.0,0.0,0.6875,0.082588,0.764706,0.417989,0.253012,1,0,...,1,0,1,0,1,0,0,1,1,0
1,0.486486,0.285714,0.25,0.444574,0.082588,0.705882,0.52381,0.373494,0,1,...,1,0,1,0,1,0,0,1,1,0
2,0.189189,0.142857,0.0,0.25,0.085237,0.705882,0.470899,0.313253,1,0,...,1,0,1,0,1,0,0,1,1,0
3,0.162162,0.0,0.0,0.125,0.065026,0.705882,0.470899,0.313253,1,0,...,1,0,1,0,1,0,1,0,1,0
4,0.513514,0.428571,0.375,0.5,0.09051,0.764706,0.613757,0.313253,0,1,...,1,0,1,0,1,0,0,1,1,0


In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y, test_size=0.3, random_state=2)

In [15]:
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

((949, 57), (949,), (407, 57), (407,))

Let's find how many unique values are in the target labels

In [16]:
np.unique(Y_train, return_counts=True), np.unique(Y_test, return_counts=True)

((array([0, 1], dtype=int64), array([752, 197], dtype=int64)),
 (array([0, 1], dtype=int64), array([318,  89], dtype=int64)))

## Model Training: LogisticRegression

In [17]:
model = LogisticRegression(penalty='l2', C=0.5, max_iter=1000)

In [18]:
model.fit(X_train, Y_train)

In [19]:
model.score(X_train, Y_train), model.score(X_test, Y_test)

(0.8229715489989463, 0.8132678132678133)

### Checking The Model's Sensitivity And Specificity

Let's create a function to check the sensitivity and specificity 

In [20]:
def evaluation_score(labels, predictions):  
    sensitivity = 0.0
    specificity = 0.0


    #Get the number of positive values and negative values
    numNeg, numPos = labels.value_counts()

    #Go through labels and predictions to calculate sentivity and specificity
    for label, prediction in zip(labels, predictions):
        if prediction == label and label == 1:
            sensitivity = sensitivity + 1
        elif prediction == label and label == 0:
            specificity = specificity + 1
        else:
            continue

    #Divide by length so both sensitivty and specificty are between 0 to 1
    sensitivity = sensitivity / numPos
    specificity = specificity / numNeg 

    return sensitivity, specificity

In [21]:
Y_predict = model.predict(X_test)

evaluation_score(Y_test, Y_predict)

(0.30337078651685395, 0.9559748427672956)

Our LogisticRegression model has a sensitivity of 30.34% with a specificity of 95.6% 

With such a high specificity, the model is quite good. Were it to be deployed in hospitals, it could help doctors quickly determine women who are at a lesser risk of delivering an LBW child. This can allow doctors and nurses more adequately focus their attention on women who need their help. 

In [22]:
Y_train_predict = model.predict(X_train)
evaluation_score(Y_train, Y_train_predict)

(0.22842639593908629, 0.9787234042553191)

## Model Training: Support Vector Machine

In [23]:
modelsvm = SVC(kernel='linear')

modelsvm.fit(X_train, Y_train)

In [24]:
modelsvm.score(X_train, Y_train), modelsvm.score(X_test, Y_test)

(0.8071654373024236, 0.8083538083538083)

In [25]:
Y_predict = modelsvm.predict(X_test)

evaluation_score(Y_test, Y_predict)

(0.4157303370786517, 0.9182389937106918)

In [26]:
Y_train_predict = modelsvm.predict(X_train)
evaluation_score(Y_train, Y_train_predict)

(0.29441624365482233, 0.9414893617021277)

Using a support vector machine, the sensitivity of our model rises to 41.57%, while the specificity still remain high at 91.82%.

Without any marked reduction in the model's specificity, I believe, this is a better model than the LogisticRegression. 

***NOTE*** These  models appear to be less sensitive on the training set that they are on the testing set although their overall accuracy on the training set is higher.