# Low Birth Weight (LBW)

### ***This is a statistical analysis with a machine learning algorithm to understand the factors that lead to women deliverying low birth weight babies in Ghana.***

The logistics Regression machine learning algorithm can be used to assist doctors and midwives quickly analyze the data normally recorded for pregnant women and make a prediction of the likelihood that the patient would deliver a baby of low birth weight

## About the data

This is a dataset obtained from a graduate nursing student on a study to identify the factors the lead to women deliverying babies of low birth weight in Ghana. I want to analyze this data because it is meaningful to me. The data obtained contains many variables that are normally collected as part of routine examination of pregnant women before, during, and after delivery. The paper on this (which is under peer-review) contains many interesting details. I want to recreate its statistical findings with python, and also build a machine learning algorithm to diagnose the probability of delivering a child of low birth weight. 

## This workbook is in 3 parts
* ***Data Cleaning and Feature Engineering***
* ***Statistical Analysis***
* ***Building the Machine Learning Model***

### To do:
* Data cleaning 
* Feature engineering

Importing dependecies for the Statistical analysis.

In [1]:
import numpy as np
import pandas as pd
import random

In [2]:
df = pd.read_excel('14_JUNE_Low Birth Weight.xlsx')
df.head()

Unnamed: 0,MATERNALAGE,LEVELOFEDUCATION,OCCUPATION,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,HEPATITISBSTATUS,SYPHILLISSTATUS,RETROSTATUS,...,BIRTHWEIGHT,APGARAT1MIN,APGARAT5MIN,BABYLENGTH,HEADCIRCUMFERENCE,NICUADMISSION,RESPIRATORYDISTRESS,STILLBIRTH,IUGR,NEONATALOUTCOME
0,18.0,Secondary,Self employed,1.0,0.0,11.0,,Non Reactive,Non Reactive,Non Reactive,...,2.6,8.0,9.0,,,No,No,No,No,Alive
1,31.0,Illiterate,Unemployed,3.0,2.0,,,Non Reactive,Non Reactive,Non Reactive,...,3.2,7.0,8.0,54.0,33.0,No,Yes,No,No,Alive
2,20.0,Secondary,Unemployed,2.0,0.0,4.0,10.9,Non Reactive,,,...,2.8,7.0,8.0,,,No,No,No,No,Alive
3,19.0,Secondary,Self employed,1.0,0.0,2.0,8.6,,,Non Reactive,...,2.4,7.0,8.0,49.0,30.0,No,No,No,No,Alive
4,32.0,Tertiary,Civil Servant,4.0,3.0,8.0,11.5,Non Reactive,Non Reactive,Non Reactive,...,2.9,8.0,9.0,45.0,35.0,No,No,No,No,Alive


In [3]:
df.shape

(1356, 36)

## Data Cleaning

Right from the start, we can see that this dataset has many NULL values. Let's explore

Let's count the total number of null entries under each variable

In [4]:
df.isnull().sum() 

MATERNALAGE              11
LEVELOFEDUCATION         29
OCCUPATION               48
GRAVIDITY                16
PARITY                   19
NO.ANTENALVISITS        119
HB_Delivery             457
HEPATITISBSTATUS        362
SYPHILLISSTATUS         454
RETROSTATUS             103
BLOODGROUP              416
GESTATIONALAGE            0
PTDlt37WEEKS              0
MODEOFDELIVERY            6
SBPBEFOREDELIVERY       318
DBPBEFOREDELIVERY       318
SBPAFTERDELIVERY        552
DBPAFTERDELIVERY        552
MATERNALOUTCOME           9
AntepartumHemorrhage      7
Postpartumhemorrhage      6
ECLAMPSIA                 6
SEVEREPREECLAMPSIA        7
BABYSEX                  20
LBW                      45
LOWBIRTHWEIGHT           45
BIRTHWEIGHT              45
APGARAT1MIN              24
APGARAT5MIN              24
BABYLENGTH              228
HEADCIRCUMFERENCE       228
NICUADMISSION            30
RESPIRATORYDISTRESS      19
STILLBIRTH               15
IUGR                    131
NEONATALOUTCOME     

Honestly, some of these columns, especially for blood pressure recorded before and after delivery ['SBPBEFOREDELIVERY', 'DBPBEFOREDELIVERY', 'SBPAFTERDELIVERY', 'DBPAFTERDELIVERY'] have way too many missing values. We can drop some of these columns entirely, but we'll never get to appreciate how that'll affect the analysis of the data without having first tried it. 

So, we'll create a function that can be used to drop some columns just in case that becomes important. For now, we'll proceed with the data as is.

In [5]:
# Function to drop columns given the maximum number of missing values allowed. 

def drop_columns(df, num):
    drop_cols = pd.DataFrame(df.isnull().sum(), columns=['Null Count'])
    drop_cols.reset_index(inplace=True)

    drop_cols = drop_cols[drop_cols['Null Count'] > num]
    df = df.drop(columns=drop_cols['index'])

    return df, drop_cols

In [6]:
df2, drop_cols = drop_columns(df, 200)
df2.shape, drop_cols

((1356, 26),
                 index  Null Count
 6         HB_Delivery         457
 7    HEPATITISBSTATUS         362
 8     SYPHILLISSTATUS         454
 10         BLOODGROUP         416
 14  SBPBEFOREDELIVERY         318
 15  DBPBEFOREDELIVERY         318
 16   SBPAFTERDELIVERY         552
 17   DBPAFTERDELIVERY         552
 29         BABYLENGTH         228
 30  HEADCIRCUMFERENCE         228)

***To replace the missing values in the remaining columns, we'll acomplish this in two ways:***
* ***Replace all Categorical NaN values with the highest occuring category***

* ***Replace all Numercial NaN values with the mean of that column.***

Also, given that there are many variables with missing enteries, it will be best to create a function to facilitate this process. As someone who learned python from Harvard's CS50 python course, I feel like writing functions are very important. Just in case there's ever a need to reuse a block of code. No need for copy and paste. Lol. 

In [7]:
# Creating a function to accept the datafram and a list of column names to replace all NaN with the average of that column. 

# A Function for replacing Numerical missing values
def replace_num_nan(df, columns):
    
    for column in columns:
        average_value = df[column].mean(skipna=True)
        df[column].fillna(average_value, inplace=True)

    return df


In [8]:
# A Function for replacing Categorical missing values
def replace_cat_nan(df, columns):
    
    for column in columns:
        most_occuring = df[column].mode()[0]
        df[column].fillna(most_occuring, inplace=True)

    return df

In [12]:
#Getting the column names of the df into a new variable
cat_columns = ['LEVELOFEDUCATION', 'OCCUPATION', 'HEPATITISBSTATUS', 'SYPHILLISSTATUS', 'RETROSTATUS', 'BLOODGROUP', 
             'PTDlt37WEEKS', 'MODEOFDELIVERY', 'MATERNALOUTCOME', 'AntepartumHemorrhage', 'Postpartumhemorrhage', 
             'ECLAMPSIA', 'SEVEREPREECLAMPSIA', 'BABYSEX', 'LBW', 'LOWBIRTHWEIGHT', 'NICUADMISSION', 'RESPIRATORYDISTRESS', 
             'STILLBIRTH', 'IUGR', 'NEONATALOUTCOME']

num_columns = ['MATERNALAGE', 'GRAVIDITY', 'PARITY', 'NO.ANTENALVISITS', 'HB_Delivery', 'GESTATIONALAGE', 
               'SBPBEFOREDELIVERY', 'DBPBEFOREDELIVERY', 'SBPAFTERDELIVERY', 'DBPAFTERDELIVERY', 
              'BIRTHWEIGHT', 'APGARAT1MIN', 'APGARAT5MIN', 'BABYLENGTH', 'HEADCIRCUMFERENCE']

In [13]:
df = replace_num_nan(df, num_columns)
df = replace_cat_nan(df, cat_columns)
df.head()

Unnamed: 0,MATERNALAGE,LEVELOFEDUCATION,OCCUPATION,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,HEPATITISBSTATUS,SYPHILLISSTATUS,RETROSTATUS,...,BIRTHWEIGHT,APGARAT1MIN,APGARAT5MIN,BABYLENGTH,HEADCIRCUMFERENCE,NICUADMISSION,RESPIRATORYDISTRESS,STILLBIRTH,IUGR,NEONATALOUTCOME
0,18.0,Secondary,Self employed,1.0,0.0,11.0,10.598487,Non Reactive,Non Reactive,Non Reactive,...,2.6,8.0,9.0,49.210993,33.162943,No,No,No,No,Alive
1,31.0,Illiterate,Unemployed,3.0,2.0,7.113177,10.598487,Non Reactive,Non Reactive,Non Reactive,...,3.2,7.0,8.0,54.0,33.0,No,Yes,No,No,Alive
2,20.0,Secondary,Unemployed,2.0,0.0,4.0,10.9,Non Reactive,Non Reactive,Non Reactive,...,2.8,7.0,8.0,49.210993,33.162943,No,No,No,No,Alive
3,19.0,Secondary,Self employed,1.0,0.0,2.0,8.6,Non Reactive,Non Reactive,Non Reactive,...,2.4,7.0,8.0,49.0,30.0,No,No,No,No,Alive
4,32.0,Tertiary,Civil Servant,4.0,3.0,8.0,11.5,Non Reactive,Non Reactive,Non Reactive,...,2.9,8.0,9.0,45.0,35.0,No,No,No,No,Alive


***This looks okay. Now let's check our data again for any missing values***

In [14]:
df.isnull().sum()

MATERNALAGE             0
LEVELOFEDUCATION        0
OCCUPATION              0
GRAVIDITY               0
PARITY                  0
NO.ANTENALVISITS        0
HB_Delivery             0
HEPATITISBSTATUS        0
SYPHILLISSTATUS         0
RETROSTATUS             0
BLOODGROUP              0
GESTATIONALAGE          0
PTDlt37WEEKS            0
MODEOFDELIVERY          0
SBPBEFOREDELIVERY       0
DBPBEFOREDELIVERY       0
SBPAFTERDELIVERY        0
DBPAFTERDELIVERY        0
MATERNALOUTCOME         0
AntepartumHemorrhage    0
Postpartumhemorrhage    0
ECLAMPSIA               0
SEVEREPREECLAMPSIA      0
BABYSEX                 0
LBW                     0
LOWBIRTHWEIGHT          0
BIRTHWEIGHT             0
APGARAT1MIN             0
APGARAT5MIN             0
BABYLENGTH              0
HEADCIRCUMFERENCE       0
NICUADMISSION           0
RESPIRATORYDISTRESS     0
STILLBIRTH              0
IUGR                    0
NEONATALOUTCOME         0
dtype: int64

***A little exploration of the data***

In [15]:
df.describe()

Unnamed: 0,MATERNALAGE,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,GESTATIONALAGE,SBPBEFOREDELIVERY,DBPBEFOREDELIVERY,SBPAFTERDELIVERY,DBPAFTERDELIVERY,BIRTHWEIGHT,APGARAT1MIN,APGARAT5MIN,BABYLENGTH,HEADCIRCUMFERENCE
count,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0,1356.0
mean,26.238662,2.585821,1.406133,7.113177,10.598487,38.553097,115.608863,71.304432,113.631841,71.18408,2.910969,7.100601,8.247748,49.210993,33.162943
std,6.303188,1.56934,1.465588,2.554521,3.091221,2.192099,13.99729,10.644119,22.962878,23.918261,0.770229,1.687925,1.749695,17.96088,2.710699
min,13.0,1.0,0.0,0.0,1.2,26.0,11.0,18.0,11.0,30.0,0.5,0.0,0.0,0.0,0.0
25%,21.0,1.0,0.0,6.0,10.1,38.0,110.0,66.0,110.0,70.0,2.6,7.0,8.0,48.0,32.0
50%,26.0,2.0,1.0,7.113177,10.598487,39.0,115.608863,71.304432,113.631841,71.18408,2.9,8.0,9.0,49.210993,33.162943
75%,30.0,4.0,2.0,9.0,11.0,40.0,120.0,74.0,113.631841,71.18408,3.2,8.0,9.0,50.0,34.0
max,50.0,8.0,8.0,16.0,115.0,43.0,200.0,184.0,861.0,900.0,24.0,10.0,10.0,512.0,50.0


## FEATURE ENGINEERING

***Here, we try to engineer new columns so that we can tell a better statistical story about the data.***

***Given that there are many numerical columns that we want to categorize, it'll be better to create a function to facilitate this process.***

In [16]:
# This is a dynamic function where we can specify how many categories we want to define for each column

def categorize_column(df, column_name, bins):
    
    num_of_categories = len(bins) - 1
    labels = [f"{bins[i]}-{bins[i+1]-1}" for i in range(num_of_categories)]
    
    if column_name in df.columns:
        df['CAT_' + column_name] = pd.cut(df[column_name], bins=bins, labels=labels, right=False)
        
    else:
        print("Column '%s' does not exist in the DataFrame." % column_name)
    
    return df

### Categorizing MATERNALAGE
* 0 = <=20 years 
* 1 = 21 - 35 years 
* 2 = >35 years 


In [17]:
maternalage_bins = [0, 21, 36, 100]

df = categorize_column(df, 'MATERNALAGE', maternalage_bins)

In [18]:
df['CAT_MATERNALAGE'].unique()

['0-20', '21-35', '36-99']
Categories (3, object): ['0-20' < '21-35' < '36-99']

***Now, we can see that the CAT_MATERNALAGE has three categories ['0-20', '21-35', '36-99']***

### Modifying the categorize_column funciton

In order to more EFFICIENTLY categorize all the remaining numerical columns we'd want to categorize, let's modify our categorize_column function to include a loop that takes multiple columns in the form of a dictionary {column_name : bins}

In [19]:
# Modified categorizing funciton 

def mod_categorize_column(df, col_dict):
    
    for col in col_dict:
        bins = col_dict[col]
        num_of_categories = len(bins) - 1
        labels = [f"{bins[i]}-{bins[i+1]-1}" for i in range(num_of_categories)]
        
        if col in df.columns:
            df['CAT_' + col] = pd.cut(df[col], bins=bins, labels=labels, right=False)
        else:
            print("Column '%s' does not exist in the DataFrame." % column_name)

    return df

Now, we can create bins for all the different columns and push them to this modified function 

### GRAVIDITY Categories 
* 0 = 1
* 1 = 2
* 2 = > 3

### PARITY Categories 
* 0 = 0
* 1 = 1 
* 2 = > 2


In [20]:
GRAVIDITY_bins = [1, 2, 3, 10]

PARITY_bins = [0, 1, 2, 10]

In [21]:
col_dict = {}

col_dict['GRAVIDITY'] = GRAVIDITY_bins

col_dict['PARITY'] = PARITY_bins

In [22]:
df = mod_categorize_column(df, col_dict)

In [23]:
df.head()

Unnamed: 0,MATERNALAGE,LEVELOFEDUCATION,OCCUPATION,GRAVIDITY,PARITY,NO.ANTENALVISITS,HB_Delivery,HEPATITISBSTATUS,SYPHILLISSTATUS,RETROSTATUS,...,BABYLENGTH,HEADCIRCUMFERENCE,NICUADMISSION,RESPIRATORYDISTRESS,STILLBIRTH,IUGR,NEONATALOUTCOME,CAT_MATERNALAGE,CAT_GRAVIDITY,CAT_PARITY
0,18.0,Secondary,Self employed,1.0,0.0,11.0,10.598487,Non Reactive,Non Reactive,Non Reactive,...,49.210993,33.162943,No,No,No,No,Alive,0-20,1-1,0-0
1,31.0,Illiterate,Unemployed,3.0,2.0,7.113177,10.598487,Non Reactive,Non Reactive,Non Reactive,...,54.0,33.0,No,Yes,No,No,Alive,21-35,3-9,2-9
2,20.0,Secondary,Unemployed,2.0,0.0,4.0,10.9,Non Reactive,Non Reactive,Non Reactive,...,49.210993,33.162943,No,No,No,No,Alive,0-20,2-2,0-0
3,19.0,Secondary,Self employed,1.0,0.0,2.0,8.6,Non Reactive,Non Reactive,Non Reactive,...,49.0,30.0,No,No,No,No,Alive,0-20,1-1,0-0
4,32.0,Tertiary,Civil Servant,4.0,3.0,8.0,11.5,Non Reactive,Non Reactive,Non Reactive,...,45.0,35.0,No,No,No,No,Alive,21-35,3-9,2-9


In [24]:
df['CAT_GRAVIDITY'].value_counts().sort_index(ascending=True)

1-1    430
2-2    348
3-9    578
Name: CAT_GRAVIDITY, dtype: int64

## Saving the dataframe as a CSV file for statistical analysis in a different notebook 

In [25]:
df.to_csv('Cleaned2 - Low Birth Weight.csv', index=False)