# Data Preparation for Classification of Maternal Age 

In [1]:
import pandas as pd
import numpy as np

The main dataframe that contains all necessary data must first be loaded.

In [2]:
mainDf = pd.read_csv('C:/Users/Nefeli/Desktop/biomed_project_data/mainDf_int.csv')

<br>The classes for the mother's age are,initially, are not chosen too arbitrarily but based on methodology used on a well cited paper about the relation of maternal age to adverse pregnacny outcomes provided in the link below:</br> 
<br>https://bmcpregnancychildbirth.biomedcentral.com/articles/10.1186/s12884-019-2400-x</br>

<br>A: <= 17 years old</br>
<br>B: 18-28 years old </br>
<br>C: 29-39 years old </br>
<br>D: >=40 years old </br>

In [3]:
mainDf['Maternal_Age_4'] = [None]*len(mainDf) 

In [4]:
def maternalAgeClassAssigner4(x):
    if x<17:
        return 'A'
    elif x>=18 and x<=28:
        return 'B'
    elif x>=29 and x<=39:
        return 'C'
    elif x>= 40:
        return 'D'
    else:
        return np.nan

In [5]:
mainDf['Maternal_Age_4']= mainDf['Age'].apply(lambda x : maternalAgeClassAssigner4(x))

In [6]:
y_labels, counts = np.unique(mainDf['Maternal_Age_4'], return_counts=True)
print(y_labels)
print(counts)

['B' 'C' 'D']
[100 118   4]


<br>It is observed that class A does not exist in the dataset while classes B and C are fairly balanced. However, class D only has 4 instances. Class imbalance is not a desired characteristic, and since D only has 4 instances, we can merge C and D into one class and turn the problem into a binary classification problem. More specifically we can consider 29 and below as class A and all ages above that as class B.</br>

<br>New Classes:</br>
<br>A: <= 29 years old</br>
<br>B: > 29 years old</br>

In [7]:
def maternalAgeClassAssigner2(x):
    if x<=29:
        return 'A'
    elif x>29:
        return 'B'
    else:
        return np.nan

In [8]:
mainDf['Maternal_Age'] = [None]*len(mainDf) 
mainDf['Maternal_Age']= mainDf['Age'].apply(lambda x : maternalAgeClassAssigner2(x))

In [9]:
y_labels, counts = np.unique(mainDf['Maternal_Age'], return_counts=True)
print(y_labels)
print(counts)

['A' 'B']
[121 101]


As expected, the dataset is significantly more balanced now. The dataframe can now be saved as a csv file in order to utilize it for training classification models and evaluating their performance on the problem at hand.

In [10]:
to_drop=['Age','Maternal_Age_4'] 
ageDf= mainDf.drop(columns=to_drop).copy()
#ageDf.info()
ageDf.head(3)

Unnamed: 0,pH,BDecf,pCO2,BE,Apgar1,Apgar5,Gest.weeks,Weight(g),Sex,Gravidity,...,FHR_II_ffill_total_power,FHR_II_ffill_vlf,FHR_II_ffill_haar_stdev,FHR_II_ffill_haar_mean,FHR_II_ffill_samp_entr,FHR_II_ffill_bub_entr,diff_nni20,diff_lf_hf,diff_haar_std,Maternal_Age
0,7.14,8.14,7.7,-10.5,6.0,8.0,37.0,2660.0,2.0,1.0,...,368.077564,210.991854,1.549748,-0.000147,0.031682,0.179575,16,3.059011,0.437721,B
1,7.0,7.92,12.0,-12.0,8.0,8.0,41.0,2900.0,2.0,1.0,...,573.335415,388.247905,3.125196,-0.01974,0.053499,0.15447,2,0.020674,1.175809,A
2,7.2,3.03,8.3,-5.6,7.0,9.0,40.0,3770.0,1.0,1.0,...,285.519025,149.841842,2.459169,0.030445,0.052023,0.205526,8,1.917875,1.699915,B


In [11]:
ageDf.to_csv('C:/Users/Nefeli/Desktop/biomed_project_data/ageDf_bin.csv',index=False)