# Data Preparation for Classification of pH

In [1]:
import pandas as pd
import numpy as np

The main dataframe that contains all necessary data must first be loaded.

In [2]:
mainDf = pd.read_csv('C:/Users/Nefeli/Desktop/biomed_project_data/mainDf_int.csv')

In [3]:
y_labels, counts = np.unique(mainDf['pH'], return_counts=True)
print(y_labels)
print(counts)

[6.87 6.92 6.93 6.98 6.99 7.   7.02 7.03 7.05 7.07 7.08 7.09 7.1  7.11
 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.2  7.21 7.22 7.23 7.24 7.25
 7.26 7.27 7.28 7.29 7.3  7.31 7.32 7.33 7.34 7.35]
[ 1  1  2  2  1  2  1  3  3  3  2  1  4  2  5  9 11  3 15  4 11  8 14  7
  6  7 13 12 15  8  6  9  9  6  7  1  5  3]


<br>Groups of pH values and total count sums for each:</br>
<br>{6.87 6.92 6.93 6.98 6.99 7.} -> total count = 9</br>
<br>{7.02 7.03 7.05 7.07 7.08 7.09 7.1} -> total count = 17</br>
<br>{7.11
 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.2}-> total count = 82</br>
<br>{7.21 7.22 7.23 7.24 7.25
 7.26 7.27 7.28 7.29 7.3} -> total count = 92</br>
<br>{7.31 7.32 7.33 7.34 7.35} -> total count = 22</br>

Looking at the existing pH values and their corresponding count values, there are two good split options that will lead to balaced classes. They can be considered equivalent. The first is using 7.2 as the splitting point for the two classes (class A : 108 elements, class B: 114 elements). The second is using 7.21 as the splitting point for the two classes (class A: 115 elements, class B:107 elements).

Instead of class A and B we can name the classes Risky 'RISK' and Non-Risky 'NORISK' given the fact there may be loose corellation to CTG data and pH values that indicate asphyxiation of the neonate or other pathological phenomena. Of course, many contesting scientific opinions on this exist and on what values may indicate what pathological phenomenon or risk indicator. The ideal thresehold value is the subject of ongoing research. There is no agreement on a thresehold value to be used to discern between normal and abnormal situations but those proposed are usually smaller than 7.2. However, it must be noted that 7.2 is mentioned as a split point itself so given both reasoning regarding the data available and existing information 7.2 indeed seems to be a good choice of thresehold here. 7.2 is considered within the risky partition.

In [4]:
def pHClassAssigner(x):
    if x<=7.2:
        return 'RISK'
    elif x>7.2:
        return 'NORISK'
    else:
        return np.nan

In [5]:
mainDf['pH_risk'] = [None]*len(mainDf) 

In [6]:
mainDf['pH_risk']  = mainDf['pH'].apply(lambda x : pHClassAssigner(x))

In [7]:
y_labels, counts = np.unique(mainDf['pH_risk'], return_counts=True)
print(y_labels)
print(counts)

['NORISK' 'RISK']
[114 108]


In [8]:
to_drop=['pH']
phDf= mainDf.drop(columns=to_drop).copy()
#phDf.info()
phDf.head(3)

Unnamed: 0,BDecf,pCO2,BE,Apgar1,Apgar5,Gest.weeks,Weight(g),Sex,Age,Gravidity,...,FHR_II_ffill_total_power,FHR_II_ffill_vlf,FHR_II_ffill_haar_stdev,FHR_II_ffill_haar_mean,FHR_II_ffill_samp_entr,FHR_II_ffill_bub_entr,diff_nni20,diff_lf_hf,diff_haar_std,pH_risk
0,8.14,7.7,-10.5,6.0,8.0,37.0,2660.0,2.0,32.0,1.0,...,368.077564,210.991854,1.549748,-0.000147,0.031682,0.179575,16,3.059011,0.437721,RISK
1,7.92,12.0,-12.0,8.0,8.0,41.0,2900.0,2.0,23.0,1.0,...,573.335415,388.247905,3.125196,-0.01974,0.053499,0.15447,2,0.020674,1.175809,RISK
2,3.03,8.3,-5.6,7.0,9.0,40.0,3770.0,1.0,31.0,1.0,...,285.519025,149.841842,2.459169,0.030445,0.052023,0.205526,8,1.917875,1.699915,RISK


In [9]:
phDf.to_csv('C:/Users/Nefeli/Desktop/biomed_project_data/phDf_bin.csv',index=False)