# 1. Preprocessing

## 1.1 Introduction to the Chronic Kidney Disease Dataset

The Chronic Kidney Disease (CKD) dataset is a public health dataset that has been widely used in data mining and machine learning experiments. This dataset originates from the University of California, Irvine (UCI) Machine Learning Repository, a popular resource for machine learning researchers and enthusiasts. The data was used to predict whether a patient is suffering from CKD or not, based on various health metrics.

CKD is a condition characterized by a gradual loss of kidney function over time. It is a serious condition because the kidneys are essential for filtering waste and excess fluids from the blood, which are then excreted in urine. When chronic kidney disease reaches an advanced stage, dangerous levels of fluid, electrolytes, and wastes can build up in the body.

The dataset comprises 400 instances and 24 medical predictor attributes plus the class attribute. The predictors include demographic data such as age and sex, vital sign measurements such as blood pressure, and laboratory test results such as blood glucose levels, packed cell volume, and white blood cell count. The class attribute indicates whether the individual has CKD or not.

Some attributes are numerical and represent actual health metrics, while others are nominal and represent health classifications. The dataset contains missing values, representing the realistic scenario that not all health data are always available for every patient.

## 1.2 Setting Up the Environment

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import  classification_report,  ConfusionMatrixDisplay
from matplotlib.pylab import rcParams
#!pip install liac-arff
import arff

## 1.3 Data collection and cleaning

Opening the dataset and reading it in with the relevant library (in this case, arff). The output file is prepared for the cleaned and preprocessed data.

Especially this line is important:

`line.replace('	', '').replace(',,',',').replace('ckd,\n', 'ckd\n').replace(' yes','yes')`

otherwise, the file could not be opened

In [2]:
fin = open("data/chronic_kidney_disease_full.arff", "rt")

#output file to write the result to
fout = open("processed/chronic_kidney_disease_full.arff", "wt")
#for each line in the input file
for line in fin:
	#read replace the string and write to output file
	fout.write(line.replace('	', '').replace(',,',',').replace('ckd,\n', 'ckd\n').replace(' ckd', 'ckd').replace(' yes','yes'))
#close input and output files
fin.close()

dataset = arff.load(open("processed/chronic_kidney_disease_full.arff", "r"))
print(dataset['description'])

1. Title: Early stage of Indians Chronic Kidney Disease(CKD)

2. Source Information:
  (a) Source:
Dr.P.Soundarapandian.M.D.,D.M
    (Senior Consultant Nephrologist),
Apollo  Hospitals,
Managiri,
Madurai Main Road,
Karaikudi,
Tamilnadu,
India.
  (b) Creator:
L.Jerlin Rubini(Research Scholar)
Alagappa University
EmailId   :jel.jerlin@gmail.com
ContactNo :+91-9597231281
  (c) Guided by:
Dr.P.Eswaran Assistant Professor,
Department of Computer Science and Engineering,
Alagappa University,
Karaikudi,
Tamilnadu,
India.
Emailid:eswaranperumal@gmail.com
  (d) Date     : july 2015

3.Relevant Information:
age-age
bp-blood pressure
sg-specific gravity
al-   albumin
su-sugar
rbc-red blood cells
pc-pus cell
pcc-pus cell clumps
ba-bacteria
bgr-blood glucose random
bu-blood urea
sc-serum creatinine
sod-sodium
pot-potassium
hemo-hemoglobin
pcv-packed cell volume
wc-white blood cell count
rc-red blood cell count
htn-hypertension
dm-diabetes mellitus
cad-coronary artery disease
appet-appetite
pe-pedal

## 1.4 Parsing the Dataset

Mapping column names to more descriptive names, identifying numeric and categorical columns and saving these column names to file for future use. Also, the initial data conversion to the pandas DataFrame takes place here.

In [3]:

name_map = {
    'age':      'age',
    'bp':       'blood_pressure',
    'sg':       'specific_gravity',
    'al':       'albumin',
    'su':       'sugar',
    'rbc':      'red_blood_cells',df = pd.DataFrame(
    'pc':       'pus_cell',
    'pcc':      'pus_cell_clumps',
    'ba':       'bacteria',
    'bgr':      'blood_glucose_random',
    'bu':       'blood_urea',
    'sc':       'serum_creatinine',
    'sod':      'sodium',
    'pot':      'potassium',
    'hemo':     'hemoglobin',
    'pcv':      'packed_cell_volume',
    'wbcc':     'white_blood_cell_count',
    'rbcc':     'red_blood_cell_count',
    'htn':      'hypertension',
    'dm':       'diabetes_mellitus',
    'cad':      'coronary_artery_disease',
    'appet':    'appetite',
    'pe':       'pedal_edema',
    'ane':      'anemia',
    'class':    'class',
}

# rename columns
dataset['attributes'] = [(name_map[attr[0]], attr[1]) for attr in dataset['attributes']]

numeric_columns = []
categorical_columns = []
column_names = []


for idx, attr in enumerate(dataset['attributes']):
    column_names.append(attr[0])
    if attr[1] == 'NUMERIC':
        numeric_columns.append(attr[0])
    else:
        categorical_columns.append(attr[0])
    print(attr[0], '\n', attr[1])
    

print('\nnumeric:',numeric_columns)
print('categorical:',categorical_columns)

age 
 NUMERIC
blood_pressure 
 NUMERIC
specific_gravity 
 ['1.005', '1.010', '1.015', '1.020', '1.025']
albumin 
 ['0', '1', '2', '3', '4', '5']
sugar 
 ['0', '1', '2', '3', '4', '5']
red_blood_cells 
 ['normal', 'abnormal']
pus_cell 
 ['normal', 'abnormal']
pus_cell_clumps 
 ['present', 'notpresent']
bacteria 
 ['present', 'notpresent']
blood_glucose_random 
 NUMERIC
blood_urea 
 NUMERIC
serum_creatinine 
 NUMERIC
sodium 
 NUMERIC
potassium 
 NUMERIC
hemoglobin 
 NUMERIC
packed_cell_volume 
 NUMERIC
white_blood_cell_count 
 NUMERIC
red_blood_cell_count 
 NUMERIC
hypertension 
 ['yes', 'no']
diabetes_mellitus 
 ['yes', 'no']
coronary_artery_disease 
 ['yes', 'no']
appetite 
 ['good', 'poor']
pedal_edema 
 ['yes', 'no']
anemia 
 ['yes', 'no']
class 
 ['ckd', 'notckd']

numeric: ['age', 'blood_pressure', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium', 'potassium', 'hemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count']
categorical: ['spec

In [4]:
# save numeric and categorical columns to file
np.savetxt('processed/categorical_columns.txt', categorical_columns, fmt='%s')
np.savetxt('processed/numerical_columns.txt', numeric_columns, fmt='%s')

## 1.5 Conversion of categorical data

Conversion of categorical data to boolean and numeric representations. We'll create three versions of the dataset: one raw version with column names replaced but values as they are, one version with boolean representations for binary categorical data, and one version where all values are numeric.

In [5]:
# Convert the dataset from an arff object to a pandas DataFrame
df = pd.DataFrame(dataset['data'], columns=column_names)
df.replace('ckd', True, inplace=True) 
df.replace('notckd', False, inplace=True)
df.to_csv('processed/df_raw.csv', index=False)  
df

Unnamed: 0,age,blood_pressure,specific_gravity,albumin,sugar,red_blood_cells,pus_cell,pus_cell_clumps,bacteria,blood_glucose_random,...,packed_cell_volume,white_blood_cell_count,red_blood_cell_count,hypertension,diabetes_mellitus,coronary_artery_disease,appetite,pedal_edema,anemia,class
0,48.0,80.0,1.020,1,0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,True
1,7.0,50.0,1.020,4,0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,True
2,62.0,80.0,1.010,2,3,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,True
3,48.0,70.0,1.005,4,0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,True
4,51.0,80.0,1.010,2,0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
344,64.0,60.0,1.020,0,0,normal,normal,notpresent,notpresent,106.0,...,42.0,8100.0,4.7,no,no,no,good,no,no,False
345,22.0,60.0,1.025,0,0,normal,normal,notpresent,notpresent,97.0,...,42.0,7900.0,6.4,no,no,no,good,no,no,False
346,33.0,60.0,,,,normal,normal,notpresent,notpresent,130.0,...,52.0,4300.0,5.8,no,no,no,good,no,no,False
347,43.0,60.0,1.025,0,0,normal,normal,notpresent,notpresent,108.0,...,43.0,7200.0,5.5,no,no,no,good,no,no,False


In [6]:
def preserve_cat(df: pd.DataFrame):
    dataframe = df.copy()
    dataframe.replace('yes', True, inplace=True) # ['htn', 'dm', 'cad', 'pe', 'ane']
    dataframe.replace('no', False, inplace=True) # ['htn', 'dm', 'cad', 'pe', 'ane']

    # dataframe.replace('present', 1, inplace=True) # ['pcc', 'ba']
    # dataframe.replace('notpresent', 0, inplace=True) # ['pcc', 'ba']

    # dataframe.replace('good', 1, inplace=True) # ['appet']
    # dataframe.replace('poor', 0, inplace=True) # ['appet']

    dataframe.replace('ckd', True, inplace=True) # ['class']
    dataframe.replace('notckd', False, inplace=True) # ['class']
    
    # dataframe.replace('normal', 1, inplace=True) # ['rbc', 'pc']
    # dataframe.replace('abnormal', 0, inplace=True) # ['rbc', 'pc']
    
    dataframe['specific_gravity'].replace('1.005', '_1.005', inplace=True)
    dataframe['specific_gravity'].replace('1.010', '_1.010', inplace=True)
    dataframe['specific_gravity'].replace('1.015', '_1.015', inplace=True)
    dataframe['specific_gravity'].replace('1.020', '_1.020', inplace=True)
    dataframe['specific_gravity'].replace('1.025', '_1.025', inplace=True)
    
    dataframe['albumin'].replace('0', '_0', inplace=True)
    dataframe['albumin'].replace('1', '_1', inplace=True)
    dataframe['albumin'].replace('2', '_2', inplace=True)
    dataframe['albumin'].replace('3', '_3', inplace=True)
    dataframe['albumin'].replace('4', '_4', inplace=True)
    dataframe['albumin'].replace('5', '_5', inplace=True)
    
    dataframe['sugar'].replace('0', '_0', inplace=True)
    dataframe['sugar'].replace('1', '_1', inplace=True)
    dataframe['sugar'].replace('2', '_2', inplace=True)
    dataframe['sugar'].replace('3', '_3', inplace=True)
    dataframe['sugar'].replace('4', '_4', inplace=True)
    dataframe['sugar'].replace('5', '_5', inplace=True)
    
    return dataframe


# Convert all binary categorical variables to boolean values and save to a new DataFrame
df_cat_preserved = preserve_cat(df)


def print_range_of_values(dataframe):
    for col in dataframe.columns:
        print(col, '\n',dataframe[col].unique())


print_range_of_values(df_cat_preserved[categorical_columns])



specific_gravity 
 ['_1.020' '_1.010' '_1.005' '_1.015' None '_1.025']
albumin 
 ['_1' '_4' '_2' '_3' '_0' None '_5']
sugar 
 ['_0' '_3' '_4' '_1' None '_2' '_5']
red_blood_cells 
 [None 'normal' 'abnormal']
pus_cell 
 ['normal' 'abnormal' None]
pus_cell_clumps 
 ['notpresent' 'present' None]
bacteria 
 ['notpresent' 'present' None]
hypertension 
 [True False None]
diabetes_mellitus 
 [True False None]
coronary_artery_disease 
 [False True None]
appetite 
 ['good' 'poor' None]
pedal_edema 
 [False True None]
anemia 
 [False True None]
class 
 [ True False]


In [7]:
df_cat_preserved.to_csv('processed/df.csv', index=False)

In [8]:
def numer(df: pd.DataFrame):
    dataframe = df.copy()
    dataframe.replace('yes', 1, inplace=True) # ['htn', 'dm', 'cad', 'pe', 'ane']
    dataframe.replace('no', 0, inplace=True) # ['htn', 'dm', 'cad', 'pe', 'ane']

    dataframe.replace('present', 1, inplace=True) # ['pcc', 'ba']
    dataframe.replace('notpresent', 0, inplace=True) # ['pcc', 'ba']

    dataframe.replace('good', 1, inplace=True) # ['appet']
    dataframe.replace('poor', 0, inplace=True) # ['appet']

    dataframe.replace('ckd', 1, inplace=True) # ['class']
    dataframe.replace('notckd', 0, inplace=True) # ['class']
    
    dataframe.replace('normal', 1, inplace=True) # ['rbc', 'pc']
    dataframe.replace('abnormal', 0, inplace=True) # ['rbc', 'pc']
    
    return dataframe

# Convert all categorical variables to numeric representations and save to a new DataFrame
df_numeric = numer(df)
df_numeric


def print_range_of_values(dataframe):
    for col in dataframe.columns:
        print(col, '\n',dataframe[col].unique())


print_range_of_values(df_numeric[categorical_columns])

specific_gravity 
 ['1.020' '1.010' '1.005' '1.015' None '1.025']
albumin 
 ['1' '4' '2' '3' '0' None '5']
sugar 
 ['0' '3' '4' '1' None '2' '5']
red_blood_cells 
 [nan  1.  0.]
pus_cell 
 [ 1.  0. nan]
pus_cell_clumps 
 [ 0.  1. nan]
bacteria 
 [ 0.  1. nan]
hypertension 
 [ 1.  0. nan]
diabetes_mellitus 
 [ 1.  0. nan]
coronary_artery_disease 
 [ 0.  1. nan]
appetite 
 [ 1.  0. nan]
pedal_edema 
 [ 0.  1. nan]
anemia 
 [ 0.  1. nan]
class 
 [ True False]


In [9]:
## save to csv
df_numeric.to_csv('processed/df_numeric.csv', index=False)

We have now a cleaned and preprocessed dataset ready for exploratory data analysis. We have dealt with missing values, anomalies, and made sure that the dataset is in a usable format for the next steps. 

The most important outputs are as follows:

- df_raw: 
The df_raw dataset contains the original dataset, except the names were replaced by more readable names. The class variable was also replaced by a boolean.

- df: 
The df dataset contains more or less the original dataset, except the columns which contained "Yes" / "No" were replaced by booleans, as well as the class variable. Categorical values were turned into strings such that the categories don't get treated as numbers.

- df_numeric: The df_numeric dataset contains all variables which are now represented numerically. The categorical values could be turned into booleans (which were encoded as 0 and 1), while other categorical variables were already encoded by numbers.

The next chapter will take us through the exploratory data analysis (EDA) process. Now that we have a cleaned and well-structured dataset, we can proceed with analyzing patterns, anomalies, checking assumptions, and testing hypotheses using statistical summary and graphical representations.