### Chronic Kidney Disease Dataset Documentation

<b>Overview</b>

This dataset was collected over approximately 2 months from a hospital in 2015. It contains medical records that can be used to predict chronic kidney disease (CKD).
Dataset Characteristics

<b>Feature Categories</b>

Numerical Features (11)

* Age (years)
* Blood Pressure (mm/Hg)
* Blood Glucose Random (mgs/dl)
* Blood Urea (mgs/dl)
* Serum Creatinine (mgs/dl)
* Sodium (mEq/L)
* Potassium (mEq/L)
* Hemoglobin (gms)
* Packed Cell Volume
* White Blood Cell Count (cells/cumm)
* Red Blood Cell Count (millions/cmm)

<b>Categorical Features (14)</b>

* Specific Gravity (values: 1.005, 1.010, 1.015, 1.020, 1.025)
* Albumin (values: 0-5)
* Sugar (values: 0-5)
* Red Blood Cells (normal, abnormal)
* Pus Cell (normal, abnormal)
* Pus Cell Clumps (present, notpresent)
* Bacteria (present, notpresent)
* Hypertension (yes, no)
* Diabetes Mellitus (yes, no)
* Coronary Artery Disease (yes, no)
* Appetite (good, poor)
* Pedal Edema (yes, no)
* Anemia (yes, no)
* Class (ckd, notckd) - Target Variable

<b>Data Source</b>

The dataset is available through the UCI Machine Learning Repository, contributed in July 2015. Original source: https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease

<b>Data Quality Notes</b>

- The dataset contains missing values across various features
- Both continuous and categorical variables are present
- Medical measurements are provided in standard clinical units
- Features cover a comprehensive range of medical indicators relevant to kidney function

In [2]:
import pandas as pd

# reading the data
features = pd.read_csv('../data/chronic_kidney_disease.csv')
target = pd.read_csv('../data/chronic_kidney_disease_target.csv')

# I will merge the two dataframes
data = pd.concat([features, target], axis=1)
data.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


In [4]:
# Lets run some diagnostics
print(data.info())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     329 non-null    float64
 16  wbcc    294 non-null    float64
 17  rbcc    269 non-null    float64
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-null    object 
 22  pe

In [3]:
# Lets do some preprocessing
# Checking missing values

missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]
print(missing_values)

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64
