# Problem Statement: Predictive Analytics for Chronic Kidney Disease Detection

## Context
In healthcare operations, early detection of chronic diseases can significantly reduce mortality rates, improve patient outcomes, and decrease healthcare costs. Chronic Kidney Disease (CKD) is a silent epidemic affecting over **850 million people worldwide**, often progressing asymptomatically until advanced stages when treatment options become limited.

Routine laboratory tests provide critical signals for early detection, but interpreting complex patterns across multiple biomarkers requires specialized expertise and time.  
The objective of this project is to design a **predictive analytics system** using the **Chronic Kidney Disease (CKD) Dataset**, which contains multivariate clinical laboratory measurements from real patients, including blood tests, urine analysis, and clinical parameters collected until diagnosis.

---

## Dataset Overview
- **Total Attributes:** 25  
- **Features:** 24  
- **Target Variable:** 1 (Class)  
- **Numerical Attributes:** 11  
- **Nominal Attributes:** 14  

---

## Attribute Information

| No. | Attribute Name | Symbol | Type | Description / Values |
|----:|---------------|--------|------|----------------------|
| 1 | Age | age | Numerical | Age in years |
| 2 | Blood Pressure | bp | Numerical | Blood pressure in mm/Hg |
| 3 | Specific Gravity | sg | Nominal | (1.005, 1.010, 1.015, 1.020, 1.025) |
| 4 | Albumin | al | Nominal | (0, 1, 2, 3, 4, 5) |
| 5 | Sugar | su | Nominal | (0, 1, 2, 3, 4, 5) |
| 6 | Red Blood Cells | rbc | Nominal | (normal, abnormal) |
| 7 | Pus Cell | pc | Nominal | (normal, abnormal) |
| 8 | Pus Cell Clumps | pcc | Nominal | (present, notpresent) |
| 9 | Bacteria | ba | Nominal | (present, notpresent) |
| 10 | Blood Glucose Random | bgr | Numerical | Blood glucose in mgs/dl |
| 11 | Blood Urea | bu | Numerical | Blood urea in mgs/dl |
| 12 | Serum Creatinine | sc | Numerical | Serum creatinine in mgs/dl |
| 13 | Sodium | sod | Numerical | Sodium in mEq/L |
| 14 | Potassium | pot | Numerical | Potassium in mEq/L |
| 15 | Hemoglobin | hemo | Numerical | Hemoglobin in gms |
| 16 | Packed Cell Volume | pcv | Numerical | Volume percentage of red blood cells |
| 17 | White Blood Cell Count | wc | Numerical | Cells per cumm |
| 18 | Red Blood Cell Count | rc | Numerical | Millions per cmm |
| 19 | Hypertension | htn | Nominal | (yes, no) |
| 20 | Diabetes Mellitus | dm | Nominal | (yes, no) |
| 21 | Coronary Artery Disease | cad | Nominal | (yes, no) |
| 22 | Appetite | appet | Nominal | (good, poor) |
| 23 | Pedal Edema | pe | Nominal | (yes, no) |
| 24 | Anemia | ane | Nominal | (yes, no) |
| 25 | Class (Target) | class | Nominal | (ckd, notckd) |

---

## Objective
To build a machine learning–based predictive model that accurately classifies patients as **CKD** or **Non-CKD**, enabling early diagnosis and supporting clinical decision-making.


## Importing Required Libraries

In [4]:
import numpy as np # numerical operations
import pandas as pd # handling datasets
from ucimlrepo import fetch_ucirepo # easily import datasets

In [7]:
# fetch dataset 
chronic_kidney_disease = fetch_ucirepo(id=336)

#  data (as pandas dataframes)
X = chronic_kidney_disease.data.features
y = chronic_kidney_disease.data.targets

# concatenate the X & y features
df = pd.concat([X, y], axis=1)

In [8]:
# view first 5 rows from the dataframe
df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


In [9]:
# information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     329 non-null    float64
 16  wbcc    294 non-null    float64
 17  rbcc    269 non-null    float64
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-null    object 
 22  pe

In [22]:
# change the datatypes to 'category'
df['sg'] = df['sg'].astype('category')
df['al'] = df['al'].astype('category')
df['su'] = df['su'].astype('category')
df['rbc'] = df['rbc'].astype('category')
df['pc'] = df['pc'].astype('category')
df['pcc'] = df['pcc'].astype('category')
df['ba'] = df['ba'].astype('category')
df['htn'] = df['htn'].astype('category')
df['dm'] = df['dm'].astype('category')
df['cad'] = df['cad'].astype('category')
df['appet'] = df['appet'].astype('category')
df['pe'] = df['pe'].astype('category')
df['ane'] = df['ane'].astype('category')
df['class'] = df['class'].astype('category')


In [23]:
# description of numerical data
df.describe()

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,391.0,388.0,356.0,381.0,383.0,313.0,312.0,348.0,329.0,294.0,269.0
mean,51.483376,76.469072,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437,38.884498,8406.122449,4.707435
std,17.169714,13.683637,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587,8.990105,2944.47419,1.025323
min,2.0,50.0,22.0,1.5,0.4,4.5,2.5,3.1,9.0,2200.0,2.1
25%,42.0,70.0,99.0,27.0,0.9,135.0,3.8,10.3,32.0,6500.0,3.9
50%,55.0,80.0,121.0,42.0,1.3,138.0,4.4,12.65,40.0,8000.0,4.8
75%,64.5,80.0,163.0,66.0,2.8,142.0,4.9,15.0,45.0,9800.0,5.4
max,90.0,180.0,490.0,391.0,76.0,163.0,47.0,17.8,54.0,26400.0,8.0


In [16]:
# check for the missing values
df.isna().sum().sort_values(ascending = False)

rbc      152
rbcc     131
wbcc     106
pot       88
sod       87
pcv       71
pc        65
hemo      52
su        49
sg        47
al        46
bgr       44
bu        19
sc        17
bp        12
age        9
ba         4
pcc        4
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [18]:
# check for duplicates
df.duplicated().sum()

0

In [26]:
def categorical_counts(cat_cols):
    for col in cat_cols:
        print(f"\nColumn: {col}")
        print(df[col].value_counts(dropna=False))
        print("-" * 30)

cate_list = [
    'sg','al','su','rbc','pc','pcc','ba',
    'htn','dm','cad','appet','pe','ane','class'
]

categorical_counts(cate_list)



Column: sg
sg
1.02     106
1.01      84
1.025     81
1.015     75
NaN       47
1.005      7
Name: count, dtype: int64
------------------------------

Column: al
al
0.0    199
NaN     46
1.0     44
2.0     43
3.0     43
4.0     24
5.0      1
Name: count, dtype: int64
------------------------------

Column: su
su
0.0    290
NaN     49
2.0     18
3.0     14
1.0     13
4.0     13
5.0      3
Name: count, dtype: int64
------------------------------

Column: rbc
rbc
normal      201
NaN         152
abnormal     47
Name: count, dtype: int64
------------------------------

Column: pc
pc
normal      259
abnormal     76
NaN          65
Name: count, dtype: int64
------------------------------

Column: pcc
pcc
notpresent    354
present        42
NaN             4
Name: count, dtype: int64
------------------------------

Column: ba
ba
notpresent    374
present        22
NaN             4
Name: count, dtype: int64
------------------------------

Column: htn
htn
no     251
yes    147
NaN      2
Name: 