# Data Cleaning & Feature Engineering

## Objective
In this notebook, we clean and preprocess the raw UCI Heart Disease dataset.
The goal is to:
- Handle missing values
- Convert the target variable into binary form
- Prepare features for machine learning models
- Save a clean, reusable dataset (`cleaned_data.csv`) for model training

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load Raw Dataset
column_names = [
    "age",
    "sex",
    "cp",
    "trestbps",
    "chol",
    "fbs",
    "restecg",
    "thalach",
    "exang",
    "oldpeak",
    "slope",
    "ca",
    "thal",
    "target"
]

df = pd.read_csv(
    "../data/raw/heart_disease_uci.data",
    header=None,
    names=column_names
)

df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Handle Missing Values
üîç Identify Missing Values

In [3]:
(df == "?").sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64

### Missing Values Strategy
- The UCI dataset uses `?` to represent missing values.
- Only `ca` and `thal` contain missing values.
- Since the number of missing rows is small, we remove those rows.

üßπ Remove Rows with Missing Values

In [4]:
df_clean = df.replace("?", np.nan)
df_clean = df_clean.dropna()

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   sex       297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    object 
 12  thal      297 non-null    object 
 13  target    297 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 34.8+ KB


### Target Transformation
- Original target values: 0, 1, 2, 3, 4
- We convert this into a binary classification problem:
  - 0 ‚Üí No heart disease
  - 1 ‚Üí Heart disease

In [5]:
df_clean["target"] = df_clean["target"].astype(int)

# Convert multi-class target into binary
df_clean["target"] = df_clean["target"].apply(lambda x: 1 if x > 0 else 0)

df_clean["target"].value_counts()

target
0    160
1    137
Name: count, dtype: int64

In [6]:
# Convert all columns to numeric
df_clean = df_clean.astype(float)

df_clean.dtypes

age         float64
sex         float64
cp          float64
trestbps    float64
chol        float64
fbs         float64
restecg     float64
thalach     float64
exang       float64
oldpeak     float64
slope       float64
ca          float64
thal        float64
target      float64
dtype: object

In [7]:
df_clean.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0.0


In [8]:
df_clean.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,3.158249,131.693603,247.350168,0.144781,0.996633,149.599327,0.326599,1.055556,1.602694,0.676768,4.73064,0.461279
std,9.049736,0.4685,0.964859,17.762806,51.997583,0.352474,0.994914,22.941562,0.469761,1.166123,0.618187,0.938965,1.938629,0.49934
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,243.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,276.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,1.0


### Feature Types
- Continuous: age, trestbps, chol, thalach, oldpeak
- Binary: sex, fbs, exang
- Categorical (ordinal/nominal): cp, restecg, slope, ca, thal

At this stage, we keep categorical features as numerical encodings.
Encoding strategies will be applied during model training using pipelines.

In [9]:
df_clean.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

### Feature Selection
- All clinically relevant features are retained.
- Feature importance and further selection will be evaluated during model training.

In [10]:
df_clean.to_csv(
    "../data/processed/cleaned_data.csv",
    index=False
)

print("cleaned_data.csv saved successfully.")

cleaned_data.csv saved successfully.


## Summary

- Missing values were handled by removing affected rows.
- The target variable was binarized.
- All features were converted to numeric format.
- A clean dataset was saved for downstream tasks.

üìå **Next Step:**  
Model training, evaluation, and selection of the best classifier.