**Basic Data Pre-Processing using UCI Heart Disease Dataset**

**Libraries Used**

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


 1: Load the Dataset

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

columns = [
    'age','sex','cp','trestbps','chol','fbs','restecg',
    'thalach','exang','oldpeak','slope','ca','thal','target'
]

data = pd.read_csv(url, names=columns)
print(data.head())


    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  


2: Dataset Overview

In [3]:
print(data.info())
print(data.describe())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.15841

3: Handle Missing Values

In [4]:
data.replace('?', np.nan, inplace=True)
print(data.isnull().sum())

data = data.astype(float)


data.fillna(data.mean(), inplace=True)


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64


4: Convert Target Variable

In [10]:
data['target'] = (data['target'] > 0).astype(int)
print(data['target'].head())


0    0
1    1
2    1
3    0
4    0
Name: target, dtype: int64


5: Feature Scaling

In [11]:

scaler = StandardScaler()

X = data.drop('target', axis=1)
y = data['target']

X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

print(X_scaled.head())


        age       sex        cp  trestbps      chol       fbs   restecg  \
0  0.948726  0.686202 -2.251775  0.757525 -0.264900  2.394438  1.016684   
1  1.392002  0.686202  0.877985  1.611220  0.760415 -0.417635  1.016684   
2  1.392002  0.686202  0.877985 -0.665300 -0.342283 -0.417635  1.016684   
3 -1.932564  0.686202 -0.165268 -0.096170  0.063974 -0.417635 -0.996749   
4 -1.489288 -1.457296 -1.208521 -0.096170 -0.825922 -0.417635  1.016684   

    thalach     exang   oldpeak     slope        ca      thal  
0  0.017197 -0.696631  1.087338  2.274579 -0.723095  0.655818  
1 -1.821905  1.435481  0.397182  0.649113  2.503851 -0.898522  
2 -0.902354  1.435481  1.346147  0.649113  1.428203  1.173931  
3  1.637359 -0.696631  2.122573  2.274579 -0.723095 -0.898522  
4  0.980537 -0.696631  0.310912 -0.976352 -0.723095 -0.898522  


6: Final Dataset Shape

In [12]:
print(X_scaled.shape)
print(y.shape)


(303, 13)
(303,)


**Conclusion**

In this experiment, the UCI Heart Disease dataset was successfully pre-processed using Python. Data cleaning, missing value handling, target transformation, and feature scaling were applied to prepare the dataset for predictive modeling such as Logistic Regression, SVM, or KNN.