## Step 1: Import Necessary Libraries  
First, we need to import the required libraries.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


### Explanation
pandas and numpy: Libraries for data manipulation.

train_test_split: To split the dataset into training and testing sets.

StandardScaler, OneHotEncoder: Tools for scaling and encoding.

SimpleImputer: Handles missing values.

Pipeline: Organizes preprocessing steps into a sequential structure.


### Step 2: Load the Dataset
We'll load the Heart Disease dataset from the UCI Machine Learning Repository.

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = [
    "age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", 
    "exang", "oldpeak", "slope", "ca", "thal", "num"  # 'num' is the target
]

# Load dataset with '?' as missing values (NaN)
df = pd.read_csv(url, names=columns, na_values="?")


### Explanation
The dataset contains clinical data to predict heart disease.

na_values="?" ensures that any '?' in the data is treated as a missing value.

### Step 3: Handle Missing Values
We'll impute missing values for two important features using the median.

In [None]:
df['ca'].fillna(df['ca'].median(), inplace=True)  # 'ca': number of major vessels
df['thal'].fillna(df['thal'].median(), inplace=True)  # 'thal': thalassemia status


### Explanation
ca and thal contain some missing values, which we replace with their median.


### Step 4: Convert Target Variable to Binary
The original target variable num contains multiple values. We convert it to binary (0: no heart disease, 1: heart disease).

In [4]:
df['num'] = df['num'].apply(lambda x: 1 if x > 0 else 0)


### Explanation
Heart disease presence: 1
No heart disease: 0

### Step 5: Define Features and Target Variables
We'll separate the features and target variable for model training.

In [5]:
X = df.drop('num', axis=1)  # Features
y = df['num']  # Target


### Explanation
X: All features except the target (num).
y: The binary target variable.

### Step 6: Encode Categorical Variables and Scale Numerical Features
We'll apply one-hot encoding to categorical variables and scale numerical features.
### Step 6.1: Identify Categorical and Numerical Features

In [6]:
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']


### Step 6.2: Create Preprocessing Pipelines
Pipeline for Numerical Features

In [7]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing with median
    ('scaler', StandardScaler())  # Scale numerical features
])


Pipeline for Categorical Features

In [8]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing with mode
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])


### Explanation
Numerical Pipeline:

Impute missing values with the median.

Scale features with StandardScaler to normalize them.

Categorical Pipeline:

Impute missing values with the most frequent value.

Apply one-hot encoding to categorical variables.

### Step 6.3: Combine Pipelines Using ColumnTransformer

In [9]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])


### Explanation
The ColumnTransformer applies different transformations to numerical and categorical features.

### Step 7: Fit and Transform the Data
We now fit the preprocessing pipelines to the data and transform the features.

In [10]:
X_processed = preprocessor.fit_transform(X)


### Explanation
The transformed dataset is now ready for model training.

### Step 8: Split the Dataset into Training and Testing Sets
We'll split the processed data into training and testing sets.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42, stratify=y
)


### Explanation
80% of the data is used for training, and 20% for testing.

stratify=y ensures that both sets maintain the same class distribution as the original data.

### Step 9: Display Dataset Shapes
Finally, we'll print the shapes of the training and testing sets to verify our preprocessing.

In [12]:
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)


Training set shape: (242, 28)
Testing set shape: (61, 28)
Training labels shape: (242,)
Testing labels shape: (61,)


### Summary
In this tutorial, we walked through the complete preprocessing pipeline for the Heart Disease dataset. 
The following steps were covered:

>Handling missing values using median and most frequent value imputation.

>Encoding categorical variables using one-hot encoding.

>Scaling numerical variables with StandardScaler.

>Splitting the dataset into training and testing sets using train_test_split.

>This preprocessed dataset is now ready to be used with machine learning algorithms for building predictive models. 

>You can further explore different algorithms like logistic regression, decision trees, or neural networks on this data.