# **Heart Disease Dataset – Full Analysis and Preprocessing**
This notebook provides a complete walkthrough of the data analysis and preparation process for the **Heart Disease Dataset**.  
It includes:
1. Dataset exploration and descriptive statistics  
2. Exploratory Data Analysis (EDA)  
3. Data cleaning, encoding, and normalization  
4. A complete preprocessing pipeline using `scikit-learn`

---

In [10]:
import matplotlib
matplotlib.use('TkAgg')  # ensures plots show up outside notebooks if needed

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib
from sklearn import set_config

# Configure plots
sns.set(style="whitegrid", palette="Set2", font_scale=1.1)

## **1. Load Dataset**
The dataset contains medical and demographic attributes that can be used to predict whether a person is likely to have heart disease.  
It’s a binary classification dataset from Kaggle.

In [11]:
# Load the dataset
df = pd.read_csv("heart.csv")

# Preview first rows
print("=== Preview of the Dataset ===")
df.head()

=== Preview of the Dataset ===


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


## **2. Basic Dataset Information**
Below we check dataset dimensions, data types, and basic statistics.

In [12]:
print(f"Number of instances (rows): {df.shape[0]}")
print(f"Number of features (columns): {df.shape[1]}")

print("\n=== Data Types ===")
print(df.dtypes)

print("\n=== Descriptive Statistics (Numerical Variables) ===")
df.describe().T

Number of instances (rows): 1025
Number of features (columns): 14

=== Data Types ===
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

=== Descriptive Statistics (Numerical Variables) ===


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1025.0,54.434146,9.07229,29.0,48.0,56.0,61.0,77.0
sex,1025.0,0.69561,0.460373,0.0,0.0,1.0,1.0,1.0
cp,1025.0,0.942439,1.029641,0.0,0.0,1.0,2.0,3.0
trestbps,1025.0,131.611707,17.516718,94.0,120.0,130.0,140.0,200.0
chol,1025.0,246.0,51.59251,126.0,211.0,240.0,275.0,564.0
fbs,1025.0,0.149268,0.356527,0.0,0.0,0.0,0.0,1.0
restecg,1025.0,0.529756,0.527878,0.0,0.0,1.0,1.0,2.0
thalach,1025.0,149.114146,23.005724,71.0,132.0,152.0,166.0,202.0
exang,1025.0,0.336585,0.472772,0.0,0.0,0.0,1.0,1.0
oldpeak,1025.0,1.071512,1.175053,0.0,0.0,0.8,1.8,6.2


## **3. Missing Values and Categorical Overview**
We check for missing data and inspect the categorical/ordinal variables.

In [13]:
print("\n=== Missing Values per Column ===")
print(df.isnull().sum())

categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
print("\n=== Unique Values for Categorical / Ordinal Features ===")
for col in categorical_features:
    print(f"{col}: {df[col].unique()}")


=== Missing Values per Column ===
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

=== Unique Values for Categorical / Ordinal Features ===
sex: [1 0]
cp: [0 1 2 3]
fbs: [0 1]
restecg: [1 0 2]
exang: [0 1]
slope: [2 0 1]
ca: [2 0 1 3 4]
thal: [3 2 1 0]


## **4. Target Variable Distribution**
Let's examine how balanced the classes are (presence vs absence of heart disease).

In [14]:
plt.figure(figsize=(5, 4))
sns.countplot(x='target', data=df, hue='target', palette='Set2', legend=False)
plt.title("Distribution of Target Variable (Heart Disease)")
plt.xlabel("Heart Disease (1 = Yes, 0 = No)")
plt.ylabel("Count")
plt.show()

df['target'].value_counts(normalize=True) * 100

target
1    51.317073
0    48.682927
Name: proportion, dtype: float64

## **5. Correlation Analysis**
We use a heatmap to visualize relationships among numerical features.

In [15]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Features")
plt.show()

# **SECTION 2 – DATA PREPARATION**

We clean the data, encode categorical variables, standardize numerical ones,  
and split the dataset for model development.

In [16]:
# --- Data Cleaning ---
print("Missing values per column:\n", df.isnull().sum())

duplicates = df.duplicated().sum()
print(f"Number of duplicated rows: {duplicates}")
df = df.drop_duplicates().reset_index(drop=True)
print(f"Shape after removing duplicates: {df.shape}")

Missing values per column:
 age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64
Number of duplicated rows: 723
Shape after removing duplicates: (302, 14)


## **Feature Encoding and Normalization**
We encode categorical variables using `OneHotEncoder` and scale numeric features using `StandardScaler`.

In [17]:
categorical_features = ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal"]
numeric_features = ["age", "trestbps", "chol", "thalach", "oldpeak"]

encoder = OneHotEncoder(drop="first", handle_unknown="ignore")
encoded_cat = encoder.fit_transform(df[categorical_features]).toarray()

encoded_df = pd.DataFrame(
    encoded_cat,
    columns=encoder.get_feature_names_out(categorical_features),
    index=df.index
)

df_encoded = pd.concat([df[numeric_features], encoded_df, df["target"]], axis=1)
print(f"\nData shape after encoding: {df_encoded.shape}")

scaler = StandardScaler()
df_encoded[numeric_features] = scaler.fit_transform(df_encoded[numeric_features])
print("\nStandardization applied to numeric features.")


Data shape after encoding: (302, 23)

Standardization applied to numeric features.


## **Dataset Splitting**
We split the dataset into an 80/20 ratio for training and testing.

In [18]:
X = df_encoded.drop("target", axis=1)
y = df_encoded["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

Training set size: 241 samples
Test set size: 61 samples


# **SECTION 3 – DATA PROCESSING PIPELINE**
Now we create a reproducible pipeline that automates all preprocessing steps  
(encoding + scaling) and includes a baseline logistic regression model.

In [19]:
categorical_transformer = OneHotEncoder(drop="first", handle_unknown="ignore")
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", categorical_transformer, categorical_features),
        ("numerical", numeric_transformer, numeric_features)
    ]
)

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

set_config(display='diagram')
pipeline

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('categorical', ...), ('numerical', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


## **Train and Evaluate Pipeline**
We fit the pipeline on training data and evaluate its predictive performance.

In [20]:
X_raw = df[categorical_features + numeric_features]
y_raw = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print("=== Classification Report (Baseline Logistic Regression) ===")
print(classification_report(y_test, y_pred, digits=3))

=== Classification Report (Baseline Logistic Regression) ===
              precision    recall  f1-score   support

           0      0.828     0.857     0.842        28
           1      0.875     0.848     0.862        33

    accuracy                          0.852        61
   macro avg      0.851     0.853     0.852        61
weighted avg      0.853     0.852     0.853        61



## **Feature Transformation Overview**
Let's inspect how many new features were created after preprocessing.

In [21]:
feature_names = pipeline.named_steps["preprocessor"].get_feature_names_out()
print("Number of transformed features:", len(feature_names))
print("First 10 transformed features:", feature_names[:10])

Number of transformed features: 22
First 10 transformed features: ['categorical__sex_1' 'categorical__cp_1' 'categorical__cp_2'
 'categorical__cp_3' 'categorical__fbs_1' 'categorical__restecg_1'
 'categorical__restecg_2' 'categorical__exang_1' 'categorical__slope_1'
 'categorical__slope_2']


## **Save the Trained Pipeline**
We save the full preprocessing + model pipeline to a `.joblib` file for reuse (e.g., in Milestone M2).

In [23]:
joblib.dump(pipeline, "heart_pipeline.joblib")
print("Saved pipeline to 'heart_pipeline.joblib'")

Saved pipeline to 'heart_pipeline.joblib'


# **Summary**
In this notebook, we:
- Explored and analyzed the Heart Disease Dataset  
- Cleaned, encoded, and standardized the data  
- Built a complete preprocessing pipeline  
- Evaluated a baseline Logistic Regression model  

This pipeline can now be reused for further modeling and optimization in the **M2 Milestone**.