# Decision Tree Classification Project


## Introduction
In this project, we will build a Decision Tree classifier to predict a target variable based on input features. Decision Trees are a powerful and interpretable machine learning algorithm used for both classification and regression tasks. We will use a dataset with some complexity, including missing values and categorical data.


## Dataset Description
The dataset contains the following features:
- **age:** Age of the patient
- **sex:** Gender of the patient (1 = male; 0 = female)
- **cp:** Chest pain type (4 values)
- **trestbps:** Resting blood pressure (in mm Hg)
- **chol:** Serum cholesterol in mg/dl
- **fbs:** Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- **restecg:** Resting electrocardiographic results (values 0, 1, 2)
- **thalach:** Maximum heart rate achieved
- **exang:** Exercise-induced angina (1 = yes; 0 = no)
- **oldpeak:** ST depression induced by exercise relative to rest
- **slope:** The slope of the peak exercise ST segment
- **ca:** Number of major vessels (0-3) colored by fluoroscopy
- **thal:** Thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- **target:** Diagnosis of heart disease (1 = yes; 0 = no)

You can download the dataset from [Kaggle's Heart Disease UCI dataset](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data).


In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
path = r"C:\Users\hassa\OneDrive\المستندات\Machine learning files\archive\heart_disease_uci.csv"
df = pd.read_csv(path)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [12]:
df.rename(columns = {"num": "target"}, inplace = True)
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,target
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


The dataset contains various medical features and a target variable indicating the presence of heart disease.


In [10]:
# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB


The dataset has 920 entries and 16 columns. There are some missing values that need to be handled before building the model.


## Data Cleaning and Preprocessing
We will handle any missing values and encode categorical features to prepare the data for modeling.


In [11]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
dtype: int64

We will use the SimpleImputer to fill in missing values. For categorical features, we will use the most frequent value, and for numerical features, we will use the mean.


In [13]:
# Separate features and target
X = df.drop(columns=['target'])
y = df['target']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessing pipeline for numerical features
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Create a preprocessing pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the numerical and categorical pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

# Apply the preprocessing pipelines to the dataset
X_preprocessed = preprocessor.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Check the first few rows of the preprocessed dataset
pd.DataFrame(X_train).head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.583321,0.901224,0.0,-0.2675001,-0.69834,2.014062,-2.050756e-16,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,-0.009413,0.051927,0.969281,-2.609929e-16,-0.618737,-0.834397,-2.050756e-16,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.270799,-0.266559,0.0,1.284405,0.0,0.0,-2.050756e-16,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,-1.636036,-0.372721,-0.658158,0.1824606,0.814108,0.684781,-1.249371,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,-1.413881,-0.160397,-0.658158,1.155845,1.371326,-0.6445,-1.249371,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


## Model Building and Evaluation
We will build a Decision Tree classifier and evaluate its performance on the test set.


In [15]:
# Build the Decision Tree classifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict the target variable on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(classification_rep)
print('Confusion Matrix:')
print(conf_matrix)

Accuracy: 0.55
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.87      0.79        75
           1       0.62      0.46      0.53        54
           2       0.29      0.28      0.29        25
           3       0.20      0.19      0.20        26
           4       0.00      0.00      0.00         4

    accuracy                           0.55       184
   macro avg       0.37      0.36      0.36       184
weighted avg       0.55      0.55      0.54       184

Confusion Matrix:
[[65  2  2  5  1]
 [14 25  8  7  0]
 [ 5  5  7  6  2]
 [ 6  7  6  5  2]
 [ 0  1  1  2  0]]


The accuracy of the Decision Tree classifier is evaluated along with the classification report and confusion matrix. These metrics provide insights into the model's performance.


In [16]:
# Define new input values based on the same feature set used for training
# Using the first row of the training set to ensure consistency
new_input = X_train[0].reshape(1, -1)

# Predict the target variable for the new input
new_prediction = model.predict(new_input)
print(f'Predicted target: {new_prediction[0]}')


Predicted target: 4


## Conclusion and Next Steps
In this project, we successfully built a Decision Tree classifier to predict the presence of heart disease. The model was evaluated using accuracy, classification report, and confusion matrix. These metrics indicate the model's performance and areas for improvement.