# Assignment 4 - Part 1: Predicting Heart Disease Using a Classification Tree (R)

This notebook implements a classification tree model to predict whether a person is likely to have heart disease using R.

In [1]:
# Load necessary libraries
library(rpart)
library(rpart.plot)
library(caret)
library(ggplot2)
library(dplyr)

# Set random seed for reproducibility
set.seed(123)

Cargando paquete requerido: ggplot2

Cargando paquete requerido: lattice


Adjuntando el paquete: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## 1.1 Data Cleaning (2 points)

In [2]:
# Load the dataset
column_names <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 
                  'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'hd')

df <- read.csv('../input/processed.cleveland.data', 
               header = FALSE,
               col.names = column_names,
               na.strings = '?')

cat("Original dataset shape:", dim(df), "\n")
head(df)

Original dataset shape: 303 14 


Unnamed: 0_level_0,age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,hd
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
2,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
3,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
4,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
5,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
6,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0


In [3]:
# Check for missing values
cat("Missing values per column:\n")
colSums(is.na(df))

# Remove missing values
df <- na.omit(df)
cat("\nDataset shape after removing missing values:", dim(df), "\n")

Missing values per column:



Dataset shape after removing missing values: 297 14 


In [4]:
# Create binary variable y (1 if heart disease, 0 otherwise)
df$y <- ifelse(df$hd > 0, 1, 0)
cat("Distribution of target variable:\n")
table(df$y)

# Remove the original hd column
df$hd <- NULL

Distribution of target variable:



  0   1 
160 137 

In [5]:
# Convert categorical variables to factors
categorical_vars <- c('sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal')
df[categorical_vars] <- lapply(df[categorical_vars], as.factor)

# Convert y to factor for classification
df$y <- as.factor(df$y)

cat("Dataset structure:\n")
str(df)

Dataset structure:
'data.frame':	297 obs. of  14 variables:
 $ age    : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex    : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
 $ cp     : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ restbp : num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol   : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs    : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ restecg: Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ thalach: num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang  : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ oldpeak: num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope  : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ ca     : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ thal   : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
 $ y      : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
 - attr(*, "na.action")= 'omit' Named int

## 1.2 Data Analysis (8 points)

### (1 point) Split data and plot classification tree

In [6]:
# Split the data into training and test sets
set.seed(123)
train_index <- createDataPartition(df$y, p = 0.7, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

cat("Training set size:", nrow(train_data), "\n")
cat("Test set size:", nrow(test_data), "\n")

Training set size: 208 
Test set size: 89 


In [7]:
# Train a classification tree without pruning
tree_model <- rpart(y ~ ., data = train_data, method = "class")

# Plot the classification tree
png('../output/classification_tree_before_pruning_R.png', width = 1200, height = 800)
rpart.plot(tree_model, main = "Classification Tree (Before Pruning)",
           extra = 104, box.palette = "RdBu", shadow.col = "gray")
dev.off()

# Display tree info
cat("Tree complexity parameters:\n")
printcp(tree_model)

Tree complexity parameters:

Classification tree:
rpart(formula = y ~ ., data = train_data, method = "class")

Variables actually used in tree construction:
[1] ca      chol    cp      oldpeak thal    thalach

Root node error: 96/208 = 0.46154

n= 208 

        CP nsplit rel error  xerror     xstd
1 0.500000      0   1.00000 1.00000 0.074893
2 0.052083      1   0.50000 0.50000 0.063296
3 0.041667      3   0.39583 0.52083 0.064196
4 0.020833      4   0.35417 0.54167 0.065052
5 0.010417      5   0.33333 0.53125 0.064630
6 0.010000      7   0.31250 0.55208 0.065464


### (2 points) Plot confusion matrix and interpret results

In [8]:
# Make predictions on test set
predictions <- predict(tree_model, test_data, type = "class")

# Calculate confusion matrix
cm <- confusionMatrix(predictions, test_data$y, 
                      dnn = c("Predicted", "Actual"))
print(cm)

# Plot confusion matrix
cm_table <- as.data.frame(cm$table)
colnames(cm_table) <- c("Predicted", "Actual", "Freq")
cm_table$Predicted <- ifelse(cm_table$Predicted == "0", "Does not have HD", "Has HD")
cm_table$Actual <- ifelse(cm_table$Actual == "0", "Does not have HD", "Has HD")

p <- ggplot(cm_table, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 8) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrix (Before Pruning)",
       x = "Predicted Label", y = "True Label") +
  theme_minimal() +
  theme(text = element_text(size = 14))

ggsave('../output/confusion_matrix_before_pruning_R.png', p, width = 8, height = 6, dpi = 300)

Confusion Matrix and Statistics

         Actual
Predicted  0  1
        0 34  5
        1 14 36
                                          
               Accuracy : 0.7865          
                 95% CI : (0.6869, 0.8663)
    No Information Rate : 0.5393          
    P-Value [Acc > NIR] : 1.112e-06       
                                          
                  Kappa : 0.5771          
                                          
 Mcnemar's Test P-Value : 0.06646         
                                          
            Sensitivity : 0.7083          
            Specificity : 0.8780          
         Pos Pred Value : 0.8718          
         Neg Pred Value : 0.7200          
             Prevalence : 0.5393          
         Detection Rate : 0.3820          
   Detection Prevalence : 0.4382          
      Balanced Accuracy : 0.7932          
                                          
       'Positive' Class : 0               
                                          


**Interpretation (Before Pruning):**

Based on our classification tree model results:

- **Dataset**: After cleaning, we have 297 samples (160 without heart disease, 137 with heart disease)
- **Split**: 208 training samples and 89 test samples (70-30 split)

**Confusion Matrix Results:**
- **True Negatives (TN)**: 34 - correctly predicted individuals without heart disease
- **True Positives (TP)**: 36 - correctly predicted individuals with heart disease  
- **False Positives (FP)**: 14 - incorrectly predicted as having heart disease (Type I error)
- **False Negatives (FN)**: 5 - incorrectly predicted as not having heart disease (Type II error, more concerning in medical diagnosis)

**Model Performance:**
- **Accuracy**: 78.65% - the model correctly classifies about 79% of cases
- **Sensitivity (Recall)**: 70.83% - ability to correctly identify those without heart disease
- **Specificity**: 87.80% - ability to correctly identify those with heart disease
- **Positive Predictive Value**: 87.18% - when predicting no heart disease, it's correct 87% of the time
- **Kappa**: 0.5771 - moderate agreement beyond chance

The model shows good specificity but could improve sensitivity. The relatively high number of false positives (14) compared to false negatives (5) suggests the model is slightly conservative in predicting heart disease.

### (1.5 points) Fix overfitting using cross-validation

In [9]:
# Generate 50 values of alpha (cp in rpart) equally spaced on logarithmic scale
alpha_values <- exp(seq(log(exp(-10)), log(0.05), length.out = 50))

cat("Number of alpha values:", length(alpha_values), "\n")
cat("Alpha range:", min(alpha_values), "to", max(alpha_values), "\n")

Number of alpha values: 50 
Alpha range: 4.539993e-05 to 0.05 


In [10]:
# Perform 4-fold cross-validation for each alpha
set.seed(123)
mean_accuracies <- numeric(length(alpha_values))

for (i in seq_along(alpha_values)) {
  # Create folds
  folds <- createFolds(train_data$y, k = 4)
  accuracies <- numeric(4)
  
  for (j in 1:4) {
    # Split into train and validation
    val_idx <- folds[[j]]
    train_cv <- train_data[-val_idx, ]
    val_cv <- train_data[val_idx, ]
    
    # Train model with current alpha
    model_cv <- rpart(y ~ ., data = train_cv, method = "class",
                      control = rpart.control(cp = alpha_values[i]))
    
    # Predict and calculate accuracy
    pred_cv <- predict(model_cv, val_cv, type = "class")
    accuracies[j] <- mean(pred_cv == val_cv$y)
  }
  
  mean_accuracies[i] <- mean(accuracies)
}

# Find optimal alpha
optimal_idx <- which.max(mean_accuracies)
optimal_alpha <- alpha_values[optimal_idx]
optimal_accuracy <- mean_accuracies[optimal_idx]

cat("Optimal alpha:", optimal_alpha, "\n")
cat("Optimal cross-validation accuracy:", optimal_accuracy, "\n")

Optimal alpha: 0.0001643517 
Optimal cross-validation accuracy: 0.8076923 


### (1.5 points) Plot Inaccuracy Rate vs Alpha

In [11]:
# Calculate inaccuracy rate
inaccuracy_rates <- 1 - mean_accuracies

# Create data frame for plotting
plot_df <- data.frame(alpha = alpha_values, inaccuracy = inaccuracy_rates)

# Plot Inaccuracy Rate vs Alpha
p <- ggplot(plot_df, aes(x = alpha, y = inaccuracy)) +
  geom_line() +
  geom_point(size = 2) +
  geom_vline(xintercept = optimal_alpha, color = "red", linetype = "dashed",
             size = 1) +
  annotate("text", x = optimal_alpha * 2, y = max(inaccuracy_rates) * 0.9,
           label = paste("Optimal α =", round(optimal_alpha, 6)), color = "red") +
  scale_x_log10() +
  labs(title = "Inaccuracy Rate vs Alpha",
       x = "Alpha (log scale)",
       y = "Inaccuracy Rate (1 - Accuracy)") +
  theme_minimal() +
  theme(text = element_text(size = 12))

ggsave('../output/inaccuracy_vs_alpha_R.png', p, width = 10, height = 6, dpi = 300)

“[1m[22mUsing `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
[36mℹ[39m Please use `linewidth` instead.”


### (2 points) Plot pruned tree and confusion matrix with optimal alpha

In [12]:
# Train a classification tree with optimal alpha
tree_pruned <- rpart(y ~ ., data = train_data, method = "class",
                     control = rpart.control(cp = optimal_alpha))

# Plot the pruned classification tree
png('../output/classification_tree_after_pruning_R.png', width = 1200, height = 800)
rpart.plot(tree_pruned, 
           main = paste("Classification Tree (After Pruning with α =", round(optimal_alpha, 6), ")"),
           extra = 104, box.palette = "RdBu", shadow.col = "gray")
dev.off()

cat("Pruned tree complexity parameters:\n")
printcp(tree_pruned)

Pruned tree complexity parameters:

Classification tree:
rpart(formula = y ~ ., data = train_data, method = "class", control = rpart.control(cp = optimal_alpha))

Variables actually used in tree construction:
[1] ca      chol    cp      oldpeak thal    thalach

Root node error: 96/208 = 0.46154

n= 208 

          CP nsplit rel error  xerror     xstd
1 0.50000000      0   1.00000 1.00000 0.074893
2 0.05208333      1   0.50000 0.55208 0.065464
3 0.04166667      3   0.39583 0.54167 0.065052
4 0.02083333      4   0.35417 0.58333 0.066637
5 0.01041667      5   0.33333 0.59375 0.067007
6 0.00016435      7   0.31250 0.62500 0.068062


In [13]:
# Make predictions with pruned tree
predictions_pruned <- predict(tree_pruned, test_data, type = "class")

# Calculate confusion matrix for pruned tree
cm_pruned <- confusionMatrix(predictions_pruned, test_data$y,
                             dnn = c("Predicted", "Actual"))
print(cm_pruned)

# Plot confusion matrix
cm_table_pruned <- as.data.frame(cm_pruned$table)
colnames(cm_table_pruned) <- c("Predicted", "Actual", "Freq")
cm_table_pruned$Predicted <- ifelse(cm_table_pruned$Predicted == "0", "Does not have HD", "Has HD")
cm_table_pruned$Actual <- ifelse(cm_table_pruned$Actual == "0", "Does not have HD", "Has HD")

p <- ggplot(cm_table_pruned, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 8) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrix (After Pruning)",
       x = "Predicted Label", y = "True Label") +
  theme_minimal() +
  theme(text = element_text(size = 14))

ggsave('../output/confusion_matrix_after_pruning_R.png', p, width = 8, height = 6, dpi = 300)

Confusion Matrix and Statistics

         Actual
Predicted  0  1
        0 34  5
        1 14 36
                                          
               Accuracy : 0.7865          
                 95% CI : (0.6869, 0.8663)
    No Information Rate : 0.5393          
    P-Value [Acc > NIR] : 1.112e-06       
                                          
                  Kappa : 0.5771          
                                          
 Mcnemar's Test P-Value : 0.06646         
                                          
            Sensitivity : 0.7083          
            Specificity : 0.8780          
         Pos Pred Value : 0.8718          
         Neg Pred Value : 0.7200          
             Prevalence : 0.5393          
         Detection Rate : 0.3820          
   Detection Prevalence : 0.4382          
      Balanced Accuracy : 0.7932          
                                          
       'Positive' Class : 0               
                                          


**Discussion:**

After implementing cross-validation with pruning using the optimal alpha value:

**Cross-Validation Results:**
- **Optimal Alpha (cp)**: 0.0001643517 (selected from 50 values ranging from ~4.54e-05 to 0.05)
- **Optimal Cross-Validation Accuracy**: 80.77% (from 4-fold cross-validation)

**Comparison: Before vs After Pruning:**

The pruning process using the optimal alpha value identified through cross-validation shows that:

1. **Model Performance**: Both the unpruned and pruned models achieved the same test accuracy of **78.65%**, with identical confusion matrices:
   - True Negatives: 34
   - True Positives: 36
   - False Positives: 14
   - False Negatives: 5

2. **Why Same Performance?**: This suggests that the optimal alpha value found (0.0001643517) is relatively small, meaning the pruning was minimal. The unpruned tree likely already had reasonable complexity, and the cross-validation confirmed that aggressive pruning wasn't necessary for this dataset.

3. **Cross-Validation Benefit**: The cross-validation accuracy (80.77%) is slightly higher than the test accuracy (78.65%), which is expected. The cross-validation process ensures we selected an alpha that generalizes well across different data splits.

4. **Key Insights**:
   - **Tree Complexity**: The optimal alpha preserves most of the tree structure, indicating the original tree wasn't severely overfitted
   - **Generalization**: The consistent performance between models suggests good generalization to unseen data
   - **Clinical Relevance**: With 87.80% specificity, the model is effective at identifying heart disease cases, which is crucial for medical screening

5. **Model Reliability**: The Kappa statistic of 0.5771 indicates moderate agreement beyond chance, and the balanced accuracy of 79.32% shows the model performs reasonably well across both classes despite some class imbalance.

**Conclusion**: The cross-validation process validated that our classification tree has appropriate complexity for this heart disease prediction task, achieving nearly 79% accuracy with good specificity for identifying positive cases.