# Confusion Matrix

YT video - https://www.youtube.com/watch?v=Kdsp6soqA7o&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=3

A confusion matrix is a table that shows how well a classification model performs by comparing predicted values with actual values. It displays four types of predictions:

True Positive (TP): Correctly predicted positive 

True Negative (TN): Correctly predicted negative 

False Positive (FP): Incorrectly predicted positive
 
False Negative (FN): Incorrectly predicted negative 

### Confusion Matrix Using Cross Validation - 

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier

# Create simple heart disease data
np.random.seed(42)

#Generate 100 people with random ages and cholesterol levels
age = np.random.normal(55, 15, 100)
cholesterol = np.random.normal(200, 50, 100)

# Simple rule: high age + high cholesterol = heart disease
risk = (age - 50) / 20 + (cholesterol - 200) / 100
actual = (risk > 0.5).astype(int) # 1 = has heart disease, 0 = no heart disease

# Create features for the model
X = np.column_stack((age, cholesterol))

# Use cross validation to get predictions (5-fold CV)
model = RandomForestClassifier(n_estimators=10, random_state=42)
predicted = cross_val_predict(model, X, actual, cv=5)

# Create confusion matrix
cm = confusion_matrix(actual, predicted)

# Display confusion matrix with predicted on top, actual on side
print("Confusion Matrix (with Cross-Validation):")
print("                Predicted")
print("Actual    | No Disease | Disease")
print("----------|------------|--------")
print(f"No Disease|     {cm[0,0]:3d}     |   {cm[0,1]:3d}")
print(f"Disease   |     {cm[1,0]:3d}     |   {cm[1,1]:3d}")

# Extract values
tn, fp, fn, tp = cm.ravel()

print(f"\nMetrics:")
print(f"True Negatives (TN): {tn} - Correctly predicted no disease")
print(f"False Positives (FP): {fp} - Incorrectly predicted disease")
print(f"False Negatives (FN): {fn} - Incorrectly predicted no disease")
print(f"True Positives (TP): {tp} - Correctly predicted disease")

# Calculate accuracy
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")





If we have N classes/categories to predict, the confusion matrix will have N rows and N columns, creating an N×N matrix.

### Comparing Confusion Matrices Across Different Machine Learning Models

Logistic Regression vs Random Forest

In [1]:
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

# Create simple heart disease data (fake data)
np.random.seed(42)

# Generate 100 people with random ages and cholesterol
age = np.random.normal(55, 15, 100)
cholesterol = np.random.normal(200, 50, 100)

# Simple rule: high age + high cholesterol = heart disease
# Risk calculated with a simple formula
risk = (age - 50) / 20 + (cholesterol - 200) / 100
actual = (risk > 0.5).astype(int)  # If risk > 0.5 = heart disease, otherwise no heart disease

# Combine features for model input into 2D array
X = np.column_stack([age, cholesterol])

# Function to display confusion matrix
def show_confusion_matrix(cm, model_name):
    print(f"\n{model_name} Confusion Matrix:")
    print("                Predicted")
    print("Actual    | No Disease | Disease")
    print("----------|------------|--------")
    print(f"No Disease|     {cm[0,0]:3d}     |   {cm[0,1]:3d}")
    print(f"Disease   |     {cm[1,0]:3d}     |   {cm[1,1]:3d}")
    
    # Calculate metrics
    tn, fp, fn, tp = cm.ravel()
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    print(f"\n{model_name} Metrics:")
    print(f"True Negatives (TN): {tn}")
    print(f"False Positives (FP): {fp}")
    print(f"False Negatives (FN): {fn}")
    print(f"True Positives (TP): {tp}")
    print(f"Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
    
    return accuracy

# Test Logistic Regression
print("=== LOGISTIC REGRESSION ===")
lr_model = LogisticRegression(random_state=42)
lr_predicted = cross_val_predict(lr_model, X, actual, cv=5)
lr_cm = confusion_matrix(actual, lr_predicted)
lr_accuracy = show_confusion_matrix(lr_cm, "Logistic Regression")

# Test Random Forest
print("\n=== RANDOM FOREST ===")
rf_model = RandomForestClassifier(n_estimators=10, random_state=42)
rf_predicted = cross_val_predict(rf_model, X, actual, cv=5)
rf_cm = confusion_matrix(actual, rf_predicted)
rf_accuracy = show_confusion_matrix(rf_cm, "Random Forest")

# Compare results
print("\n=== COMPARISON ===")
print(f"Logistic Regression Accuracy: {lr_accuracy:.3f} ({lr_accuracy*100:.1f}%)")
print(f"Random Forest Accuracy: {rf_accuracy:.3f} ({rf_accuracy*100:.1f}%)")

if lr_accuracy > rf_accuracy:
    print("Logistic Regression performed better!")
elif rf_accuracy > lr_accuracy:
    print("Random Forest performed better!")
else:
    print("Both models performed equally!")

print(f"\nBest Model: {'Logistic Regression' if lr_accuracy > rf_accuracy else 'Random Forest'}")

=== LOGISTIC REGRESSION ===

Logistic Regression Confusion Matrix:
                Predicted
Actual    | No Disease | Disease
----------|------------|--------
No Disease|      69     |     0
Disease   |       0     |    31

Logistic Regression Metrics:
True Negatives (TN): 69
False Positives (FP): 0
False Negatives (FN): 0
True Positives (TP): 31
Accuracy: 1.000 (100.0%)

=== RANDOM FOREST ===

Random Forest Confusion Matrix:
                Predicted
Actual    | No Disease | Disease
----------|------------|--------
No Disease|      67     |     2
Disease   |       9     |    22

Random Forest Metrics:
True Negatives (TN): 67
False Positives (FP): 2
False Negatives (FN): 9
True Positives (TP): 22
Accuracy: 0.890 (89.0%)

=== COMPARISON ===
Logistic Regression Accuracy: 1.000 (100.0%)
Random Forest Accuracy: 0.890 (89.0%)
Logistic Regression performed better!

Best Model: Logistic Regression


Confusion matrices help compare different ML models on the same dataset. By testing multiple algorithms, we can see which performs best and what types of errors each makes. This is especially important in medical applications where false negatives (missing disease) are more dangerous than false positives (false alarm). The confusion matrix guides us to choose the right model based on error patterns that matter most for our specific problem.