# Naive Bayes Classifier Implementation
## Lesson 21 Homework - Machine Learning

**Objective:** Implement Gaussian Naive Bayes classifier from scratch and compare with scikit-learn implementation.

**Dataset:** Iris Dataset (Fisher's Iris data set)
- Features: 4 continuous numerical features (Sepal Length, Sepal Width, Petal Length, Petal Width)
- Target: 3 classes of Iris species
- Split: 75% Training, 25% Test

## 1. Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## 2. Load and Explore the Iris Dataset

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(y, iris.target_names)

print("Dataset Shape:", X.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.describe())
print("\nClass Distribution:")
print(df['species'].value_counts())

## 3. Data Preparation - Train/Test Split

In [None]:
# Split the dataset: 75% training, 25% test (randomized)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, shuffle=True
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nTraining set class distribution:")
print(pd.Series(y_train).value_counts().sort_index())
print(f"\nTest set class distribution:")
print(pd.Series(y_test).value_counts().sort_index())

## 4. Feature A: Manual Implementation (From Scratch)

### Mathematical Foundation

**Gaussian Naive Bayes** assumes that the features follow a Gaussian (Normal) distribution.

**Bayes Theorem:**
$$P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}$$

Where:
- $P(y|X)$ = Posterior probability of class $y$ given features $X$
- $P(X|y)$ = Likelihood of features $X$ given class $y$
- $P(y)$ = Prior probability of class $y$
- $P(X)$ = Evidence (constant for all classes)

**For Gaussian distribution:**
$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$

Where:
- $\mu_y$ = Mean of feature $x_i$ for class $y$
- $\sigma_y^2$ = Variance of feature $x_i$ for class $y$

In [None]:
class GaussianNaiveBayesManual:
    """
    Gaussian Naive Bayes classifier implemented from scratch using NumPy.
    
    This implementation assumes features follow a Gaussian (Normal) distribution
    and uses Bayes theorem for classification.
    """
    
    def __init__(self):
        self.classes = None
        self.priors = None
        self.means = None
        self.variances = None
    
    def fit(self, X, y):
        """
        Train the Gaussian Naive Bayes model.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Target values
        """
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        # Initialize arrays for means, variances, and priors
        self.means = np.zeros((n_classes, n_features))
        self.variances = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)
        
        # Calculate mean, variance, and prior for each class
        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            
            # Calculate mean (μ) for each feature
            self.means[idx, :] = X_c.mean(axis=0)
            
            # Calculate variance (σ²) for each feature
            self.variances[idx, :] = X_c.var(axis=0)
            
            # Calculate prior probability P(y)
            self.priors[idx] = X_c.shape[0] / n_samples
        
        return self
    
    def _calculate_likelihood(self, x, mean, var):
        """
        Calculate the Gaussian probability density function (PDF).
        
        Formula: P(x|y) = (1 / sqrt(2π * σ²)) * exp(-(x - μ)² / (2σ²))
        
        Parameters:
        -----------
        x : float
            Feature value
        mean : float
            Mean (μ) of the feature for a specific class
        var : float
            Variance (σ²) of the feature for a specific class
        
        Returns:
        --------
        float : Probability density
        """
        eps = 1e-6  # Small constant to avoid division by zero
        numerator = np.exp(-((x - mean) ** 2) / (2 * (var + eps)))
        denominator = np.sqrt(2 * np.pi * (var + eps))
        return numerator / denominator
    
    def _calculate_posterior(self, x):
        """
        Calculate posterior probability for each class.
        
        Using log probabilities to avoid numerical underflow:
        log(P(y|X)) = log(P(y)) + Σ log(P(x_i|y))
        
        Parameters:
        -----------
        x : array-like, shape (n_features,)
            Single sample
        
        Returns:
        --------
        array : Posterior probabilities for each class
        """
        posteriors = []
        
        for idx, c in enumerate(self.classes):
            # Start with log prior: log(P(y))
            prior = np.log(self.priors[idx])
            
            # Calculate likelihood for each feature: Σ log(P(x_i|y))
            likelihood = np.sum(
                np.log(self._calculate_likelihood(x, self.means[idx, :], self.variances[idx, :]))
            )
            
            # Posterior = Prior + Likelihood (in log space)
            posterior = prior + likelihood
            posteriors.append(posterior)
        
        return np.array(posteriors)
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Test data
        
        Returns:
        --------
        array : Predicted class labels
        """
        predictions = []
        
        for x in X:
            # Calculate posterior for all classes
            posteriors = self._calculate_posterior(x)
            
            # Select class with highest posterior (ArgMax)
            predicted_class = self.classes[np.argmax(posteriors)]
            predictions.append(predicted_class)
        
        return np.array(predictions)
    
    def get_params(self):
        """
        Return the learned parameters.
        
        Returns:
        --------
        dict : Dictionary containing priors, means, and variances
        """
        return {
            'priors': self.priors,
            'means': self.means,
            'variances': self.variances
        }

### Train and Evaluate Manual Implementation

In [None]:
# Create and train the manual Naive Bayes classifier
manual_nb = GaussianNaiveBayesManual()
manual_nb.fit(X_train, y_train)

# Display learned parameters
params = manual_nb.get_params()
print("=" * 60)
print("MANUAL IMPLEMENTATION - LEARNED PARAMETERS")
print("=" * 60)

print("\nPrior Probabilities P(y):")
for idx, c in enumerate(manual_nb.classes):
    print(f"  Class {c} ({iris.target_names[c]}): {params['priors'][idx]:.4f}")

print("\nMeans (μ) for each class and feature:")
means_df = pd.DataFrame(
    params['means'],
    columns=iris.feature_names,
    index=[iris.target_names[i] for i in manual_nb.classes]
)
print(means_df)

print("\nVariances (σ²) for each class and feature:")
var_df = pd.DataFrame(
    params['variances'],
    columns=iris.feature_names,
    index=[iris.target_names[i] for i in manual_nb.classes]
)
print(var_df)

In [None]:
# Make predictions on test set
y_pred_manual = manual_nb.predict(X_test)

# Calculate accuracy
accuracy_manual = accuracy_score(y_test, y_pred_manual)

print("=" * 60)
print("MANUAL IMPLEMENTATION - RESULTS")
print("=" * 60)
print(f"\nAccuracy: {accuracy_manual:.4f} ({accuracy_manual*100:.2f}%)")

print("\nConfusion Matrix:")
cm_manual = confusion_matrix(y_test, y_pred_manual)
print(cm_manual)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_manual, target_names=iris.target_names))

## 5. Feature B: Scikit-Learn Implementation

In [None]:
# Create and train the scikit-learn Gaussian Naive Bayes classifier
sklearn_nb = GaussianNB()
sklearn_nb.fit(X_train, y_train)

# Display learned parameters
print("=" * 60)
print("SCIKIT-LEARN IMPLEMENTATION - LEARNED PARAMETERS")
print("=" * 60)

print("\nPrior Probabilities P(y):")
for idx, c in enumerate(sklearn_nb.classes_):
    print(f"  Class {c} ({iris.target_names[c]}): {np.exp(sklearn_nb.class_log_prior_[idx]):.4f}")

print("\nMeans (μ) for each class and feature:")
sklearn_means_df = pd.DataFrame(
    sklearn_nb.theta_,
    columns=iris.feature_names,
    index=[iris.target_names[i] for i in sklearn_nb.classes_]
)
print(sklearn_means_df)

print("\nVariances (σ²) for each class and feature:")
sklearn_var_df = pd.DataFrame(
    sklearn_nb.var_,
    columns=iris.feature_names,
    index=[iris.target_names[i] for i in sklearn_nb.classes_]
)
print(sklearn_var_df)

In [None]:
# Make predictions on test set
y_pred_sklearn = sklearn_nb.predict(X_test)

# Calculate accuracy
accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)

print("=" * 60)
print("SCIKIT-LEARN IMPLEMENTATION - RESULTS")
print("=" * 60)
print(f"\nAccuracy: {accuracy_sklearn:.4f} ({accuracy_sklearn*100:.2f}%)")

print("\nConfusion Matrix:")
cm_sklearn = confusion_matrix(y_test, y_pred_sklearn)
print(cm_sklearn)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_sklearn, target_names=iris.target_names))

## 6. Comparison & Analysis

In [None]:
# Compare the two implementations
print("=" * 60)
print("COMPARISON: MANUAL vs SCIKIT-LEARN")
print("=" * 60)

print(f"\nAccuracy Comparison:")
print(f"  Manual Implementation:      {accuracy_manual:.4f} ({accuracy_manual*100:.2f}%)")
print(f"  Scikit-Learn Implementation: {accuracy_sklearn:.4f} ({accuracy_sklearn*100:.2f}%)")
print(f"  Difference:                  {abs(accuracy_manual - accuracy_sklearn):.4f}")

# Check if predictions are identical
predictions_match = np.array_equal(y_pred_manual, y_pred_sklearn)
print(f"\nPredictions Identical: {predictions_match}")

if not predictions_match:
    n_differences = np.sum(y_pred_manual != y_pred_sklearn)
    print(f"Number of different predictions: {n_differences}/{len(y_test)}")

# Compare learned parameters
print("\nParameter Comparison:")
print("\nPrior Probabilities - Max Difference:")
prior_diff = np.max(np.abs(params['priors'] - np.exp(sklearn_nb.class_log_prior_)))
print(f"  {prior_diff:.10f}")

print("\nMeans - Max Difference:")
means_diff = np.max(np.abs(params['means'] - sklearn_nb.theta_))
print(f"  {means_diff:.10f}")

print("\nVariances - Max Difference:")
var_diff = np.max(np.abs(params['variances'] - sklearn_nb.var_))
print(f"  {var_diff:.10f}")

print("\n" + "=" * 60)
print("CONCLUSION")
print("=" * 60)
if predictions_match and accuracy_manual == accuracy_sklearn:
    print("\n✓ SUCCESS: Manual implementation produces IDENTICAL results to scikit-learn!")
elif abs(accuracy_manual - accuracy_sklearn) < 0.01:
    print("\n✓ SUCCESS: Manual implementation produces statistically similar results to scikit-learn!")
else:
    print("\n⚠ WARNING: Significant differences detected between implementations.")

print(f"\nBoth implementations achieved {accuracy_manual*100:.2f}% accuracy on the test set.")
print("The Gaussian Naive Bayes algorithm has been successfully implemented from scratch!")

## 7. Visualization

In [None]:
# Create visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix - Manual Implementation
sns.heatmap(cm_manual, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names, ax=axes[0])
axes[0].set_title(f'Manual Implementation\nAccuracy: {accuracy_manual:.4f}')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Confusion Matrix - Scikit-Learn Implementation
sns.heatmap(cm_sklearn, annot=True, fmt='d', cmap='Greens', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names, ax=axes[1])
axes[1].set_title(f'Scikit-Learn Implementation\nAccuracy: {accuracy_sklearn:.4f}')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

# Accuracy Comparison Bar Chart
fig, ax = plt.subplots(figsize=(8, 5))
implementations = ['Manual\nImplementation', 'Scikit-Learn\nImplementation']
accuracies = [accuracy_manual, accuracy_sklearn]
colors = ['#3498db', '#2ecc71']

bars = ax.bar(implementations, accuracies, color=colors, alpha=0.7, edgecolor='black')
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Gaussian Naive Bayes: Accuracy Comparison', fontsize=14, fontweight='bold')
ax.set_ylim([0, 1.0])
ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{acc:.4f}\n({acc*100:.2f}%)',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

## 8. Summary

### Key Accomplishments:

1. **Data Preparation**: Successfully loaded the Iris dataset and split it into 75% training and 25% test sets with random shuffling.

2. **Manual Implementation**: Implemented Gaussian Naive Bayes from scratch using only NumPy:
   - Calculated prior probabilities P(y) for each class
   - Computed means (μ) for each feature per class
   - Computed variances (σ²) for each feature per class
   - Implemented Gaussian probability density function
   - Applied Bayes theorem for prediction using ArgMax

3. **Scikit-Learn Implementation**: Used the standard GaussianNB classifier from scikit-learn for comparison.

4. **Validation**: Compared both implementations and verified that the manual implementation produces results that are identical or statistically similar to the library version.

### Mathematical Concepts Demonstrated:
- Probability theory (Bayes Theorem)
- Statistical measures (Mean, Variance)
- Gaussian distribution
- Maximum a posteriori (MAP) estimation

### Technical Stack:
- Python 3.x
- NumPy (for mathematical operations)
- Pandas (for data manipulation)
- Scikit-learn (for dataset, splitting, and reference implementation)
- Matplotlib & Seaborn (for visualization)