# **Supervised Learning on Iris Dataset**

In this assignment, we apply different supervised learning algorithms to classify the Iris dataset.

Algorithms covered:
- Q1: Naive Bayes
- Q2: Decision Trees
- Q3: k-Nearest Neighbors
- Q4: Logistic Regression
- Q5: Support Vector Machines

In [33]:
# Import libraries
import sys, os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
from matplotlib.colors import ListedColormap

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import make_friedman1, load_iris
from sklearn.preprocessing import StandardScaler

# Custom scripts
# sys.path.append("../scripts")
# from naiveBayes import naiveBayes
# import BasicTree

# Sklearn classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


In [34]:
# Define the naive Bayes function
def naiveBayes(classes, learner, parameterised_function, train_data):
    """
    Train a naive Bayes model by fitting a 
    parameterised distribution to each feature within each class, 
    assuming independence between features.

    Args:
        classes (list): List of possible class labels.
        learner (function): Function that estimates parameters from training data 
                            (e.g., mean and std for Gaussian).
        parameterised_function (function): Function that takes parameters and 
                                           returns a probability density function.
        train_data (np.ndarray): Training data with features in columns and 
                                 class labels in the last column.

    Returns:
        dict: A dictionary of functions, where each function g[class_value](test_data) 
              computes the unscaled likelihood for a batch of test points belonging to that class.
    """
    f = {}           # feature-wise likelihood functions for each class
    parameters = {}  # learned parameters per feature per class
    g = {}           # class likelihood functions

    for class_value in classes:
        # Store parameters and functions for each feature of this class
        parameters[class_value] = {}
        f[class_value] = {}

        # Extract only training samples belonging to the current class
        train_x = train_data[train_data[:, -1] == class_value][:, :-1]

        # Learn distribution parameters for each feature
        for feature in range(train_x.shape[1]): 
            parameters[class_value][feature] = learner(train_x[:,feature])
            f[class_value][feature] = parameterised_function(parameters[class_value][feature])

        # Define a class-specific likelihood function
        def create_g(class_value):     
            def g(test_data):
                # For each test point, compute likelihood of each feature independently
                unscaled_feature_likelihoods = np.array([
                    [f[class_value][feature](test_data[point, feature]) 
                     for feature in range(test_data.shape[1])]
                    for point in range(test_data.shape[0])
                ])
                # Multiply feature likelihoods (independence assumption)
                unscaled_point_likelihood = np.prod(unscaled_feature_likelihoods, axis=1).reshape(-1, 1)
                return unscaled_point_likelihood
            return g
        
        # Store the likelihood function for this class
        g[class_value] = create_g(class_value)

    return g


#  Example Usage 
# classes = [0,1]
# def learner(train):
#     mu = np.mean(train)
#     sig = np.std(train)
#     return [mu,sig]
# def parameterised_function(parameters):
#     mu, sig = parameters
#     return lambda x: np.exp(-0.5*(x - mu)**2/(sig**2))
# train_data = np.array([[2.0, 4.0, 0.0], [1.0, 5.0, 0.0],
#                        [4.0, 2.0, 1.0], [6.0, 0.0, 1.0]])
# g = naiveBayes(classes, learner, parameterised_function, train_data)
# test_data = np.array([[2.0, 5.0], [3.0,3.0]])
# for class_value in classes:
#     print(g[class_value](test_data)) 


## **Question 1.1**
### **What does the naive Bayes classifier actually return?**
The naive Bayes classifier returns, for each class, a function that can evaluate the unscaled likelihood of a test point belonging to that class. In other words, it doesn’t directly give you probabilities, but it gives you a way to score new points by multiplying together feature-wise likelihoods under the independence assumption. The class with the highest score is then taken as the prediction.

### **What do the functions do that are defined inside the main function?**
The functions defined inside act as class-specific likelihood calculators. Each one takes in test data and applies the learnt feature distributions (from the training phase) to compute how likely the test point is if it were generated by that class. They are essentially wrappers that fix the class context and then process arbitrary test points.

### **What is the role of the inputs to the naiveBayes function, in particular the learner and the parameterised function?**
The `learner` is the procedure that estimates parameters from the training data for a single feature of a given class, for example, computing the mean and variance if we assume Gaussian features. The `parameterised_function` then takes those parameters and returns an actual density function that can score feature values. Together, these inputs define the modelling choice: `learner` extracts parameters, and `parameterised_function` turns them into usable feature-wise likelihood functions.

## **Question 1.2: Where does the independence assumption made by the Naive Bayes approach come into the calculation?**
The independence assumption appears when the classifier multiplies together the feature-wise likelihoods. Instead of modelling the full joint distribution of all features at once, it assumes each feature contributes independently. That’s why, in the code, the per-feature likelihood values are computed separately and then combined using a product across features.

## **Question 1.3: What class of functions does the parameterised function in the example represent?**
In the given example, the parameterised function represents Gaussian density functions. Each feature within a class is modelled as a normal distribution with its own mean and standard deviation. More generally, the parameterised function defines the family of distributions used to describe feature likelihoods (Gaussian in this case), but it could just as well be Bernoulli, multinomial, or any other parametric density depending on the data.


In [35]:
# Load Iris dataset with first column as index
iris = pd.read_csv("../data/iris.csv", index_col=0)

# === Column headers, dtypes, missing values ===
col_info = pd.DataFrame({
    "Column": iris.columns,
    "Dtype": iris.dtypes.astype(str),
    "Missing": iris.isnull().sum().values
})

print(" Column Overview")
print(tabulate(col_info, headers="keys", tablefmt="psql"))

# === Quick statistical summary ===
print("\n Summary Statistics")
print(tabulate(iris.describe().reset_index(), headers="keys", tablefmt="psql"))

# === Value counts for Species (categorical) ===
print("\n Species Value Counts")
print(tabulate(iris['Species'].value_counts().reset_index().rename(columns={'index': 'Species', 'Species': 'Count'}), headers="keys", tablefmt="psql"))

print("\n First 5 Rows of the Dataset")
print(tabulate(iris.head(), headers="keys", tablefmt="psql"))



 Column Overview
+--------------+--------------+---------+-----------+
|              | Column       | Dtype   |   Missing |
|--------------+--------------+---------+-----------|
| Sepal.Length | Sepal.Length | float64 |         0 |
| Sepal.Width  | Sepal.Width  | float64 |         0 |
| Petal.Length | Petal.Length | float64 |         0 |
| Petal.Width  | Petal.Width  | float64 |         0 |
| Species      | Species      | object  |         0 |
+--------------+--------------+---------+-----------+

 Summary Statistics
+----+---------+----------------+---------------+----------------+---------------+
|    | index   |   Sepal.Length |   Sepal.Width |   Petal.Length |   Petal.Width |
|----+---------+----------------+---------------+----------------+---------------|
|  0 | count   |     150        |    150        |       150      |    150        |
|  1 | mean    |       5.84333  |      3.05733  |         3.758  |      1.19933  |
|  2 | std     |       0.828066 |      0.435866 |         1.7

## **Question 1.4** 
We train a Naive Bayes classifier on the Iris dataset, using a train/test split to evaluate its performance on unseen data.

In [36]:
# Features (X) and target (y)
X = iris.drop(columns=["Species"])
y = iris["Species"]

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Gaussian Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)

# Predictions
y_pred = nb.predict(X_test)

# Accuracy
print("Accuracy")
print(round(accuracy_score(y_test, y_pred), 4))

# Classification Report
print("\n Classification Report")
report = classification_report(y_test, y_pred, output_dict=True)
print(tabulate(pd.DataFrame(report).T, headers="keys", tablefmt="psql"))

# Confusion Matrix
print("\n Confusion Matrix")
cm = confusion_matrix(y_test, y_pred, labels=nb.classes_)
cm_df = pd.DataFrame(cm, index=nb.classes_, columns=nb.classes_)
print(tabulate(cm_df, headers="keys", tablefmt="psql"))

Accuracy
0.9667

 Classification Report
+--------------+-------------+----------+------------+-----------+
|              |   precision |   recall |   f1-score |   support |
|--------------+-------------+----------+------------+-----------|
| setosa       |    1        | 1        |   1        | 10        |
| versicolor   |    1        | 0.9      |   0.947368 | 10        |
| virginica    |    0.909091 | 1        |   0.952381 | 10        |
| accuracy     |    0.966667 | 0.966667 |   0.966667 |  0.966667 |
| macro avg    |    0.969697 | 0.966667 |   0.966583 | 30        |
| weighted avg |    0.969697 | 0.966667 |   0.966583 | 30        |
+--------------+-------------+----------+------------+-----------+

 Confusion Matrix
+------------+----------+--------------+-------------+
|            |   setosa |   versicolor |   virginica |
|------------+----------+--------------+-------------|
| setosa     |       10 |            0 |           0 |
| versicolor |        0 |            9 |           

### Question 1.4 Observation
The classifier performs very well, reaching an accuracy of about 96.7%. Setosa is perfectly separated, with no misclassifications at all. Versicolor has one sample misclassified as virginica, while virginica itself is predicted without errors. The precision and recall values are consistently high across all classes, showing that the model generalizes well to the test set. Overall, the naive Bayes approach is effective for this dataset, with only minor overlap between versicolor and virginica.

In [37]:
# Define a function that generates synthetic regression data
def makedata():
    """
    Generate synthetic regression dataset.

    Returns:
        tuple: X_train, X_test, y_train, y_test split into training and testing sets.
    """
    n_points = 500  # number of data points
 
    X, y = make_friedman1(n_samples=n_points, n_features=5, 
                          noise=1.0, random_state=100)
         
    return train_test_split(X, y, test_size=0.5, random_state=3)

# Define the main function to run the regression tree  
def main():
    """
    Main function to train and evaluate a manually implemented regression tree,
    then compare with sklearn's DecisionTreeRegressor.
    """
    X_train, X_test, y_train, y_test = makedata()    
    maxdepth = 10 # maximum tree depth             
    
    # Create tree root node
    treeRoot = TNode(0, X_train, y_train) 
       
    # Build the regression tree recursively
    Construct_Subtree(treeRoot, maxdepth) 
    
    # Predict on test set
    y_hat = np.zeros(len(X_test))
    for i in range(len(X_test)):
        y_hat[i] = Predict(X_test[i], treeRoot)          
    
    MSE = np.mean(np.power(y_hat - y_test, 2))    
    print("Basic tree: tree loss = ",  MSE)

# Define the tree node class
class TNode:
    """
    Class representing a node in the regression tree.
    """
    def __init__(self, depth, X, y): 
        """
        Initialize tree node.

        Args:
            depth (int): Depth of the node in the tree.
            X (np.ndarray): Matrix of explanatory variables.
            y (np.ndarray): Vector of response values.
        """
        self.depth = depth
        self.X = X
        self.y = y
        self.j = None   # index of splitting variable
        self.xi = None  # split threshold
        self.left = None
        self.right = None
        self.g = None   # regional predictor (mean of target values)
      
    def CalculateLoss(self):
        """
        Compute sum of squared deviations from mean (impurity measure).

        Returns:
            float: Loss value for this node.
        """
        if len(self.y) == 0:
            return 0
        return np.sum(np.power(self.y - self.y.mean(), 2))
                    
# Define function to construct the regression sub-tree
def Construct_Subtree(node, max_depth):  
    """
    Recursively construct regression tree.

    Args:
        node (TNode): Current node.
        max_depth (int): Maximum depth allowed.

    Returns:
        TNode: The constructed subtree rooted at this node.
    """
    if node.depth == max_depth or len(node.y) == 1:
        node.g = node.y.mean()  # Leaf node prediction
    else:
        j, xi = CalculateOptimalSplit(node)               
        node.j = j
        node.xi = xi
        Xt, yt, Xf, yf = DataSplit(node.X, node.y, j, xi)
              
        if len(yt) > 0:
            node.left = TNode(node.depth + 1, Xt, yt)
            Construct_Subtree(node.left, max_depth)
        
        if len(yf) > 0:        
            node.right = TNode(node.depth + 1, Xf, yf)
            Construct_Subtree(node.right, max_depth)      
     
    return node

# Define function to split dataset at a node
def DataSplit(X, y, j, xi):
    """
    Split dataset into left and right subsets at feature j and threshold xi.

    Args:
        X (np.ndarray): Feature matrix.
        y (np.ndarray): Target vector.
        j (int): Feature index.
        xi (float): Threshold value.

    Returns:
        tuple: (Xt, yt, Xf, yf) left and right splits.
    """
    ids = X[:, j] <= xi      
    Xt  = X[ids, :]
    Xf  = X[~ids, :]
    yt  = y[ids]
    yf  = y[~ids]
    return Xt, yt, Xf, yf             

# Define function to calculate the optimal split at a node
def CalculateOptimalSplit(node):
    """
    Find the best split for a node by minimizing loss.

    Args:
        node (TNode): Node to split.

    Returns:
        tuple: (best_var, best_xi) best splitting feature index and threshold.
    """
    X, y = node.X, node.y
    best_var = 0
    best_xi = X[0, best_var]          
    best_split_val = node.CalculateLoss()
    
    m, n = X.shape
    
    for j in range(n):
        for i in range(m):
            xi = X[i, j]
            Xt, yt, Xf, yf = DataSplit(X, y, j, xi)
            tmpt = TNode(0, Xt, yt) 
            tmpf = TNode(0, Xf, yf) 
            loss_t = tmpt.CalculateLoss()
            loss_f = tmpf.CalculateLoss()    
            curr_val = loss_t + loss_f
            if curr_val < best_split_val:
                best_split_val = curr_val
                best_var = j
                best_xi = xi
    return best_var, best_xi

# Define function to predict response for a single data point
def Predict(X, node):
    """
    Predict response for a single data point.

    Args:
        X (np.ndarray): Feature vector.
        node (TNode): Root of the decision tree.

    Returns:
        float: Predicted value.
    """
    if node.right is None and node.left is not None:
        return Predict(X, node.left)
    
    if node.right is not None and node.left is None:
        return Predict(X, node.right)
    
    if node.right is None and node.left is None:
        return node.g
    else:
        if X[node.j] <= node.xi:
            return Predict(X, node.left)
        else:
            return Predict(X, node.right)
    
# Run the main function
main()  # run manual tree

# Compare with sklearn
from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = makedata()    
regTree = DecisionTreeRegressor(max_depth=10, random_state=0)
regTree.fit(X_train, y_train)
y_hat = regTree.predict(X_test)
MSE2 = np.mean(np.power(y_hat - y_test, 2))    
print("DecisionTreeRegressor: tree loss = ",  MSE2)     


Basic tree: tree loss =  9.067077996170276
DecisionTreeRegressor: tree loss =  10.197991295531748
