# Final Q20: Build Generative Model with Iris Data
## Author: Matthew Stickle
## Generative model: Gaussian
Note: Unlike in the mnist dataset, the probabilities are big enough where we do not have to worry about underflow and thus we do not need to use logpdf. We also do not need to worry about smoothing our covariance matrix since the default covariance matrix is not singular.

In [1]:
import numpy as np

from pandas import read_csv

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from scipy.stats import multivariate_normal

In [2]:
# load in data
data = read_csv('./iris.data', header=None)

# Prepare data to be split
string_labels = data.iloc[:,-1]
string_labels = string_labels.tolist()

# remap strings to ints for convience
mapping = {l:i for i, l in enumerate(np.unique(string_labels))}
labels = [mapping[x] for x in string_labels]
labels = np.array(labels)

# Drop labels from original dataframe, leaving only numerical data to find params with
data = data.drop(columns=4)
data = data.to_numpy()

In [3]:
# Split data
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=15, random_state=0)
# Verify types and dimensions
print(f"x_train type: {type(x_train)} x_train shape: {x_train.shape}\ny_train type: {type(y_train)} y_train: {y_train.shape}")
print(f"x_test type: {type(x_test)} x_test shape: {x_test.shape}\ny_test type: {type(y_test)} y_test shape: {y_test.shape}")
print(f"unique train class: {np.unique(y_train)}\nunique test class: {np.unique(y_test)}")

x_train type: <class 'numpy.ndarray'> x_train shape: (135, 4)
y_train type: <class 'numpy.ndarray'> y_train: (135,)
x_test type: <class 'numpy.ndarray'> x_test shape: (15, 4)
y_test type: <class 'numpy.ndarray'> y_test shape: (15,)
unique train class: [0 1 2]
unique test class: [0 1 2]


In [4]:
# determine class probabilities
train_size = x_train.shape[0]
iris, counts = np.unique(y_train, return_counts=True)
iris_probs = {i: c/train_size for i, c in zip(iris, counts)}
assert sum(iris_probs.values()) == 1

In [5]:
# Find mean and cov for each class to build generative gaussian model
g = {}
for i in iris:
    x = x_train[y_train == i, :]
    m = np.mean(x, axis = 0)
    cov = np.cov(x, rowvar=False)
    assert m.shape[0] == cov.shape[0]
    g[i] = {
        'm':   m,
        'cov': cov
    }

In [6]:
# With mean and cov for each class, time to predict with generative gaussian on test data
res = np.full([x_test.shape[0], 3], -np.inf)
for i,params in g.items():
    pdf = multivariate_normal.pdf(x_test, mean=params['m'], cov=params['cov'])
    res[:,i] = (pdf * iris_probs[i])
y_pred = np.argmax(res, axis=1)

In [7]:
acc = accuracy_score(y_test, y_pred)
error_rate = 1 - acc
print(f"error rate of generative gaussian model is {error_rate}")

error rate of generative gaussian model is 0.0


## Other observations
Getting an error rate of 0% is pretty fishy, especially when the testing size is so small. Boosting up the testing size and reducing the training size (ie: 100 test samples versus 15 test samples) did produce a higher error rate. This makes sense since we are reducing the training set and thus have less accurate measures of our class probabilities, mean, and covariance for each class. In the case where I altered the test set to have 100 samples, I found an error rate of 5%.