1. Implement a diagonal Gaussian parametric density estimator. It will
have to work for data of arbitrary dimension d. As seen in the labs, it
should have a train() method to learn the parameters and a method
predict() which calculates the log density.

In [87]:
import numpy as np
import math 

class diagonal_gaussian_parametric:
    def __init__(self):
        pass
    
    def train(self, train_inputs):
        
        # if only one training dimension is passed, add dummy dimension
        if len(np.shape(train_inputs)) == 1:
            train_inputs = np.expand_dims(train_inputs, axis=1).T

        self.train_inputs = train_inputs
        self.n, self.d = np.shape(self.train_inputs)

        self.mu = np.sum(self.train_inputs, axis=0) / self.n
        self.sigma = np.cov(self.train_inputs.T) * np.eye(self.d)

    def predict(self, test_data):

        # if only one test_data is passed, add dummy dimension
        if len(np.shape(test_data)) == 1:
            test_data = np.expand_dims(test_data, axis=1).T

        self.test_data = test_data
        n_inputs = np.shape(test_data)[0]
        densities = np.zeros(n_inputs)
        
        normalizer = 1 / ((2* np.pi)**(self.d / 2) * np.sqrt(sigma_det))

        # we treat each input test_data independently
        for i in range(n_inputs):
            
            sigma_inv = np.linalg.inv(self.sigma)
            sigma_det = np.linalg.det(self.sigma)

            exponent = (-0.5) * (self.test_data[i, :] - self.mu).T.dot(sigma_inv).dot(self.test_data[i, :] - self.mu)
            p = normalizer*np.exp(exponent)
            
            # handle edge case where p(x)=0
            if p == 0:
                p = np.finfo(float).eps
                
            densities[i] = -np.log(p)
        
        return(densities)
    
        


In [89]:
iris = np.loadtxt("iris.txt")
test = diagonal_gaussian_parametric()
test.train(iris)

test_data = np.array([[1,2,3,4,5], [2,30,40,50,60]])
print(test.predict(test_data))

test_data = np.array([1,2,3,4,5])
print(test.predict(test_data))

[37.26788439 36.04365339]
[37.26788439]


2. Implement a Parzen density estimator with an isotropic Gaussian kernel.
It will have to work for data of arbitrary dimension d. Likewise it
should have a train() method and a predict() method that computes
the log density.

In [123]:
class parzen_density_estimator:
    def __init__(self):
        pass
    
    def train(self, train_inputs, sigma=0):
        self.train_data = train_inputs
        self.d = len(self.train_data[0])
        
        if sigma == 0:
            self.sigma = np.std(self.train_data) # std because isotropic Gaussian
        else:
            self.sigma = sigma
            
    
    def predict(self, test_data):
        
        # if only one test_data is passed, add dummy dimension
        if len(np.shape(test_data)) == 1:
            test_data = np.expand_dims(test_data, axis=1).T

        self.test_data = test_data
        n_inputs = np.shape(test_data)[0]
        n_train = np.shape(self.train_data)[0]
        densities = np.zeros(n_inputs)
        
        normalizer = 1 / ((2*np.pi)**(self.d / 2) * self.sigma**self.d)
            
        for i in range(n_inputs): 
            
            # calculate average distance between this training point and all test points
            p = 0 
            for j in range(n_train):
            
                # we're using euclidean distance
                distance = np.sum(self.test_data[i, :] - self.train_data[j, :], axis=0)**2
                exponent = (-0.5) * (distance**2 / self.sigma**2)
                p += normalizer*np.exp(exponent)
            
            # handle edge case where p(x)=0
            if p == 0:
                p = np.finfo(float).eps
                
            # save the average kernel values across all training points
            densities[i] = -np.log(p/n_train)
        
            
        return(densities)


In [124]:
iris = np.loadtxt("iris.txt")
test = parzen_density_estimator()
test.train(iris)


test_data = np.array([[1,2,3,4,5], [2,30,40,50,60]])
print(test.predict(test_data))

test_data = np.array([1,2,3,4,5])
print(test.predict(test_data))

[ 9.57305544 41.05428868]
[9.57305544]


1D densities: From the Iris dataset examples, choose a subset corresponding
to one of the classes (of your choice), and one of the characteristic
features, so that we will be in dimension d = 1 and produce a
single graph (using the plot function) including:

(a) the data points of the subset (displayed on the x axis).
(b) a plot of the density estimated by your parametric Gaussian estimator.
(c) a plot of the density estimated by the Parzen estimator with a
hyper-parameter σ (standard deviation) too small.
(d) a plot of the density estimated by the Parzen estimator with the
hyper-parameter σ being a little too big.
(e) a plot of the density estimated by the Parzen estimator with the
hyper-parameter σ that you consider more appropriate. Use a
different color for each plot, and provide your graph with a clear
legend.
(f) Explain how you chose your hyper-parameter σ.

In [None]:
iris = np.leadtxt("iris.txt")
iris_subset = []


2D densities: Now add a second characteristic feature of Iris, in order
to have entries in d = 2 and produce 4 plots, each displaying the points
of the subset of the data (with the plot function ), and the contour
lines of the density estimated (using the contour function):
    
(a) by the diagonal Gaussian parametric estimator.
(b) by the Parzen estimator with the hyper-parameter σ (standard
deviation ) being too small.
(c) by the Parzen estimator with the hyper-parameter σ being a little
too big.
(d) by the Parzen estimator with the hyper-parameter σ that you
consider more appropriate.
(e) Explain how you chose your hyper-parameter σ