###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes

###### Submission deadline: 7 pm, Tuesday 6 Apr 2021

**Student ID(s):**     997351


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [27]:
# Code from COMP30027 (Machine Learning, 2021, Semester 2) Assignment 1 by Justin Kelley, 997351, (22/03/2021)
# Import required modules.
from collections import defaultdict
import math
import random
import copy

# Indicates a missing value.
MISSING_VALUE = "9999"
NULL = "NULL"
NA = None

# Penalty for missing probablity such as a class with no instances or missing attribute-class pair.
# Cannot have 0 for these cases as log(x) values will be negative for x between 0 and 1.
PENALTY = -9999999

# Kernel Bandwidths to check.
KBs = [5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25]
KBs_EXTENEDED = [1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 50, 100, 200, 500, 1000, 2000]

# Different cross validation partition sizes.
DEFAULT_M = 10
Ms = [1, 2, 5, 10, 20, 50, 100, 200, 400]

# Dividers
DIVIDER = "*******************************************************************"
PRIMARY_DIVIDER = "######################################################################"

In [28]:
def extractData(filename):
    """Takes data from file, filename, and splits it into two lists.
    One list stores the concept, while the other contains the attributes.
    Returns these two lists."""
    
    # List of instances for each attribute.
    x = []
    
    # List of instances for the concept.
    y = []
    
    # Open file and add data to attribute and concept lists.
    with open(filename, mode='r') as file:
        for line in file:
            atts = line.strip().split(",")
            
            # Flag missing values are removable by marking them as null.
            for i in range(1, len(atts)):
                if atts[i] == MISSING_VALUE:
                    atts[i] = NULL
                else:
                    atts[i] = float(atts[i])
            
            # Add values to lists.
            x.append(atts[1:])
            y.append(atts[0])
            
    return x, y



#*************************** PRIMARY FUNCTION START *************************#
def preprocess(trainFile, testFile):
    """Prepares the data by reading in data from a training and testing file
    and converts that data into a useful format for training and testing
    purposes. Returns the data in this format as seperate test and training sets."""
    
    # Test and training sets attributes.
    xTrain = []
    xTest = []
    
    # Test and training sets concepts.
    yTrain = []
    yTest = []
    
    # Extract data from test and training files.
    xTrain, yTrain = extractData(trainFile)
    xTest, yTest = extractData(testFile)
    
    # Returns testing sets (xTest, yTest), concepts (yTrain) as lists and
    # the training attribute set (xTrain) as a dictionary group by class levels.
    return xTrain, yTrain, xTest, yTest
#**************************** PRIMARY FUNCTION END **************************#


In [29]:
def groupByClass(xTrain, yTrain):
    """Takes the attributes in xTrain and groups them by the class levels
    stored in yTrain. Returns this grouping as a dictionary that is a list
    of lists."""
    
    # Stores attributes by class.
    attributesByClass = defaultdict(list)
    
    # Get each class level of the concept.
    for i in range(0, len(yTrain)):
        attributesByClass[yTrain[i]] = [[] for i in range(0, len(xTrain[0]))]
    
    # Obtain the instances for each attribute for each class level.
    for i in range(0, len(yTrain)):
        # For each instance for every attribute, add it to the appropriate list
        # via class level.
        features = attributesByClass[yTrain[i]]
        for j in range(0, len(features)):
            features[j].append(xTrain[i][j])
        attributesByClass[yTrain[i]] = features
        
    return attributesByClass



def computeMeanAndStandardDeviation(values):
    """Takes a list of floats, computes and returns their mean and standard
    deviation. Assumes values has length 2 or greater."""
    
    # Compute mean.
    mean = sum(values) / len(values)
    
    # Compute standard deviation.
    sd = 0
    for value in values:
        sd = sd + (value - mean) ** 2
    sd = math.sqrt(sd / (len(values) - 1))
    
    return mean, sd



#*************************** PRIMARY FUNCTION START *************************#
def train(xTrain, yTrain):
    """This function calculates the prior probabilities and likelihoods from the training data
    and uses it to build a Gaussian naive Bayes model. Returns dictionaries storing
    the class priors and likelihoods (mean and standard deviation for each attribute
    for each class)."""
    
    # Stores the probablity of each class level occuring among the instances.
    priors = defaultdict(int)
    
    # Stores the mean and standard deviation for each attribute for each class
    # level which are used to calculate likelihoods using the Gaussian distribution.
    likelihoods = defaultdict(list)
    
    
    # Group the attributes in training set by class.
    xTrain = groupByClass(xTrain, yTrain)
    
    # For each dictionary, get all the classes.
    for instance in yTrain:
        priors[instance] += 1
        likelihoods[instance] = []
    
    
    # Compute data for priors and likelihoods.
    for classLevel in priors:
        # Compute prior probablity for each class (class level).
        priors[classLevel] = priors[classLevel] / len(yTrain)
        
        # Compute mean and standard deviation for each attribute for each
        # class (class level).
        features = []
        for attribute in xTrain[classLevel]:
            # Remove missing values.
            values = []
            for instance in attribute:
                if instance != NULL:
                    values.append(float(instance))
            
            # Compute mean and standard deviation. Gaussian distribution needs sd > 0.
            if len(values):
                mean, sd = computeMeanAndStandardDeviation(values)
                if sd == 0.0:
                    features.append([NA, NA])
                else:
                    features.append([mean, sd])
            else:
                features.append([NA, NA])
        
        # Add attribute mean and standard deviation data for each class (class level).
        likelihoods[classLevel] = features
    
    return priors, likelihoods

In [30]:
def GaussianDensity(value, mean, standardDeviation):
    """Computes the probablity at the point, value, of a Gaussian distribution
    with mean, mean, and standard deviation, standardDeviation, using the
    density function. Returns this probablity."""
    
    # Compute the expotent for the expotential.
    expotent = -0.5 * (((value - mean) / standardDeviation) ** 2)
    
    # Compute the factor in front of the expotential.
    factor = 1 / (standardDeviation * math.sqrt(2 * math.pi))
    
    # Compute the probablity.
    return factor * math.exp(expotent)



def GaussianNaiveBayes(likelihoods, classLevel, instance, index):
    """Computes the likelihood probablity using Gaussian naive bayes."""
    
    # Get parameters for the Gaussian distribution.
    value = float(instance[index])
    mean = likelihoods[classLevel][index][0]
    std = likelihoods[classLevel][index][1]
    
    # Check if values were computable.
    if mean is None or std is None:
        return 0
    
    # Compute and return likelihood probablity through Gaussian distribution.
    return GaussianDensity(value, mean, std)



#************************ USED FOR QUESTION 3 START *************************#
def KDENaiveBayes(likelihoods, classLevel, instance, index, std):
    """Computes the likelihood probablity using KDE naive bayes.
    The standard deviation, std, is the kernel bandwidth."""
    
    # Value from test set.
    value = float(instance[index])
    
    # Use data points from training set as means for Gaussian distributions.
    means = []
    for mean in likelihoods[classLevel][index]:
        # Ignore missing values.
        if mean != NULL:
            means.append(float(mean))
    
    # Number of data points.
    n = len(means)
    if n == 0:
        return 0
    
    # Compute and return kernel density estimate (KDE).
    result = 0
    for mean in means:
        result = result + GaussianDensity(value, mean, std)
    return result / n
#************************* USED FOR QUESTION 3 END **************************#



def getMax(probablities):
    """For a list of probablities in a dictionary, probablities, for
    class levels will find and return the largest one."""
    
    # Class and probablity value for most likely class.
    maxClassLevel = None
    maxProbablity = None
    
    # Find highest probablity class.
    for classLevel in probablities:
        if maxClassLevel is None:
            maxClassLevel = classLevel
            maxProbablity = probablities[classLevel]
        elif maxProbablity < probablities[classLevel]:
            maxProbablity = probablities[classLevel]
            maxClassLevel = classLevel
    
    return maxClassLevel



#*************************** PRIMARY FUNCTION START *************************#
def predict(xTest, yTest, priors, likelihoods, method, kernelBandwidth = 5):
    """Uses the prior and likelihood probablities to make predictions for the class
    level for each instance in xTest. Will either perform Gaussian naive bayes or KDE
    naive bayes. Returns a list of the predictions."""

    # List of predictions made for each instance.
    predictions = []
    
    
    # Make a prediction for each instance.
    for instance in xTest:
        
        # For each instance, compute the probablity for each class.
        # Use logathirm for calculations to handle floating point issues.
        probablities = defaultdict(float)
        for classLevel in set(yTest):
            
            # Get prior probablity. Assign zero posterior probablity for classes with no instances.
            if priors[classLevel] == 0:
                probablities[classLevel] = PENALTY
                break
            probablity = math.log(priors[classLevel])
            
            # Compute and add likelihood probablity to prior probablity.
            for i in range(0, len(likelihoods[classLevel])):
                if instance[i] != NULL:
                    # Compute likelihood using Gaussian naive bayes.  (Basic implamentation prior to questions.)
                    if method == "Gaussian":
                        result = GaussianNaiveBayes(likelihoods, classLevel, instance, i)
                    # Compute likelihood using KDE naive bayes. (Question 3 and Question 4.)
                    elif method == "KDE":
                        result = KDENaiveBayes(likelihoods, classLevel, instance, i, kernelBandwidth)
                    if result > 0:
                        probablity = probablity + math.log(result)
                    else:
                        probablity = probablity + PENALTY
            
            probablities[classLevel] = probablity
         
        # Find class with highest probablity.
        predictions.append(getMax(probablities))    
    
    return predictions
#**************************** PRIMARY FUNCTION END **************************#

In [31]:
#*************************** PRIMARY FUNCTION START *************************#
def evaluate(predictions, yTest):
    """Computes and returns the accuracy score."""
    
    # Assign zero score if no predictions made.
    if len(predictions) == 0:
        return 0
    
    # Compute accuracy score.
    score = 0
    for i in range(0, len(predictions)):
        if (predictions[i] == yTest[i]):
            score += 1
    score = score / len(predictions)
    
    return score
#**************************** PRIMARY FUNCTION END **************************#

In [32]:
#************ SOME OF THE FUNCTIONS USED FOR QUESTION 4 START ***************#
def randomCrossValidation(x, y, m, shuffle = True):
    """Takes the traing data, x and y, shuffles it randomly to ensure greater chance of
    class level diversity in each partition for purposes of forming m partitions
    for cross-validation. Also allows for no random shuffling."""
    
    # Partitions of attributes and concepts.
    xPartitions = []
    yPartitions = []
    
    # All possible pairings of partitions to form training and test sets from m-1 and 1
    # partition respectively.
    sets = []
    
    # Partition size. Number of partitions is set to the number of instances.
    increment = int(len(y) / m)
    
    
    # Shuffle instances if requested.
    if shuffle:
        # Group concept together with attributes to ensure consistency when shuffling list.
        newList = x
        for i in range(0, len(x)):
            newList[i].append(y[i])

        # Shuffle the list.
        random.shuffle(newList)
        random.randrange(len(y))

        # Seperate list back into attribute and concept compotents.
        x = []
        y = []
        for i in newList:
            x.append(i[:-1])
            y.append(i[-1])
  
    
    # Form partitions of the data.
    last = 0
    for i in range(1, m + 1):
        if i != m:
            xPartitions.append(x[last: i * increment])
            yPartitions.append(y[last: i * increment])
        else:
            xPartitions.append(x[last:])
            yPartitions.append(y[last:])
        last = i * increment
    
    
    # Join partitions to form different training and test sets.
    for i in range(0, len(yPartitions)):
        xTrain = []
        yTrain = []
        
        # Test set is made from one partition.
        xTest = xPartitions[i]
        yTest = yPartitions[i]
        
        # Training set is made from remaining partitions.
        for j in range(0, len(yPartitions)):
            if j != i:
                xTrain += xPartitions[j]
                yTrain += yPartitions[j]
        sets.append([xTrain, yTrain, xTest, yTest])
    
    return sets
    
    

def selectKernelBandwidth(xTrain, yTrain, m, showInfo = False, shuffle = True):
    """Test the performance of KDE naive bayes with a range of kernel bandwidths using
    cross-validation with or without random shuffling of instances for testing in order
    to select the best one for the model."""
    
    # Split the training data into several training/test pairs via cross validation.
    # Data is shuffled to improve diversity of class levels in each pair.
    sets = randomCrossValidation(xTrain, yTrain, m, shuffle)
    
    
    # The kernel bandwidth with the highest score.
    maxScore = -1
    maxKB = -1
    
    
    # Determine the best kernel bandwidth.
    if showInfo:
        print(PRIMARY_DIVIDER)
        print(f"Showing accuracy score for each different KB (m = {m}):")
    for KB in KBs:
        # Testing sets from each run are to be joined together.
        predictions = []
        testSet = []
        # Run through get combination of test/training pairs formed.
        for selection in sets:
            # Stores the probablity of each class level occuring among the instances.
            priors = defaultdict(int)

            # Compute frequency of each class level.
            for instance in selection[1]:
                priors[instance] += 1

            # Compute prior probablity for each class level.
            for classLevel in priors:
                priors[classLevel] = priors[classLevel] / len(selection[1])
            
            # Prepare attributes for KDE likelihood estimations.
            likelihoods = groupByClass(selection[0], selection[1])
            
            # Predict class level and add to running list.
            predictions += predict(selection[2], selection[3], priors, likelihoods, "KDE", KB)
            testSet += selection[3]
            
        # Compute score and check if it is the highest.
        score = evaluate(predictions, testSet)
        if score > maxScore:
            maxScore = score
            maxKB = KB
        if showInfo:
            print(f"KB = {KB}, has accuracy score of {score}.")
    
    
    if showInfo:
        print(DIVIDER)
        print(f"KB = {maxKB} has highest accuracy score of {maxScore}.")
        print(PRIMARY_DIVIDER)
    
    return maxKB
#************** SOME OF THE FUNCTIONS USED FOR QUESTION 4 END ****************#

In [33]:
def runBasic():
    """Runs the Gaussian naive bayes models
    using the normal train and testing data sets."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using Gaussian method.
    predictionsGaussian = predict(xTest, yTest, priors, likelihoods, "Gaussian")
    
    # Compute accuracy score.
    scoreGaussian = evaluate(predictionsGaussian, yTest)

    # Display results.
    print(PRIMARY_DIVIDER)  
    print(f"Gaussian naive bayes with default data set has accuracy score {round(scoreGaussian, 3)}.")
    print(PRIMARY_DIVIDER) 
    print()
    print()
    
    return 0

    
    
def runDefault():
    """Runs the Gaussian and KDE (with kernal bandwidth of 5) naive bayes models
    using the normal train and testing data sets."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using Gaussian and KDE with kernel bandwidth of 5.
    predictionsGaussian = predict(xTest, yTest, priors, likelihoods, "Gaussian")
    predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE")
    
    # Compute accuracy scores for each method.
    scoreGaussian = evaluate(predictionsGaussian, yTest)
    scoreKDE = evaluate(predictionskDE, yTest)

    # Display results.
    print(PRIMARY_DIVIDER)  
    print(f"Gaussian naive bayes with default data set has accuracy score {round(scoreGaussian, 3)}.")
    print(DIVIDER)
    print(f"KDE naive bayes with default data set and kernel bandwidth of 5 has accuracy score {round(scoreKDE, 3)}.")
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0
    
    
    
def runDefaultWithCrossValidation():
    """Runs the Gaussian and KDE (with kernal bandwidth selected via cross validation)
    naive bayes models using the normal train and testing data sets."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Compute best kernel bandwidth using randomised cross validation.
    KB = selectKernelBandwidth(copy.deepcopy(xTrain), copy.deepcopy(yTrain), DEFAULT_M, shuffle = True)
    
    # Run naive bayes using Gaussian and KDE with kernel bandwidth of 5.
    predictionsGaussian = predict(xTest, yTest, priors, likelihoods, "Gaussian")
    predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE", KB)
    
    # Compute accuracy scores for each method.
    scoreGaussian = evaluate(predictionsGaussian, yTest)
    scoreKDE = evaluate(predictionskDE, yTest)

    # Display results.
    print(PRIMARY_DIVIDER)
    print(f"Gaussian naive bayes with default data set has accuracy score {round(scoreGaussian, 3)}.")
    print(DIVIDER)
    print("KDE naive bayes with default data set and kernel bandwidth selected by\n" +
          f"cross validation has accuracy score {round(scoreKDE, 3)}.")
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0

    
    
def runUsingWholeDataSet():
    """Runs the Gaussian and KDE (with kernal bandwidth of 5) naive bayes models
    using the whole data set as testing."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "combined.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using Gaussian and KDE with kernel bandwidth of 5.
    predictionsGaussian = predict(xTest, yTest, priors, likelihoods, "Gaussian")
    predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE")
    
    # Compute accuracy scores for each method.
    scoreGaussian = evaluate(predictionsGaussian, yTest)
    scoreKDE = evaluate(predictionskDE, yTest)

    # Display results.
    print(PRIMARY_DIVIDER)  
    print(f"Gaussian naive bayes with combined data set has accuracy score {round(scoreGaussian, 3)}.")
    print(DIVIDER)
    print(f"KDE naive bayes with combined data set and kernel bandwidth of 5 has accuracy score {round(scoreKDE, 3)}.")
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0
    
    
    
def runUsingWholeDataSetWithCrossValidation():
    """Runs the Gaussian and KDE (with kernal bandwidth selected via cross validation)
    naive bayes models using the whole data set as testing."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "combined.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Compute best kernel bandwidth using randomised cross validation.
    KB = selectKernelBandwidth(copy.deepcopy(xTrain), copy.deepcopy(yTrain), DEFAULT_M, shuffle = True)
    
    # Run naive bayes using Gaussian and KDE with kernel bandwidth of 5.
    predictionsGaussian = predict(xTest, yTest, priors, likelihoods, "Gaussian")
    predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE", KB)
    
    # Compute accuracy scores for each method.
    scoreGaussian = evaluate(predictionsGaussian, yTest)
    scoreKDE = evaluate(predictionskDE, yTest)

    # Display results.
    print(PRIMARY_DIVIDER)
    print(f"Gaussian naive bayes with combined data set has accuracy score {round(scoreGaussian, 3)}.")
    print(DIVIDER)
    print("KDE naive bayes with combined data set and kernel bandwidth selected by\n" +
          f"cross validation has accuracy score {round(scoreKDE, 3)}.")
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0
    
    
    
def runKDEWithDifferentKBs():
    """Runs the Gaussian naive bayes model using a range of KBs
    using both training and testing data sets."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using a range of KBs.
    print(PRIMARY_DIVIDER)
    print("Running KDE naive bayes with normal training and testing sets.")
    for KB in KBs_EXTENEDED:
        # Run naive bayes using KDE.
        predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE", KB)

        # Compute accuracy scores for each method.
        scoreKDE = evaluate(predictionskDE, yTest)

        # Display results.
        print(f"When KB = {KB}, accuracy score is {round(scoreKDE, 3)}.")
        print(PRIMARY_DIVIDER)
        print(PRIMARY_DIVIDER)
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0



def runKDEWithCrossValidationWithInfo():
    """Runs the Gaussian naive bayes model using a range of Ms
    using both training and testing data sets."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using a range of Ms using cross validation.
    print(PRIMARY_DIVIDER)
    print("Running KDE naive bayes with normal training and testing sets.")
    print("Running it with different values for m.")
    for m in Ms:
        # Compute best kernel bandwidth using randomised cross validation.
        KB = selectKernelBandwidth(copy.deepcopy(xTrain), copy.deepcopy(yTrain), m, showInfo = True, shuffle = True)
        
        # Make predictions using KDE naive bayes.
        predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE", KB)
    
        # Compute accuracy scores for the method.
        scoreKDE = evaluate(predictionskDE, yTest)

        # Display results.
        print(f"When m = {m}, KB = {KB}, accuracy score is {round(scoreKDE, 3)}.")
        print(PRIMARY_DIVIDER)
        print(PRIMARY_DIVIDER)
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0
    

    
def runKDECrossValidationWithNoShuffling():
    """Runs the Gaussian naive bayes model using a range of Ms
    using both training and testing data sets without random
    shuffling to show the result of this."""
    
    # Get training and testing data sets from files.
    xTrain, yTrain, xTest, yTest = preprocess("train.csv", "test.csv")
    
    # Compute priors and likelihoods.
    priors, likelihoods = train(xTrain, yTrain)
    
    # Run naive bayes using a range of Ms using cross validation.
    print(PRIMARY_DIVIDER)
    print("Running KDE naive bayes with normal training and testing sets.")
    print("Running it with different values for m.")
    for m in Ms:
        # Compute best kernel bandwidth using normal cross validation.
        KB = selectKernelBandwidth(copy.deepcopy(xTrain), copy.deepcopy(yTrain), m, showInfo = True, shuffle = False)
        
        # Make predictions using KDE naive bayes.
        predictionskDE = predict(xTest, yTest, priors, groupByClass(xTrain, yTrain), "KDE", KB)
    
        # Compute accuracy scores for the method.
        scoreKDE = evaluate(predictionskDE, yTest)

        # Display results.
        print(f"When m = {m}, KB = {KB}, accuracy score is {round(scoreKDE, 3)}.")
        print(PRIMARY_DIVIDER)
        print(PRIMARY_DIVIDER)
    print(PRIMARY_DIVIDER)
    print()
    print()
    
    return 0
    
    
    
#**************************** DRIVER FUNCTION END ***************************#
def runModel():
    """This functions runs the naive bayes model under different circumstances."""
    # Each of these functions will run the naive bayes classifer under different
    # situations. Comment or uncomment the function to run it. Use the README file
    # for further details.
    
    # Run the Gaussian using the normal train and testing data sets.
    runBasic()
    
    # Run the Gaussian and KDE (with kernal bandwidth of 5) naive bayes models
    # using the normal train and testing data sets.
    runDefault()
    
    # Run the Gaussian and KDE (with kernal bandwidth selected via cross validation)
    # naive bayes models using the normal train and testing data sets.
    runDefaultWithCrossValidation()
    
    # Run the Gaussian and KDE (with kernal bandwidth of 5) naive bayes models
    # using the whole data set as testing.
    runUsingWholeDataSet()
    
    # Run the Gaussian and KDE (with kernal bandwidth selected via cross validation)
    # naive bayes models using the whole data set as testing."""
    runUsingWholeDataSetWithCrossValidation()
    
    # Run the Gaussian naive bayes model using a range of KBs
    # using both training and testing data sets.
    runKDEWithDifferentKBs()
    
    # Run the Gaussian naive bayes model using a range of Ms
    # using both training and testing data sets.
    runKDEWithCrossValidationWithInfo()
    
    # Run the Gaussian naive bayes model using a range of Ms
    # using both training and testing data sets without random
    # shuffling to demonstrate the consquences.
    runKDECrossValidationWithNoShuffling()
    
    return 0
#**************************** DRIVER FUNCTION END ***************************#
    
runModel()

######################################################################
Gaussian naive bayes with default data set has accuracy score 0.716.
######################################################################


######################################################################
Gaussian naive bayes with default data set has accuracy score 0.716.
*******************************************************************
KDE naive bayes with default data set and kernel bandwidth of 5 has accuracy score 0.767.
######################################################################


######################################################################
Gaussian naive bayes with default data set has accuracy score 0.716.
*******************************************************************
KDE naive bayes with default data set and kernel bandwidth selected by
cross validation has accuracy score 0.759.
######################################################################


#####################

KB = 5, has accuracy score of 0.7951807228915663.
KB = 7.5, has accuracy score of 0.8032128514056225.
KB = 10, has accuracy score of 0.7991967871485943.
KB = 12.5, has accuracy score of 0.7965194109772423.
KB = 15, has accuracy score of 0.7925033467202142.
KB = 17.5, has accuracy score of 0.7951807228915663.
KB = 20, has accuracy score of 0.7938420348058902.
KB = 22.5, has accuracy score of 0.7831325301204819.
KB = 25, has accuracy score of 0.7777777777777778.
*******************************************************************
KB = 7.5 has highest accuracy score of 0.8032128514056225.
######################################################################
When m = 5, KB = 7.5, accuracy score is 0.767.
######################################################################
######################################################################
######################################################################
Showing accuracy score for each different KB (m = 10):
KB = 5, has accuracy s

When m = 2, KB = 5, accuracy score is 0.767.
######################################################################
######################################################################
######################################################################
Showing accuracy score for each different KB (m = 5):
KB = 5, has accuracy score of 0.5488621151271754.
KB = 7.5, has accuracy score of 0.5488621151271754.
KB = 10, has accuracy score of 0.5488621151271754.
KB = 12.5, has accuracy score of 0.5488621151271754.
KB = 15, has accuracy score of 0.5488621151271754.
KB = 17.5, has accuracy score of 0.5488621151271754.
KB = 20, has accuracy score of 0.5488621151271754.
KB = 22.5, has accuracy score of 0.5488621151271754.
KB = 25, has accuracy score of 0.5488621151271754.
*******************************************************************
KB = 5 has highest accuracy score of 0.5488621151271754.
######################################################################
When m = 5, KB = 5, accurac

0

## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.