# ID2214 Assignment 2 Group no. [7]
### Project members: 
[Cecilia Battinelly, cbat@kth.se]
[Valgerdur Tryggvadottir, vtry@kth.se]
[Parastu Rahgozar, parastu@kth.se]
[Zhiwu Dong, zhiwu@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.


### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above,
and thereby 


## Load NumPy, pandas and time

In [15]:
import numpy as np
import pandas as pd
import time


## Reused functions from Assignment 1

In [16]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

# ACCURACY FUNCTION
def accuracy(df, correctlabels):
    dff = df.copy()
    if len(correctlabels)!= len(dff.index):
         print("Error, mismatching dimensions!")
         return None
 
    
    # Create a series where the class corresponding to the largest element
    # in a column are collected. NOTE: in case of a tie, the first class is
    # selected.
    
    maximum = dff.idxmax(axis=1)
    num = 0
    
    for i in range(len(maximum)):
        if maximum[i] == correctlabels[i]:
            num += 1
    den = len(maximum)
    
    frac = num / den

    return frac


# BRIER SCORE FUNCTION
def brier_score(df,correctlabels):
    # Create an empty list where all partial sums will be stored
    score = []
    
    # Iterate over each row according to the definition:
    # for each prediction do (pi-oi)^2 where pi is the predicted
    # probability and oi is the real one, i.e. it is 1 iff class of i is the
    # correct label o/w it is 0
    
    for i in df.index:
        # Create partial sum for each row
        summ = 0
        row = np.array(df.loc[i])
        
        # Initialize o vector for the current row. Initialized to 0, and the
        # element in position indx is set to 1, because idx is the index
        # corresponding to the correct class
        
        o = np.zeros(len(row))
        indx = np.where(df.columns==correctlabels[i])[0]
        o[indx] = 1
        for i in range(len(row)):
            summ += (row[i] - o[i])**2
        score.append(summ)
    
    # Normalize the score dividing by the number of instances = number of rows 
    # = number of prediction vectors!
    brier = sum(score)/len(df.index)
    
    return brier


# AUC FUNCTION
def auc(df,correctlabels):
    #Create a vector where to store partial AUCs for each class (i.e. column) 
    AUCT = [] 
    for col in df.columns:
         # For each class create a dictionary to store {score: tp, fp}
        d = {}
        values = df[col].values
        # Create array of scores values in decreasing order
        val = sorted(values, reverse = True)  
        # For each element in val iterate and set the threshold equal to the 
        # current value and label the instances accordingly to this 
        for elem in val:
            # Current score = threshold = each value in the column iteratively
            s = elem 
            tp = 0
            fp = 0
            # For each score in the unordered list of scores check if the label is correct
            # If so, increase by 1 tp; o/w increase by 1 fp
            for j in range(len(values)):
                if values[j] == s:
                    label = col
                    if label == correctlabels[j]:
                        tp += 1
                    else:
                        fp += 1
            # Append the values of fp, tp to the corresponding score
            d[s] = [tp,fp]
        # Implementation of AUC area algorithm for each class
        Tot_tp = 0
        Tot_fp = 0
        for elem in d:
            Tot_tp += d[elem][0]
            Tot_fp += d[elem][1]
        AUC = 0
        Cov_tp = 0
        for elem in d:
            fp = d[elem][1]
            tp = d[elem][0]
            if fp == 0:
                Cov_tp += tp
            elif tp == 0:
                AUC += (Cov_tp/Tot_tp)*(fp/Tot_fp)
            else:
                AUC += (Cov_tp/Tot_tp)*(fp/Tot_fp)+(tp/Tot_tp)*(fp/Tot_fp)/2
                Cov_tp += tp
        AUCT.append(AUC)
    
    
    #Compute the frequency vector for classes' occurence w.r.t. correct labels
    fr = []
    for col in df.columns:
        freq = 0
        for val in correctlabels:
            if val == col:
                freq += 1
        freq = freq /len(correctlabels)
        fr.append(freq)
    
    # Return weighted auc = return scalar product AUCT & frequency vector
    return np.dot(fr,AUCT)





# CREATE_NORMALIZATION FUNCTION
def create_normalization(df,normalizationtype="minmax"):
    d = {}
    df1 = df.copy()
    # Go through the columns different from ID and CLASS
    # pick the max and min value and append them (tog. with the norm_type name)
    # to a dictionary whose key is the column name.
    # Then go through all the elements in the column and normalize them
    # finally change the column values using the normalized ones
    if normalizationtype=="minmax":
        for elem in df1.columns:
            if elem!='ID' and elem!='CLASS':
                minn = df1[elem].min()
                maxx = df1[elem].max()
                d[elem] = ("minmax", minn, maxx)
                df1[elem] = [(x-minn)/(maxx-minn) for x in df[elem]]
    else:
    # Go through the columns different from ID and CLASS
    # find the mean and std value and append them (tog. with the norm_type name)
    # to a dictionary whose key is the column name.
    # Then go through all the elements in the column and normalize them-
    # Finally change the column values using the normalized ones
        for elem in df1.columns:
            if elem!='ID' and elem!='CLASS':
                mean = df1[elem].mean()
                std = df1[elem].std()
                d[elem] = ("zscore",mean,std)
                df1[elem] = df1[elem].apply(lambda x: (x-mean)/std)
    return df1, d


# APPLY NORMALIZATION
def apply_normalization(df,normalization):
    df1 = df.copy()
    # Go through the columns different from ID and CLASS
    # and normalize the column's value, according to the chosen method,
    # using the referring values from mapping dictionary normalization
    # (i.e. min & max for minmax & mean, std for zscore)
    # NOTE: for minmax normalization type normalized values > 1 are set to 1
    # & normalized values < 0 are set to 0 (Normalization minmax must range 0-1)
    # Finally change the column's values using the normalized ones
    for elem in df1.columns:
        if elem!='ID' and elem!='CLASS':
            values = df1[elem].values
            tup_elem = normalization[elem]
            for i in range(len(values)):
                if tup_elem[0] == "minmax":
                    values[i] =  (values[i] - tup_elem[1])/(tup_elem[2]-tup_elem[1])
                    #Here we consider only into 0-1 interval
                    if values[i] < 0.0:
                        values[i] = 0.0
                    if values[i] > 1.0:
                        values[i] = 1.0
                    df1[elem] = values
                elif tup_elem[0] == "zscore":
                    values[i] = (values[i] - tup_elem[1]) / tup_elem[2]
                    df1[elem] = values

    return df1


# CREATE_IMPUTATION FUNCTION
def create_imputation(df1):
    df = df1.copy()
    d = {}
    # Go through all the columns whose name is different from ID & CLASS
    # First we analyze numerical columns and if all the values are missing
    # we fill them with 0, otherwise we fill the missing values with the mean.
    # We append to the dictionary the mean value of the column referring to the column
    # name as key.

    # Later(else) we go through object and category columns. If all values are missing
    # we fill them with "" for object column, or first element for category column
    # otherwise we fill with the first element of mode array.
    # We append to the dictionary the mode value [0] of the column referring to the column
    # name as key.
    
    for elem in df.columns:
        if elem!='ID' and elem!='CLASS':
            if df[elem].dtype == "float64" or df[elem].dtype == "int64":
                values = df[elem].values     
                if np.all(np.isnan(values)):
                    df[elem].fillna(0, inplace = True)
                else:
                      df[elem].fillna(df[elem].mean(), inplace = True)
                    

                value = df[elem].mean() 
                d[elem] = value
            else:
                values = df[elem].values
                if values.all() == np.nan:
                    if df[elem].dtype == "object":
                        df[elem].fillna("", inplace = True)
                    elif df[elem].dtype == "category":
                        df[elem].fillna(df[elem].cat.categories[0], inplace = True)
                else:
                    df[elem].fillna(df[elem].mode()[0], inplace = True)

                value2 = df[elem].mode()[0]
                d[elem] = value2

    return df, d


# APPLY IMPUTATION
def apply_imputation(df1,imputation):
    df = df1.copy()
    
    # Go through all columns whose name is not ID or CLASS and fill the missing
    # values reffering to the imputation dictionary, i.e, given the column name, we fill
    # the missing value using the valueZ in dictionary (elem: valueZ).

    for elem in df1.columns:
        if elem!='ID' and elem!='CLASS':
            tup_elem = imputation[elem]
            df[elem].fillna(tup_elem, inplace = True)

    return df

def create_bins(dff,nobins=10,bintype="equal-width"):
    d = {}
    df = dff.copy()
    if bintype=="equal-width":
        for elem in df.columns:
            if elem!='ID' and elem!='CLASS':
                values = df[elem].values
                res, bins = pd.cut(values,nobins,retbins=True,labels=False)
                bins=list(bins)
                bins[0] = -np.inf
                bins[-1] = np.inf
                d[elem] = bins #mapping
                df[elem]= res
                df[elem] = df[elem].astype("category")
                df[elem].cat.categories
                lab = list(range(0,nobins))
                df[elem] = df[elem].cat.set_categories(lab)
    else:
        for elem in df.columns:
            if elem!='ID' and elem!='CLASS':
                values = df[elem].values
                res, bins = pd.qcut(values,nobins,retbins=True,labels=False, duplicates='drop')
                bins = list(bins)
                bins[0] = -np.inf
                bins[-1]= np.inf
                d[elem] =  bins #mapping
                df[elem] = res
                df[elem] = df[elem].astype("category")
                df[elem].cat.categories
                lab = list(range(0,nobins))
                df[elem] = df[elem].cat.set_categories(lab)


    return df, d

def apply_bins(dff,binning):
    df = dff.copy()
    for elem in df.columns:
        if elem!='ID' and elem!='CLASS':
                values = df[elem].values
                bins = binning[elem]
                res = pd.cut(values,bins,labels=False)
                df[elem] = res
                df[elem] = df[elem].astype("category")
                df[elem].cat.categories
    return df




## 1. Define the class kNN

In [17]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# imputation, normalization, one_hot, labels, training_labels, training_data
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.normalization should be a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot should be a one-hot mapping (see Assignment 1; can be excluded if this function was not completed)
# self.training_labels should be a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels should be the categories of the previous series
# self.training_data should be the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
# normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns 
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# k: an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are estimated by the relative class frequencies in the set of class labels from the k nearest 
#              (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation, normalization and (possibly) one-hot
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies
class kNN:
        
    def _init_(self):
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        self.training_time = None


    def fit(self, df, normalizationtype="minmax"):
        norm = normalizationtype
        # imputation mapping
        df1, map_imp = create_imputation(df)
        self.imputation = map_imp
        
        # normalization mapping
        df3, map_norm = create_normalization(df1,norm)
        self.normalization = map_norm
        
        
        
        classes = df["CLASS"].astype("category")  # Gives the column "CLASS", each column in DataFrame is of type Series
        self.training_labels = classes
        
        # labels: the categories of the previous series
        self.labels = classes.cat.categories # gives the unique categories 

        df4 = df3.drop(["CLASS", "ID"], axis = 1)
        self.training_data = df4



    def predict(self, df, k=5):
        
        # First drop "CLASS" and "ID" if df has these columns
        if "CLASS" and "ID" in df.columns:
            df_copy = df.drop(["CLASS", "ID"], axis = 1)
        elif "ID" in df.columns:
            df_copy = df.drop(["ID"], axis = 1)
        elif "CLASS" in df.columns:
            df_copy = df.drop(["CLASS"], axis = 1)
        else:
            df_copy = df.copy()

            
        
        # Apply imputation, normalization
        df1 = apply_imputation(df_copy,self.imputation)
        df3 = apply_normalization(df1,self.normalization)

        
        # Define the subfunction get_nearest neighbors
        def get_nearest_neighbors(x_test, k):
            
            distances = []

            for i in range(len(self.training_data)):
                # First we compute the euclidean distance 
                # distance between x_test (which is a new point) to all training instances
                distance = np.sqrt(np.sum(np.square(x_test - self.training_data.iloc[i, :])))
                # add it to list of distances
                distances.append([distance, i])          

            # The distances in increasing order, then we can choose the k closest ones
            distances = sorted(distances)

            
            # Make a list of the k neighbors' targets
            labels = []
            for i in range(k):
                index = distances[i][1]
                
                # Use the indexes of the k nearest neighbors to find the corresponding labels
                labels.append(self.training_labels[index])
            
            
            # Calculate class frequency
            k_nearest = labels
            
            k_unique = set(list(k_nearest))
            d = {}
            for elem in k_unique:
                freq = k_nearest.count(elem)
                freq = freq /len(k_nearest)
                d[elem] = freq
            
            probability = np.zeros(len(self.labels))
           
            for i in range(len(self.labels)):
                if self.labels[i] in k_unique:
                    probability[i] = d[self.labels[i]]
            
            return k_nearest, probability
        
        
            
        values = df3.values  
        
        predictions_list = []
        # Iterate over rows
        for i in range(len(values)):
            x_test = values[i]
            k_nearest, probability = get_nearest_neighbors(x_test, k)
            cols = self.labels
            predictions_list.append(probability)
        # Create a dataframe for the voting classes. There are #row=#instances and #cols=classes
        predictions = pd.DataFrame(predictions_list, columns=cols)
       
           
        return predictions

In [18]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

results


Training time: 0.01 s.
Testing time (k=1): 3.74 s.
Testing time (k=3): 3.19 s.
Testing time (k=5): 3.24 s.
Testing time (k=7): 3.22 s.
Testing time (k=9): 3.19 s.


Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [19]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


Accuracy on training set (k=1): 1.00
AUC on training set (k=1): 1.00
Brier score on training set (k=1): 0.00


Eventually, the one_hot was not used as this encoding is expected to change categorical values into numerical values and the given dataset is already numerical.

## 2. Define the class NaiveBayes

In [20]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.class_priors should be a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
# to the relative frequencies of the labels
# self.feature_class_value_counts should be a mapping from a feature (column name) to another mapping, which
# given a feature value and class label provides the number of training instances with this specific combination
# self.feature_class_counts should me a mapping from the feature (column name) and class label to the number of
# training instances with this specific class label and any (non-missing) value for the feature
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: feature_class_value_counts can be a dictionary, which given a feature f returns a mapping obtained 
#         by pandas groupby and size (see lecture slides), which given a feature value v and class label c 
#         returns the number of instances, e.g., using get((c,v),0)
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
# predictions with estimated class probabilities for each row in df, where the class probabilities
# are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply discretization
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
class NaiveBayes: 
    
    def _init_(self):
        self.binning = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
        self.labels = None

    def fit(self, df, nobins = 10, bintype = "equal-width"):
        df1 = df.copy()
        df1, self.binning = create_bins(df, nobins, bintype)
        classColumn = df["CLASS"].astype("category")
        classes = df["CLASS"]
        freq = classes.value_counts()/len(classes) #value_counts works for series and counts instances of each element    
        self.class_priors = freq.to_dict()
        self.labels = classColumn.cat.categories
        dictCount, dictValueCount = {},{}
        
        for col in df1.columns:
            if(col not in ['CLASS', 'ID']):
                df2 = df1.dropna(axis = 0,subset = ['CLASS', col])
                df3 = df2.groupby(['CLASS',col]).size()
                dictCount[col] = dict(df2.loc[:,'CLASS'].value_counts())
                dictValueCount[col] = dict(df3)

        self.feature_class_value_counts = dictValueCount
          
        self.feature_class_counts = dictCount
        
    def predict(self, df):
        df1 = df.copy()
        df1 = apply_bins(df1, self.binning)

        df1 = df1.drop(labels = ['CLASS', 'ID'], axis = 1)
        # Training labels
        labels = self.labels
        # #Rows, columns and labels from test dataset
        nrow, ncol, nlabel = df1.shape[0], df1.shape[1], len(labels)
        matrix = np.zeros([nlabel, nrow, ncol])
        
        # Matrix with a coefficient that is the relative frequency for a specific (classLabel, Row, and feature)
        # 
        for col_num in range(ncol):
            col = df1.columns[col_num]
            elem = list(set(df1[col]))
            k  = len(elem)
            # This k (as it counts the number of possible values for a column) is needed if one wants to implement 
            # Laplace correction. If so uncomment relat_freq and comment the current ones
            for label_num in range(nlabel):
                    label = labels[label_num]
    
                    for row_num in range(nrow):
                        # Value of bin in this location
                        value = df1.iloc[row_num, col_num]
                        # Search for this value given this label in self.feature_class_value_counts
                        if((label, value) in self.feature_class_value_counts[col].keys()):
                            features_value_count = self.feature_class_value_counts[col][(label, value)]
                            feature_count = self.feature_class_counts[col][label]
                            relat_frequency = (features_value_count)/ (feature_count)
                            #relat_frequency = (features_value_count + 1)/ (feature_count + k)
                        else:
                            relat_frequency = 0
                            # relat_frequency = 1/k
                        matrix[label_num, row_num, col_num] = relat_frequency
                        # Save the relat_frequencies into a 3D matrix
                        # The dimensions are class, instances, column name
                        # so we can think that for each class we get a 2D matrix of the same size as df1,
                        # where the values in each cell will be the relat_frequency for the bin value that was in df1 
                        # given this class, so we can think that we have 6 (because 6 classes) 2D matrices on top of each other


        # We multiply the relative frequencies of the feature between eachother for a specific (rowNumber, classLabel).
        # To get the Naive-Bayes assumption numerator. Resulting matrix is s.t. (nlabel,nrow). For each instance we store
        # vertically the relative freq. given the class.
        mat_non_normalized = matrix.prod(axis = 2)
        
        # We create a matrix of the class prior value to then multiply one-by-one the terms of the two matrices
        classesVect = np.array([self.class_priors[labels[i]] for i in range(nlabel)])
        # Create a matrix of classes and then transpose it
        matClasses = np.tile(classesVect, nrow).reshape([nrow, nlabel]).T
        mat_non_normalized = (mat_non_normalized * matClasses)
        
        #We create the normalization matrix, i.e. for each instance we sum the rel. frequency given all the classes so that
        #For each instance, the relative freq. sum to 1.
        normalization = np.sum(mat_non_normalized, axis = 0)
        # But when this sum is 0 (i.e. if all rel. freq are 0), we can't let the value to 0 because we will divide by this value
        # So we put the values to 1 which means that dividing by this value doesn't change anything
        # And we keep a matrix with the information of which sum were equal to 0 to later replace the probabilities
        # with the class priors
        normalizing_mat = np.tile(normalization, nlabel).reshape([nlabel, nrow])
        normalizing_mat_zero = normalizing_mat==0
        normalizing_mat += normalizing_mat_zero.astype('float')
        
        # Then we normalize to get the final matrix.
        normalized_mat = mat_non_normalized / normalizing_mat
        # And we replace the probabilities where the sum were 0 by the class priors.
        # We are doing that by adding the values 
        # (because if the sum was 0 then the value was 0 also so adding is same as replacing)
        normalizing_adding_priors = normalizing_mat_zero.astype('float')*matClasses
        normalized_mat += normalizing_adding_priors

        result = pd.DataFrame(normalized_mat.T, columns = labels)
        return(result)

In [21]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

results


Training time (3, 'equal-width'): 0.07 s.
Testing time (3, 'equal-width'): 0.09 s.
Training time (3, 'equal-size'): 0.06 s.
Testing time (3, 'equal-size'): 0.08 s.
Training time (5, 'equal-width'): 0.05 s.
Testing time (5, 'equal-width'): 0.09 s.
Training time (5, 'equal-size'): 0.05 s.
Testing time (5, 'equal-size'): 0.08 s.
Training time (10, 'equal-width'): 0.05 s.
Testing time (10, 'equal-width'): 0.08 s.
Training time (10, 'equal-size'): 0.05 s.
Testing time (10, 'equal-size'): 0.08 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [22]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


Laplace correction was not used but implemented. To test it, just uncomment the relat_freq lines and comment the ones above.