# Assignment 2 Group no. 6
### Project members: 
Mar Balibrea Rull, marbr@kth.se

Marcos Fernández Carbonell, marcosfc@kth.se

Tiger Zha, tigerz2@illinois.edu

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.


### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above,
and thereby 


## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time

## Reused functions from Assignment 1

In [2]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

# Imputation functions ----------------------------------------------------------------------
def create_imputation(df):
    df_tmp = df.copy()
    
    df_tmp_num = df_tmp.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["int","float"]) # Select only numeric columns
    df_mean = pd.DataFrame(df_tmp_num.mean())
    df_tmp_num[(df_tmp_num.isna().all()).index[df_tmp_num.isna().all()]] = 0 # When all values are missing in a numeric column all values are replaced with by 0

    df_tmp_obj = df_tmp.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["object","category"]) # Select only object and categorical columns
    df_mode = df_tmp_obj.mode().transpose()
    df_tmp_obj[(df_tmp_obj.isna().all()).index[df_tmp_obj.isna().all()]] = "" # When all values are missing in a object or categorical column all values are replaced by ""

    imputation = ((pd.concat([df_mean[0],df_mode[0]], axis=0)).fillna(0)).to_dict() # Create a mapping from column name to new value
    
    df_out = apply_imputation(df, imputation) # Apply imputation
    return df_out,imputation

def apply_imputation(df, imputation):
    df_out = df.copy()
    df_out.fillna(value=imputation, inplace=True) # Replace NaN values accoding with the dictionary imputation
    return df_out
# End of Imputation functions ---------------------------------------------------------------


# Normalization functions -------------------------------------------------------------------
def create_normalization(df,normalizationtype = "minmax"):
    df_tmp = df.select_dtypes(include=["int","float"]).drop(["ID", "CLASS"], axis=1) # Copy the input dataframe and consider columns of type float or int and drop ID and CLASS columns

    if normalizationtype == "minmax":
        normalization = {col:(normalizationtype,df_tmp[col].min(),df_tmp[col].max()) for col in df_tmp.columns} # Mapping from each column to a tuple("minmax",min,max)
    elif (normalizationtype == "zscore"):
        normalization = {col:(normalizationtype,df_tmp[col].mean(),df_tmp[col].std()) for col in df_tmp.columns} # Mapping from each column to a tuple("zscore",mean,std)
    
    df_out = apply_normalization(df,normalization) # Apply normalization
    
    return df_out, normalization

def apply_normalization(df,normalization):
    df_out = df.copy()
    for col in normalization:
        normalizationtype = normalization[col][0] # Get normalization type
        arg1 = normalization[col][1] # Get first argument (min or mean)
        arg2 = normalization[col][2] # Get second argument (max or std)
        if (normalizationtype == "minmax"):
            df_out[col] = (df_out[col]-arg1)/(arg2-arg1) # (COL - min(COL))/(max(COL)-min(COL))
            df_out[col].clip(0,1,inplace=True) # Limit the output range to [0,1]
        elif (normalizationtype == "zscore"):
            df_out[col] = (df_out[col]-arg1)/arg2 # (COL - mean)/std
    return df_out
# End of Normalization functions ------------------------------------------------------------


# One hot functions -------------------------------------------------------------------------
def create_one_hot(df):
    df_tmp = df.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["object","category"]) # Copy, drop and selec from input dataframe
    df_tmp = df_tmp.astype("category") # Change to categorical (b,o,x)
    one_hot = {col:tuple(df_tmp[col].cat.categories) for col in df_tmp} # Mapping from column name to a set of categories
    df_out = apply_one_hot(df,one_hot) # Apply one hot according to the input dictionary
    return df_out,one_hot

def apply_one_hot(df,one_hot):
    df_out = df.copy() # Copy dataframe
    for col in one_hot: # For each column
        for cat in one_hot[col]: # For each categorical value      
            df_out[col+"-"+str(cat)] = (df[col] == str(cat)).astype("float") # Fill new column "column - cat"
        df_out.drop(columns=col, inplace=True) # Drop analyzed column
    return df_out
# End of One hot functions ------------------------------------------------------------------


# Brier score function ----------------------------------------------------------------------
def brier_score(df,correctlabels):
    cat = df.columns # Get categories (columns)
    target = [(lab == cat).astype(float) for lab in correctlabels] # Get a matrix of targuet values from a list of correct labels 
    df_target = pd.DataFrame(target,columns=cat) # Matrix to dataframe
    return ((((df - df_target).pow(2)).sum(axis=1)).sum())/len(df) # Calculate brier score from dataframes
# End of brier score function ----------------------------------------------------------------


# Accuracy function -------------------------------------------------------------------------
def accuracy(df,correctlabels):
    predicted = df.idxmax(axis=1, skipna=True) # Index (column name) of first occurrence of maximum value
    acc = sum(predicted == correctlabels)/len(correctlabels)
    return acc
# End of Accuracy function ------------------------------------------------------------------


# AUC function ------------------------------------------------------------------------------
def auc(df,correctlabels):
    cat = df.columns
    target = [(lab == cat).astype(float) for lab in correctlabels]
    df_target = pd.DataFrame(target, columns=cat)

    AUC = 0
    for col in df.columns:                       # For each class create a dictionary with a mapping from each score score:[num_pos,num_neg]
        dic = {score:[0,0] for score in df[col]} # Initialize dictionary

        for n in range(len(df[col])):            # Check for each row of the dataframe and update the dictionary
            score = df[col][n]
            if df_target[col][n] == 1:           # Check if positive
                dic[score] = [dic[score][0]+1, dic[score][1]] # Increment positive instances
            else:                                # Negative
                dic[score] = [dic[score][0], dic[score][1]+1] # Increment negative instances

        dic_sorted = {}
        sorted_list = np.array([dic[key] for key in sorted(dic, reverse=True)]) # Create a reversely sorted list of positive and negative instances
        
        tp = sorted_list[:,0] # Get true positives from the list
        fp = sorted_list[:,1] # Get false positives from the list

        AUC_tmp = 0 # Initialize AUC
        cov_tp = 0 # Initialize cov_tp
        tot_tp = sum(tp) # Get total number of TP
        tot_fp = sum(fp) # Get total number of FP
        for i in range(len(tp)):
            if fp[i] == 0:
                cov_tp += tp[i]
            elif tp[i] == 0:
                AUC_tmp += (cov_tp/tot_tp)*(fp[i]/tot_fp) # Update AUC_tmp
            else:
                AUC_tmp += (cov_tp/tot_tp)*(fp[i]/tot_fp)+(tp[i]/tot_tp)*(fp[i]/tot_fp)/2 # Update AUC_tmp
                cov_tp += tp[i]

        #print('AUC_tmp: '+str(AUC_tmp))
        AUC += sum(df_target[col])/len(df_target)*AUC_tmp # Add weighted AUC_tmp to AUC 
    return AUC
# End of AUC function -----------------------------------------------------------------------


# Binning functions -------------------------------------------------------------------------
def create_bins(df, nobins, bintype="equal-width"):
    df_tmp = df.drop(["CLASS","ID"], axis=1, errors="ignore").select_dtypes(include=["int","float"])

    binning = {}
    df_out = pd.DataFrame()
    for col in df_tmp.columns:
        if (bintype == "equal-width"):
            res_tmp, bins_tmp = pd.cut(df_tmp[col], nobins, retbins=True, labels=False, duplicates="drop")
        elif (bintype == "equal-size"):
            res_tmp, bins_tmp = pd.qcut(df_tmp[col], nobins, retbins=True, labels=False, duplicates="drop")
            
        bins_tmp[0] = -np.inf # Set the first element of the binning list to -np.inf
        bins_tmp[-1] = np.inf # Set the last element of the binning list to np.inf
        
        binning[col] = tuple(bins_tmp)
        
    df_out = apply_bins(df,binning) # Apply binning
    return df_out,binning

def apply_bins(df,binning):
    df_out = df.copy()
    for col in binning:
        df_out[col] = pd.DataFrame(pd.cut(df_out[col],binning[col],labels=False, duplicates="drop")).astype("category")
    return df_out
# End of Binning functions ------------------------------------------------------------------

## 1. Define the class kNN

In [36]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# imputation, normalization, one_hot, labels, training_labels, training_data
#
# Input to fit:x
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.normalization should be a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot should be a one-hot mapping (see Assignment 1; can be excluded if this function was not completed)
# self.training_labels should be a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels should be the categories of the previous series
# self.training_data should be the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
# normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns 
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# k: an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are estimated by the relative class frequencies in the set of class labels from the k nearest 
#              (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation, normalization and (possibly) one-hot
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies


In [3]:
class kNN:
    def __init__(self):
        self.imputation = None       # Initialize attribute imputation
        self.normalization = None    # Initialize attribute normalization
        self.labels = None           # Initialize attribute labels
        self.training_labels = None  # Initialize attribute training_labels
        self.training_data = None    # Initialize attribute training_data
        
    def fit(self, df, normalizationtype="minmax"):
        self.training_data, self.imputation = create_imputation(df) # Get imputation mapping
        self.training_data, self.normalization = create_normalization(self.training_data,normalizationtype) # Get normalization mapping
        self.training_labels = df["CLASS"].astype("category") # "CLASS" column as pandas series, set to be of type "category" 
        self.labels = self.training_labels.cat.categories # Categories of the "CLASS" series
        self.training_data.drop(columns=["CLASS","ID"], inplace=True) # Remove CLASS and ID columns
         
    def predict(self, df, k=5):
                
        def get_nearest_neighbor_predicitons(x_test, k):
            # 1.Compute the distances to all the training instances
            d = [] # Initialize list distances
            for i in range(len(self.training_data)):
                d.append([np.sqrt(sum((x_test.iloc[n] - self.training_data.iloc[i])**2)) for n in range(len(x_test))]) # Compute euclidean distance and append the list of distances to d
            d = np.asarray(d) # List of lists to numpy ndarray
            
            # 2.Get the k-nearest neighbors
            k_nearest_idx = np.argsort(d,axis=0)[:k,:] # Get the indexes of the k-nearest neighbors
            
            # 3.Calculate class probabilities
            predictions = pd.DataFrame(0,index=np.arange(len(x_test)), columns=self.labels).fillna(0)
            for i in range(k_nearest_idx.shape[1]):
                for neigh in range(k_nearest_idx.shape[0]): # Equals to range(k)
                    predictions[self.training_labels[k_nearest_idx[neigh,i]]][i] += 1 # Update frequency
            return predictions/k # Divide by k to get the class probabilities and return the result
        
        
        df_tmp = df.drop(columns=["CLASS","ID"]) # Remove CLASS and ID columns
        df_tmp = apply_imputation(df_tmp, self.imputation) # Apply imputation
        df_tmp = apply_normalization(df_tmp, self.normalization) # Apply normalization
        
        return get_nearest_neighbor_predicitons(df_tmp, k) # Predict

In [4]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

results


Training time: 0.10 s.
Testing time (k=1): 5.89 s.
Testing time (k=3): 5.89 s.
Testing time (k=5): 5.86 s.
Testing time (k=7): 5.92 s.
Testing time (k=9): 5.92 s.


Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [5]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


Accuracy on training set (k=1): 1.00
AUC on training set (k=1): 1.00
Brier score on training set (k=1): 0.00


### Comment on assumptions, things that do not work properly, etc.

We assume that k is greater or equals than 1

## 2. Define the class NaiveBayes

In [38]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.class_priors should be a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
# to the relative frequencies of the labels
# self.feature_class_value_counts should be a mapping from a feature (column name) to another mapping, which
# given a feature value and class label provides the number of training instances with this specific combination
# self.feature_class_counts should be a mapping from the feature (column name) and class label to the number of
# training instances with this specific class label and any (non-missing) value for the feature
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: feature_class_value_counts can be a dictionary, which given a feature f returns a mapping obtained 
#         by pandas groupby and size (see lecture slides), which given a feature value v and class label c 
#         returns the number of instances, e.g., using get((c,v),0)
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
# predictions with estimated class probabilities for each row in df, where the class probabilities
# are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply discretization
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors


In [80]:
class NaiveBayes:
    def __init__(self):
        self.binning = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
    
    def fit(self,df,nobins=10,bintype="equal-width"):
        
        df_tmp,self.binning = create_bins(df, nobins, bintype) # Get a discretization mapping
        self.class_priors = dict(df_tmp["CLASS"].value_counts()) #  Mapping (dictionary) from the labels of the "CLASS" to the relative frequencies of the labels
        self.feature_class_value_counts = {col:{(n[0],n[1]):len(g)
                                                for (n,g) in df_tmp.groupby(["CLASS",col])}
                                                for col in df_tmp.columns.drop(["CLASS","ID"])} # Mapping from a feature (column name) to another mapping, which given a feature value and class label provides the number of training instances with this specific combination
        # TODO:check non-missing values
        self.feature_class_counts = {col:{n:len(g) for (n,g) in df_tmp.groupby("CLASS")} for col in df_tmp.columns.drop(["CLASS","ID"])} # Mapping from the feature (column name) and class label to the number of training instances with this specific class label and any (non-missing) value for the feature
    
    def predict(self,df):
        # 1. Apply discretization
        df_tmp = apply_bins(df,self.binning)

        # 2. Calculate relative frequency of the observed feature value given the class (using feature_class_value_counts and feature_class_counts)
        relative_freq = self.feature_class_value_counts.copy()
        for feat in self.feature_class_value_counts.keys():
            for class_value in self.feature_class_value_counts[feat].keys():
                    relative_freq[feat][class_value] /= self.feature_class_counts[feat][class_value[0]]
        
        # Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
        #         product of the relative frequencies
        non_normalized_estimated_class_prob = pd.DataFrame(1.,index=np.arange(len(df_tmp)), columns=self.class_priors.keys()).fillna(1.)
        for i in range(df_tmp.shape[0]):
            for feat in df_tmp.columns.drop(["CLASS","ID"]):
                value = df_tmp[feat].iloc[i]
                for c in self.class_priors.keys():
                    non_normalized_estimated_class_prob[c][i] *= relative_freq[feat].get((c,value),0)
            non_normalized_estimated_class_prob.iloc[i] *= pd.Series(self.class_priors)
                
        # Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
        #         this sum is zero, then set the probabilities to the class priors                                                                                                            
        normalized_estimated_class_prob = non_normalized_estimated_class_prob.div(non_normalized_estimated_class_prob.sum(axis=1), axis=0).fillna(0)
                                                                                                            
        return normalized_estimated_class_prob

In [24]:
glass_train_df = pd.read_csv("glass_train.txt")
glass_train_df

Unnamed: 0,ID,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,CLASS
0,202,1.51653,11.95,0.00,1.19,75.18,2.70,8.93,0.00,0.00,7
1,124,1.51707,13.48,3.48,1.71,72.52,0.62,7.99,0.00,0.00,2
2,152,1.52127,14.32,3.90,0.83,71.50,0.00,9.49,0.00,0.00,3
3,197,1.51556,13.87,0.00,2.54,73.23,0.14,9.41,0.81,0.01,7
4,144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2
5,99,1.51689,12.67,2.88,1.71,73.21,0.73,8.54,0.00,0.00,2
6,207,1.51645,14.94,0.00,1.87,73.11,0.00,8.67,1.38,0.00,7
7,36,1.51567,13.29,3.45,1.21,72.74,0.56,8.57,0.00,0.00,1
8,50,1.51898,13.58,3.35,1.23,72.08,0.59,8.91,0.00,0.00,1
9,117,1.51829,13.24,3.90,1.41,72.33,0.55,8.31,0.00,0.10,2


In [86]:
glass_train_df = pd.read_csv("glass_train.txt")
df = glass_train_df.copy()
class_priors = dict(glass_train_df["CLASS"].value_counts())
#print(class_priors)

feature_class_value_counts = {}
for feature in df.drop(columns=["ID","CLASS"]).columns:
    for cla in class_priors.keys():
        series_tmp = df.loc[df["CLASS"]==cla][feature].value_counts()
        key = series_tmp.index
        key = list(zip(2*np.ones(len(key)),key))
        values = series_tmp.values
        feature_class_value_counts[feature] = dict(zip(key,values))
#df.loc[df["CLASS"] == some_value]
#glass_train_df[["RI",""]].value_counts()
print(feature_class_value_counts)

{'RI': {(2.0, 1.51969): 1, (2.0, 1.51916): 1, (2.0, 1.51937): 1, (2.0, 1.51829): 1, (2.0, 1.51888): 1, (2.0, 1.51905): 1}, 'Na': {(2.0, 14.46): 1, (2.0, 14.15): 1, (2.0, 14.99): 1, (2.0, 14.56): 1, (2.0, 13.79): 1, (2.0, 14.0): 1}, 'Mg': {(2.0, 0.0): 2, (2.0, 0.78): 1, (2.0, 2.24): 1, (2.0, 2.39): 1, (2.0, 2.41): 1}, 'Al': {(2.0, 1.56): 1, (2.0, 1.62): 1, (2.0, 1.19): 1, (2.0, 0.56): 1, (2.0, 1.74): 1, (2.0, 2.09): 1}, 'Si': {(2.0, 72.37): 1, (2.0, 72.76): 1, (2.0, 73.48): 1, (2.0, 72.5): 1, (2.0, 72.74): 1, (2.0, 72.38): 1}, 'K': {(2.0, 0.0): 6}, 'Ca': {(2.0, 9.57): 1, (2.0, 9.95): 1, (2.0, 10.88): 1, (2.0, 9.77): 1, (2.0, 9.26): 1, (2.0, 11.22): 1}, 'Ba': {(2.0, 0.0): 6}, 'Fe': {(2.0, 0.0): 6}}


In [79]:
glass_train_df = pd.read_csv("glass_train.txt")
df_tmp = glass_train_df.copy()
nobins = 3
bintype = "equal-width"

df_tmp,binning = create_bins(df_tmp, nobins, bintype) # Get a discretization mapping
class_priors = dict(df_tmp["CLASS"].value_counts()) 
#print(class_priors)
feature_class_value_counts = {col:{(n[0],n[1]):len(g)
             for (n,g) in df_tmp.groupby(["CLASS",col])}
             for col in df_tmp.columns.drop(["CLASS","ID"])}
#print(feature_class_value_counts)
feature_class_counts = {col:{n:len(g) for (n,g) in df_tmp.groupby("CLASS")} for col in df_tmp.columns.drop(["CLASS","ID"])} # Mapping from the feature (column name) and class label to the number of training instances with this specific class label and any (non-missing) value for the feature
#print(feature_class_counts)



glass_test_df = pd.read_csv("glass_test.txt")
df_tmp = glass_test_df.copy()

df_tmp = apply_bins(df_tmp,binning)
relative_freq = feature_class_value_counts.copy()

for feat in relative_freq.keys():
    for class_value in relative_freq[feat].keys():
            relative_freq[feat][class_value] /= feature_class_counts[feat][class_value[0]]
            #break
    #break        

#print(relative_freq["RI"])
#print(feature_class_value_counts["RI"])
non_normalized_estimated_class_prob = pd.DataFrame(1.,index=np.arange(len(df_tmp)), columns=class_priors.keys()).fillna(1.)
print(non_normalized_estimated_class_prob)
for i in range(df_tmp.shape[0]):
    for feat in df_tmp.columns.drop(["CLASS","ID"]):
        value = df_tmp[feat].iloc[i]
        for c in class_priors.keys():
            non_normalized_estimated_class_prob[c][i] *= relative_freq[feat].get((c,value),0)
            print(relative_freq[feat].get((c,value),1.))
            
        break
    print(non_normalized_estimated_class_prob)
    print(pd.Series(class_priors))
    non_normalized_estimated_class_prob.iloc[i] *= pd.Series(class_priors)
    print(non_normalized_estimated_class_prob)
    break
#print(non_normalized_estimated_class_prob)        
normalized_estimated_class_prob = non_normalized_estimated_class_prob.div(non_normalized_estimated_class_prob.sum(axis=1), axis=0).fillna(0)
#print(normalized_estimated_class_prob)

predictions = normalized_estimated_class_prob

            
#print(predictions)

       2    1    7    5    3    6
0    1.0  1.0  1.0  1.0  1.0  1.0
1    1.0  1.0  1.0  1.0  1.0  1.0
2    1.0  1.0  1.0  1.0  1.0  1.0
3    1.0  1.0  1.0  1.0  1.0  1.0
4    1.0  1.0  1.0  1.0  1.0  1.0
5    1.0  1.0  1.0  1.0  1.0  1.0
6    1.0  1.0  1.0  1.0  1.0  1.0
7    1.0  1.0  1.0  1.0  1.0  1.0
8    1.0  1.0  1.0  1.0  1.0  1.0
9    1.0  1.0  1.0  1.0  1.0  1.0
10   1.0  1.0  1.0  1.0  1.0  1.0
11   1.0  1.0  1.0  1.0  1.0  1.0
12   1.0  1.0  1.0  1.0  1.0  1.0
13   1.0  1.0  1.0  1.0  1.0  1.0
14   1.0  1.0  1.0  1.0  1.0  1.0
15   1.0  1.0  1.0  1.0  1.0  1.0
16   1.0  1.0  1.0  1.0  1.0  1.0
17   1.0  1.0  1.0  1.0  1.0  1.0
18   1.0  1.0  1.0  1.0  1.0  1.0
19   1.0  1.0  1.0  1.0  1.0  1.0
20   1.0  1.0  1.0  1.0  1.0  1.0
21   1.0  1.0  1.0  1.0  1.0  1.0
22   1.0  1.0  1.0  1.0  1.0  1.0
23   1.0  1.0  1.0  1.0  1.0  1.0
24   1.0  1.0  1.0  1.0  1.0  1.0
25   1.0  1.0  1.0  1.0  1.0  1.0
26   1.0  1.0  1.0  1.0  1.0  1.0
27   1.0  1.0  1.0  1.0  1.0  1.0
28   1.0  1.0 

In [10]:
glass_train_df = pd.read_csv("glass_train.txt")
df_tmp = glass_train_df.copy()
nobins = 3
bintype = "equal-width"

df_tmp,binning = create_bins(df_tmp, nobins, bintype) # Get a discretization mapping
class_priors = dict(df_tmp["CLASS"].value_counts()) 
#print(class_priors)
feature_class_value_counts = {col:{(n[0],n[1]):len(g)
             for (n,g) in df_tmp.groupby(["CLASS",col])}
             for col in df_tmp.columns.drop(["CLASS","ID"])}
#print(feature_class_value_counts)
feature_class_counts = {col:{n:len(g) for (n,g) in df_tmp.groupby("CLASS")} for col in df_tmp.columns.drop(["CLASS","ID"])} # Mapping from the feature (column name) and class label to the number of training instances with this specific class label and any (non-missing) value for the feature
#print(feature_class_counts)



glass_test_df = pd.read_csv("glass_test.txt")
df_tmp = glass_test_df.copy()

df_tmp = apply_bins(df_tmp,binning)
relative_freq = feature_class_value_counts.copy()

for feat in relative_freq.keys():
    for class_value in relative_freq[feat].keys():
            relative_freq[feat][class_value] /= feature_class_counts[feat][class_value[0]]
            #break
    #break        

#print(relative_freq["RI"])
#print(feature_class_value_counts["RI"])
non_normalized_estimated_class_prob = pd.DataFrame(1.,index=np.arange(len(df_tmp)), columns=class_priors.keys()).fillna(1.)
#print(non_normalized_estimated_class_prob)
for i in range(df_tmp.shape[0]):
    for feat in df_tmp.columns.drop(["CLASS","ID"]):
        value = df_tmp[feat].iloc[i]
        for c in class_priors.keys():
            non_normalized_estimated_class_prob[c][i] *= relative_freq[feat].get((c,value),0)
    non_normalized_estimated_class_prob.iloc[i] *= pd.Series(class_priors)

#print(non_normalized_estimated_class_prob)
normalized_estimated_class_prob.loc[normalized_estimated_class_prob.sum(axis=1)==0] = pd.Series(class_priors)
normalized_estimated_class_prob = non_normalized_estimated_class_prob.div(non_normalized_estimated_class_prob.sum(axis=1), axis=0).fillna(0)
#print(normalized_estimated_class_prob)

predictions = normalized_estimated_class_prob

            
#print(predictions)

In [19]:
normalized_estimated_class_prob.index[normalized_estimated_class_prob.sum(axis=1)==0].tolist()

[20]

In [30]:
#pd.DataFrame(class_priors.values(),columns=class_priors.keys())
pd.Series(class_priors)

2    34
1    31
7    20
5     8
3     8
6     6
dtype: int64

In [81]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

results


Training time (3, 'equal-width'): 0.12 s.
Testing time (3, 'equal-width'): 0.80 s.
Training time (3, 'equal-size'): 0.12 s.
Testing time (3, 'equal-size'): 0.70 s.
Training time (5, 'equal-width'): 0.10 s.
Testing time (5, 'equal-width'): 0.68 s.
Training time (5, 'equal-size'): 0.10 s.
Testing time (5, 'equal-size'): 0.68 s.
Training time (10, 'equal-width'): 0.09 s.
Testing time (10, 'equal-width'): 0.64 s.
Training time (10, 'equal-size'): 0.10 s.
Testing time (10, 'equal-size'): 0.66 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.621325,0.72356
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.559282,0.750656
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.57027,0.747255
10,equal-size,0.588785,0.743837,0.746409


In [82]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


### Comment on assumptions, things that do not work properly, etc.