# Project 1: What is labelled data worth to Naive Bayes?
---

## Initialisation

In [1]:
# Library
import pandas as pd
import numpy as np
import random
from IPython.display import display

In [33]:
# Data Path Constant
BREAST_CANCER = "2018S1-proj1_data/breast-cancer-dos.csv"
CAR = "2018S1-proj1_data/car-dos.csv"
HYPOTHYROID = "2018S1-proj1_data/hypothyroid-dos.csv"
MUSHROOM = "2018S1-proj1_data/mushroom-dos.csv"

# Column name for each data set
BREAST_CANCER_COLUMN = ["age", "menopause", "tumor-size", "inv-nodes", "node-caps", "deg-malig", "breast", "breast-quad", "irradiat", "class"]
CAR_COLUMN = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
HYPOTHYROID_COLUMN = ["sex", "on_thyroxine", "query_on_thyroxine", "on_antithyroid_medication", "thyroid_surgery", "query_hypothyroid", "query_hyperthyroid", "pregnant", "sick", "tumor", "lithium", "goitre", "TSH_measured", "T3_measured", "TT4_measured", "T4U_measured", "FTI_measured", "TBG_measured", "class"]
MUSHROOM_COLUMN = ["cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat", "class"]

# Other Constant
PRIOR_INDEX = 0
POSTERIOR_INDEX = 1
EPSILON = 0.000001 # Epsilon smoothing
ITERATION = 10 # Number of iteration in unsupervised naive bayes

## Preprocess

In [4]:
'''
This function should open a data file in csv, and transform it into a usable format 
@param data = csv data that will be opened
@param columns = new column name for header
@param eliminate = eliminate the missing/ ? instances (recommended if there are only few missing instances)
@return df = clean pandas dataframe object
'''
def preprocess(data, columns, eliminate=True):
    # Read and add a header to the data frame
    df = pd.read_csv(data, header=None)
    df.columns = columns
    
    # If the parameter ignore is set to be false then we don't ignore
    if (eliminate):
        # Iterate through the dataframe and only append without missing value
        # Capture the index of one with the missing values
        for index, row in df.iterrows():
            for att in row:
                # If encounter missing values in the data, don't use that
                if (att == "?"):
                    df.drop(index, inplace=True)
                    break
    
    # Return the clean data
    return df

## Train Supervised

In [5]:
'''
This function should build a supervised NB model and return a dictionary count
@param train_data = training data that are used to create the supervised NB classifier
@param class_label = column name of the class that we want to classify (class) in this case
@return count_prior = dictionary describing prior count of the class in training data
@return count_posterior = dictionary of dictionaries posterior count
'''
def train_count_supervised(train_data, class_label):
    # Calculate prior (dictionary_prior)
    # Initiate python dictionary with the number of class in the training data as it's key
    count_prior = {}
    for unique_class in train_data[class_label].unique():
        count_prior[unique_class] = 0
    
    # Loop through the training data and get how many for every classes instance.
    # Now we have the count prior class that are used for prediction
    for index, row in train_data.iterrows():
        count_prior[row[class_label]] += 1
    
    # Calculate count posterior (dictionary_posterior), the data structure used are dictionary
    # of dictionary of dictionaries
    count_posterior = {}
    
    # Setup the dictionary component (initialise)
    column_name = list(train_data.columns)
    column_name.remove(class_label)
    for col in column_name:
        count_posterior[col] = {}
        for unique_class in train_data[class_label].unique():
            count_posterior[col][unique_class] = {}
            for unique_col in train_data[col].unique():
                count_posterior[col][unique_class][unique_col] = 0
    
    # Now use the training data to perform count calculation
    for index, row in train_data.iterrows():
        for col in column_name:
            count_posterior[col][row[class_label]][row[col]] += 1
            
    return((count_prior, count_posterior))

In [6]:
'''
This function should build supervised NB model and return a dictionary of probability
@param train_data = training data that are used to create the supervised NB classifier
@param class_label = column name of the class that we want to classify
@return probability_prior = dictionary describing prior probability of the class in training data,
@return probability_posterior = dictionary of dictionaries posterior probability
'''
def train_probability_supervised(train_data, class_label):
    (count_prior, count_posterior) = train_count_supervised(train_data, class_label)
    
    # Now calculate the probability of each instances, (i.e. 'Cough': {'flu': {'yes': 3, 'no': 0}, 'cold': {'yes': 1, 'no': 1}}
    # will have P(cough = yes | flu) = 3/3, P(cough = no | flu) = 0/3 and P(cough = yes | cold) = 1/2, P(cough = no | cold) = 1/2
    # First calculate the prior probability of the class P(c)
    probability_prior = {}
    sum_instance = sum(count_prior.values())
    for unique_class in train_data[class_label].unique():
        probability_prior[unique_class] = count_prior[unique_class] / sum_instance
        
        # Perform epsilon smoothing
        if (count_prior[unique_class] == 0):
            probability_prior[unique_class] = EPSILON
    
    # Calculate the posterior probability
    probability_posterior = count_posterior
    column_name = list(train_data.columns)
    column_name.remove(class_label)
                
    # Now calculate the posterior probability
    for col in column_name:
        for unique_class in train_data[class_label].unique():
            sum_instance = sum(probability_posterior[col][unique_class].values())
            for unique_col in train_data[col].unique():
                probability_posterior[col][unique_class][unique_col] /= sum_instance
                
                # Perform epsilon smoothing
                if (probability_posterior[col][unique_class][unique_col] == 0):
                    probability_posterior[col][unique_class][unique_col] = EPSILON
                
            
    return((probability_prior, probability_posterior))

## Predict Supervised

In [7]:
'''
This function should predict the class for a set of instances, based on a trained model 
@param test_data = data to be tested
@param train_data = data used for setup such as finding all possible classes
@param class_label = attribute that we want to classify using naive bayes
@param model = tuple consisting probability_prior and probability_posterior. 
Mainly use the train_probability_supervised instead of train_count_supervised
@return test_class = array containing the class predicted by the naive bayes classifier
'''
def predict_supervised(test_data, train_data, class_label, model):
    prior_probability = model[PRIOR_INDEX]
    posterior_probability = model[POSTERIOR_INDEX]
    test_class = [] # used to capture test result
    
    # Used for calculation purposes
    column_name = list(train_data.columns)
    column_name.remove(class_label)
    
    # Get the answer for every test instance
    for index, row in test_data.iterrows():
        # Initiate dictionary capturing the values calculated by naive bayes model
        test_value = {}
        for unique_class in train_data[class_label].unique():
            test_value[unique_class] = 0

        # Calculate for each class using the naive bayes model (log model for multiplication)
        for unique_class in train_data[class_label].unique():
            test_value[unique_class] = np.log(prior_probability[unique_class])
            for col in column_name:
                test_value[unique_class] += np.log(posterior_probability[col][unique_class][row[col]])
            
        # After calculating all of the possible class, we want to choose the maximum
        maximum_class = (train_data[class_label].unique())[0]
        maximum_value = test_value[maximum_class]
        for key, value in test_value.items():
            if (value > maximum_value):
                maximum_value = value
                maximum_class = key
    
        # Append result
        test_class.append(maximum_class)
    
    # Return the classifier for the class
    return test_class

# Evaluate Supervised

In [8]:
'''
This function should evaluate a set of predictions, in a supervised context.
@param true_test_result = array class actual
@param predicted_test_result = array class predicted
@return accuracy = (TP+TN) / (TP+TN+FP+FN)
'''
def evaluate_supervised(true_test_result, predicted_test_result):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Measure accuracy
        correct = 0
        for i in range(len(true_test_result)):
            if (true_test_result[i] == predicted_test_result[i]):
                correct += 1
            
        accuracy = correct / len(true_test_result)
        
    return accuracy

In [9]:
'''
Create confusion matrix based on the actual and predicted class
@param true_test_result = array class actual
@param predicted_test_result = array class predicted
@param class_column = all possible classes in the dataset
@return confusion_df = confusion matrix (dataframe)
'''
def confusion_matrix_supervised(true_test_result, predicted_test_result, class_column):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Create a pandas dataframe actual is the row, predicted is the column
        confusion_df = pd.DataFrame()
        
        for unique_class in class_column:
            confusion_df[unique_class] = [0 for i in range(len(class_column))]
        
        # Change index for df
        confusion_df.index = class_column
        
        # Calculate the confusion matrix
        for i in range(len(true_test_result)):
            confusion_df.loc[true_test_result[i], predicted_test_result[i]] += 1
            
        # Add actual and predicted description on the table to make it easier to see
        predicted_column = []
        for string in confusion_df.columns:
            string += " predicted"
            predicted_column.append(string.title())
       
        actual_row = []
        for string in class_column:
            string += " actual"
            actual_row.append(string.title())
        
        confusion_df.columns = predicted_column
        confusion_df.index = actual_row
        
        return confusion_df

## Main Program

In [10]:
# Using the breast cancer data
df_breast_cancer = preprocess(BREAST_CANCER, BREAST_CANCER_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_breast_cancer, "class")
predicted_test_result = predict_supervised(df_breast_cancer, df_breast_cancer, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_breast_cancer["class"]), predicted_test_result, df_breast_cancer["class"].unique())
display(confusion_df)
print("The accuracy for breast cancer dataset is {}.".format(evaluate_supervised(list(df_breast_cancer["class"]), predicted_test_result)))

Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,48,33
No-Recurrence-Events Actual,31,165


The accuracy for breast cancer dataset is 0.7689530685920578.


In [11]:
# Using car data
df_car = preprocess(CAR, CAR_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_car, "class")
predicted_test_result = predict_supervised(df_car, df_car, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_car["class"]), predicted_test_result, df_car["class"].unique())
display(confusion_df)
print("The accuracy for car dataset is {}.".format(evaluate_supervised(list(df_car["class"]), predicted_test_result)))

Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,1161,47,0,2
Acc Actual,85,289,0,10
Vgood Actual,0,26,39,0
Good Actual,0,46,2,21


The accuracy for car dataset is 0.8738425925925926.


In [12]:
# Using the breast cancer data
df_hypo = preprocess(HYPOTHYROID, HYPOTHYROID_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_hypo, "class")
predicted_test_result = predict_supervised(df_hypo, df_hypo, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_hypo["class"]), predicted_test_result, df_hypo["class"].unique())
display(confusion_df)
print("The accuracy for hypothyroid dataset is {}.".format(evaluate_supervised(list(df_hypo["class"]), predicted_test_result)))

Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,0,149
Negative Actual,0,2941


The accuracy for hypothyroid dataset is 0.9517799352750809.


In [13]:
# Using the mushroom data
df_mushroom = preprocess(MUSHROOM, MUSHROOM_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_mushroom, "class")
predicted_test_result = predict_supervised(df_mushroom, df_mushroom, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_mushroom["class"]), predicted_test_result, df_mushroom["class"].unique())
display(confusion_df)
print("The accuracy for mushroom dataset is {}.".format(evaluate_supervised(list(df_mushroom["class"]), predicted_test_result)))

Unnamed: 0,P Predicted,E Predicted
P Actual,2156,0
E Actual,16,3472


The accuracy for mushroom dataset is 0.997165131112686.


## Train Unsupervised

In [21]:
'''
Initialise the dataset with random distribution
@param dataset = dataframe of the dataset
@param class_label = column name that we want to classify
@return unsupervised_dataset = dataset that we have added random distribution
'''
def initialise_unsupervised_naive_bayes(dataset, class_label):
        row_number = dataset.shape[0]
        class_number = len(dataset[class_label].unique())
        unsupervised_dataset = dataset.drop(["class"], axis=1)

        # sample from uniform distribution
        sample_matrix = np.zeros((row_number, class_number))
        for i in range(row_number):
            samples = np.random.uniform(0, 1, class_number)
            samples /= sum(samples)  # normalise so it sums to 1
            sample_matrix[i] = samples
        
        # Add a column to the dataset according to random distribution (initialisation phase)
        row_instance = unsupervised_dataset.shape[0]
        for unique_class in dataset[class_label].unique():
            unsupervised_dataset[unique_class] = [0 for i in range(row_instance)]
        
        matrix_counter = 0
        # Iterate through the matrix and assign to the dataframe
        for index, row in unsupervised_dataset.iterrows():
            unsupervised_dataset.loc[index, -class_number:] = sample_matrix[matrix_counter]
            matrix_counter += 1
        
        return(unsupervised_dataset)

In [22]:
'''
This function should build an unsupervised NB model and return a count
@param class_column = possible class name (weak unsupervised model)
@param attribute_column = attributes that are used for calculation
@param dataset = data that are used to create the unsupervised NB classifier (format after running initialise_unsupervised_naive_bayes function)
@param class_label = column name of the class that we want to classify
@return count_prior = dictionary describing prior count of the class in training data
@return count_posterior = dictionary of dictionaries posterior count
'''
def train_count_unsupervised(class_column, attribute_column, dataset, class_label):
    # Calculate prior (dictionary_prior)
    # Initiate python dictionary with the number of class in the training data as it's key
    count_prior = {}
    for unique_class in class_column:
        count_prior[unique_class] = 0
    
    # Loop through the training data and sum the probability
    for index, row in dataset.iterrows():
        for unique_class in class_column:
            count_prior[unique_class] += row[unique_class]
    
    # Calculate count posterior (dictionary_posterior), the data structure used are dictionary
    # of dictionary of dictionaries
    count_posterior = {}
    
    # Setup the dictionary component
    for col in attribute_column:
        count_posterior[col] = {}
        for unique_class in class_column:
            count_posterior[col][unique_class] = {}
            for unique_col in dataset[col].unique():
                count_posterior[col][unique_class][unique_col] = 0
    
    # Now use the training data to perform calculation
    for index, row in dataset.iterrows():
        for col in attribute_column:
            for unique_class in class_column:
                count_posterior[col][unique_class][row[col]] += row[unique_class]
   
    return((count_prior, count_posterior))

In [23]:
'''
This function should build unsupervised NB model and return a probability
@param class_column = possible class name (weak unsupervised model)
@param attribute_column = attributes that are used for calculation
@param dataset = data that are used to create the unsupervised NB classifier (format after running initialise_unsupervised_naive_bayes function)
@param class_label = column name of the class that we want to classify
@return probability_prior = dictionary describing prior probability of the class in training data,
@return probability_posterior = dictionary of dictionaries posterior probability
'''
def train_probability_unsupervised(class_column, attribute_column, dataset, class_label):
    (count_prior, count_posterior) = train_count_unsupervised(class_column, attribute_column, dataset, class_label)
    
    # Now calculate the probability of each instances, (i.e. 'Cough': {'flu': {'yes': 0.3, 'no': 0}, 'cold': {'yes': 0.1, 'no': 0.1}}
    # will have P(cough = yes | flu) = 0.3/0.3, P(cough = no | flu) = 0/0.3 and P(cough = yes | cold) = 0.1/0.2, P(cough = no | cold) = 0.1/0.2
    
    # First calculate the prior probability of the class P(c)
    probability_prior = {}
    sum_instance = sum(count_prior.values())
    for unique_class in class_column:
        probability_prior[unique_class] = count_prior[unique_class] / sum_instance
        
        # Perform epsilon smoothing
        if (count_prior[unique_class] == 0.0):
            probability_prior[unique_class] = EPSILON
    
    # Calculate the posterior probability
    probability_posterior = count_posterior
                
    # Now calculate the posterior probability
    for col in attribute_column:
        for unique_class in class_column:
            sum_instance = sum(probability_posterior[col][unique_class].values())
            for unique_col in dataset[col].unique():
                probability_posterior[col][unique_class][unique_col] /= sum_instance
                
                # Perform epsilon smoothing
                if (probability_posterior[col][unique_class][unique_col] == 0):
                    probability_posterior[col][unique_class][unique_col] = EPSILON
    
    return((probability_prior, probability_posterior))

## Predict Unsupervised

In [25]:
'''
This function should predict the class for a set of instances, based on a trained model 
@param class_column = possible class name (weak unsupervised model)
@param attribute_column = attributes that are used for calculation
@param dataset = data that are used to calculate prediction
@param class_label = attribute that we want to classify using naive bayes
@param model = tuple consisting probability_prior and probability_posterior. 
@return test_class = the class predicted by the naive bayes classifier. The predict class will change the structure of dataset to be used for the next iteration.
'''
def predict_unsupervised(class_column, attribute_column, dataset, class_label, model):
    prior_probability = model[PRIOR_INDEX]
    posterior_probability = model[POSTERIOR_INDEX]
    test_class = [] # used to capture test result
    
    # Get the answer for every test instance
    for index, row in dataset.iterrows():
        # Initiate dictionary capturing the values calculated by naive bayes model
        test_value = {}
        for unique_class in class_column:
            test_value[unique_class] = 0

        # Calculate for each class using the naive bayes model (log model for multiplication)
        for unique_class in class_column:
            test_value[unique_class] = np.log(prior_probability[unique_class])
            for col in attribute_column:
                test_value[unique_class] += np.log(posterior_probability[col][unique_class][row[col]])
            
        # After calculating all of the possible class, we want to choose the maximum
        maximum_class = class_column[0]
        maximum_value = test_value[maximum_class]
        for key, value in test_value.items():
            if (value > maximum_value):
                maximum_value = value
                maximum_class = key
    
        # Append result
        test_class.append(maximum_class)
        # Change the dataset structure for the instance to prepare for the next iteration
        # First take the exponent of that to get the real probability calculation value
        for unique_class in class_column:
            test_value[unique_class] = np.exp(test_value[unique_class])
        
        # Calculate the new probability
        denominator_new = sum(test_value.values())
        
        for unique_class in class_column:
            dataset.loc[index, unique_class] = test_value[unique_class] / denominator_new
    
    # Return the classifier for the class
    return test_class

## Evaluate Unsupervised

In [26]:
'''
This function calculate the accuracy based on the confusion matrix that are given
@param confusion_matrix = the confusion matrix for unsupervised
@return accuracy = accuracy of the unsupervised
'''
def evaluate_unsupervised(confusion_matrix):
    total_instance = 0
    true_positive = 0
    column = list(confusion_matrix.columns)
    
    for index, row in confusion_matrix.iterrows():
        current_max = row[column[0]]
        for col in column:
            if (row[col] > current_max):
                current_max = row[col]
            total_instance += row[col]
        
        true_positive += current_max
    
    return true_positive/total_instance

In [27]:
'''
This function create a confusion matrix for unsupervised
@param true_test_result = list displaying the real value of the test result
@param predicted_test_result = list displaying the prediction
@param class_column = all possible classes
'''
def confusion_matrix_unsupervised(true_test_result, predicted_test_result, class_column):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Create a pandas dataframe actual is the row, predicted is the column
        confusion_df = pd.DataFrame()
        
        for unique_class in class_column:
            confusion_df[unique_class] = [0 for i in range(len(class_column))]
        
        # Change index for df
        confusion_df.index = class_column
        
        # Calculate the confusion matrix
        for i in range(len(true_test_result)):
            confusion_df.loc[true_test_result[i], predicted_test_result[i]] += 1
            
        # Add actual and predicted description on the table to make it easier to see
        predicted_column = []
        for string in confusion_df.columns:
            string += " predicted"
            predicted_column.append(string.title())
       
        actual_row = []
        for string in class_column:
            string += " actual"
            actual_row.append(string.title())
        
        confusion_df.columns = predicted_column
        confusion_df.index = actual_row
        
        return confusion_df

## Main Program

In [37]:
# Using the breast cancer dataset
print("Breast cancer dataset".title())
attribute_column = BREAST_CANCER_COLUMN
breast_df = preprocess(BREAST_CANCER, BREAST_CANCER_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(breast_df, "class")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(breast_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(breast_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(breast_df["class"]), predicted_test_result, breast_df["class"].unique())
    display(confusion_matrix)
    print("The accuracy of breast cancer dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n\n")

Breast Cancer Dataset
Iteration 1


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,29,52
No-Recurrence-Events Actual,71,125


The accuracy of breast cancer dataset based on confusion matrix is 0.6389891696750902.



Iteration 2


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,26,55
No-Recurrence-Events Actual,76,120


The accuracy of breast cancer dataset based on confusion matrix is 0.631768953068592.



Iteration 3


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,24,57
No-Recurrence-Events Actual,98,98


The accuracy of breast cancer dataset based on confusion matrix is 0.5595667870036101.



Iteration 4


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,20,61
No-Recurrence-Events Actual,113,83


The accuracy of breast cancer dataset based on confusion matrix is 0.628158844765343.



Iteration 5


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,23,58
No-Recurrence-Events Actual,125,71


The accuracy of breast cancer dataset based on confusion matrix is 0.6606498194945848.



Iteration 6


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,22,59
No-Recurrence-Events Actual,131,65


The accuracy of breast cancer dataset based on confusion matrix is 0.6859205776173285.



Iteration 7


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,26,55
No-Recurrence-Events Actual,143,53


The accuracy of breast cancer dataset based on confusion matrix is 0.7148014440433214.



Iteration 8


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,28,53
No-Recurrence-Events Actual,147,49


The accuracy of breast cancer dataset based on confusion matrix is 0.7220216606498195.



Iteration 9


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,28,53
No-Recurrence-Events Actual,150,46


The accuracy of breast cancer dataset based on confusion matrix is 0.7328519855595668.



Iteration 10


Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,32,49
No-Recurrence-Events Actual,152,44


The accuracy of breast cancer dataset based on confusion matrix is 0.7256317689530686.





In [45]:
# Using the breast cancer dataset
print("car dataset".title())
attribute_column = CAR_COLUMN
car_df = preprocess(CAR, CAR_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(car_df, "class")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(car_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(car_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(car_df["class"]), predicted_test_result, car_df["class"].unique())
    display(confusion_matrix)
    print("The accuracy of car dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n\n")

Car Dataset
Iteration 1


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 2


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 3


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 4


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 5


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 6


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,20,111,100,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 7


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,426,257,347
Acc Actual,21,111,99,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3726851851851852.



Iteration 8


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,428,255,347
Acc Actual,21,111,99,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3738425925925926.



Iteration 9


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,428,255,347
Acc Actual,21,111,99,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3738425925925926.



Iteration 10


Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,180,428,255,347
Acc Actual,21,111,99,153
Vgood Actual,11,24,4,26
Good Actual,1,22,7,39


The accuracy of car dataset based on confusion matrix is 0.3738425925925926.





In [41]:
# Using the hypothyroid dataset
print("hypothyroid dataset".title())
attribute_column = HYPOTHYROID_COLUMN
hypo_df = preprocess(HYPOTHYROID, HYPOTHYROID_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(hypo_df, "class")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(hypo_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(hypo_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(hypo_df["class"]), predicted_test_result, hypo_df["class"].unique())
    display(confusion_matrix)
    print("The accuracy of hypothyroid dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n\n")

Hypothyroid Dataset
Iteration 1


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,88,61
Negative Actual,1614,1327


The accuracy of hypothyroid dataset based on confusion matrix is 0.5508090614886731.



Iteration 2


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,90,59
Negative Actual,1567,1374


The accuracy of hypothyroid dataset based on confusion matrix is 0.5362459546925566.



Iteration 3


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,137,12
Negative Actual,2340,601


The accuracy of hypothyroid dataset based on confusion matrix is 0.8016181229773462.



Iteration 4


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,143,6
Negative Actual,2466,475


The accuracy of hypothyroid dataset based on confusion matrix is 0.844336569579288.



Iteration 5


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,143,6
Negative Actual,2463,478


The accuracy of hypothyroid dataset based on confusion matrix is 0.8433656957928802.



Iteration 6


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,144,5
Negative Actual,2466,475


The accuracy of hypothyroid dataset based on confusion matrix is 0.8446601941747572.



Iteration 7


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,145,4
Negative Actual,2524,417


The accuracy of hypothyroid dataset based on confusion matrix is 0.8637540453074434.



Iteration 8


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,145,4
Negative Actual,2552,389


The accuracy of hypothyroid dataset based on confusion matrix is 0.8728155339805825.



Iteration 9


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,149,0
Negative Actual,2659,282


The accuracy of hypothyroid dataset based on confusion matrix is 0.9087378640776699.



Iteration 10


Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,149,0
Negative Actual,2688,253


The accuracy of hypothyroid dataset based on confusion matrix is 0.9181229773462783.





In [44]:
# Using the mushroom dataset
print("mushroom dataset".title())
attribute_column = MUSHROOM_COLUMN
mushroom_df = preprocess(MUSHROOM, MUSHROOM_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(mushroom_df, "class")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(mushroom_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(mushroom_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(mushroom_df["class"]), predicted_test_result, mushroom_df["class"].unique())
    display(confusion_matrix)
    print("The accuracy of mushroom dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n\n")

Mushroom Dataset
Iteration 1


Unnamed: 0,P Predicted,E Predicted
P Actual,1547,609
E Actual,1281,2207


The accuracy of mushroom dataset based on confusion matrix is 0.6651311126860383.



Iteration 2


Unnamed: 0,P Predicted,E Predicted
P Actual,1445,711
E Actual,431,3057


The accuracy of mushroom dataset based on confusion matrix is 0.797661233167966.



Iteration 3


Unnamed: 0,P Predicted,E Predicted
P Actual,1424,732
E Actual,38,3450


The accuracy of mushroom dataset based on confusion matrix is 0.8635719347980156.



Iteration 4


Unnamed: 0,P Predicted,E Predicted
P Actual,1302,854
E Actual,2,3486


The accuracy of mushroom dataset based on confusion matrix is 0.848334514528703.



Iteration 5


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.



Iteration 6


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.



Iteration 7


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.



Iteration 8


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.



Iteration 9


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.



Iteration 10


Unnamed: 0,P Predicted,E Predicted
P Actual,1296,860
E Actual,0,3488


The accuracy of mushroom dataset based on confusion matrix is 0.8476257973068746.





## Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.

### Question 1

### Question 2

- As with other supervised algorithm, eventhough naive bayes still work with few instance of training data, it will be better if there are more data to some extent until it hits plateau. This explains why with few instance such as the first two datasets, naive bayes accuracy is around (~80%) and when using thousands of instances (last 2 datasets) the accuracy improves to around (~97%). One of the reason is because in order for naive bayes to work flawlessly it needs enough/ sufficient amount of data in order for the classifier to be able to understand probabilistic relationship between attributes.
- Naive Bayes method used an assumption of conditional independence which are generally untrue. In normal cases, this untrue assumption will still work, because we believe that some P(xi|c) are overestimated and underestimated which we can think the error is cancelling. However, we cannot always rely on this assumption because depending on the feature engineering a dataset may be appropriate to use naive bayes or inappropriate. With feature engineering we can maximise the accuracy of naive bayes by choosing feature that are not really correlated. When choosing a feature we can use information gain to choose appropriate feature or just perform a correlation analysis in feature then removing highly correlated feature

### Question 3

### Question 5

- Deterministically labelling the instances in the initialisation phase of unsupervised NB classifier is a bad idea because we choose a really strong prior that resulted on subsequent iteration, the class label won't change at all. The idea with bayesian methods is that we want to choose a prior which allowed us to calculate the posterior. However when choosing prior that are too strong, it is the same thing as not using the data at all since the data won't have any influence anymore. 

### Question 6

### Question 7