# Project 1: What is labelled data worth to Naive Bayes?
---

## Initialisation

In [1]:
# Library
import pandas as pd
import numpy as np
import random
from IPython.display import display

In [126]:
# Data Path Constant
BREAST_CANCER = "2018S1-proj1_data/breast-cancer-dos.csv"
CAR = "2018S1-proj1_data/car-dos.csv"
HYPOTHYROID = "2018S1-proj1_data/hypothyroid-dos.csv"
MUSHROOM = "2018S1-proj1_data/mushroom-dos.csv"

# Column name for each data set
BREAST_CANCER_COLUMN = ["age", "menopause", "tumor-size", "inv-nodes", "node-caps", "deg-malig", "breast", "breast-quad", "irradiat", "class"]
CAR_COLUMN = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
HYPOTHYROID_COLUMN = ["sex", "on_thyroxine", "query_on_thyroxine", "on_antithyroid_medication", "thyroid_surgery", "query_hypothyroid", "query_hyperthyroid", "pregnant", "sick", "tumor", "lithium", "goitre", "TSH_measured", "T3_measured", "TT4_measured", "T4U_measured", "FTI_measured", "TBG_measured", "class"]
MUSHROOM_COLUMN = ["cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat", "class"]

# Other Constant
PRIOR_INDEX = 0
POSTERIOR_INDEX = 1
EPSILON = 0.000001 # Epsilon smoothing
ITERATION = 5 # Number of iteration in unsupervised naive bayes

In [136]:
# Used to check algorithm correctness
training_df = pd.DataFrame(data={"Headache": ["severe", "no", "mild", "mild", "severe", "no", "mild"], "Sore": ["mild", "severe", "mild", "no", "severe", "severe", "mild"], "Temperature":["high", "normal", "normal", "normal", "normal", "high", "normal"], "Cough": ["yes", "yes", "yes", "no", "yes", "no", "no"], "class":["flu", "cold", "flu", "cold", "flu", "fever", "fever"]})
display(training_df)
# training2_df = pd.read_csv("cars-sample.csv")
# training2_df.drop(["Unnamed: 0"], axis=1, inplace=True)
# training2_df.columns = CAR_COLUMN
# display(training2_df)

training2_df = pd.read_csv("cars-sample.csv")
training2_df.drop(["Unnamed: 0"], axis=1, inplace=True)
training2_df.columns = CAR_COLUMN
display(training2_df)

Unnamed: 0,Cough,Headache,Sore,Temperature,class
0,yes,severe,mild,high,flu
1,yes,no,severe,normal,cold
2,yes,mild,mild,normal,flu
3,no,mild,no,normal,cold
4,yes,severe,severe,normal,flu
5,no,no,severe,high,fever
6,no,mild,mild,normal,fever


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,high,high,3,4,small,low,unacc
1,high,vhigh,4,2,small,med,unacc
2,low,high,5more,more,big,med,acc
3,vhigh,low,5more,4,med,high,acc
4,vhigh,med,2,4,big,low,unacc
5,low,med,3,4,big,med,good
6,low,med,5more,4,big,high,vgood


## Preprocess

In [12]:
# This function should open a data file in csv, and transform it into a usable format 
# @param data = csv data that will be opened
# @param columns = new column name for header
# @param eliminate = eliminate the missing/ ? instances (recommended if there are only few missing instances)
# @return df = clean pandas dataframe object
def preprocess(data, columns, eliminate=True):
    # Read and add a header to the data frame
    df = pd.read_csv(data, header=None)
    df.columns = columns
    
    # If the parameter ignore is set to be false then we don't ignore
    if (eliminate):
        # Iterate through the dataframe and only append without missing value
        # Capture the index of one with the missing values
        for index, row in df.iterrows():
            for att in row:
                # If encounter missing values in the data, don't use that
                if (att == "?"):
                    df.drop(index, inplace=True)
                    break
    
    # Return the clean data
    return df

## Train Supervised

In [5]:
# This function should build a supervised NB model and return a count
# @param train_data = training data that are used to create the supervised NB classifier
# @param class_label = column name of the class that we want to classify
# @return count_prior = dictionary describing prior count of the class in training data
# @return count_posterior = dictionary of dictionaries posterior count
def train_count_supervised(train_data, class_label):
    # Calculate prior (dictionary_prior)
    # Initiate python dictionary with the number of class in the training data as it's key
    count_prior = {}
    for unique_class in train_data[class_label].unique():
        count_prior[unique_class] = 0
    
    # Loop through the training data and get how many for every classes instance.
    # Now we have the count prior class that are used for prediction
    for index, row in train_data.iterrows():
        count_prior[row[class_label]] += 1
    
    # Calculate count posterior (dictionary_posterior), the data structure used are dictionary
    # of dictionary of dictionaries
    count_posterior = {}
    
    # Setup the dictionary component
    column_name = list(train_data.columns)
    column_name.remove(class_label)
    for col in column_name:
        count_posterior[col] = {}
        for unique_class in train_data[class_label].unique():
            count_posterior[col][unique_class] = {}
            for unique_col in train_data[col].unique():
                count_posterior[col][unique_class][unique_col] = 0
    
    # Now use the training data to perform calculation
    for index, row in train_data.iterrows():
        for col in column_name:
            count_posterior[col][row[class_label]][row[col]] += 1
            
    return((count_prior, count_posterior))

In [6]:
# This function should build supervised NB model and return a probability
# @param train_data = training data that are used to create the supervised NB classifier
# @param class_label = column name of the class that we want to classify
# @return probability_prior = dictionary describing prior probability of the class in training data,
# @return probability_posterior = dictionary of dictionaries posterior probability
def train_probability_supervised(train_data, class_label):
    (count_prior, count_posterior) = train_count_supervised(train_data, class_label)
    
    # Now calculate the probability of each instances, (i.e. 'Cough': {'flu': {'yes': 3, 'no': 0}, 'cold': {'yes': 1, 'no': 1}}
    # will have P(cough = yes | flu) = 3/3, P(cough = no | flu) = 0/3 and P(cough = yes | cold) = 1/2, P(cough = no | cold) = 1/2
    # First calculate the prior probability of the class P(c)
    probability_prior = {}
    sum_instance = sum(count_prior.values())
    for unique_class in train_data[class_label].unique():
        probability_prior[unique_class] = count_prior[unique_class] / sum_instance
        
        # Perform epsilon smoothing
        if (count_prior[unique_class] == 0):
            probability_prior[unique_class] = EPSILON
    
    # Calculate the posterior probability
    probability_posterior = count_posterior
    column_name = list(train_data.columns)
    column_name.remove(class_label)
                
    # Now calculate the posterior probability
    for col in column_name:
        for unique_class in train_data[class_label].unique():
            sum_instance = sum(probability_posterior[col][unique_class].values())
            for unique_col in train_data[col].unique():
                probability_posterior[col][unique_class][unique_col] /= sum_instance
                
                # Perform epsilon smoothing
                if (probability_posterior[col][unique_class][unique_col] == 0):
                    probability_posterior[col][unique_class][unique_col] = EPSILON
                
            
    return((probability_prior, probability_posterior))

## Predict Supervised

In [8]:
# This function should predict the class for a set of instances, based on a trained model 
# @param test_data = data to be tested
# @param train_data = data used for setup such as finding all possible classes
# @param class_label = attribute that we want to classify using naive bayes
# @param model = tuple consisting probability_prior and probability_posterior. 
# Mainly use the train_probability_supervised instead of train_count_supervised
# @return test_class = the class predicted by the naive bayes classifier
def predict_supervised(test_data, train_data, class_label, model):
    prior_probability = model[PRIOR_INDEX]
    posterior_probability = model[POSTERIOR_INDEX]
    test_class = [] # used to capture test result
    
    # Used for calculation purposes
    column_name = list(train_data.columns)
    column_name.remove(class_label)
    
    # Get the answer for every test instance
    for index, row in test_data.iterrows():
        # Initiate dictionary capturing the values calculated by naive bayes model
        test_value = {}
        for unique_class in train_data[class_label].unique():
            test_value[unique_class] = 0

        # Calculate for each class using the naive bayes model (log model for multiplication)
        for unique_class in train_data[class_label].unique():
            test_value[unique_class] = np.log(prior_probability[unique_class])
            for col in column_name:
                test_value[unique_class] += np.log(posterior_probability[col][unique_class][row[col]])
            
        # After calculating all of the possible class, we want to choose the maximum
        maximum_class = (train_data[class_label].unique())[0]
        maximum_value = test_value[maximum_class]
        for key, value in test_value.items():
            if (value > maximum_value):
                maximum_value = value
                maximum_class = key
    
        # Append result
        test_class.append(maximum_class)
    
    # Return the classifier for the class
    return test_class

# Evaluate Supervised

In [9]:
# This function should evaluate a set of predictions, in a supervised context 
def evaluate_supervised(true_test_result, predicted_test_result):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Measure accuracy
        correct = 0
        for i in range(len(true_test_result)):
            if (true_test_result[i] == predicted_test_result[i]):
                correct += 1
            
        accuracy = correct / len(true_test_result)
        
    return accuracy

In [10]:
def confusion_matrix_supervised(true_test_result, predicted_test_result, class_column):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Create a pandas dataframe actual is the row, predicted is the column
        confusion_df = pd.DataFrame()
        
        for unique_class in class_column:
            confusion_df[unique_class] = [0 for i in range(len(class_column))]
        
        # Change index for df
        confusion_df.index = class_column
        
        # Calculate the confusion matrix
        for i in range(len(true_test_result)):
            confusion_df.loc[true_test_result[i], predicted_test_result[i]] += 1
            
        # Add actual and predicted description on the table to make it easier to see
        predicted_column = []
        for string in confusion_df.columns:
            string += " predicted"
            predicted_column.append(string.title())
       
        actual_row = []
        for string in class_column:
            string += " actual"
            actual_row.append(string.title())
        
        confusion_df.columns = predicted_column
        confusion_df.index = actual_row
        
        return confusion_df

## Main Program

In [14]:
# Using the breast cancer data
df_breast_cancer = preprocess(BREAST_CANCER, BREAST_CANCER_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_breast_cancer, "class")
predicted_test_result = predict_supervised(df_breast_cancer, df_breast_cancer, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_breast_cancer["class"]), predicted_test_result, df_breast_cancer["class"].unique())
display(confusion_df)
print("The accuracy for breast cancer dataset is {}.".format(evaluate_supervised(list(df_breast_cancer["class"]), predicted_test_result)))
print("\n\n")

# Using car data
df_car = preprocess(CAR, CAR_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_car, "class")
predicted_test_result = predict_supervised(df_car, df_car, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_car["class"]), predicted_test_result, df_car["class"].unique())
display(confusion_df)
print("The accuracy for car dataset is {}.".format(evaluate_supervised(list(df_car["class"]), predicted_test_result)))
print("\n\n")

# Using the breast cancer data
df_hypo = preprocess(HYPOTHYROID, HYPOTHYROID_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_hypo, "class")
predicted_test_result = predict_supervised(df_hypo, df_hypo, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_hypo["class"]), predicted_test_result, df_hypo["class"].unique())
display(confusion_df)
print("The accuracy for hypothyroid dataset is {}.".format(evaluate_supervised(list(df_hypo["class"]), predicted_test_result)))
print("\n\n")

# Using the breast cancer data
df_mushroom = preprocess(MUSHROOM, MUSHROOM_COLUMN, eliminate=True)
model_main = train_probability_supervised(df_mushroom, "class")
predicted_test_result = predict_supervised(df_mushroom, df_mushroom, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_mushroom["class"]), predicted_test_result, df_mushroom["class"].unique())
display(confusion_df)
print("The accuracy for mushroom dataset is {}.".format(evaluate_supervised(list(df_mushroom["class"]), predicted_test_result)))
print("\n\n")

# Using training data
df_training = training_df
model_main = train_probability_supervised(df_training, "class")
predicted_test_result = predict_supervised(df_training, df_training, "class", model_main)
confusion_df = confusion_matrix_supervised(list(df_training["class"]), predicted_test_result, df_training["class"].unique())
display(confusion_df)
print("The accuracy for the training cold flu dataset is {}.".format(evaluate_supervised(list(df_training["class"]), predicted_test_result)))

Unnamed: 0,Recurrence-Events Predicted,No-Recurrence-Events Predicted
Recurrence-Events Actual,48,33
No-Recurrence-Events Actual,31,165


The accuracy for breast cancer dataset is 0.7689530685920578.





Unnamed: 0,Unacc Predicted,Acc Predicted,Vgood Predicted,Good Predicted
Unacc Actual,1161,47,0,2
Acc Actual,85,289,0,10
Vgood Actual,0,26,39,0
Good Actual,0,46,2,21


The accuracy for car dataset is 0.8738425925925926.





Unnamed: 0,Hypothyroid Predicted,Negative Predicted
Hypothyroid Actual,0,149
Negative Actual,0,2941


The accuracy for hypothyroid dataset is 0.9517799352750809.





Unnamed: 0,P Predicted,E Predicted
P Actual,2156,0
E Actual,16,3472


The accuracy for mushroom dataset is 0.997165131112686.





Unnamed: 0,Flu Predicted,Cold Predicted,Fever Predicted
Flu Actual,3,0,0
Cold Actual,0,2,0
Fever Actual,0,0,2


The accuracy for the training cold flu dataset is 1.0.


## Train Unsupervised

In [117]:
# This function initialise the distribution randomly to the dataframe to begin the unsupervised
# calculation.
# @param dataset = dataframe of the dataset
# @param class_label = class that we will use the classification on
# @return unsupervised_dataset = dataset that have been appended by the distribution columns
def initialise_unsupervised_naive_bayes(dataset, class_label):
    # First remove the class label and put that on the columns so that we can assign a distribution
    class_column = list(dataset[class_label].unique())
    last_class = class_column[-1]
    unsupervised_dataset = dataset.drop(["class"], axis=1)
    
    # Add a column to the dataset according to random distribution (initialisation phase)
    row_instance = unsupervised_dataset.shape[0]
    for unique_class in class_column:
        unsupervised_dataset[unique_class] = [0 for i in range(row_instance)]
    
    # Add random value to the dataset
    for index, row in unsupervised_dataset.iterrows():
        max_probability = 1
        for unique_class in class_column:
            # Assign the remaining probability to the last class
            if (unique_class == last_class):
                unsupervised_dataset.loc[index, unique_class] = max_probability
            else:
                unsupervised_dataset.loc[index, unique_class] = random.uniform(0, max_probability)
                max_probability -= unsupervised_dataset.loc[index, unique_class]
    
    return unsupervised_dataset

car_df = preprocess(CAR, CAR_COLUMN)
display(car_df.head())
unsupervised_df = initialise_unsupervised_naive_bayes(car_df, "class")
display(unsupervised_df)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,unacc,acc,vgood,good
0,vhigh,vhigh,2,2,small,low,0.275685,0.209520,0.437981,0.076814
1,vhigh,vhigh,2,2,small,med,0.800529,0.132982,0.053636,0.012853
2,vhigh,vhigh,2,2,small,high,0.176241,0.210448,0.201689,0.411622
3,vhigh,vhigh,2,2,med,low,0.933781,0.053483,0.000775,0.011962
4,vhigh,vhigh,2,2,med,med,0.097150,0.451131,0.246554,0.205165
5,vhigh,vhigh,2,2,med,high,0.910338,0.016199,0.017718,0.055745
6,vhigh,vhigh,2,2,big,low,0.791692,0.047520,0.015593,0.145195
7,vhigh,vhigh,2,2,big,med,0.291803,0.641368,0.050477,0.016352
8,vhigh,vhigh,2,2,big,high,0.582601,0.152595,0.071705,0.193099
9,vhigh,vhigh,2,4,small,low,0.487156,0.008787,0.386409,0.117648


In [170]:
def initialise_unsupervised_naive_bayes(dataset, class_label):
        no_of_rows = dataset.shape[0]
        no_of_classes = len(dataset[class_label].unique())

        # sample no_of_rows times from a uniform distribution (uniform prior)
        sample_matrix = np.zeros((no_of_rows, no_of_classes))
        for i in range(no_of_rows):
            samples = np.random.uniform(0, 1, no_of_classes)
            samples /= sum(samples)  # normalise so it sums to 1
            sample_matrix[i] = samples

        # create new columns
        sample_matrix_transpose = np.transpose(sample_matrix)
        for i in range(no_of_classes):
            # add a new column, using a vector from the matrix as the data
            class_ = dataset[class_label].unique()[i]
            row_values = sample_matrix_transpose[i]
            dataset[class_] = row_values

In [152]:
# This function should build an unsupervised NB model and return a count
# @param class_column = possible class name (weak unsupervised model)
# @param attribute_column = attributes that are used for calculation
# @param dataset = data that are used to create the unsupervised NB classifier (format after running initialise_unsupervised_naive_bayes function)
# @param class_label = column name of the class that we want to classify
# @return count_prior = dictionary describing prior count of the class in training data
# @return count_posterior = dictionary of dictionaries posterior count
def train_count_unsupervised(class_column, attribute_column, dataset, class_label):
    # Calculate prior (dictionary_prior)
    # Initiate python dictionary with the number of class in the training data as it's key
    count_prior = {}
    for unique_class in class_column:
        count_prior[unique_class] = 0
    
    # Loop through the training data and sum the probability
    for index, row in dataset.iterrows():
        for unique_class in class_column:
            count_prior[unique_class] += row[unique_class]
    
    # Calculate count posterior (dictionary_posterior), the data structure used are dictionary
    # of dictionary of dictionaries
    count_posterior = {}
    
    # Setup the dictionary component
    for col in attribute_column:
        count_posterior[col] = {}
        for unique_class in class_column:
            count_posterior[col][unique_class] = {}
            for unique_col in dataset[col].unique():
                count_posterior[col][unique_class][unique_col] = 0
    
    # Now use the training data to perform calculation
    for index, row in dataset.iterrows():
        for col in attribute_column:
            for unique_class in class_column:
                count_posterior[col][unique_class][row[col]] += row[unique_class]
   
    return((count_prior, count_posterior))

display(training2_df)
unsupervised_df = initialise_unsupervised_naive_bayes(training2_df, "class")
unsupervised_df['unacc'] = [0.09899670531767234,
                    0.32932573681422983,
                    0.33410576906864276,
                    0.12666018046468697,
                    0.01841670419573832,
                    0.25575616577323973,
                    0.3574805722747498]
unsupervised_df['vgood'] = [0.4222035619811808,
                    0.135427490167598,
                    0.4410694638512566,
                    0.47875032937958656,
                    0.17592750857716227,
                    0.36930076206948437,
                    0.36509255048732786]
unsupervised_df['good'] = [0.323166195312893,
                    0.36364365470065596,
                    0.10710165369961186,
                    0.29369226646774543,
                    0.5121058802348358,
                    0.13377844815404982,
                    0.1543820041187137]
unsupervised_df['acc'] =   [0.15563353738825392,
                    0.17160311831751635,
                    0.11772311338048874,
                    0.10089722368798107,
                    0.2935499069922637,
                    0.24116462400322605,
                    0.12304487311920863]
display(unsupervised_df)
test_model = train_probability_unsupervised(training2_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,high,high,3,4,small,low,unacc
1,high,vhigh,4,2,small,med,unacc
2,low,high,5more,more,big,med,acc
3,vhigh,low,5more,4,med,high,acc
4,vhigh,med,2,4,big,low,unacc
5,low,med,3,4,big,med,good
6,low,med,5more,4,big,high,vgood


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,unacc,acc,good,vgood
0,high,high,3,4,small,low,0.098997,0.155634,0.323166,0.422204
1,high,vhigh,4,2,small,med,0.329326,0.171603,0.363644,0.135427
2,low,high,5more,more,big,med,0.334106,0.117723,0.107102,0.441069
3,vhigh,low,5more,4,med,high,0.12666,0.100897,0.293692,0.47875
4,vhigh,med,2,4,big,low,0.018417,0.29355,0.512106,0.175928
5,low,med,3,4,big,med,0.255756,0.241165,0.133778,0.369301
6,low,med,5more,4,big,high,0.357481,0.123045,0.154382,0.365093


In [120]:
# This function should build unsupervised NB model and return a probability
# @param class_column = possible class name (weak unsupervised model)
# @param attribute_column = attributes that are used for calculation
# @param dataset = data that are used to create the unsupervised NB classifier (format after running initialise_unsupervised_naive_bayes function)
# @param class_label = column name of the class that we want to classify
# @return probability_prior = dictionary describing prior probability of the class in training data,
# @return probability_posterior = dictionary of dictionaries posterior probability
def train_probability_unsupervised(class_column, attribute_column, dataset, class_label):
    (count_prior, count_posterior) = train_count_unsupervised(class_column, attribute_column, dataset, class_label)
    
    # Now calculate the probability of each instances, (i.e. 'Cough': {'flu': {'yes': 0.3, 'no': 0}, 'cold': {'yes': 0.1, 'no': 0.1}}
    # will have P(cough = yes | flu) = 0.3/0.3, P(cough = no | flu) = 0/0.3 and P(cough = yes | cold) = 0.1/0.2, P(cough = no | cold) = 0.1/0.2
    
    # First calculate the prior probability of the class P(c)
    probability_prior = {}
    sum_instance = sum(count_prior.values())
    for unique_class in class_column:
        probability_prior[unique_class] = count_prior[unique_class] / sum_instance
        
        # Perform epsilon smoothing
        if (count_prior[unique_class] == 0.0):
            probability_prior[unique_class] = EPSILON
    
    # Calculate the posterior probability
    probability_posterior = count_posterior
                
    # Now calculate the posterior probability
    for col in attribute_column:
        for unique_class in class_column:
            sum_instance = sum(probability_posterior[col][unique_class].values())
            for unique_col in dataset[col].unique():
                probability_posterior[col][unique_class][unique_col] /= sum_instance
                
                # Perform epsilon smoothing
                if (probability_posterior[col][unique_class][unique_col] == 0):
                    probability_posterior[col][unique_class][unique_col] = EPSILON
    
    return((probability_prior, probability_posterior))

(a,b) = train_probability_unsupervised(car_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")
display(a)
display(b)

{'acc': 0.24955344019011938,
 'good': 0.12401181345097781,
 'unacc': 0.5048090690797541,
 'vgood': 0.12162567727914882}

{'buying': {'acc': {'high': 0.25109091849297427,
   'low': 0.2580010336567641,
   'med': 0.24994049682515582,
   'vhigh': 0.2409675510251058},
  'good': {'high': 0.24530848037278,
   'low': 0.2360132170724615,
   'med': 0.2662454120877231,
   'vhigh': 0.25243289046703543},
  'unacc': {'high': 0.25075500854917565,
   'low': 0.2527038484031028,
   'med': 0.24540776953472648,
   'vhigh': 0.25113337351299514},
  'vgood': {'high': 0.2494115241901702,
   'low': 0.23662218052441458,
   'med': 0.2526180803431106,
   'vhigh': 0.2613482149423045}},
 'doors': {'acc': {'2': 0.24643009761614507,
   '3': 0.24754849503184564,
   '4': 0.25451104917605355,
   '5more': 0.2515103581759558},
  'good': {'2': 0.2430821371834849,
   '3': 0.2484476658869288,
   '4': 0.23610681048580837,
   '5more': 0.27236338644377805},
  'unacc': {'2': 0.2522313915919078,
   '3': 0.2509571075786552,
   '4': 0.25112075705295434,
   '5more': 0.2456907437764827},
  'vgood': {'2': 0.2551169410586158,
   '3': 0.25264033622080306,

## Predict Unsupervised

In [160]:
# This function should predict the class for a set of instances, based on a trained model 
# @param dataset = data that are used to calculate prediction
# @param class_column = possible class name (weak unsupervised model)
# @param attribute_column = attributes that are used for calculation
# @param class_label = attribute that we want to classify using naive bayes
# @param model = tuple consisting probability_prior and probability_posterior. 
# @return test_class = the class predicted by the naive bayes classifier. The predict class will change the structure of dataset to be used for the next iteration.
def predict_unsupervised(dataset, class_column, attribute_column, class_label, model):
    prior_probability = model[PRIOR_INDEX]
    posterior_probability = model[POSTERIOR_INDEX]
    test_class = [] # used to capture test result
    
    # Get the answer for every test instance
    for index, row in dataset.iterrows():
        # Initiate dictionary capturing the values calculated by naive bayes model
        test_value = {}
        for unique_class in class_column:
            test_value[unique_class] = 0

        # Calculate for each class using the naive bayes model (log model for multiplication)
        for unique_class in class_column:
            test_value[unique_class] = np.log(prior_probability[unique_class])
            for col in attribute_column:
                test_value[unique_class] += np.log(posterior_probability[col][unique_class][row[col]])
            
        # After calculating all of the possible class, we want to choose the maximum
        maximum_class = class_column[0]
        maximum_value = test_value[maximum_class]
        for key, value in test_value.items():
            if (value > maximum_value):
                maximum_value = value
                maximum_class = key
    
        # Append result
        test_class.append(maximum_class)
        # Change the dataset structure for the instance to prepare for the next iteration
        # First take the exponent of that to get the real probability calculation value
        for unique_class in class_column:
            test_value[unique_class] = np.exp(test_value[unique_class])
        
        # Calculate the new probability
        denominator_new = sum(test_value.values())
        
        for unique_class in class_column:
            dataset.loc[index, unique_class] = test_value[unique_class] / denominator_new
    
    # Return the classifier for the class
    return test_class

test_result = predict_unsupervised(unsupervised_df, ["vgood", "unacc", "acc", "good"], CAR_COLUMN[:-1], "class", test_model)
display(unsupervised_df)
display(confusion_matrix_unsupervised(training2_df["class"], test_result,["vgood", "unacc", "acc", "good"]))

test_model = train_probability_unsupervised(training2_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")
test_result2 = predict_unsupervised(unsupervised_df, ["vgood", "unacc", "acc", "good"], CAR_COLUMN[:-1], "class", test_model)
display(unsupervised_df)
display(confusion_matrix_unsupervised(training2_df["class"], test_result2,["vgood", "unacc", "acc", "good"]))

test_model = train_probability_unsupervised(training2_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")
test_result3 = predict_unsupervised(unsupervised_df, ["vgood", "unacc", "acc", "good"], CAR_COLUMN[:-1], "class", test_model)
display(unsupervised_df)
display(confusion_matrix_unsupervised(training2_df["class"], test_result3,["vgood", "unacc", "acc", "good"]))

# print(training_df)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,unacc,acc,good,vgood
0,high,high,3,4,small,low,4.488917e-05,0.004841778,0.9951131,2.138274e-07
1,high,vhigh,4,2,small,med,0.940898,2.155676e-09,0.05910195,2.442198e-26
2,low,high,5more,more,big,med,0.97947,1.891122e-05,5.253066e-09,0.02051113
3,vhigh,low,5more,4,med,high,1.387297e-18,2.219895e-11,4.236005e-09,1.0
4,vhigh,med,2,4,big,low,3.221029e-16,0.9774314,0.02240915,0.0001594701
5,low,med,3,4,big,med,0.04411001,0.7999923,0.002000365,0.1538974
6,low,med,5more,4,big,high,0.003273888,0.0008653209,3.104657e-07,0.9958605


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,0,0,0
Unacc Actual,0,1,1,1
Acc Actual,1,1,0,0
Good Actual,0,0,1,0


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,unacc,acc,good,vgood
0,high,high,3,4,small,low,3.526425e-09,1.137173e-08,1.0,1.312937e-21
1,high,vhigh,4,2,small,med,0.9998045,2.121938e-31,0.0001955169,4.910483e-92
2,low,high,5more,more,big,med,0.9999364,7.760326e-11,5.1640299999999995e-20,6.356113e-05
3,vhigh,low,5more,4,med,high,1.792177e-56,2.028965e-28,1.573645e-31,1.0
4,vhigh,med,2,4,big,low,1.24629e-39,0.9999993,7.242867e-07,5.21263e-09
5,low,med,3,4,big,med,4.268972e-05,0.9883932,3.107869e-07,0.01156376
6,low,med,5more,4,big,high,9.13272e-07,6.774383e-07,2.9624229999999996e-19,0.9999984


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,0,0,0
Unacc Actual,0,1,1,1
Acc Actual,1,1,0,0
Good Actual,0,0,1,0


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,unacc,acc,good,vgood
0,high,high,3,4,small,low,2.0529949999999998e-19,9.367204e-26,1.0,4.035424e-58
1,high,vhigh,4,2,small,med,1.0,6.289664e-109,2.341679e-14,1.15327e-318
2,low,high,5more,more,big,med,1.0,6.018153e-25,1.80857e-53,4.669206e-11
3,vhigh,low,5more,4,med,high,2.0404150000000002e-162,4.948324000000001e-69,7.559739e-105,1.0
4,vhigh,med,2,4,big,low,1.287422e-96,1.0,2.219949e-24,6.713196e-18
5,low,med,3,4,big,med,2.078147e-14,0.9999652,2.667183e-22,3.480235e-05
6,low,med,5more,4,big,high,2.146863e-16,4.537449e-13,1.355675e-55,1.0


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,0,0,0
Unacc Actual,0,1,1,1
Acc Actual,1,1,0,0
Good Actual,0,0,1,0


## Evaluate Unsupervised

In [110]:
# This function calculate the accuracy based on the confusion matrix that are given
# @param confusion_matrix = the confusion matrix for unsupervised
# @return accuracy = accuracy of the unsupervised
def evaluate_unsupervised(confusion_matrix):
    total_instance = 0
    true_positive = 0
    column = list(confusion_matrix.columns)
    
    for index, row in confusion_matrix.iterrows():
        current_max = row[column[0]]
        for col in column:
            if (row[col] > current_max):
                current_max = row[col]
            total_instance += row[col]
        
        true_positive += current_max
    
    return true_positive/total_instance

In [111]:
# This function create a confusion matrix for unsupervised
# @param true_test_result = list displaying the real value of the test result
# @param predicted_test_result = list displaying the prediction
# @param class_column = all possible classes
def confusion_matrix_unsupervised(true_test_result, predicted_test_result, class_column):
    if (len(true_test_result) != len(predicted_test_result)):
        print("Error, different length.")
    else:
        # Create a pandas dataframe actual is the row, predicted is the column
        confusion_df = pd.DataFrame()
        
        for unique_class in class_column:
            confusion_df[unique_class] = [0 for i in range(len(class_column))]
        
        # Change index for df
        confusion_df.index = class_column
        
        # Calculate the confusion matrix
        for i in range(len(true_test_result)):
            confusion_df.loc[true_test_result[i], predicted_test_result[i]] += 1
            
        # Add actual and predicted description on the table to make it easier to see
        predicted_column = []
        for string in confusion_df.columns:
            string += " predicted"
            predicted_column.append(string.title())
       
        actual_row = []
        for string in class_column:
            string += " actual"
            actual_row.append(string.title())
        
        confusion_df.columns = predicted_column
        confusion_df.index = actual_row
        
        return confusion_df

In [175]:
# Using the car dataset
print("Car dataset analysis".title())
unsupervised_df = pd.read_csv("debug_out.csv")
unsupervised_df.drop(["Unnamed: 0", "6"], axis=1, inplace=True)
unsupervised_df.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "good", "vgood", "acc", "unacc"]
# display(unsupervised_df)
# print("Iteration 0")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    test_model = train_probability_unsupervised(car_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")
#     display(test_model)
    test_result2 = predict_unsupervised(unsupervised_df, ["vgood", "unacc", "acc", "good"], CAR_COLUMN[:-1], "class", test_model)
    
    confusion_matrix = confusion_matrix_unsupervised(car_df["class"], test_result2,["vgood", "unacc", "acc", "good"])
    display(confusion_matrix)
#     display(unsupervised_df)
    print("The accuracy of car dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n")

Car Dataset Analysis
Iteration 1


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,5,32,27
Unacc Actual,231,517,255,207
Acc Actual,24,78,195,87
Good Actual,1,30,20,18


The accuracy of car dataset based on confusion matrix is 0.4479166666666667.
Iteration 2


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,5,32,27
Unacc Actual,230,516,257,207
Acc Actual,24,78,195,87
Good Actual,1,30,20,18


The accuracy of car dataset based on confusion matrix is 0.44733796296296297.
Iteration 3


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,5,32,27
Unacc Actual,230,516,257,207
Acc Actual,24,78,195,87
Good Actual,1,30,20,18


The accuracy of car dataset based on confusion matrix is 0.44733796296296297.
Iteration 4


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,5,32,27
Unacc Actual,230,516,257,207
Acc Actual,24,78,195,87
Good Actual,1,30,20,18


The accuracy of car dataset based on confusion matrix is 0.44733796296296297.
Iteration 5


Unnamed: 0,Vgood Predicted,Unacc Predicted,Acc Predicted,Good Predicted
Vgood Actual,1,5,32,27
Unacc Actual,230,516,257,207
Acc Actual,24,78,195,87
Good Actual,1,30,20,18


The accuracy of car dataset based on confusion matrix is 0.44733796296296297.


## Main Program

In [177]:
# Using the breast cancer dataset
print("Breast cancer dataset".title())
attribute_column = BREAST_CANCER_COLUMN
breast_df = preprocess(BREAST_CANCER, BREAST_CANCER_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(breast_df, "class")
# print("Iteration 0")
# display(unsupervised_df)
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(breast_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(unsupervised_df, breast_df["class"].unique(), attribute_column[:-1], "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(breast_df["class"]), predicted_test_result, breast_df["class"].unique())
#     display(unsupervised_df)
    display(confusion_matrix)
    print("The accuracy of breast cancer dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n")

# Using the car dataset
print("Car dataset analysis".title())
unsupervised_df = pd.read_csv("debug_out.csv")
unsupervised_df.drop(["Unnamed: 0", "6"], axis=1, inplace=True)
unsupervised_df.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "good", "vgood", "acc", "unacc"]
# display(unsupervised_df)
# print("Iteration 0")
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    test_model = train_probability_unsupervised(car_df["class"].unique(), CAR_COLUMN[:-1], unsupervised_df, "class")
#     display(test_model)
    test_result2 = predict_unsupervised(unsupervised_df, ["vgood", "unacc", "acc", "good"], CAR_COLUMN[:-1], "class", test_model)
    
    confusion_matrix = confusion_matrix_unsupervised(car_df["class"], test_result2,["vgood", "unacc", "acc", "good"])
    display(confusion_matrix)
#     display(unsupervised_df)
    print("The accuracy of car dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n")

# Using the hypothyroid dataset
print("Hypothyroid dataset analysis".title())
attribute_column = HYPOTHYROID_COLUMN
hypo_df = preprocess(HYPOTHYROID, HYPOTHYROID_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(hypo_df, "class")
# print("Iteration 0")
# display(unsupervised_df)
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(hypo_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(unsupervised_df, hypo_df["class"].unique(), attribute_column[:-1], "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(hypo_df["class"]), predicted_test_result, hypo_df["class"].unique())
#     display(unsupervised_df)
    display(confusion_matrix)
    print("The accuracy of hypothyroid dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n")

# Using the mushroom dataset
print("Mushroom dataset analysis".title())
attribute_column = MUSHROOM_COLUMN
mushroom_df = preprocess(MUSHROOM, MUSHROOM_COLUMN)
unsupervised_df = initialise_unsupervised_naive_bayes(mushroom_df, "class")
# print("Iteration 0")
# display(unsupervised_df)
for i in range(ITERATION):
    # Train and give prediction and calculate accuracy
    print("Iteration {}".format(i+1))
    model = train_probability_unsupervised(mushroom_df["class"].unique(), attribute_column[:-1], unsupervised_df, "class")
    predicted_test_result = predict_unsupervised(unsupervised_df, mushroom_df["class"].unique(), attribute_column[:-1], "class", model)
    confusion_matrix = confusion_matrix_unsupervised(list(mushroom_df["class"]), predicted_test_result, mushroom_df["class"].unique())
#     display(unsupervised_df)
    display(confusion_matrix)
    print("The accuracy of mushroom dataset based on confusion matrix is {}.".format(evaluate_unsupervised(confusion_matrix)))
    print("\n")

Breast Cancer Dataset
Iteration 1


AttributeError: 'NoneType' object has no attribute 'iterrows'

Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.