# Project 1: What is labelled data worth to Naive Bayes?
---
###### Student Name(s): Maleakhi Agung Wijaya, Chirag Rao Sahib

## Initialisation

In [1]:
# Library
import pandas as pd
import numpy as np

In [22]:
# Data Path Constant
BREAST_CANCER = "2018S1-proj1_data/breast-cancer-dos.csv"
CAR = "2018S1-proj1_data/car-dos.csv"
HYPOTHYROID = "2018S1-proj1_data/hypothyroid-dos.csv"
MUSHROOM = "2018S1-proj1_data/mushroom-dos.csv"

# Column name for each data set
BREAST_CANCER_COLUMN = ["age", "menopause", "tumor-size", "inv-nodes", "node-caps", "deg-malig", "breast", "breast-quad", "irradiat", "class"]
CAR_COLUMN = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
HYPOTHYROID_COLUMN = ["sex", "on_thyroxine", "query_on_thyroxine", "on_antithyroid_medication", "thyroid_surgery", "query_hypothyroid", "query_hyperthyroid", "pregnant", "sick", "tumor", "lithium", "goitre", "TSH_measured", "T3_measured", "TT4_measured", "T4U_measured", "FTI_measured", "TBG_measured", "class"]
MUSHROOM_COLUMN = ["cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat", "class"]

In [105]:
# Used to check algorithm correctness
df = pd.DataFrame(data={"Headache": ["severe", "no", "mild", "mild", "severe"], "Sore": ["mild", "severe", "mild", "no", "severe"], "Temperature":["high", "normal", "normal", "normal", "normal"], "Cough": ["yes", "yes", "yes", "no", "yes"], "class":["flu", "cold", "flu", "cold", "flu"]})
df

Unnamed: 0,Cough,Headache,Sore,Temperature,class
0,yes,severe,mild,high,flu
1,yes,no,severe,normal,cold
2,yes,mild,mild,normal,flu
3,no,mild,no,normal,cold
4,yes,severe,severe,normal,flu


## Preprocess

In [58]:
# This function should open a data file in csv, and transform it into a usable format 
# @param data = csv data that will be opened
# @param columns = new column name for header
# @param eliminate = eliminate the missing/ ? instances (recommended if there are only few ? instances)
# @return df = clean pandas dataframe object
def preprocess(data, columns, eliminate=True):
    # Read and add a header to the data frame
    df = pd.read_csv(data, header=None)
    df.columns = columns
    
    # If the parameter ignore is set to be false then we don't ignore
    if (eliminate):
        # Iterate through the dataframe and only append without missing value
        # Capture the index of one with the missing values
        for index, row in df.iterrows():
            for att in row:
                # If encounter missing values in the data, don't use that
                if (att == "?"):
                    df.drop(index, inplace=True)
                    break
    
    # Return the clean data
    return df

## Train Supervised

In [101]:
# This function should build a supervised NB model and return a count
# @param train_data = training data that are used to create the supervised NB classifier
# @param class_label = column name of the class that we want to classify
# @return count_prior = dictionary describing prior count of the class in training data, 
#         count_posterior = dictionary of dictionaries posterior count
def train_count_supervised(train_data, class_label):
    # Calculate prior (dictionary_prior)
    # Initiate python dictionary with the number of class in the training data as it's key
    count_prior = {}
    for unique_class in train_data[class_label].unique():
        count_prior[unique_class] = 0
    
    # Loop through the training data and get how many for every classes instance.
    # Now we have the count prior class that are used for prediction
    for index, row in train_data.iterrows():
        count_prior[row[class_label]] += 1
    
    # Calculate count posterior (dictionary_posterior), the data structure used are dictionary
    # of dictionary of dictionaries
    count_posterior = {}
    
    # Setup the dictionary component
    column_name = list(train_data.columns)
    column_name.remove(class_label)
    for col in column_name:
        count_posterior[col] = {}
        for unique_class in train_data[class_label].unique():
            count_posterior[col][unique_class] = {}
            for unique_col in train_data[col].unique():
                count_posterior[col][unique_class][unique_col] = 0
    
    # Now use the training data to perform calculation
    for index, row in train_data.iterrows():
        for col in column_name:
            count_posterior[col][row[class_label]][row[col]] += 1
            
    return((count_prior, count_posterior))

# This function should build supervised NB model and return a probability
# @param train_data = training data that are used to create the supervised NB classifier
# @param class_label = column name of the class that we want to classify
# @return probability_prior = dictionary describing prior probability of the class in training data,
#         probability_posterior = dictionary of dictionaries posterior probability
def train_probability_supervised(train_data, class_label):
    (count_prior, count_posterior) = train_count_supervised(train_data, class_label)
    print(count_posterior)
    
    # Now calculate the probability of each instances, (i.e. 'Cough': {'flu': {'yes': 3, 'no': 0}, 'cold': {'yes': 1, 'no': 1}}
    # will have P(cough = yes | flu) = 3/3, P(cough = no | flu) = 0/3 and P(cough = yes | cold) = 1/2, P(cough = no | cold) = 1/2
    
    # First calculate the prior probability of the class P(c)
    probability_prior = {}
    sum_instance = sum(count_prior.values())
    for unique_class in train_data[class_label].unique():
        probability_prior[unique_class] = count_prior[unique_class] / sum_instance
    
    # Calculate the posterior probability
    probability_posterior = count_posterior
    column_name = list(train_data.columns)
    column_name.remove(class_label)
                
    # Now calculate the posterior probability
    for col in column_name:
        for unique_class in train_data[class_label].unique():
            sum_instance = sum(probability_posterior[col][unique_class].values())
            for unique_col in train_data[col].unique():
                probability_posterior[col][unique_class][unique_col] /= sum_instance
            
    return((probability_prior, probability_posterior))

In [None]:
# This function should predict the class for a set of instances, based on a trained model 
def predict_supervised():
    return

In [None]:
# This function should evaluate a set of predictions, in a supervised context 
def evaluate_supervised():
    return

In [None]:
# This function should build an unsupervised NB model 
def train_unsupervised():
    return

In [None]:
# This function should predict the class distribution for a set of instances, based on a trained model
def predict_unsupervised():
    return

In [None]:
# This function should evaluate a set of predictions, in an unsupervised manner
def evaluate_unsupervised():
    return

Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.