
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    Andy Chen, Miow Fong Sim

**Student ID(s):**     903370, 881623

This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

# Function Definition

In [352]:
# For pre-processing
import pandas as pd
import numpy as np

# For smoothing
from collections import defaultdict

# For Gaussian pdf
import scipy.stats as stats

# For train/test split
import random

# For visualisation
import matplotlib.pyplot as plt
import seaborn as sns

In [353]:
###GIVEN DATASETS###

f1 = 'breast-cancer-wisconsin.data'
a1 = 'Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,class'.split(',')

f2 = 'mushroom.data'
a2 = 'class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gil-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat'.split(',')

f3 = 'lymphography.data'
a3 = 'class,lymphatics,block of affere,bl. of lymph. c,bl. of lymph. s,by pass,extravasates,regeneration of,early uptake in,lym.nodes dimin,lym.nodes enlar,changes in lym.,defect in node,changes in node,changes in stru,special forms,dislocation of,exclusion of no,no. of nodes in'.split(',')

f4 = 'wdbc.data'
a4 = 'ID number,class,Mean radius,Mean texture,Mean perimeter,Mean area,Mean smoothness,Mean compactness,Mean concavity,Mean concave points,Mean symmetry,Mean fractal dimension,Radius SE,Texture SE,Perimeter SE,Area SE,Smoothness SE,Compactness SE,Concavity SE,Concave points SE,Symmetry SE,Fractal dimension SE,Worst radius,Worst texture,Worst perimeter,Worst area,Worst smoothness,Worst compactness,Worst concavity,Worst concave points,Worst symmetry,Worst fractal dimension'.split(',')

f5 = 'wine.data'
a5 = 'class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline'.split(',')

f6 = 'car.data'
a6 = 'buying,maint,doors,persons,lug_boot,safety,class'.split(',')

f7 = 'nursery.data'
a7 = 'parents,has_nurs,form,children,housing,finance,social,health,class'.split(',')

f8 = 'somerville.data'
a8 = 'class,city information availability,housing cost,public schools overall quality,trust in the local police,streets and sidewalks maintenance,social community events availability'.split(',')

f9 = 'adult.data'
a9 = 'age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class'.split(',')

f10 = 'bank.data'
a10 = 'age,job,marital,education,default,balance,housing,loan,contact,day,campaign,pdays,previous,poutcome,class'.split(',')

#Since there is no pre-defined class, let expenses be the class.
f11 = 'university.data'
a11 = 'University-name,State,Control,number-of-students,male:female (ratio),student:faculty (ratio),sat-verbal,sat-math,expenses,percent-financial-aid,number-of-applicants,percent-admittance,percent-enrolled,academics,social,quality-of-life,academic-emphasis'.split(',')

### DATASETS DICTIONARY ###
data = [f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11]
data_attributes = [a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11]

datasets_dictionary = {}
for i in range(len(data)):
    datasets_dictionary.update({data[i] : data_attributes[i]})

In [354]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

#dataframe is created
#missing values are replaced with NaN

def preprocess(filename, cross_validation = False):
    '''
    Creates dataframe, replacing missing values
    
    Returns:
        - dataframe, if 10-fold cross validation is not performed
        - cross_val_dict, if 10-fold cross validation is performed  
    '''
    # Import file
    dataframe = pd.read_csv(filename,encoding = 'ISO-8859-1', header = None, names = datasets_dictionary[filename])
    
    # Fix missing values
    if filename == f11:
        dataframe.replace('0', np.NaN, inplace=True)
        print(dataframe.shape[0] - dataframe.dropna().shape[0], 'missing values found.')
        
    else:
        dataframe.replace('?', np.NaN, inplace=True)
        print(dataframe.shape[0] - dataframe.dropna().shape[0], 'missing values found.')
    
    # No cross validation
    if cross_validation == False:
        return dataframe
    
    # Cross validation
    else:
    
    ############### QUESTION 4: 10- FOLD CROSS VALIDATION ###############
    
        # Split data into 10 parts and store partitions in a list
        cross_val_parts = np.array_split(dataframe, 10)

        # 1st partition will be testing set
        # 2nd - 10th will be training set
        # 2nd partition will be testing set 
        # 1st, 3rd- 10th partitions will be training set 
        # ...
    
        # Initialising keys to cross_val_dict
        cross_val_dict = {'training': [], 'testing': []}
    
        k = 10
        
        for i in range(k):
            
            # cross_val_testing
            cross_val_testing = cross_val_parts[i]
            
            # cross_val_training
            left_training = cross_val_parts[:i]
            right_training = cross_val_parts[i+1:]
            
            cross_val_training = pd.concat(left_training + right_training)

            cross_val_dict['training'].append(cross_val_training)
            cross_val_dict['testing'].append(cross_val_testing)
        
        return cross_val_dict

In [355]:
# This function should calculat prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train(data, smooth = (None, 0)):
    '''
    Calculates prior probabilities and likelihoods from training data.
    
    Arguments:
        data: dataFrame
            - rows are instances
            - columns are attributes / class label
            - instances with missing values have been omitted
        smooth: tuple with dictionary assocating attribute to number of unique values and a smoothing parameter (Laplace)
    
    Returns: 
        Model, a single dictionary containing all necessary values for the naive-bayes classifier.
    '''
    
    Model = {}
    
    for classlabel in data['class'].unique():
        
        # Stores priors and likelihoods for a class
        Model[classlabel] = {}
        
        # Select all values of a single class, excluding the class column
        class_data = data[data['class'] == classlabel].drop('class', axis=1)
        
        # Calculate prior probability of the class label
        Model[classlabel]['prior'] = np.log(class_data.shape[0])-np.log(data.shape[0]) 
        
        # Note: Log probabilities are used to prevent problems associated with underflow
        
        # Calculate likelihood of data: P(attribute=x|c)
        Model[classlabel]['likelihood'] = {}
        
        for attribute in class_data.columns:
            
            # Numeric attribute which requires Gaussian pdf
            # Note: isNumericAttribute is a dataset-specific function to identify continuous attributes.
            if isNumericAttribute(class_data[attribute]):
                
                mu = class_data[attribute].mean()
                sig = class_data[attribute].std()
                assert(sig>0)
                
                # Store the mean and std of the distribution
                Model[classlabel]['likelihood'][attribute] = (mu, sig)
            
            # Nominal/binned data which uses relative frequencies
            else:
                # Smoothing: retrieve number of unique values for the attribute
                numUniqueValues = smooth[0][attribute]
                smoothingparameter = smooth[1]
                
                # If smoothing paramters is 0, do not smooth
                if smoothingparameter == 0:
                    
                    # Default probability close to 0, to avoid infinity.
                    likelihood_attribute = defaultdict(lambda: np.log(10**(-24))) 
                    
                    # Calculate relative frequency as an estimate to the likelihood
                    for unique_value in class_data[attribute].unique():
                        likelihood_attribute[unique_value] = np.log(sum(class_data[attribute] == unique_value))\
                        -np.log(class_data.shape[0])
                    
                # If smoothing parameter > 0. do add-k smoothing
                else:
                    
                    # Default value is set to k /(N + k*d), where d = num. possible values and N = total number of class instances.
                    likelihood_attribute = defaultdict(lambda: np.log(smoothingparameter)\
                                                       -np.log(class_data.shape[0]+ smoothingparameter*numUniqueValues))
               
                    # Calculate relative frequency as an estimate to the likelihood
                    for unique_value in class_data[attribute].unique():
                        likelihood_attribute[unique_value] = np.log(sum(class_data[attribute] == unique_value)+smoothingparameter)\
                        -np.log(class_data.shape[0] + smoothingparameter* numUniqueValues)
                    
                Model[classlabel]['likelihood'][attribute] = likelihood_attribute
                
    return Model
           

In [356]:
def cross_val_train(cross_val_dict):
    
    '''
    Trains cross-validation training set
    
    Returns modelCV, a list of dictionaries containing all necessary values for Naive Bayes Classifier
    '''
    
    #modelCV = [{}, {}, {}...]
    modelCV = []
    
    #training_set should contain 10 partitions
    training_set = cross_val_dict['training']
    
    # Train each partition
    # Append Rule of each partition to modelCV
    for i in range(len(training_set)):
        
        # Smoothing
        smoothparam = 0
        num_unique = {}
        for f in (training_set[i]):
            num_unique[f] = len(training_set[i][f].unique()) # not what i intended
        
        modelCV.append(train(training_set[i], (num_unique, smoothparam)))
        
    return modelCV

In [357]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you 
# can re-use the training data as a test set)

def predict(testset, Model):
    '''
    Produces a list of predicted class labels from a naive-bayes classifier.
    Model: dictionary of priors and likelihoods
    returns: list of class labels
    '''
    
    
    # Contains predicted classes for testset instances
    predictions = []
    
    # For each instance, try to predict what the class label should be
    for rowindex in range(testset.shape[0]):
        
        posteriors = {}
        
        # Find the posterior probability for each class: product of prior and likelihoods = sum of all log probabilities
        for classlabel in Model:
            
            # Log Prior
            posteriors[classlabel] = Model[classlabel]['prior']
            
            # Sum of log likelihoods
            for attribute in Model[classlabel]['likelihood']:

                # If missing, exclude from calculation.
                # Equivalent to multiplying by one, but no effect as it affects all posterior probabilities
                if testset.iloc[rowindex,:][attribute] == np.NaN:
                    print('missing value identified') # should print once for each class
                    
                
                # If numeric data, calculate log likelihood from pdf and add
                elif isNumericAttribute(testset[attribute]):
                    
                    mu, sig = Model[classlabel]['likelihood'][attribute]
                    
                    # Gaussian probability distribution
                    posteriors[classlabel] += np.log(stats.norm.pdf(testset.iloc[rowindex,:][attribute],
                                                                    loc = mu, scale = sig))
              
                # If relative frequency: search for log likelihood and add
                else:
                    posteriors[classlabel] += Model[classlabel]['likelihood'][attribute][testset.iloc[rowindex,:][attribute]]

        # Select greatest posterior probability
        predictedclass = None
        
        for classlabel in posteriors:
            if (predictedclass == None) or (posteriors[classlabel] > posteriors[predictedclass]):
                predictedclass = classlabel
        
        # Add prediction of an instance to the set of prediction
        predictions.append(predictedclass)
    
    return predictions

In [358]:
def cross_val_predict(cross_val_dict, modelCV):    
    '''Returns predictions for cross-validation'''
    
    testing_set = cross_val_dict['testing']
    
    # predictionsCV = []
    predictionsCV = []
    
    for i in range(len(testing_set)):
        predictionsCV.append(predict(testing_set[i], modelCV[i]))
        
    return predictionsCV

In [359]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground truth labels

def evaluate(groundtruth, predictedlabels):
    '''
    groundtruth: array of true class labels for a dataset 
    predicted: array of predicted class labels predicted by the NaiveBayes model
    
    Returns: accuracy, and a confusion matrix in a tuple
    '''
    
    # Verify that the number of predictions matches the size of the test set
    assert(len(groundtruth) == len(predictedlabels))
    
    # Elementwise comparison to find matches (True) and sum to get total number of matches
    correct = 0
    for i in range(len(groundtruth)):
        correct += (groundtruth[i] == predictedlabels[i])
    
    confusion = pd.DataFrame({'groundtruth': groundtruth, 'predicted': predictedlabels})
    
    confusion = pd.crosstab(confusion['groundtruth'], confusion['predicted'], 
                            rownames=['groundtruth'], colnames = ['predicted'])
    
    #return accuracy and confusion matrix
    return correct/len(predictedlabels), confusion

In [16]:
def cross_val_evaluate(cross_val_dict, predictionsCV):
    
    '''
    Compares cross validation testing set labels (groundtruth)
    
    to predicted labels of cross validation
    
    Returns: a list of accuracies 
    '''
    
    #list to store cross validation accuracies
    # accuraciesCV = [0.99, 0.98, ....]
    accuraciesCV = []
    
    testing_set = cross_val_dict['testing']
    
    # Extract actual class labels (cross val groundtruth) from testing set
    # groundtruthsCV = [[groundtruth from 1st testing set], [groundtruth from 2nd testing set],...]
    
    groundtruthsCV = []
    for i in range(len(testing_set)):
        
        groundtruth = cross_val_dict['testing'][i]['class']
        groundtruth_list = groundtruth.tolist()
        groundtruthsCV.append(groundtruth_list)        
        
    # Appends accuracies to accuraciesCV
    for i in range(len(groundtruthsCV)):
        accuracy = evaluate(groundtruthsCV[i], predictionsCV[i])
        accuraciesCV.append(accuracy[0])
    
    #return mean of accuraciesCV
    return np.mean(accuraciesCV)

## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

# Q1
Try discretising the numeric attributes in these datasets and treating them as discrete variables in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the numeric values into any number of levels (but around 3 to 5 levels would probably be a good starting point). Does discretizing the variables improve classification performance, compared to the Gaussian na¨ıve Bayes approach? Why or why not?

# Code Instructions for Q1

Define the functions. The discretize function may be customized to achieve equal-size or equal-frequency binning.

In the main code:

* Select the dataset to be used in the naive-bayes model.
* By default, the model will use Gaussian NB unless the discretize function is used. The number of bins can be varied by the user.
* If smoothing is desired (not used in Q1), the smoothing parameter can be changed. Only relevant for discretized numerical data.

In [360]:
def isNumericAttribute(series):
    ''' 
    Datasets: wine, wdbc, bank
    
    Ensures that the train function uses the Gaussian pdf.
    
    Returns True if the series contains a numeric attribute, and false otherwise.
    '''
    if series.dtype == object:
        return False
    if len(series.unique())>9:
        #print(series.name, 'is numeric')
        return True

    return False

def discretize(data, bins = 5):
    '''
    Datasets: wine, wdbc, bank
    
    Assumes data is a dataFrame of features only
    Discretizes only the numeric attributes, which are determined by number of unique values.
    
    Returns: the same dataframe with numeric features in equal-sized bins
    '''
    
    discretizeddata = data
    for feature in data.columns:
        if isNumericAttribute(data[feature]):
            # create bins and group
            
            # Equal-size bins
            discretizeddata[feature]= pd.qcut(discretizeddata[feature], bins, duplicates = 'drop')
            
            # Equal-width bins
            #discretizeddata[feature]= pd.cut(discretizeddata[feature], bins, duplicates = 'drop')
            
            print(feature, 'was discretized')
    
    return discretizeddata

In [5]:
for bins in [0,3,5]: # Include 0 at the front of the list to include GaussianNB

    # Uncomment to select a dataset

    #dataset = preprocess('bank.data')
    dataset = preprocess('wine.data')
    #dataset = preprocess('wdbc.data')

    # Check if GaussianNB is required (needs to be first or the data will already be discretized)
    if bins != 0:
        # === Discretize ===
        print(dataset.head(1))
        dataset = discretize(dataset, bins)
        # ==================

    print('shape:', dataset.shape)

    # Verify that all features are correctly formatted
    print(dataset.head(1))

    for iteration in range(1000,1003): # Seed

        results = []
        split = 0.5

        random.seed(iteration)
        indices = [x for x in range(0,dataset.shape[0])]
        random.shuffle(indices)

        trainingindex = indices[0:int(dataset.shape[0]*split)]        
        testindex = indices[int(dataset.shape[0]*split):dataset.shape[0]]

        trainingset = dataset.iloc[trainingindex,:]
        testset = dataset.iloc[testindex,:]

        groundtruth_in = trainingset['class']
        groundtruth_out = testset['class']

        trainingset_noclass = trainingset.drop('class', axis=1)
        testset_noclass = testset.drop('class', axis=1)

        # Smoothing
        smoothparam = 0
        num_unique = {}
        for f in dataset:
            num_unique[f] = len(dataset[f].unique())

        # Naive-Bayes Model: No Smoothing
        modelNB_nosmooth = train(trainingset.dropna(), (num_unique, smoothparam))

        trainingfitNB_nosmooth = predict(trainingset_noclass, modelNB_nosmooth)
        trainingevaluationNB_nosmooth = evaluate(groundtruth_in.tolist(), trainingfitNB_nosmooth)

        predictionNB_nosmooth = predict(testset_noclass, modelNB_nosmooth)
        evaluationNB_nosmooth = evaluate(groundtruth_out.tolist(), predictionNB_nosmooth)

        # Print results
        print('===================================')
        print('No Smoothing, bins:', bins)
        print('In-sample accuracy:', round(trainingevaluationNB_nosmooth[0],4))
        print('Out-of-sample accuracy:', round(evaluationNB_nosmooth[0],4))
        print(evaluationNB_nosmooth[1])

NameError: name 'preprocess' is not defined

# Q2
Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline performance.

## Q2 Instructions:
* Run the isNumericAttribute and baseline functions.
* Run the main code. Edit the code to customize: dataset, smoothing parameter.

In the assignment, no smoothing was used for this question, and the entire dataset was used in training and testing (instances with missing values were removed during training).

In [361]:
def isNumericAttribute(series):
    ''' 
    Datasets: wine, wdbc, bank
    
    Ensures that the train function uses the Gaussian pdf.
    
    Returns True if the series contains a numeric attribute, and false otherwise.
    '''
    if series.dtype == object:
        return False
    if len(series.unique())>9:
        #print(series.name, 'is numeric')
        return True

    return False

def discretize(data, bins = 5):
    '''
    Datasets: wine, wdbc, bank
    
    Assumes data is a dataFrame of features only
    Discretizes only the numeric attributes, which are determined by number of unique values.
    
    Returns: the same dataframe with numeric features in equal-sized bins
    '''
    
    discretizeddata = data
    for feature in data.columns:
        if isNumericAttribute(data[feature]):
            # create bins and group
            
            # Uncomment to select: equal-size bins
            discretizeddata[feature]= pd.qcut(discretizeddata[feature], bins, duplicates = 'drop')
            
            # Uncomment to select: Equal-width bins
            #discretizeddata[feature]= pd.cut(discretizeddata[feature], bins, duplicates = 'drop')
            
            print(feature, 'was discretized')
    
    return discretizeddata

def baseline_train(data):
    ''' Returns the class which appears the most frequently in the training set'''
    maxclass = data['class'].value_counts().idxmax()
    return maxclass

def baseline_predict(data, maxclass):
    ''' Returns the predicted labels on a testset given a baseline model'''
    return [maxclass for x in range(data.shape[0])]


In [4]:
# Uncomment to select a datasetset
#dataset = preprocess('wine.data')
dataset = preprocess('wdbc.data')
#dataset = preprocess('bank.data')

print('shape:', dataset.shape)
print(dataset.head(1))

results = []

groundtruth = dataset['class']
dataset_noclass = dataset.drop('class', axis=1)

# Baseline model: 0R
modelBaseline = baseline_train(dataset)
predictionBaseline = baseline_predict(dataset.dropna(), modelBaseline)
evaluationBaseline = evaluate(groundtruth, predictionBaseline)

# Smoothing
smoothparam = 0
num_unique = {}
for f in dataset:
    num_unique[f] = len(dataset[f].unique())

# Naive-Bayes Model
modelNB = train(dataset.dropna(), (num_unique, smoothparam))
predictionNB = predict(dataset_noclass, modelNB)
evaluationNB = evaluate(groundtruth, predictionNB)

# Print results
print('===================================')
print('Baseline Accuracy = ', round(evaluationBaseline[0],4))

# Smoothed:
print('Smoothing parameter = ', smoothparam)
print(evaluationBaseline[1])

print('===================================')
print('NB Accuracy = ', round(evaluationNB[0],4))

print(evaluationNB[1])

NameError: name 'preprocess' is not defined

### Q4
Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy (you should implement this yourself and do not simply call existing implementations from `scikit-learn`). How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)

## Q4 Instructions:

* Nominal datasets(f2,f3) and Numerical datasets (f4,f5) will be used.
* Run testing on training and 10-fold cross-validation.
* Testing on training accuracy will be printed. 
* Average accuracy will be printed for 10-fold Cross Validation.

In [None]:
#################### 10-FOLD CROSS VALIDATION ####################

''' Compares accuracies of Testing on Training Data and Cross Validation'''

nominal_numerical_datasets = [f2,f3,f4,f5]
# Nominal Datasets = f2,f3
# Numerical Datasets = f4,f5

for data in nominal_numerical_datasets:
    print('=' * 50)
    print(f"{data}")
    print('=' * 50)
    
    #################### TESTING ON TRAINING DATA ####################
    dataframe = preprocess(data, cross_validation = False)
    
    groundtruth = dataframe['class']
    data_noclass = dataframe.drop('class', axis=1)
    
    # Smoothing
    smoothparam = 0
    num_unique = {}
    for f in dataframe:
        num_unique[f] = len(dataframe[f].unique())


    # Naive-Bayes Model
    modelNB = train(dataframe.dropna(), (num_unique, smoothparam))
    predictionNB = predict(data_noclass, modelNB)
    evaluationNB = evaluate(groundtruth, predictionNB)
    print('Testing on Training Data Accuracy = ', round(evaluationNB[0],4))
    
    # 10-fold Cross Validation
    cross_val_dict = preprocess(data, cross_validation = True)
    modelCV = cross_val_train(cross_val_dict)
    predictionsCV = cross_val_predict(cross_val_dict, modelCV)
    evaluationCV = cross_val_evaluate(cross_val_dict, predictionsCV)
    print('Cross Validation Accuracy Average= ', round(evaluationCV, 4))
    print('=' * 50)

# Q5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the na¨ıve Bayes classifier? Explain why, or why not.

## Q5 Instructions:
* Run the isNumericAttribute Function to ensure that all attributes are read as nominal data.
* Run the main code. Edit the code to customize: dataset, random seed, dataset split, smoothing parameter.
    
In the assignment, the data was split 50/50, the random seeds used were 1000 to 1002, and the smoothing parameters used were 0,1 and 5.

In [363]:
def isNumericAttribute(series):
    '''
    Datasets: breast_cancer_wisconsin, lymphography (all features assumed to be Nominal)
    
    Returns True if the series contains a numeric attribute, and false otherwise.
    '''
    return False

In [383]:
# Uncomment to select a dataset
#dataset = preprocess('breast-cancer-wisconsin.data')
dataset = preprocess('lymphography.data')

print('shape:', dataset.shape)

# Verify that all features are correctly formatted
print(dataset.head(1))

for iteration in range(1000,1003):

    results = []
    split = 0.5

    random.seed(iteration)
    indices = [x for x in range(0,dataset.shape[0])]
    random.shuffle(indices)

    trainingindex = indices[0:int(dataset.shape[0]*split)]        
    testindex = indices[int(dataset.shape[0]*split):dataset.shape[0]]

    trainingset = dataset.iloc[trainingindex,:]
    testset = dataset.iloc[testindex,:]

    groundtruth_in = trainingset['class']
    groundtruth_out = testset['class']
    
    trainingset_noclass = trainingset.drop('class', axis=1)
    testset_noclass = testset.drop('class', axis=1)

    # Smoothing
    smoothparam = 0
    num_unique = {}
    for f in dataset:
        num_unique[f] = len(dataset[f].unique())

    # Naive-Bayes Model: No Smoothing
    modelNB_nosmooth = train(trainingset.dropna(), (num_unique, smoothparam))
    
    trainingfitNB_nosmooth = predict(trainingset_noclass, modelNB_nosmooth)
    trainingevaluationNB_nosmooth = evaluate(groundtruth_in.tolist(), trainingfitNB_nosmooth)
    
    predictionNB_nosmooth = predict(testset_noclass, modelNB_nosmooth)
    evaluationNB_nosmooth = evaluate(groundtruth_out.tolist(), predictionNB_nosmooth)

    # Naive-Bayes Model: Smoothing with k=1
    smoothparam1 = 1 # Change smoothing parameter here
    
    modelNB_smooth1 = train(trainingset.dropna(), (num_unique, smoothparam1))
    
    trainingfitNB_smooth1 = predict(trainingset_noclass, modelNB_smooth1)
    trainingevaluationNB_smooth1 = evaluate(groundtruth_in.tolist(), trainingfitNB_smooth1)
    
    predictionNB_smooth1 = predict(testset_noclass, modelNB_smooth1)
    evaluationNB_smooth1 = evaluate(groundtruth_out.tolist(), predictionNB_smooth1)
    
    # Naive-Bayes Model: Smoothing with k=5
    smoothparam5 = 5 # Change smoothing parameter here
    
    modelNB_smooth5 = train(trainingset.dropna(), (num_unique, smoothparam5))
    
    trainingfitNB_smooth5 = predict(trainingset_noclass, modelNB_smooth)
    trainingevaluationNB_smooth5 = evaluate(groundtruth_in.tolist(), trainingfitNB_smooth5)
    
    predictionNB_smooth5 = predict(testset_noclass, modelNB_smooth5)
    evaluationNB_smooth5 = evaluate(groundtruth_out.tolist(), predictionNB_smooth5)

    # Print results
    print('===================================')
    print('No Smoothing:')
    print('In-sample accuracy:', round(trainingevaluationNB_nosmooth[0],4))
    print('Out-of-sample accuracy:', round(evaluationNB_nosmooth[0],4))
    print(evaluationNB_nosmooth[1])

    print('======')
    print('Smoothing parameter = ', smoothparam1)
    print('In-sample accuracy:', round(trainingevaluationNB_smooth1[0],4))
    print('Out-of-sample accuracy:', round(evaluationNB_smooth1[0],4))
    print(evaluationNB_smooth1[1])
    
    print('======')
    print('Smoothing parameter = ', smoothparam5)
    print('In-sample accuracy:', round(trainingevaluationNB_smooth5[0],4))
    print('Out-of-sample accuracy:', round(evaluationNB_smooth5[0],4))
    print(evaluationNB_smooth5[1])
    
    

0 missing values found.
shape: (148, 19)
   class  lymphatics  block of affere  bl. of lymph. c  bl. of lymph. s  \
0      3           4                2                1                1   

   by pass  extravasates  regeneration of  early uptake in  lym.nodes dimin  \
0        1             1                1                2                1   

   lym.nodes enlar  changes in lym.  defect in node  changes in node  \
0                2                2               2                4   

   changes in stru  special forms  dislocation of  exclusion of no  \
0                8              1               1                2   

   no. of nodes in  
0                2  
No Smoothing:
In-sample accuracy: 0.8649
Out-of-sample accuracy: 0.8378
predicted     2   3  4
groundtruth           
1             0   1  0
2            35   3  0
3             8  26  0
4             0   0  1
Smoothing parameter =  1
In-sample accuracy: 0.8649
Out-of-sample accuracy: 0.8514
predicted     2   3  4
groun