# Classification with probability

### Part 1: Naive Bayes

Everything so far has been a linear classifier. Now we'll move up a gear, and implement some non-linear classifiers. The first, as we saw in class, is Naive Bayes, that makes use of proability to make predictions.

We make use of Bayes Theorem, that allows us to calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes Theorem is stated as:

P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data

We're going to break this down into several steps. Again, I've given you a contrived data set for you to test your functions.

#### (a) Separate by class

Just as in class, we need to calculate the probability of data by the class they belong to. We'll need to separate our data by the class. Create a dictionary, where the key is class, and the values is a list of all instances with that class value. 

In [1]:
# Contrived data set

dataset = [[3.393533211,2.331273381,0],
    [3.110073483,1.781539638,0],
    [1.343808831,3.368360954,0],
    [3.582294042,4.67917911,0],
    [2.280362439,2.866990263,0],
    [7.423436942,4.696522875,1],
    [5.745051997,3.533989803,1],
    [9.172168622,2.511101045,1],
    [7.792783481,3.424088941,1],
    [7.939820817,0.791637231,1]]


# implement separateByClass(dataset) here
def separateByClass(dataset):
    dictionary = {}
    classes = [dataset[0][-1]]
    for ins in dataset:
        if ins[-1] not in classes:
            classes.append(ins[-1])
    for c in classes:
        instance = []
        for ins in dataset:
            if ins[-1] == c:
                instance.append(tuple(ins))
        dictionary[c] = instance
    return dictionary


print(separateByClass(dataset))

{0: [(3.393533211, 2.331273381, 0), (3.110073483, 1.781539638, 0), (1.343808831, 3.368360954, 0), (3.582294042, 4.67917911, 0), (2.280362439, 2.866990263, 0)], 1: [(7.423436942, 4.696522875, 1), (5.745051997, 3.533989803, 1), (9.172168622, 2.511101045, 1), (7.792783481, 3.424088941, 1), (7.939820817, 0.791637231, 1)]}


#### (b) Summarize the data

We need two statistics from the data, the mean and the standard deviation. You should have these functions in a previous assignment, remembering the standard deviation is simply the square root of the variance. We need the mean and standard deviation for each of our attributes, i.e. for each column of our data. Create a function that summarizes a given data set, by gathering all of the information for each column, and calculating the mean and standard deviation on that columns data. We'll collect this information into a tuple, one per column, comprising the mean, the standard deviation and the number of elements in each column). Return a list of these tuples. 

In [2]:
import math

# implement summarizeDataset(dataset) here, and copy across any functions you need to help you
def mean(listOfValues):
    return sum(listOfValues)/len(listOfValues)

def std_dev(listOfValues):
    meanValue = mean(listOfValues)
    var = 0
    for value in listOfValues:
        var += (value - meanValue)**2
    return math.sqrt(var/(len(listOfValues)-1))

def summarizeDataset(dataset):
    summary = []
    for i in range(len(dataset[0])-1):
        val = [ins[i] for ins in dataset]
        summary.append((mean(val), std_dev(val), len(val)))
    return summary

#### (c) Summarize data by class

We now need to combine the functions from (a) and (b) above. Create a summarizeByClass function, that splits the data by class, and then caluclates statistics for each row of the data for each class. The results - the list of tuples of statistics, one per column - should then be stored in a dictionary by their class value. summarizeByClass should return such a dictionary.

In [3]:
# implement summarizeByClass(dataset) here
def summarizeByClass(dataset): 
    summary = {}
    dictionary = separateByClass(dataset)
    for key in dictionary.keys():
        values = dictionary[key]
        summary[key] = summarizeDataset(values)
    return summary

print(summarizeByClass(dataset))

# The dictionary for the contrived data should look like:
# {0: [(2.7420144012, 0.9265683289298018, 5), (3.0054686692, 1.1073295894898725, 5)], 1: [(7.6146523718, 1.2344321550313704, 5), (2.9914679790000003, 1.4541931384601618, 5)]}

{0: [(2.7420144012, 0.9265683289298018, 5), (3.0054686692, 1.1073295894898725, 5)], 1: [(7.6146523718, 1.2344321550313704, 5), (2.9914679790000003, 1.4541931384601618, 5)]}


#### (d) Guassiaun Probability Density

We're working with numerical data here, so we need to implement the gaussian probability density function (PDF) we talked about in class, so we can attach probabilities to real values. A gaussian distribution can be summarized from two values - guess which two? If you guessed mean and standard deviation, you were correct. The gaussian PDF is calculated as follows:

probability(x) = (1 / (sqrt(2 * pi) * std_dev)) * exp(-((x-mean) ** 2 / 2 * std_dev ** 2 )))

Hopefully, you can see why we're going to need the mean and the std_dev from function (c)

Create a function that:
- takes a value
- takes a mean
- takes a standard deviation

and returns the probability of seeing that value, given that distribution, using the formula above.

In [4]:
# Implement calcProb(value, mean, std_dev) here
def calcProb(value, mean, std_dev):
    return (math.exp(-((value-mean) ** 2) / (2 * (std_dev ** 2 ))))/ (math.sqrt(2 * math.pi) * std_dev)

#### (e) Class Probabilities

We can now use probabilites calculated from our training data to calculate probabilities for an instance of new data, by creating a function called calcClassProbs. Probabilites have to be calculated separately for each possible class in our data, so for each class we have to calculate the likelihood the new instance of data belongs to that class. The probability that a piece of data belongs to a class is calculated by:

p(class|data) = p(X|class) * P(class)

The divison has been removed, because we're just trying to maximize the result of the formula above. The largest value we get for each class above determines which class we assign. Each input value is treated separately, so in the case where we have TWO input values in our data (X1 and X2), the probablility that an instance belongs to class 0 is calculated by:

P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

We have to repeat this for each class, and then choose the class with the highest score. We should not assume a fixed number of input features, X, the above was just an illustration. 

We'll start by creating a function that will return the probabilities of predicting each class for a given instance. This function will take a dictionary of summaries (as returned by (c), above) and an instance, and will generate a dictionary of probabilites, with one entry per class. The steps are as follows:

- We need to calculate the total number of training instances, by counting the counts stored in the summary statistics. So if there are 9 instances with one label, and 5 with another (as in the weather data) then we need to know there are 14 instances. 

- This will help us calculate the probability of a given class, the prior probability P(class), as the ratio of rows with a given class divided by all rows in the training data

- Next probabilities are calculated for each input value in the instance, using the gaussian PDF, and the statistics for that column and of that class. Probabilites are multiplied together as they are accumulated with the formula given above. 

- The process is repeated for each class in the data

- Return the dictionary of probabilities for each class for the new instance

Some things that might help with implementation. 

- Dictionaries are your friend here
- The data returned by (c) above is already divided by class. You can:
    - discover the prior probability from this data (how many instances for this class, divided by the total instances)
    - iterate over the tuples, which give you the information (mean, std_dev, count) on a per column basis
    - calculate probability given the attribute value corresponding to that column using your function from (d)

Try this out on the contrived data. 

NOTE: If you want to output ACTUAL probabilities by class, we divide each score in the dictionary for an instance, by the sum of the values. You don't need to do this, it's just a reminder.


In [5]:
# Implement calcClassProbs(summaries, instance) here
def calcClassProbs(summaries, instance):
    classes = summaries.keys()
    dict_prob = {}
    count = sum([summaries[key][0][-1] for key in classes])

    for key in classes:
        class_prob = summaries[key][0][-1]/count
        for i in range(len(instance)-1):
            mean = summaries[key][i][0]
            std_dev = summaries[key][i][1]
            class_prob = class_prob*calcProb(instance[i], mean, std_dev)
        dict_prob[key] = class_prob
    return dict_prob


# Test it out here
summaries = summarizeByClass(dataset)
probabilities = calcClassProbs(summaries, dataset[0])
print('Probabilities are:',probabilities)

# I think if everything works, it should be:
# {0: 0.05032427673372075, 1: 0.00011557718379945765}
# which according to the percentage calculation give above should be:
# 99.77% in favour of class 0 

sumProbs = sum([v for _,v in probabilities.items()])
for k,v in probabilities.items():
    print('The probability of the instance belonging to class %d is %.2f' % (k,v/sumProbs*100))

0.3362559189806222
0.2993212841354971
0.0009340506299685847
0.24747520121761474
Probabilities are: {0: 0.05032427673372075, 1: 0.00011557718379945766}
The probability of the instance belonging to class 0 is 99.77
The probability of the instance belonging to class 1 is 0.23


#### (f) Tying it all together

You need to create a predict function. This function works very much as the example above, in that it takes a dictionary of summaries and a single row, and uses calcClassProbabilites to get the dictionary of probabilities. From this dictionary, find the largest value and corresponding class. Return this class. 

You also need a naiveBayes function, that takes a training set and a test set. It needs to generate summary statistics from the training set (using (c), above), then make predictions for each instance in the test set, by calling your predict function above for each instance, using the summaries generated. Append these predictions to a list you return.

In [6]:
# Implement predict(summaries,instance) here
def predict(summaries,instance):
    class_prob = calcClassProbs(summaries, instance)
    keys = list(class_prob.keys())
    chosen = keys[0]
    for key in keys[1:]:
        if class_prob[key] > class_prob[chosen]:
            chosen = key    
    return chosen

# Implement naiveBayes(train,test) here
def naiveBayes(train, test):
    prediction = []
    summaries = summarizeByClass(train)
    for instance in test:
        prediction.append(predict(summaries,instance))
    return prediction

### Applying to real data

You've seen bits of the iris dataset in class. It's one of the most well known data sets in machine learning and data mining. So you might as well have a go at it! You can find out more about it here: http://archive.ics.uci.edu/ml/datasets/Iris

You'll need to:

- Load the data
- convert all but the last column to floats
- convert the last column to an int. There are THREE classes, so convert them to 0, 1 and 2 accordingly
- call evaluate algorithm, using a 5-fold cross-validation
- print the mean, min and max scores
- compare this to some reasonable baseline
- give me a very short write up of the results

In [7]:
import csv

def load_data(filename):
    csv_reader = csv.reader(open(filename, newline=''), delimiter=',')
    new_list = []
    for row in csv_reader:
        new_list.append(row)
    return new_list

data_iris = load_data("iris.csv")

# Convert the features from strings to floats 
def column2Float(dataset, column):
    for row in dataset:
        row[column] = float(row[column])

for i in range(len(data_iris[0])-1):  
    column2Float(data_iris, i)
    
for row in data_iris:
    if row[-1] == "Iris-setosa":
        row[-1] = 0
    elif row[-1] == "Iris-versicolor":
        row[-1] = 1 
    else:
        row[-1] = 2
        
#==============================================
import random 
import csv
import copy

def cross_validation_data(dataset, folds):
    new_list = []
    copy_list = copy.deepcopy(dataset)
    fold_len = len(dataset)/folds
    for i in range(folds):
        current_fold = []
        while len(current_fold) < fold_len and len(copy_list)!=0:
            random_inst = random.choice(copy_list) 
            current_fold.append(random_inst)
            copy_list.remove(random_inst)
        new_list.append(current_fold)
    return new_list

def evaluate_algorithm(dataset, algorithm, folds, metric, *args):
    new_data = cross_validation_data(dataset, folds)  
    scores = []
    for fold in new_data:
        train = copy.deepcopy(new_data)
        train.remove(fold)
        train = [element for sublist in train for element in sublist]
        test = [instance[:-1] + [None] for instance in fold]
        
        predicted = algorithm(train,test, *args)
        actual = [instance[-1] for instance in fold]
        result = metric(actual,predicted)
        scores.append(result)

    return scores

def accuracy(actual, predicted):
    length = len(actual)
    counter = 0
    for i in range(length):
        if actual[i] == predicted[i]:
            counter += 1
    return (counter/length)*100

import collections
from collections import Counter 

def zeroRC(train, test):
    valueY = [instance[-1] for instance in train]
    most_occur = Counter(valueY).most_common(1)[0][0] 
    return [most_occur for i in range(len(test))]

#==============================================
folds = 5

naiveBayes_scores = evaluate_algorithm(data_iris, naiveBayes, folds, accuracy)
zeroRC_result = evaluate_algorithm(data_iris, zeroRC, folds, accuracy)

print("\nNumber of instances:", len(data_iris))
print("Number of features :", len(data_iris[0])-1, "\n")

print("naiveBayes highest score:", max(naiveBayes_scores))
print("naiveBayes lowest score:", min(naiveBayes_scores))
print("naiveBayes mean score:", mean(naiveBayes_scores), "\n")

print("zeroRC highest score:", max(zeroRC_result))
print("zeroRC lowest score:", min(zeroRC_result))
print("zeroRC mean score:", mean(zeroRC_result))

0.25621879012323706
0.8744023275587459
0.020553995123118788
0.001160777172175909
0.09432905621535878
0.19479242161250118
4.079959436919916e-60
1.5246941462484245e-11
0.7627223470562572
1.2012564398931271
0.7735096378337687
0.47999291500384
0.5479585938029295
0.8744023275587459
0.32133463220898967
1.0016226936782249
0.0015931699546382105
0.19479242161250118
8.28677567658275e-102
3.912497587406639e-48
0.6230962882240644
1.2012564398931271
0.35708803100936737
0.14158323951513008
0.6362187587909545
1.2335819798351635
0.14453949307164063
0.039621987637234395
5.3940300019314915e-05
0.4277930919350117
6.88558411675362e-85
2.616369427235208e-22
0.3418132778108217
1.154694934050874
0.6578590613621146
1.9505755797748618
0.02009514032965669
0.13138619159318762
0.0001814087009527031
0.001160777172175909
1.1146652016599556
0.0188305265910113
2.177642221212888e-29
1.5246941462484245e-11
0.1321864503441021
0.4090674874051872
0.10813866004243011
0.47999291500384
0.6353633465111961
0.7342319556355759
0

0.10763628477135449
0.6319744529554572
0.4336276645784319
0.6510805033212373
1.073495752454898
4.6100361528582105e-05
0.05602409251034646
1.699612554273493e-117
1.4489372014123928e-44
0.6803345367914467
1.2277345263566135
0.8816567213302221
2.0269201006508917
0.4998769894234977
1.2469229370284476
0.056125812321866905
0.048776568348337025
0.007083638835433607
0.442110388835536
1.0714742922391185e-50
1.6745889836988422e-20
0.6803345367914467
0.15799775024896032
0.044117954039757684
7.957870737477823e-06
0.4998769894234977
0.47707463493586555
0.6827035867015075
0.8442846579390472
0.007083638835433607
1.0898352550179022
5.793663768117472e-97
6.659281664354856e-78
0.1433183708251004
0.045244145602691424
0.19001426931477508
0.5020501736023155
0.021418896820221908
0.011831971466119277
0.000809583442654461
0.0015422171905581145
1.0513052175084001
0.0007894329306229478
1.9255227955401053e-26
1.6208280511523071e-10
0.4591551507815884
0.5005193009923693
0.1794515487064697
0.00527531063680737
0.62

2.5362642516579777e-113
2.6861300294646327e-90
0.40748275839221926
0.8903240846782473
0.006232359326423618
0.0003067028159851383
0.6056696187404904
1.131962578660472
0.6519041273970889
1.1954039668153673
5.7347212524943395e-05
0.4962475569251563
1.6828906277340754e-124
2.6861300294646327e-90
0.5579039693857513
1.1601080043469931
0.747200053066245
1.624936336646283
0.16298273363786667
0.60426845904638
0.08285882571930366
0.015239757167182507
0.42516616539019786
0.05611577778212611
2.4991556628970462e-57
7.017149792458385e-22
0.01102902378873978
0.8903240846782473
3.733280894660511e-08
1.8277620112444247e-06
0.002616797581034489
1.131962578660472
6.524888610320661e-14
3.281142247577799e-10
0.20672077276132633
0.4962475569251563
1.4832858612415476
3.884583911235409
0.5579039693857513
0.9609634344539696
0.7127812378602932
1.8780669920839483
0.16298273363786667
0.422327562798426
0.013805586669024759
0.04287364888076123
0.42516616539019786
0.026377888494985503
6.9693122265933724e-43
1.272859

Naive Bayes significantly outperforms zeroRC with mean accuracy of >90% compared to around 24% of zeroRC.


