In [None]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
%matplotlib notebook

In [None]:
# read in our data - taking care to check that it is tab-delimited!
data = pandas.read_csv('data/dataWeather.txt',delimiter='\t')

# 1R implementation
Here, we want to implement the 1R algorithm on the weather data.

So, we need to go through each attribute, find the decisions for each value of the attribute and select that value that gives us the maximum number of decisions.

In [None]:
# this will hold the errors that I make for each attribute
err=pandas.DataFrame(np.zeros((1,4)),columns=data.columns[:-1])

# loop through all columns
for c in data.columns[:-1]:
    # for each of the values in the column
    for a in data[c].unique():
        # check how many times "yes" and "no" occur
        # in the Play column for that attribute value
        tmp=data['Play'][data[c]==a].value_counts()
        # if that array has two entries
        if(tmp.size==2):
            # if the number of "yes"s outnumber the "no"s
            # we should decide that this gives us evidence for "yes"
            if(tmp['Yes']>=tmp['No']):
                print(c,': if',a,'-->','Yes',tmp['No'],'errors')
                err[c]+=tmp['No']
            # if the number of "no"s outnumber the "yes"s
            # we should decide that this gives us evidence for "no"
            else:
                print(c,': if',a,'-->','No',tmp['Yes'],'errors')
                err[c]+=tmp['Yes']
        # if that array has only one entry, we decide that
        # and of course we make no error in this case
        else:
            print(c,': if',a,'-->',tmp.index[0],'0 errors')
print('\n\n\n',err)    
print('\n\n\nfound minimum error for attribute',err.idxmin(axis=1)[0])

## Results
We found two attributes ("Outlook" and "Humidity") that give us only four errors. The other two attributes ("Temperature" and "Windy") give us five errors, so we will arbitrarily chose one of the first two attributes for our 1R learner...

# Naive Bayes
Let's try to implement the Naive Bayes for the weather data similarly to how it is presented in the slides.

For this, we will need to estimate the likelihoods:

$P(Yes|<Attribute Value>)$ and $P(No|<Attribute Value>)$ and $P(Yes)$ and $P(No)$

In [None]:
# Naive Bayes implementation
# let's make a dictionary of dataFrames for the storage
# of the probabilities
bayesData={}

# this holds the different decisions we have
# I'm assuming they live in the last column of the data
dV = data.columns[-1]
decisions = data[dV].unique()
# how many occurrences of each decision do we have?
# this one is used to measure P(yes) and P(no)
decisionsCount = data[dV].value_counts()

# this holds the independent variables (attributes)
iV=data.columns[:-1]

# now loop through all independent variables (attributes)
for c in iV:
    # get the different values we have for each attribute 
    values = data[c].unique()    
    # make an entry in the dictionary with columns consisting
    # of the different decisions, indexed by the values of
    # the attribute
    bayesData[c]=pandas.DataFrame(columns=decisions,index=values)
    # now loop through all values of the attribute
    for v in values:
        # find out the decisions for that value
        tmp=data[dV][data[c]==v]
        # loop through all decision values 
        for d in decisions:
            # determine the likelihood of that combination
            bayesData[c][d][v]=len(tmp[tmp==d])/decisionsCount[d]

# print out the likelihoods
for c in iV:
    print(c,'\n',bayesData[c],'\n\n')

# test with a new day
newDay = ['Sunny','Cool','High',True]

# test all decision categories
for d in decisions:
    # initialize likelihood
    p=1
    # collect all likelihoods from table
    for n,c in enumerate(iV):
        p=p*bayesData[c][d][newDay[n]]
    # multiply by the prior, i.e. likelihood of the decision itself
    p=p*decisionsCount[d]/decisionsCount.sum()
    print(d,':',p)

## Results
As expected, our new day was classified as a "No". These likelihoods are not yet normalized, but since that would not change the maximum operation, we don't actually need to do this.

## DIY
So now I would like to make a function that tests a given day in the format of `newDay` above and returns both the normalized probability and the decision that is made given the Naive Bayes algorithm.

You can assume that the data variables are known from the outside, so you do not need to pass them to the function. In addition, the function has a verbose input, which (if true) outputs some information about the decision probabilities it has calculated...

In [None]:
# function definition
def testDay(newDay,verbose=False):
    # initialize probabilities to a numpy array of ones of the correct size
    p = np.ones(len(decisions))
    
    # loop through all possible decisions and get index as well
    for ... in enum...:
        # collect all probabilities
        for n,c in enumerate(iV):
            p[nd]=...
        # multiply by the likelihood of the decision itself
        p[nd]=...
        if (verbose):
            print('{:s}: {:.5f}'.format(d,p[nd]))
    # normalize the array so that probabilities sum up to 1
    p=...
    # return the highest probability and the decision to the user
    return(...,...)

If you test this function with our `newDay` variable, you should get the following output:

`In [NN]: testDay(newDay)`

`Out[NN]: (0.79541734860883795, 'No')`

In [None]:
testDay(newDay)

## DIY - Results and comparison with 1R
We would like to see whether Naive Bayes is actually better than 1R.

For this, implement a for-loop that goes through all rows of the input data `data` and tests each day. How many errors do you get with Naive Bayes??

In [None]:
for ... in rows of data:
    # print both the result of testDay and the actual TRUE result from the data array!
    print(testDay(day),...)