# Assignment 2: Naive Bayes

The goal of this assignment is to learn about Naive Bayes through implementing a simple
naive Bayes classifier.  You will also learn a bit about how to load data files (i.e. not just relying on data built into scikit-learn), though in this assignment that part has been done for you.  Read through it and try to understand what's going on.

##### When complete, submit your notebook file on Blackboard as usual.

## Read the data

The first thing we need to do is read the data from disk.  We'll use a library called Pandas to do that; it's really good for reading things like excel spreadsheets and comma-separated-value (CSV) data files.

In [150]:
# this time we'll use Pandas for reading our data file (it's great for CSVs)
import pandas as pd
# we'll also still want numpy
import numpy as np
from collections import Counter

### The Dataset
#### For this assignment, you'll be using a dataset covering congressional voting behavior.  The original version is downloadable from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records), but you can just grab a copy from Blackboard.  Here's the dataset description:

1. Title: 1984 United States Congressional Voting Records Database

2. Source Information:
    (a) Source:  Congressional Quarterly Almanac, 98th Congress, 
                 2nd session 1984, Volume XL: Congressional Quarterly Inc. 
                 Washington, D.C., 1985.
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987 

3. Past Usage
   - Publications
     1. Schlimmer, J. C. (1987).  Concept acquisition through 
        representational adjustment.  Doctoral dissertation, Department of 
        Information and Computer Science, University of California, Irvine, CA.
        -- Results: about 90%-95% accuracy appears to be STAGGER's asymptote
     - Predicted attribute: party affiliation (2 classes)

4. Relevant Information:
      This data set includes votes for each of the U.S. House of
      Representatives Congressmen on the 16 key votes identified by the
      CQA.  The CQA lists nine different types of votes: voted for, paired
      for, and announced for (these three simplified to yea), voted
      against, paired against, and announced against (these three
      simplified to nay), voted present, voted present to avoid conflict
      of interest, and did not vote or otherwise make a position known
      (these three simplified to an unknown disposition).

5. Number of Instances: 435 (267 democrats, 168 republicans)

6. Number of Attributes: 16 + class name = 17 (all Boolean valued)

7. Attribute Information:
   1. Class Name: 2 (democrat, republican)
   2. handicapped-infants: 2 (y,n)
   3. water-project-cost-sharing: 2 (y,n)
   4. adoption-of-the-budget-resolution: 2 (y,n)
   5. physician-fee-freeze: 2 (y,n)
   6. el-salvador-aid: 2 (y,n)
   7. religious-groups-in-schools: 2 (y,n)
   8. anti-satellite-test-ban: 2 (y,n)
   9. aid-to-nicaraguan-contras: 2 (y,n)
  10. mx-missile: 2 (y,n)
  11. immigration: 2 (y,n)
  12. synfuels-corporation-cutback: 2 (y,n)
  13. education-spending: 2 (y,n)
  14. superfund-right-to-sue: 2 (y,n)
  15. crime: 2 (y,n)
  16. duty-free-exports: 2 (y,n)
  17. export-administration-act-south-africa: 2 (y,n)

8. Missing Attribute Values: Denoted by "?"

   NOTE: It is important to recognize that "?" in this database does 
         not mean that the value of the attribute is unknown.  It 
         means simply, that the value is not "yea" or "nay" (see 
         "Relevant Information" section above).

In [151]:
# since the data file has no header, we need to define "names" for the columns
# since the class label (political party in this case) is first, we'll assign
# that name, and then just label the various votes as vN, where N is a number
colNames = ['party']
for i in range(16):
    colNames.append('v'+str(i))
    
# actually read the data, then take a look at it
raw = pd.read_csv('house-votes-84.data', header=None, names=colNames )
raw


Unnamed: 0,party,v0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [152]:
# as usual, length will give us the number of examples
len(raw)


435

## Split the data into training and testing sets

Next, we need to split the data into train and test sets, and then break off the labels from the feature vectors.  We'll also convert from Pandas style tables to numpy style arrays.

In [153]:
# let's shuffle this and split it into train and test sets:
shuffled = raw.sample(frac=1) # randomly re-order the examples# frac=1 => % of rows 
trainFrac = 0.7 # 70%/30% train/test split
trainCount = int(len(raw) * trainFrac)
train = shuffled[:trainCount]
test = shuffled[trainCount:]

In [154]:
print("train example count:", len(train), ", test example count:", len(test))

train example count: 304 , test example count: 131


In [155]:
# split off class labels and convert to numpy array
trainLabels = train['party'].values
testLabels = test['party'].values
len(trainLabels)

304

In [156]:
# let's take a look at what we got
trainLabels

array(['republican', 'democrat', 'republican', 'democrat', 'republican',
       'democrat', 'republican', 'democrat', 'republican', 'democrat',
       'republican', 'republican', 'democrat', 'democrat', 'republican',
       'democrat', 'democrat', 'democrat', 'democrat', 'republican',
       'democrat', 'democrat', 'democrat', 'republican', 'democrat',
       'democrat', 'democrat', 'republican', 'democrat', 'republican',
       'democrat', 'republican', 'republican', 'democrat', 'democrat',
       'democrat', 'republican', 'democrat', 'democrat', 'democrat',
       'democrat', 'republican', 'republican', 'democrat', 'democrat',
       'democrat', 'republican', 'democrat', 'democrat', 'democrat',
       'republican', 'democrat', 'democrat', 'democrat', 'republican',
       'republican', 'democrat', 'democrat', 'republican', 'democrat',
       'democrat', 'democrat', 'democrat', 'democrat', 'democrat',
       'republican', 'republican', 'republican', 'republican',
       'republican', '

In [157]:
# find all the distinct options:
classLabels = np.unique(trainLabels)
classLabels

array(['democrat', 'republican'], dtype=object)

In [158]:
# now remove the labels from the feature vectors and convert to numpy array
trainFeatures = train.drop(['party'], axis=1).values
testFeatures = test.drop(['party'], axis=1).values
trainFeatures

array([['n', 'y', 'n', ..., 'y', 'n', 'y'],
       ['y', 'n', 'y', ..., 'n', 'y', 'y'],
       ['y', 'y', 'n', ..., 'y', 'n', 'y'],
       ...,
       ['n', 'y', 'n', ..., 'y', 'n', 'y'],
       ['y', 'n', 'y', ..., 'y', 'y', 'y'],
       ['y', 'n', 'y', ..., 'n', 'n', 'y']], dtype=object)

In [159]:
# we can look at a row by index
trainFeatures[0]

array(['n', 'y', 'n', 'y', 'y', 'y', 'n', 'n', 'n', 'y', 'y', 'y', 'y',
       'y', 'n', 'y'], dtype=object)

In [160]:
# we can look at a single feature with a double index (row, col)
trainFeatures[0][0]

'n'

In [161]:
# total number of training examples
numExamples = len(trainFeatures)
numExamples

304

In [162]:
# number of features in the first example (should be the same for all examples)
numFeatures = len(trainFeatures[0])
numFeatures

16

In [163]:
# now let's find the unique feature values
featureValues = np.unique(trainFeatures)
featureValues

array(['?', 'n', 'y'], dtype=object)

## Count occurrences

Now we'll need to do all of our occurrence counting, so we can estimate our probabilities.  We'll need to estimate the unconditional liklihood for each class first (i.e. how many times each class label occurs in the training data divided by the total number of training examples).

In [164]:
# now we just need to build up our table of probability estimates
# we'll use python dictionaries for this, as it lets us use strings as our "index" values
labelCounts = {}
# initialize to 0s
for c in classLabels:# republican, democract
    labelCounts[c] = 0
print(labelCounts)
# then count occurrences
for i in range(len(trainLabels)):#304
    labelCounts[trainLabels[i]] += 1

{'democrat': 0, 'republican': 0}


### Compute unconditional probability estimates

In [165]:
# this lets us estimate the unconditional class probabilities, P(C)
# we just need to divide by the total number of examples
# we can do that here and store the result, or just do it on-the-fly
# in our final classifier algorithm
print(labelCounts)

labelProbs = {}
for c in classLabels:
    labelProbs[c] = labelCounts[c] / numExamples
    
print(labelProbs)

{'democrat': 190, 'republican': 114}
{'democrat': 0.625, 'republican': 0.375}


## Compute Conditional counts

Now we'll need to compute the conditional probabilities, for which the first step is to compute how often each feature value corresponds with each class.  You'll need to write this part yourself, but it should be similar to the above ones, only now you're going to need at 3-tuple (class, featureIndex, featureValue) as your dictionary key rather than a 2-tuple

In [166]:
# finally, we'll do the counts for the conditional probabilities, P(x|C)
featureCounts = {}
for c in featureValues:# y,n,?
    featureCounts[c] = 0

                    # then count occurrences
        #Counter(trainFeatures[0]).values()
for i in range(numExamples):
    featureCounts[trainFeatures[i][0]] += 1 # trainFeatures-y,y,y,y,n,/....
print(featureCounts)
featureprobs = {}
for c in featureValues: 
    featureprobs[c]= featureCounts[c] / numExamples#
print(featureprobs)

    


{'?': 10, 'n': 165, 'y': 129}
{'?': 0.03289473684210526, 'n': 0.5427631578947368, 'y': 0.4243421052631579}


In [172]:
data={}
data=np.array(raw)
data
featureVal={}
featureVal=([row[0] for row in trainFeatures])
d = {}
# if d['j']:
#     print('y')
# else:
#     print('n')
for i in range(304):
    for j in range(16):
        key=(data[i][0],j,data[i][j+1])
        if key in d:
            d[key] += 1
        else:
            d[key]=1
   

## Naive Bayes Classifier

Finally, we're ready to write a function that takes in a novel example and tries to classify it using the Naive Bayes algorithm.  You can either have your function take in your count tables from before, or just treat them as global variables.  Writing this function is left to you.

In [176]:
# now we can define a function to take an example and
# classify it using Naive Bayes (note that we treat the occurence counts as globals)
# NOTE: this function should return a class label
def naiveBayesClassify(example):
    probr=1
    probd=1
    for i in range(len(example)):
        probr=(probr*d[('republican',i,example[i])])/labelProbs['republican']
    for i in range(len(example)):
        probd=(probd*d[('democrat',i,example[i])])/labelProbs['democrat']
    probRepublic= labelProbs['republican']*probr
    probRepublic= probRepublic/labelProbs['republican']
    probDemo= labelProbs['democrat']*probr
    probDemo= probDemo/labelProbs['democrat']
    if probRepublic > probDemo:
        return "republican"
    else:
        return "democrat"


In [177]:
# let's try classifying one example
print("features:",trainFeatures[0])
print("true label:",trainLabels[0])
print("predicted label:", naiveBayesClassify(trainFeatures[0]))

features: ['n' 'y' 'n' 'y' 'y' 'y' 'n' 'n' 'n' 'y' 'y' 'y' 'y' 'y' 'n' 'y']
true label: republican
predicted label: democrat


In [179]:
# finally, let's apply it to the full test set and calculate an accuracy
correct = 0
for i in range(len(testFeatures)):
    guess = naiveBayesClassify(testFeatures[i])
    #print(guess)
    if guess == testLabels[i]:
        correct += 1
print("correct:", correct, ", accuracy:", correct / len(testFeatures))

correct: 76 , accuracy: 0.5801526717557252
