<a href="https://colab.research.google.com/github/peterjsadowski/ics235/blob/master/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1: Naive Bayes Classifier

In this homework you will implement the Naive Bayes Classsifier on a data set of votes in the U.S. House of Representatives, with the goal of predicting the party affiliation of each congressman. The input data $X$ is given by a $N$-by-$D$ matrix, where $N$ is the number of examples and $D=16$ is the number of input features. Each feature is binary (yes/no). The targets are given by a length-$N$ sequence of classes, $Y$, that are also binary. More information on the data set can be found at  https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.





```
# This is formatted as code
```

First, we need to download the data. The following code uses the urllib library to request data from a website. The pandas library is a powerful library for data analysis --- we use the read_csv method to automatically parse the comma seperated variable (csv) file.

In [0]:
import pandas 
import urllib.request  
import numpy   # Numerical python.

# Download the data.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"
response = urllib.request.urlopen(url)

# Interpret text data into pandas data frame. Interpret 'abstain' votes as 'no'.
dataset  = pandas.read_csv(response, header=None, true_values=['y'], false_values=['n','?'])

# Set the column names.
names = ['label'] + [f'vote_{i}' for i in range(16)]
dataset.columns = names

# Tells pandas that this is a categorical feature.
dataset['label'] = pandas.Categorical(dataset['label'])

print("Dataset shape: ", dataset.shape)
dataset.head() # Prints first 5 examples from the data set.

Dataset shape:  (435, 17)


Unnamed: 0,label,vote_0,vote_1,vote_2,vote_3,vote_4,vote_5,vote_6,vote_7,vote_8,vote_9,vote_10,vote_11,vote_12,vote_13,vote_14,vote_15
0,republican,False,True,False,True,True,True,False,False,False,True,False,True,True,True,False,True
1,republican,False,True,False,True,True,True,False,False,False,False,False,True,True,True,False,False
2,democrat,False,True,True,False,True,True,False,False,False,False,True,False,True,True,False,False
3,democrat,False,True,True,False,False,True,False,False,False,False,True,False,True,False,False,True
4,democrat,True,True,True,False,True,True,False,False,False,False,True,False,True,True,True,True


Numpy is a powerful library for mathematical operations on vectors and matrices. Here we convert the pandas data into a 2-dimensional numpy array (a matrix). 

In [0]:
X = numpy.array(dataset.iloc[:,1:]) # Convert input features into Numpy array.
Y = dataset['label'].cat.codes # Converts string labels to binary values.

# Split data into train and test set.
Xtrain = X[0:100, :]
Xtest  = X[100:,:]
print(Xtrain.shape, Xtest.shape)

(100, 16) (335, 16)


You are asked to implement the following functions.

In [0]:

def generative_model(Xtrain):
    ''' 
    Implements a generative algorithm on binary data.
    Inputs
        Xtrain: NxD matrix of features.
        Ytrain: N vector of class labels
    
    Returns
        p_label: Length 2 vector of class probabilities.
        p_votes: 2xD Matrix where entry i,j is p(x_j|v=i).
    ''' 
    # WRITE ME 
    pass
    return p_label, p_votes
  
def discriminative_model(p_label, p_votes, Xtest):
    '''
    Implements Naive Bayes Classification.
    Inputs
      p_label, p_votes: From generative_model.
      Xtest: NxD matrix of binary features.
    '''
    # WRITE ME
    pass

def accuracy(y_true, y_predicted):
    ''' Calculates the fraction of correct predictions.
    '''
    assert(len(y_true) == len(y_predicted))
    # WRITE ME
    pass
    

## To turn in:
1) Implement the Naive Bayes Classifier using the starter code above. 

2) Compute the log probability of the test set. 

3) Compare the NB classifier to a model in which we predict a 50-50 chance for each vote, in terms of accuracy and the log probability. Which model is better and why? Describe two situations in which the Naive Bayes Classifier will fail. 

