# Naive Bayes

In [4]:
import pandas as pd

In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it ( y = +1 for “read”, y = -1 for “discard”).

In [2]:
D = [[0,0,1,1,0,-1],[1,1,0,1,0,-1],[0,1,1,1,1,-1],[1,1,1,1,0,-1],[0,1,0,0,0,-1],[1,0,1,1,1,1],[0,0,1,0,0,1],[1,0,0,0,0,1],[1,0,1,1,0,1],[1,1,1,1,1,-1]]
df = pd.DataFrame(data=D, columns=['author','long','research','grade','lottery','read'])

df

Unnamed: 0,author,long,research,grade,lottery,read
0,0,0,1,1,0,-1
1,1,1,0,1,0,-1
2,0,1,1,1,1,-1
3,1,1,1,1,0,-1
4,0,1,0,0,0,-1
5,1,0,1,1,1,1
6,0,0,1,0,0,1
7,1,0,0,0,0,1
8,1,0,1,1,0,1
9,1,1,1,1,1,-1



In the case of any ties, we will prefer to predict class +1. I decide to try a naïve Bayes classifier to make my decisions and compute my uncertainty.

#### Q: Compute all the probabilities necessary for a naïve Bayes classifier, i.e., the class probability p( y) and all the individual feature probabilities p(xi j y), for each class y and feature xi

P(author) = 6/10  
P(no author) = 4/10  
P(author|read) = 3/4  
P(author|not read) = 3/6  
P(no author|read) = 1/4   
P(no author|not read) = 3/6   

P(long) = 5/10   
P(not long) = 5/10   
P(long|read) = 0/4   
P(long|not read) = 5/6   
P(not long|read) = 4/4    
P(not long|not read) = 1/6   

P(research) = 7/10   
P(no research) = 3/10    
P(research|read) = 3/4     
P(research|not read) = 4/6      
P(no research|read) = 1/4     
P(no research|not read) = 2/6    

P(grade) = 7/10    
P(no grade) = 3/10    
P(grade|read) = 2/4   
P(grade|not read) = 5/6   
P(not grade|read) = 2/4     
P(not grade|not read) = 1/6    

P(lottery) = 3/10    
P(no lottery) = 7/10    
P(lottery|read) = 1/4      
P(lottery|not read) = 2/6   
P(no lottery|read) = 3/4     
P(no lottery|not read) = 4/6      

P(read) = 4/10      
P(not read) = 6/10      

#### Q: Which class would be predicted for x = (0 0 0 0 0)? What about for x = (1 1 0 1 0)?



**x = (0,0,0,0,0)**     
**P(read|0,0,0,0,0) = P(read)P(no author|read)P(no long|read)P(no research|read)P(no grade|read)P(no lottery|read)**  
                  = (4/10)(1/4)(4/4)(1/4)(2/4)(3/4)  
                  = (96/10,240)   
                  = 0.009375  

**P(not read|0,0,0,0,0) = P(not read)P(no author|not read)P(not long|not read)P(no research|not read)P(no grade|not read)P(no lottery|not read)**   
                     = (6/10)(3/6)(1/6)(2/6)(1/6)(4/6)    
                     = (144/77,760)    
                     = 0.00185185   

**Because the predicted value for read given that the features were all 0 was higher than the predicted value for not read, the predicted class would be read.**


**x = (1,1,0,1,0)**    
**P(read|1,1,0,1,0) = P(read)P(author|read)P(long|read)P(no research|read)P(grade|read)P(no lottery|read)**      
                  = (4/10)(3/4)(0/4)(1/4)(2/4)(3/4)     
                  = (0/10,240)    
                  = 0     
                  
**P(not read|1,1,0,1,0) = P(not read)P(author|not read)P(long|not read)P(no research|not read)P(grade|not read)P(no lottery|not read)**   
                      = (6/10)(3/6)(5/6)(2/6)(5/6)(4/6)   
                      = (3,600/77,760)    
                      = 0.04629     
                      
**Because the predicted value for read is zero, while the predicted value for not read is 0.046, the predicted class for these given features is not read**

#### Q: Compute the posterior probability that y = +1 given the observation x = (1 1 0 1 0).



Because the numerator of the operation, which is the predicted class value using the same features, is equal to 0, the posterior probability that y = 1 is also 0

#### Q: Why should we probably not use a “joint” Bayes classifier (using the joint probability of the features x, as opposed to a naïve Bayes classifier) for these data?



We should not use a joint Bayes classifier for these datasets because a joint probability problem would require 32 equations; whereas, a Naive Bayes probability would only take 10 equations because since you assume conditional independence, the amount of equations is the feature possibilites added instead of multiplying them.

#### Q: Suppose that, before we make our predictions, we lose access to my address book, so that we cannot tell whether the email author is known. Should we re-train the model, and if so, how? (e.g.: how does the model, and its parameters, change in this new situation?) Hint: what will the naïve Bayes model over only features x2 . . . x5 look like, and what will its parameters be?



What you would have to do is re-train the model, but just ommit the author feature while keeping the other 4 features. Since the other features are still fine, the only feature that isn't working the same is author, sot he model needs to be re-trained without the author feature.

## 2. Statement of Collaboration

#### a. Whom I worked with?

I mainly just worked with Tucker, Kolby and Matt on this lab, but I did most of it on my own

## 3. Extra Credit

#### a. Frequentist vs. Bayesian Argument

Frequentist school only uses conditional distrubtions when given specific data. A frequentist believes probabilites represent long run frequencies in whihc events occur. Bayesian probability is more an indication of the plausability of a proposition or a situation.

I agree with the Bayesian side of the argument because I think that probability represents a plausability, rather than an event occurence. 