# Introduction


In this lesson, we'll review Bayes Theorem in the context of a 1984 Congressional Voting dataset from the University of California Irvine. You can access the dataset from the [UCI repository here](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).


# Conditional Probability

A congressman voted no on providing aid to El Salvador. Given that 61% of the congress were Democrats, 74.9% of whom voted no for providing aid to El Salvador, and only 4.8% of Republicans voted no to the proposal, what is the conditional probability that this individual is a Democrat?


## Teacher Notes

Use this question as an opportunity to gauge student's understanding of Bayes' Theorem and probability concepts in general. From here, this question can lead naturally into formulation Naive Bayes for classification.


## Answer

$P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}$   
  
$P(\text{Democrat|Voted No to El Salvador Aid}) = \frac{P(\text{Voted No to El Salvador Aid}|\text{Democrat})\cdot P(\text{Democrat})}{P(\text{Voted No to El Salvador Aid})}$
  
$P(\text{Democrat|Voted No to El Salvador Aid}) = \frac{.749\cdot .61
}{P(\text{Voted No to El Salvador Aid})}$

the denominator here is a little tricky....  

$P(\text{Voted No to El Salvador Aid}) = P(\text{Voted No to El Salvador Aid}| \text{Democrat})\cdot P(\text{Democrat}) + P(\text{Voted No to El Salvador Aid}| \text{Republican})\cdot P(\text{Republican})$

$P(\text{Voted No to El Salvador Aid}) = .749 \cdot .61 + .048 \cdot (1-.61)$
$P(\text{Voted No to El Salvador Aid}) = 0.47561


Substituting into our original formula...

$P(\text{Democrat|Voted No to El Salvador Aid}) = \frac{.749\cdot .61
}{0.47561}$

$P(\text{Democrat|Voted No to El Salvador Aid}) = 0.96064$

# From Scenario to Problem Formulation

You're boss asks you to write a classifier to determine if a politician is Republican or Democrat based on their voting record. Given the dataset below, how would you formulate this problem? What is the dependent variable? What are the independent variables?

In [1]:
!head house-votes-84.data

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [2]:
columns = ["Class Name",
"handicapped-infants",
"water-project-cost-sharing",
"adoption-of-the-budget-resolution",
"physician-fee-freeze",
"el-salvador-aid",
"religious-groups-in-schools",
"anti-satellite-test-ban",
"aid-to-nicaraguan-contras",
"mx-missile",
"immigration",
"synfuels-corporation-cutback",
"education-spending",
"superfund-right-to-sue",
"crime",
"duty-free-exports",
"export-administration-act-south-africa"]

In [3]:
import pandas as pd

df = pd.read_csv('house-votes-84.data', header=None, names=columns)
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
Class Name                                435 non-null object
handicapped-infants                       435 non-null object
water-project-cost-sharing                435 non-null object
adoption-of-the-budget-resolution         435 non-null object
physician-fee-freeze                      435 non-null object
el-salvador-aid                           435 non-null object
religious-groups-in-schools               435 non-null object
anti-satellite-test-ban                   435 non-null object
aid-to-nicaraguan-contras                 435 non-null object
mx-missile                                435 non-null object
immigration                               435 non-null object
synfuels-corporation-cutback              435 non-null object
education-spending                        435 non-null object
superfund-right-to-sue                    435 non-null object
crime                      

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [11]:
#Problem Formulation

X = df[[col for col in df.columns if col!='Class Name']]
y = df['Class Name']

# Implementing a Statistical Learning Algorithm

How could you extend the Bayes Theorem from the word problem above to create a Naive Bayes Classifier?

# Review Lab Solutions

[Lab Solutions](https://github.com/learn-co-curriculum/dsc-gaussian-naive-bayes-lab/tree/solution)

Questions for the class:

* How could you extend the Bayes Theorem from the word problem above to create a Naive Bayes Classifier?
    * Calculate conditional probability of each class (Dem/Rep) based on the user's voting history. Multiply these probabilities to calculate the overall relative belief we have that this individual is Democrat or Republican. Choose the higher of these two relative beliefs.

After discussing implementing a Naive Bayes Classifier for this dataset....


* How does this differ from the [Guassian Naive Bayes - Lab](https://github.com/learn-co-curriculum/dsc-gaussian-naive-bayes-lab) where you wrote a classifier for predicting whether an individual has heart disease? You can see the original dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci#heart.csv). Here's a preview of some of the data:
<img src="images/uci-heart-data-preview.png">
    * Link to Lab
    * Link 
* What kind of implementation would you use here? [Given that you are working with categorical variables]
* Gaussian Naive Bayes vs Multinomial Naive Bayes
    * How did the classifier implementations differ for the Guassian Naive Bayes Lab versus the Document Classification Lab? Why were alternative formulations used? (Continuous data versus discrete)

# Comparing Naive Bayes Classifiers

* How does this differ from the [Guassian Naive Bayes - Lab](https://github.com/learn-co-curriculum/dsc-gaussian-naive-bayes-lab) where you wrote a classifier for predicting whether an individual has heart disease? You can see the original dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci#heart.csv). Here's a preview of some of the data:
<img src="images/uci-heart-data-preview.png">

In [4]:
df['Class Name'].unique()

array(['republican', 'democrat'], dtype=object)

In [5]:
df['Class Name'].value_counts(normalize=True)['democrat']

0.6137931034482759

In [6]:
v = df.iloc[0]
v.index

Index(['Class Name', 'handicapped-infants', 'water-project-cost-sharing',
       'adoption-of-the-budget-resolution', 'physician-fee-freeze',
       'el-salvador-aid', 'religious-groups-in-schools',
       'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
       'immigration', 'synfuels-corporation-cutback', 'education-spending',
       'superfund-right-to-sue', 'crime', 'duty-free-exports',
       'export-administration-act-south-africa'],
      dtype='object')

In [8]:
def classify_rep(row, train, return_posteriors=False):
    classes = list(train['Class Name'].unique())
    priors = dict(train['Class Name'].value_counts(normalize=True).map(np.log))
    for issue in row.index:
        if issue == 'Class Name':
                continue
        else:
            congressman_vote = row[issue]
            training_probs = train.groupby('Class Name')[issue].value_counts(normalize=True)
            democrat_conditional_prob = training_probs['democrat'][congressman_vote]
            republican_conditional_prob = training_probs['republican'][congressman_vote]
            priors['democrat'] += np.log(democrat_conditional_prob)
            priors['republican'] += np.log(republican_conditional_prob)
    posteriors = [priors[class_] for class_ in classes]
    if return_posteriors:
        return (posteriors, classes[np.argmax(posteriors)])
    return classes[np.argmax(posteriors)]

# Scaffolding / Additional Analysis Questions

* Explain line 3 of the codeblock above.
    > A: The prior probabilities for each class should be that class' relative frequency from the training data.
* Explain lines 12 and 13
    > A: Update the relative probabilities given the voting history of both parties 
* Explain line 14
    > A: Create an ordered list of the updated probabilities so that we can easily select the maximum relative likelihood

In [17]:
import numpy as np
from sklearn.model_selection import train_test_split




X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=22)


train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

train.head()

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa,Class Name
409,n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,n,republican
12,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,?,democrat
365,n,y,n,n,y,y,n,n,n,y,y,n,y,y,n,n,democrat
65,y,y,n,y,y,y,y,n,n,n,n,y,y,y,n,y,republican
384,y,y,y,y,y,y,n,n,n,n,y,y,y,y,n,y,democrat


In [15]:
y_train_pred = []
y_test_pred = []
y_train_residuals = []
y_test_residuals = []

train = pd.concat([X_test, y_test], axis=1)

for row_idx in train.index:
    pred = classify_rep(train.loc[row_idx], train)
    y_train_pred.append(pred)
    if pred == train.loc[row_idx]['Class Name']:
        y_train_residuals.append(1)
    else:
        y_train_pred.append(0)
for row in test:
    pred = classify_rep(test.iloc[row_idx], train)
    y_test_pred.append(pred)
    if pred == test.iloc[row_idx]['Class Name']:
        y_test_residuals.append(1)
    else:
        y_test_pred.append(0)
print('Training Accuracy: ', np.mean(y_train_residuals))
print('Testing Accuracy: ', np.mean(y_test_residuals))

KeyError: '?'

# Analyzing Code

Investigate the `classify_rep()` function written above. Why does it throw an error in the script running above? What implications does this have for the Naive Bayes classifier? How should cases like this be handled in your opinion?



# Analyzing Code

The code above throws an error because there were no such examples in the training set. In other words, while the individual voted `?`, there were no such observations for one of the classes. This is also a tricky scenario. If you assign zero probability to this case because there were no corresponding observations, then all of the following probabilities will also be cancelled out because of multiplying by zero. As such, an appropriate adaptation might be to assign a small nonzero probability. However, finding a suitable magnitude for this small probability is also arbitrary.

In [18]:
# Updated Code 

def classify_rep(row, train, return_posteriors=False):
    classes = list(train['Class Name'].unique())
    priors = dict(train['Class Name'].value_counts(normalize=True).map(np.log))
    for issue in row.index:
        if issue == 'Class Name':
                continue
        else:
            congressman_vote = row[issue]
            training_probs = train.groupby('Class Name')[issue].value_counts(normalize=True)
            try:
                democrat_conditional_prob = training_probs['democrat'][congressman_vote]
            except:
                democrat_conditional_prob = 5*10**-3
            try:
                republican_conditional_prob = training_probs['republican'][congressman_vote]
            except:
                republican_conditional_prob = 5*10**-3
            priors['democrat'] += np.log(democrat_conditional_prob)
            priors['republican'] += np.log(republican_conditional_prob)
    posteriors = [priors[class_] for class_ in classes]
    if return_posteriors:
        return (posteriors, classes[np.argmax(posteriors)])
    return classes[np.argmax(posteriors)]

y_train_pred = []
y_test_pred = []
y_train_residuals = []
y_test_residuals = []

train = pd.concat([X_test, y_test], axis=1)

for row_idx in train.index:
    pred = classify_rep(train.loc[row_idx], train)
    y_train_pred.append(pred)
    if pred == train.loc[row_idx]['Class Name']:
        y_train_residuals.append(1)
    else:
        y_train_pred.append(0)
for row in test:
    pred = classify_rep(test.iloc[row_idx], train)
    y_test_pred.append(pred)
    if pred == test.iloc[row_idx]['Class Name']:
        y_test_residuals.append(1)
    else:
        y_test_pred.append(0)
print('Training Accuracy: ', np.mean(y_train_residuals))
print('Testing Accuracy: ', np.mean(y_test_residuals))

IndexError: single positional indexer is out-of-bounds