# <center>Group Project 1: Supervised Learning</center>
## <center>Josh Melton and Ivan Benitez</center>  

### Part 2: Congressional Voting Data
#### a) Data Preparation

First, read the voting data csv into a pandas dataframe.  
- The data contain yes/no voting records on various issues: 'y' is converted to 1 and 'n' to 0.
- The class label column contains each congressperson's party affiliation. For simplicity, republican is converted to 1 and democrat to 0.
- The data also contain missing values, encoded with a '?'. These values are converted to numpy NaN. 

In [13]:
import pandas as pd
import numpy as np

voting_data = pd.read_csv('voting_data.csv',
                          header=0,
                          index_col=False)

voting_data.replace({'y': 1, 'n': 0, '?': np.NaN,
                     'republican': 1, 'democrat': 0}, inplace=True)

# voting_data.info()

Three different methods for handling the missing data were used to create three versions of the dataset
- Version 1: Remove rows with missing values
- Version 2: Replace missing values with a third label (2)
- Version 3: Replace missing values with the mode

In [14]:
from sklearn.impute import SimpleImputer

# Version 1: Drop rows with NaN
voting_data_v1 = voting_data.dropna(axis=0, how='any')
labels_v1 = voting_data_v1['Class Name']
features_v1 = voting_data_v1.drop('Class Name', axis=1)

# Version 2: Replace NaN with third category (2)
voting_data_v2 = voting_data.fillna(2)
labels_v2 = voting_data_v2['Class Name']
features_v2 = voting_data_v2.drop('Class Name', axis=1)

# Version 3: Replace NaN with mode (most frequent)
labels_v3 = voting_data['Class Name']
features_v3 = voting_data.drop('Class Name', axis=1)

imp = SimpleImputer(strategy='most_frequent')
features_v3 = imp.fit_transform(features_v3)

features = [features_v1, features_v2, features_v3]
labels = [labels_v1, labels_v2, labels_v3]

Functions to print formatted metrics.

In [15]:
from sklearn import metrics

def print_metrics(labels, preds):
    """
        Prints confusion matrix and metrics scores for a binary classification
    """
    scores = metrics.precision_recall_fscore_support(labels, preds)
    conf = metrics.confusion_matrix(labels, preds)
    print(' ' * 4 + 'Confusion Matrix')
    print(' ' * 17 + 'Predict Positive    Predict Negative')
    print('Actual Positive         {}                 {}'.format(conf[1, 1], conf[1, 0]))
    print('Actual Negative         {}                 {}'.format(conf[0, 1], conf[0, 0]))
    print()
    print('Accuracy: {:.3f}'.format(metrics.accuracy_score(labels, preds)))
    print()
    print(' ' * 4 + 'Classification Report')
    print(' ' * 11 + 'Positive    Negative')
    print('Num cases    {}           {}'.format(scores[3][1], scores[3][0]))
    print('Precision    {:.2f}       {:.2f}'.format(scores[0][1], scores[0][0]))
    print('Recall       {:.2f}       {:.2f}'.format(scores[1][1], scores[1][0]))
    print('F1 Score     {:.2f}       {:.2f}'.format(scores[2][1], scores[2][0]))

def print_cv_scores(results):
    """
        Prints scoring metrics from cross-validation 
    """
    f1 = results['test_f1']
    precision = results['test_precision']
    recall = results['test_recall']
    accuracy = results['test_accuracy']

    print(' ' * 4 + 'Cross Validation Scores')
    print(' ' * 9 + 'F1     Precision    Recall    Accuracy')
    for i, (f, p, r, a) in enumerate(zip(f1, precision, recall, accuracy)):
        print('Fold {}   {:.3f}    {:.3f}      {:.3f}     {:.3f}'.format(i+1, f, p, r, a))
    print()
    print('Mean F1: {:.3f}'.format(f1.mean()))
    print('Mean Precision: {:.3f}'.format(precision.mean()))
    print('Mean Recall: {:.3f}'.format(recall.mean()))
    print('Mean Accuracy: {:.3f}'.format(accuracy.mean()))

#### b) Decision Tree and Naive Bayes Models

For each version of the data, initialize and fit a Decision Tree classifier and a Naive Bayes model.  
Evaluate the models on the test data and then run 5-fold cross validation on the whole data set.  
F1, precision, recall, and accuracy scores are reported for each fold as well as the mean score across all folds.

In [16]:
import sklearn.model_selection as ms
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB

scoring = ['f1', 'precision', 'recall', 'accuracy']
for i, (feat, lab) in enumerate(zip(features, labels)):

    X_train, X_test, y_train, y_test = ms.train_test_split(feat, lab,
                                                           test_size=0.2,
                                                           random_state=1776)

    # Decision Tree model
    print('Version {}: Decision Tree'.format(i+1))
    tree = DecisionTreeClassifier(criterion='gini',
                                  class_weight='balanced',
                                  random_state=1916)
    tree.fit(X_train, y_train)   

    y_pred_tree = tree.predict(X_test)
    print_metrics(y_test, y_pred_tree)
    print()

    tree_cv_scores = ms.cross_validate(tree, feat, lab,
                                       cv=5, scoring=scoring)
    print_cv_scores(tree_cv_scores)
    print('-' * 25)

    # Naive Bayes model
    print('Version {}: Naive Bayes'.format(i+1))
    bnb = BernoulliNB()
    bnb.fit(X_train, y_train)

    y_pred_nb = bnb.predict(X_test)
    print_metrics(y_test, y_pred_nb)
    print()

    bnb_cv_scores = ms.cross_validate(bnb, feat, lab,
                                      cv=5, scoring=scoring)
    print_cv_scores(bnb_cv_scores)
    print('#' * 50)

Version 1: Decision Tree
    Confusion Matrix
                 Predict Positive    Predict Negative
Actual Positive         16                 0
Actual Negative         1                 30

Accuracy: 0.979

    Classification Report
           Positive    Negative
Num cases    16           31
Precision    0.94       1.00
Recall       1.00       0.97
F1 Score     0.97       0.98

    Cross Validation Scores
         F1     Precision    Recall    Accuracy
Fold 1   0.933    0.913      0.955     0.936
Fold 2   1.000    1.000      1.000     1.000
Fold 3   0.977    1.000      0.955     0.979
Fold 4   0.950    1.000      0.905     0.957
Fold 5   0.884    0.864      0.905     0.889

Mean F1: 0.949
Mean Precision: 0.955
Mean Recall: 0.944
Mean Accuracy: 0.952
-------------------------
Version 1: Naive Bayes
    Confusion Matrix
                 Predict Positive    Predict Negative
Actual Positive         14                 2
Actual Negative         4                 27

Accuracy: 0.872

    Cl