### Classification Error Metric Challenges

**Settings:  Where applicable, use test_size=0.30, random_state=4444.  This will permit comparison of results across users.

*These reference the Classification Challenges.*

In [60]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc
from sklearn.grid_search import GridSearchCV

#### Challenge 1

For the house representatives data set, calculate the accuracy, precision, recall and f1 scores of each classifier you built (on the test set).

In [32]:
votes = pd.read_csv('voting.csv')
names = ['party','handicapped','water','adoption',
         'physician','aid_el_salvador','religion',
         'satellite','aid_nicaraguan','missile',
         'immigration','synfuels','education','superfund',
         'crime','duty_free','south_africa']
votes.columns = names
votes.replace({'n':0,'y':1,'?':np.nan},inplace=True)
for col in votes.columns:
    if col != 'party':  
        votes[col].replace(np.nan, votes[col].mode()[0],inplace=True)

In [33]:
y = votes.pop('party')
X = votes
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=4444)

Recomputing the logistic model.

In [35]:
param_grid = {'C':.1*np.arange(1,100)}
rgr = GridSearchCV(LogisticRegression(),param_grid,cv=5)
rgr.fit(X_train,y_train)
rgr.best_params_

{'C': 0.5}

In [36]:
rgr = LogisticRegression(C=.5)
rgr.fit(X_train,y_train)
print rgr.score(X_test,y_test)
print ''
print precision_recall_fscore_support(y_test,rgr.predict(X_test))

0.954198473282

(array([ 0.93975904,  0.97916667]), array([ 0.98734177,  0.90384615]), array([ 0.96296296,  0.94      ]), array([79, 52]))


Recomputing the Knn model.

In [38]:
param_grid = {'n_neighbors':np.arange(1,21)}
knn = GridSearchCV(KNeighborsClassifier(),param_grid,cv=5)
knn.fit(X_train,y_train)
knn.best_params_

{'n_neighbors': 2}

In [39]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit (X_train,y_train)
print knn.score(X_test,y_test)
print ''
print precision_recall_fscore_support(y_test,knn.predict(X_test))

0.923664122137

(array([ 0.90588235,  0.95652174]), array([ 0.97468354,  0.84615385]), array([ 0.93902439,  0.89795918]), array([79, 52]))


#### Challenge 2

For each, draw the ROC curve and calculate the AUC.

In [58]:
rgr_probs = rgr.predict_proba(X_test)
rgr_probs_isdem = zip(*rgr_probs)[0]
rgr_fpr, rgr_tpr, rgr_thresholds = roc_curve(y_test,rgr_probs_isdem,pos_label='democrat')

In [59]:
knn_probs = knn.predict_proba(X_test)
knn_probs_isdem = zip(*knn_probs)[0]
knn_fpr, knn_tpr, knn_thresholds = roc_curve(y_test,knn_probs_isdem,pos_label='democrat')

#### Challenge 3

Calculate the same metrics you did in challenge 1, but this time in a cross validation scheme with the `cross_val_score` function (like in Challenge 9).

#### Challenge 4

For your movie classifiers, calculate the precision and recall for each class.

#### Challenge 5

Draw the ROC curve (and calculate AUC) for the logistic regression classifier from challenge 12.