### Logistic Regression  (Vicent Liu)

Logistic regression is the classification counterpart to linear regression. Predictions are mapped to be between 0 and 1 through the logistic function, which means that predictions can be interpreted as class probabilities.
The models themselves are still "linear," so they work well when your classes are linearly separable (i.e. they can be separated by a single decision surface). Logistic regression can also be regularized by penalizing coefficients with a tunable penalty strength.

Strengths: Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid overfitting. 
    Logistic models can be updated easily with new data using stochastic gradient descent. 
    
Weaknesses: Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships.

Reference: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import _pickle as cPickle
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.cross_validation import cross_val_score

In [3]:
df = pd.read_csv('./data/ny_hmda_2015_minmax.csv')

x = np.array(df.drop(['action_taken'],1)) 
y = np.array(df['action_taken'])
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2)

logreg = linear_model.LogisticRegression(C=1e4)
logreg.fit(x_train, y_train)

LogisticRegression(C=10000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [8]:
#Z = logreg.predict(X)
logreg.score(x_test, y_test)

0.82411292892807819

In [10]:
#test different C
for i in range(10):
    logreg = linear_model.LogisticRegression(C=10**i)
    print(logreg.fit(x_train, y_train).score(x_test, y_test))

0.824019972379
0.82419260597
0.824112928928
0.824112928928
0.824112928928
0.824099649421
0.824126208435
0.824099649421
0.824112928928
0.824112928928


In [11]:
#test penalty = l1
for i in range(10):
    logreg = linear_model.LogisticRegression(C=10**i, penalty='l1')
    print(logreg.fit(x_train, y_train).score(x_test, y_test))

0.824046531393
0.824152767449
0.824152767449
0.824139487942
0.824152767449
0.82419260597
0.824152767449
0.824166046956
0.824166046956
0.824166046956


In [18]:
#TEST NORM DATASET
df_norm = pd.read_csv('./data/ny_hmda_2015_normalize.csv')

x_norm = np.array(df_norm.drop(['action_taken'],1)) 
y_norm = np.array(df_norm['action_taken'])
x_train_norm, x_test_norm, y_train_norm, y_test_norm = train_test_split(x_norm,y_norm,test_size = 0.2)

logreg = linear_model.LogisticRegression()
print(logreg.fit(x_train_norm, y_train_norm).score(x_test_norm, y_test_norm))

0.819743971104


In [19]:
#TEST ROBUST DATASET
df_robust = pd.read_csv('./data/ny_hmda_2015_robust.csv')

x_robust = np.array(df_robust.drop(['action_taken'],1)) 
y_robust = np.array(df_robust['action_taken'])
x_train_robust, x_test_robust, y_train_robust, y_test_robust = train_test_split(x_robust,y_robust,test_size = 0.2)

logreg = linear_model.LogisticRegression()
print(logreg.fit(x_train_robust, y_train_robust).score(x_test_robust, y_test_robust))

0.823329438011


In [20]:
#Cross Validation
scores = cross_val_score(linear_model.LogisticRegression(), x_robust, y_robust, scoring='accuracy', cv=10)
print(scores)

[ 0.81326357  0.83097843  0.85190694  0.84667481  0.81884096  0.81517582
  0.82999575  0.8089825   0.68816233  0.88937583]


In [12]:
cPickle.dump(logreg,open('models/logreg_model.p', 'wb'))  

In [14]:
x_test[0]

array([ 0.00730146,  2.        ,  1.        ,  1.        ,  2.        ,
        1.        ,  3.        ,  1.        ,  7.        ,  1.        ,
        0.00728592,  0.28884682,  0.29711717,  0.53991365,  0.20210518,
        0.28043419,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
        0.        ,  0.        ])