In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

## 3.3.4 Logistic Regression

**Task**<br>
Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

* Vanilla logistic regression
* Ridge logistic regression
* Lasso logistic regression

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

**Data**<br>
Predicting credit card fraud.

Data Source: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

Data has been anonymized through PCA.  The results are 28 principle components, a time variable, transaction amount, and classification as fraud (1) or not (0).

In [3]:
# Load in data
df = pd.read_csv('creditcardfraud.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Vanilla Logistic Regression

In [4]:
# Import model
from sklearn.linear_model import LogisticRegression

# Instantiate and set regularization coefficient to large value
lr = LogisticRegression(C=1e9)

# Define variables
x = df.drop(columns=['Class'])
y = df['Class']

# Fit model
fit = lr.fit(x, y)

# Get results
print('Coefficient')
coef = fit.coef_
print(coef)
print(fit.intercept_)

# Get predictions
pred_y = lr.predict(x)

print('\nAccuracy by fraud status')
print(pd.crosstab(pred_y, y))

print('\nPercentage accuracy')
print(lr.score(x,y))


Coefficient
[[-7.12205661e-05  3.19001931e-01 -4.84125628e-01 -7.93512825e-01
   1.20293566e-01  5.74886414e-02 -5.40509683e-02  3.35307692e-01
  -3.74349559e-01 -3.88608914e-01 -2.07048719e-01 -2.86743952e-01
   1.86455372e-02 -3.06674966e-01 -6.94622189e-01 -4.27799496e-01
  -2.94742322e-01 -4.39989190e-01  3.10683059e-02  2.65188972e-02
   9.20024094e-02  2.48887616e-01  3.51030078e-01  6.77173252e-02
  -2.44442971e-02 -3.56185269e-01  6.07207628e-02 -8.88572729e-02
   2.77995748e-02 -5.58259375e-03]]
[-1.62885062]

Accuracy by fraud status
Class       0    1
row_0             
0      284240  203
1          75  289

Percentage accuracy
0.9990239003957065


* large class imbalance leading to very high accuracy 
* 203 fraudulent cases mislabeled as not fraud
* 75 miscategorized as fraud 

### Ridge Logistic Regression
L2 regularization

In [32]:
# Generate range of alpha values to pick one resulting in best r squared
alphas = np.arange(0.1, 10, 1)
lr_ridge = LogisticRegression(penalty='l2')
ridge_r_squared = []

# Train model with different regularization values
for a in alphas:
    lr_ridge.set_params(C=a, fit_intercept=False)
    lr_ridge.fit(x, y)
    y_pred = lr_ridge.predict(x)
    ridge_r_squared.append(lr_ridge.score(x, y))

In [33]:
# Get values
ridge_r_squared

[0.9984831833487239,
 0.9985393617432156,
 0.9985253171445927,
 0.9985253171445927,
 0.9986271404846089,
 0.9985218059949369,
 0.9985218059949369,
 0.9985218059949369,
 0.9985218059949369,
 0.9985218059949369]

In [34]:
# Get alpha corresponding to highest r-squared
alphas[4]

4.1

In [35]:
# Not much variation observed by changing penalization coefficient.  Select best one.

# Instantiate and set regularization coefficient 
lr_ridge = LogisticRegression(penalty='l2', C=4.1, fit_intercept=False)

# Fit model
lr_ridge.fit(x, y)

# Get results
print('Coefficient')
print(lr_ridge.coef_)
print(lr_ridge.intercept_)

# Get predictions
pred_y_r = lr_ridge.predict(x)

print('\nAccuracy by fraud status')
print(pd.crosstab(pred_y_r, y))

print('\nPercentage accuracy')
print(lr_ridge.score(x,y))

Coefficient
[[-1.00363009e-04  4.95539317e-01 -8.64729681e-01 -1.41665871e+00
   2.11182323e-01 -1.48737343e-01 -2.81149840e-02  7.70208490e-01
  -4.84613149e-01 -5.64425923e-01 -4.41858733e-01 -5.07158083e-01
  -4.80217284e-02 -3.81347402e-01 -6.59168776e-01 -9.39193951e-01
  -4.37884633e-01 -6.15370299e-01  1.57018671e-01 -5.82967583e-02
   5.13752137e-01  6.88499318e-01  8.80571737e-01  3.58571084e-01
   2.03330059e-02 -1.15529186e+00  2.83262888e-01 -1.99524115e-01
   8.88575969e-02 -1.24274188e-02]]
0.0

Accuracy by fraud status
Class       0    1
row_0             
0      284113  189
1         202  303

Percentage accuracy
0.9986271404846089


* Overall accuracy actually went down
* Improved classification of fraud (what we're really after) at the expense of increased false positives
* Not expected results.  Could it be due to data already having PCA analysis conducted?  Or how penalization coefficient is being determined?

### LASSO Logistic Regression

L1 Regularization

In [45]:
# Repeat iterative process to find best value of penalization coefficient

# Generate range of alpha values
alphas = np.arange(0.01, 1, 0.1)
lr_lasso = LogisticRegression(penalty='l1')
lasso_r_squared = []

# Train model with different regularization values
for a in alphas:
    lr_lasso.set_params(C=a, fit_intercept=False)
    lr_lasso.fit(x, y)
    y_pred = lr_lasso.predict(x)
    lasso_r_squared.append(lr_lasso.score(x, y))

In [46]:
# Get values
lasso_r_squared

[0.9984902056480354,
 0.9988202537156742,
 0.9988413206136085,
 0.9988448317632642,
 0.9988378094639527,
 0.9988378094639527,
 0.9988413206136085,
 0.9988413206136085,
 0.9988448317632642,
 0.9988448317632642]

In [47]:
# Instantiate and set regularization coefficient to selected value
lr_lasso = LogisticRegression(penalty='l1', C=0.01)

# Fit model
lr_lasso.fit(x, y)

# Get results
print('Coefficient')
print(lr_lasso.coef_)
print(lr_lasso.intercept_)

# Get predictions
pred_y_l = lr_lasso.predict(x)

print('\nAccuracy by fraud status')
print(pd.crosstab(pred_y_l, y))

print('\nPercentage accuracy')
print(lr_lasso.score(x,y))

Coefficient
[[-1.52403584e-05  0.00000000e+00  0.00000000e+00 -2.56184060e-02
   2.85757773e-01  4.21347000e-02  0.00000000e+00  0.00000000e+00
  -1.34171116e-01 -1.15931038e-02 -2.83624536e-01  0.00000000e+00
  -6.93929718e-02  0.00000000e+00 -7.00286781e-01  0.00000000e+00
  -3.27052109e-02 -1.02405851e-01  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00 -2.13878307e-04]]
[-6.03523999]

Accuracy by fraud status
Class       0    1
row_0             
0      284278  211
1          37  281

Percentage accuracy
0.9991292348853785


^^ This r squared doesn't match up with iteration results
* Overall accuracy improved, but false negatives increased.  Overall worst performance considering goal.

### Conclusions
Strenths and limitations:
* **Ridge regression** keeps all predictor variables.  Good if you they're all relevant for predicting target variable.  Otherwise, makes model more complicated than it needs to be.
* **LASSO regression** performs feature selection, which reduces model complexity.  Undesireable if you want to keep all features in model.

Data set specifics:
* Ridge and LASSO did not produce substantially better results than the regular logistic regression.  Is this because of the data structure and how it was preprocessed already?  Or is it something with hyperparameter tuning?