## 3.3.4 Challenge

Pick a dataset of your choice with a binary outcome and the potential for at least 15 features.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

Vanilla logistic regression
Ridge logistic regression
Lasso logistic regression

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
import scipy.stats as stats
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
                             f1_score)
from sklearn import svm

%matplotlib inline
sns.set_style('white')

In [2]:
#data loading and cleaning from a previous notebook
da = pd.read_csv(
    r'C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\3.3.4 data\texas.csv'
)
da = da.fillna(value=0)
da = da[:604]
da.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3,Unnamed: 13
0,Abernathy,2821,0,0.0,0.0,0.0,0,0,12,12,0,0,1.0,0.0
1,Abilene,119401,477,1.0,0.0,37.0,125,314,4769,1055,3460,254,16.0,0.0
2,Addison,15961,51,1.0,0.0,4.0,11,35,784,129,593,62,1.0,0.0
3,Alamo,18876,164,0.0,0.0,11.0,27,126,1336,203,1052,81,1.0,0.0
4,Alamo Heights,7443,9,0.0,0.0,2.0,2,5,235,36,194,5,0.0,0.0


In [3]:
#adding relevant categories to a new frame
dp = pd.DataFrame()
dp['Population'] = da[['Population']]
dp['Violent_Crime'] = da[['Violent\ncrime']]
dp['Murder'] = da[['Murder and\nnonnegligent\nmanslaughter']]
dp['Rape1'] = da[['Rape\n(revised\ndefinition)1']]
dp['Rape2'] = da[['Rape\n(legacy\ndefinition)2']]
dp['Robbery'] = da[['Robbery']]
dp['Burglary'] = da[['Burglary']]
dp['Aggravated_Assault'] = da[['Aggravated\nassault']]
dp['Larceny_Theft'] = da[['Larceny-\ntheft']]
dp['Motor_Vehicle_Theft'] = da[['Motor\nvehicle\ntheft']]
dp['Arson'] = da[['Arson3']]
dp['Property_Crime'] = da[['Property\ncrime']]

In [4]:
#making all the columns into integers and cleaning
dp['Rape'] = dp['Rape1'] + dp['Rape2']
dp['Arson'] = dp['Arson'].astype(str)
dp = dp.replace({r',': ''}, regex=True)
dp['Arson'] = dp.Arson.str.replace('nan','0')
dp = dp.apply(pd.to_numeric)
dp = dp.drop(['Rape1', 'Rape2', 'Property_Crime'], axis=1)

In [5]:
dp.head()

Unnamed: 0,Population,Violent_Crime,Murder,Robbery,Burglary,Aggravated_Assault,Larceny_Theft,Motor_Vehicle_Theft,Arson,Rape
0,2821,0,0.0,0,12,0,0,0,1.0,0.0
1,119401,477,1.0,125,1055,314,3460,254,16.0,37.0
2,15961,51,1.0,11,129,35,593,62,1.0,4.0
3,18876,164,0.0,27,203,126,1052,81,1.0,11.0
4,7443,9,0.0,2,36,5,194,5,0.0,2.0


In [6]:
#making a few of the columns into binary variables based on if they occur or not
dp['Violent_Crimec'] = np.where(dp['Violent_Crime'] >= 1, 1, 0)
dp['Murderc'] = np.where(dp['Murder'] >= 1, 1, 0)
dp['Robberyc'] = np.where(dp['Robbery'] >= 1, 1, 0)
dp['Aggravated_Assaultc'] = np.where(dp['Aggravated_Assault'] >= 1, 1, 0)
dp['Arsonc'] = np.where(dp['Arson'] >= 1, 1, 0)
dp['Rapec'] = np.where(dp['Rape'] >= 1, 1, 0)

In [7]:
#robbery has roughly even numbers of 1 and 0 so we will use it as our depenent variable. We will try to predict wether or not
#a robbery will occur in a city based on it's other statistics
lr = LogisticRegression(C=.1)
y = dp['Robberyc']
X = dp[['Population', 'Violent_Crime', 'Murder', 'Burglary', 'Aggravated_Assault', 'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson', 'Rape']]

# Fit the model.
fit = lr.fit(X, y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X, y))

Coefficients
[[ -2.89033293e-05   1.18096235e-02  -1.91567060e-04  -4.43183077e-04
    4.14587209e-03   9.40909555e-03   4.70846003e-03  -9.27872293e-05
    4.06699022e-04]]
[-0.00313571]

 Accuracy by admission status
Robberyc    0    1
row_0             
0          28    1
1         183  392

 Percentage accuracy
0.695364238411


In [8]:
#The model in it's base is pretty poor. It looks like a few of the integers imply some variables are not as good estimators as 
#some of the others, so we will get rid of the ones with a value much closer to 0
lr = LogisticRegression(C=.1)
y = dp['Robberyc']
X = dp[['Violent_Crime', 'Murder', 'Aggravated_Assault', 'Burglary', 'Larceny_Theft', 'Motor_Vehicle_Theft', 'Rape']]

# Fit the model.
fit = lr.fit(X, y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X, y))

Coefficients
[[ 1.36403801 -0.55936375 -1.34388299  0.00288484  0.00143466  0.02833138
  -1.24179708]]
[-1.49594141]

 Accuracy by admission status
Robberyc    0    1
row_0             
0         207    8
1           4  385

 Percentage accuracy
0.980132450331


In [9]:
#This vastly improved the model, but again some of the predictors seem to be much less useful that the others so let's
#cut them again
lr = LogisticRegression(C=.1)
y = dp['Robberyc']
X = dp[['Violent_Crime', 'Murder', 'Aggravated_Assault', 'Rape']]

# Fit the model.
fit = lr.fit(X, y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X, y))

Coefficients
[[ 1.43257012 -0.55044377 -1.39908202 -1.26631682]]
[-1.43743333]

 Accuracy by admission status
Robberyc    0    1
row_0             
0         207    5
1           4  388

 Percentage accuracy
0.985099337748


In [10]:
#now that we have a working model, lets test for confidence
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

In [11]:
# Fit the model.
fit = lr.fit(xtrain, ytrain)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(xtrain)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, ytrain))

print('\n Percentage accuracy')
print(lr.score(xtrain, ytrain))

Coefficients
[[ 1.24884352 -0.38382156 -1.21640515 -1.09955832]]
[-1.35980321]

 Accuracy by admission status
Robberyc    0    1
row_0             
0         167   11
1           4  271

 Percentage accuracy
0.966887417219


In [12]:
# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(xtest)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, ytest))

print('\n Percentage accuracy')
print(lr.score(xtest, ytest))

Coefficients
[[ 1.24884352 -0.38382156 -1.21640515 -1.09955832]]
[-1.35980321]

 Accuracy by admission status
Robberyc   0    1
row_0            
0         40    7
1          0  104

 Percentage accuracy
0.953642384106


In [13]:
#these values are very similar so lets test lasso regression and see its relation to the ridge above
lr = LogisticRegression(penalty='l1', C=.1)
y = dp['Robberyc']
X = dp[['Population', 'Violent_Crime', 'Murder', 'Burglary', 'Aggravated_Assault', 'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson', 'Rape']]
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

In [14]:
# Fit the model.
fit = lr.fit(xtrain, ytrain)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(xtrain)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, ytrain))

print('\n Percentage accuracy')
print(lr.score(xtrain, ytrain))

Coefficients
[[  5.23502863e-05   7.48398847e-03   0.00000000e+00   2.48961894e-03
    2.72600150e-02   4.45690609e-04   4.99721643e-03   0.00000000e+00
   -1.49061552e-03]]
[-0.34615965]

 Accuracy by admission status
Robberyc   0    1
row_0            
0         98   26
1         56  273

 Percentage accuracy
0.818984547461


In [15]:
#looks like changing the penalty type changes the important variables
X = dp[['Aggravated_Assault', 'Motor_Vehicle_Theft', 'Arson', 'Rape']]
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

In [16]:
# Fit the model.
fit = lr.fit(xtrain, ytrain)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(xtrain)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, ytrain))

print('\n Percentage accuracy')
print(lr.score(xtrain, ytrain))

Coefficients
[[ 0.00926865  0.08741887  0.02859147  0.10781651]]
[-0.35420339]

 Accuracy by admission status
Robberyc    0    1
row_0             
0         107   43
1          49  254

 Percentage accuracy
0.796909492274


In [17]:
# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(xtest)

print('\n Accuracy by admission status')
print(pd.crosstab(pred_y_sklearn, ytest))

print('\n Percentage accuracy')
print(lr.score(xtest, ytest))

Coefficients
[[ 0.00926865  0.08741887  0.02859147  0.10781651]]
[-0.35420339]

 Accuracy by admission status
Robberyc   0   1
row_0           
0         42  13
1         13  83

 Percentage accuracy
0.827814569536


In [18]:
model = sm.Logit(y, X)
model.fit(maxiter=1000)

Optimization terminated successfully.
         Current function value: 0.439018
         Iterations 12


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x1b59ce1dac8>

In [19]:
model = sm.Logit(ytrain, xtrain)
model.fit(maxiter=1000)

Optimization terminated successfully.
         Current function value: 0.441491
         Iterations 11


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x1b59ce1de10>

In [20]:
model = sm.Logit(ytest, xtest)
model.fit(maxiter=1000)

Optimization terminated successfully.
         Current function value: 0.423553
         Iterations 13


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x1b59ce1d0f0>

The value of C was chosen at .1 because ridge regression had a perfect score using the default, and since our prupose in this notebook was to compare models, we wanted to use a value that would gaurentee comparable results.

In this situation, Vanilla logistic regression is by far the worst regression type with under a .5 score and ridge regression is by far the best. It is important to note that feature selection in ridge regression improved the model's strength by almost 30% while feature selection in lasso didn't improve the model at all. 