Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

Vanilla logistic regression
Ridge logistic regression
Lasso logistic regression
If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

In [87]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
%matplotlib inline
sns.set_style('white')

types = {'Population':int, 'Violent\ncrime':int, 'Murder and\nnonnegligent\nmanslaughter':int, 'Robbery':int}

# Load the dataset
data = pd.read_csv('C:\\Users\\maken\\table_8_offenses_known_to_law_enforcement_mississippi_by_city_2013.csv', skiprows = 4, skipfooter = 2, thousands = ',')



In [88]:
# Convert string number values to ints
for value in types:
    value.str(',')
print(types)

AttributeError: 'str' object has no attribute 'str'

In [89]:
data.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson,Unnamed: 13
0,Aberdeen,5473,9,0,,1.0,4,4,172,40,127,5,0,
1,Amory,7167,5,0,,1.0,2,2,265,81,181,3,2,
2,Batesville,7417,18,2,,1.0,5,10,447,72,360,15,1,
3,Biloxi,44744,213,3,36.0,,90,84,2316,703,1482,131,6,
4,Byhalia,1270,10,0,,0.0,0,10,71,19,51,1,0,


In [90]:
data.tail()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson,Unnamed: 13
34,Southaven,50801,168,0,,9.0,20,139,1455,210,1189,56,3,
35,Starkville,24519,42,1,,1.0,11,29,608,125,472,11,0,
36,Vicksburg,23318,110,5,,23.0,14,68,1260,275,915,70,6,
37,West Point,11179,54,1,,5.0,14,34,392,162,229,1,0,
38,Wiggins,4535,8,0,,3.0,2,3,116,29,84,3,0,


In [91]:
# Create fifteen features
data['pop_over_twentyk'] = False
data.loc[data['Population'] > 20000, 'pop_over_twentyk'] = True
data['violent_crime_over_fifty'] = False
data.loc[data['Violent\ncrime'] > 50, 'violent_crime_over_fifty'] = True
data['rape_over_ten'] = False
data.loc[data['Rape\n(legacy\ndefinition)2'] > 10, 'rape_over_ten'] = True
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature
#Feature

In [92]:
data.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson,Unnamed: 13,pop_over_twentyk,violent_crime_over_fifty,rape_over_ten
0,Aberdeen,5473,9,0,,1.0,4,4,172,40,127,5,0,,False,False,False
1,Amory,7167,5,0,,1.0,2,2,265,81,181,3,2,,False,False,False
2,Batesville,7417,18,2,,1.0,5,10,447,72,360,15,1,,False,False,False
3,Biloxi,44744,213,3,36.0,,90,84,2316,703,1482,131,6,,True,True,False
4,Byhalia,1270,10,0,,0.0,0,10,71,19,51,1,0,,False,False,False


In [93]:
from sklearn.linear_model import LogisticRegression

# Vanilla Logist Regression
lr = LogisticRegression(C=1e9)

y = data['violent_crime_over_fifty']
X = data['pop_over_twentyk'].values.reshape(-1,1)

fit = lr.fit(X, y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X, y))

Coefficients
[[ 2.30255882]]
[-1.3862682]

 Accuracy
violent_crime_over_fifty  False  True 
row_0                                 
False                        20      5
True                          4     10

 Percentage accuracy
0.769230769231


In [94]:
from sklearn import linear_model
from sklearn import preprocessing

y = data['violent_crime_over_fifty']
X = data['pop_over_twentyk'].values.reshape(-1,1)

# Ridge Logistic Regression
ridgeregr = linear_model.Ridge(alpha=10, fit_intercept=False) 
ridgeregr.fit(X, y)

"""
# Display.
print('Coefficients')
print(ridgeregr.coef_)
print(ridgeregr.intercept_)

rpred_y_sklearn = ridgeregr.predict(X)
print('\n Accuracy')
print(pd.crosstab(rpred_y_sklearn, y))
"""
print('\n Percentage accuracy')
print(ridgeregr.score(X, y))


 Percentage accuracy
0.0144675925926


In [105]:
# Lasso Logistic Regression

y = data['violent_crime_over_fifty']
X = data['pop_over_twentyk'].values.reshape(-1,1)

lass = linear_model.Lasso(alpha=.35)
lassfit = lass.fit(X, y)

# Display.
print('Coefficients')
print(lass.coef_)
print(lass.intercept_)

lpred_y_sklearn = lassfit.predict(X)
print('\n Accuracy')
print(pd.crosstab(lpred_y_sklearn, y))


print('\n Percentage accuracy')
print(lass.score(X, y))

Coefficients
[ 0.]
0.384615384615

 Accuracy
violent_crime_over_fifty  False  True 
row_0                                 
0.384615                     24     15

 Percentage accuracy
0.0
