**Thinkful - 3.3.4 - Challenge - Logistic, Ridge and Lasso Models**

In [5]:
import math
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
sns.set(style="white", context="talk")

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd")
 
#Read file and remove commas from numbers (currently strings)
df2 = pd.read_csv('Data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2013_v2.csv')
print(len(df2))
#df = df2.iloc[:,:].dropna()
df = df2.dropna()
print(len(df))

variables = ['Population','Murder','Violent crime','Robbery','Aggravated assault','Arson',
            'Property crime','Burglary','Larceny-theft','Motor vehicle theft','Rape']
'''df[variables] = df[variables].astype(int)
df['State'] = df['State'].map(lambda x: x.title())
df.head()'''

9289
9289


"df[variables] = df[variables].astype(int)\ndf['State'] = df['State'].map(lambda x: x.title())\ndf.head()"

**Create An Outcome Variable**

In [7]:
population = df['Population']
df['Property Crime Per Capita'] = df['Property crime']/population
prop_crime_pc = df['Property Crime Per Capita']

is_safe = []
for j in range(len(prop_crime_pc)):
    if prop_crime_pc[j]<0.05:
        is_safe.append(0)
    else:
        is_safe.append(1)
df['Is Safe'] = pd.Series(is_safe, index=df.index)

**Creating New Features**

We currently have 9 features, so I created 6 more.

In [19]:
#Create variable for murder per capita and append to dataframe
df['Robbery Per Capita'] = df['Robbery']/population
df['Murder Per Capita'] = df['Murder']/population
df['Robbery-Murder'] = df['Robbery'] * df['Murder']
df['Robbery-Larceny'] = df['Robbery'] * df['Larceny-theft']

# Create feature for region
Region = []
Northeast = ['Connecticut','Maine','Massachusetts','New Hampshire','Rhode Island', 
             'Vermont','New Jersey','New York','Pennsylvania']
Midwest = ['Illinois','Indiana','Michigan','Ohio','Wisconsin','Iowa','Kansas', 
           'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota']
South = ['Delaware','Florida','Georgia','Maryland','North Carolina','South Carolina',
        'Virginia','District of Columbia','West Virginia','Alabama','Kentucky', 
        'Mississippi','Tennessee','Arkansas','Louisiana','Oklahoma','Texas']
West = ['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah','Wyoming', 
        'Alaska','California','Hawaii','Oregon','Washington']

for i in range(len(df['State'])):
    if df['State'][i] in Northeast:
        Region.append(0)
    elif df['State'][i] in Midwest:
        Region.append(1)
    elif df['State'][i] in South:
        Region.append(2)
    else:
        Region.append(3)

df['Region'] = pd.Series(Region, index=df.index) 

# Create feature for city size
Large_City = []
for j in range(len(df['Population'])):
    if df['Population'][j]<30000:
        Large_City.append(0)
    elif df['Population'][j]<60000:
        Large_City.append(1)
    else:
        Large_City.append(2)
df['Large City'] = pd.Series(Large_City, index=df.index)

In [20]:
features = ['Murder','Violent crime','Robbery','Aggravated assault','Arson',
            'Burglary','Larceny-theft','Motor vehicle theft','Rape',
            'Robbery Per Capita','Murder Per Capita','Large City','Region',
           'Robbery-Murder','Robbery-Larceny']
# Need more features + Region excluded because keyerror could not be converted to float

**Logistic Regression**

In [24]:
# Declare a logistic regression classifier.
# Parameter regularization coefficient C described above. L2 = ridge, L2 = lasso
lr = LogisticRegression(C=1e9)
y1 = is_safe
X = df[features]

# Fit the model.
fit = lr.fit(X, y1)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print("R^2 = ",lr.score(X, y1))
score_log = cross_val_score(lr, X, y1, cv=10)
print("Cross Validation, CV = 10:")
print(score_log)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score_log.mean(), score_log.std() * 2))

Coefficients
[[  1.47582212e-04   8.43159639e-05   2.25665878e-03  -9.22608131e-04
   -2.07096847e-04   3.33940722e-04  -6.06692024e-04   7.17723451e-04
   -1.44598575e-03  -1.16462262e-07  -1.99549685e-08  -2.11465752e-04
   -5.81230526e-03  -6.51500577e-06   3.76759897e-08]]
[-0.00193744]
R^2 =  0.89955861772
Cross Validation, CV = 10:
[ 0.71182796  0.89247312  0.90204521  0.89666308  0.90312164  0.90204521
  0.90419806  0.90193966  0.90517241  0.90409483]
Unweighted Accuracy: 0.88 (+/- 0.11)


**Selection of regularization parameter for ridge and lasso regression**

After evaluating the impact of alpha on the R^2 value for both ridge and lasso regression (below), it appears that the R^2 value for the ridge regression model decreases as alpha increases (which is expected), whereas the R^2 value for lasso regression is consistently zero.

In [None]:
alphaStep = []
R2_ridge = []
R2_lasso = []
Alph_1 = 0.1
i = 1

while i < 100:
    
    #Ridge Regression
    ridgeregr = linear_model.Ridge(alpha=Alph_1, fit_intercept=False) 
    ridgeregr.fit(X, y)
    R2_ridge.append(ridgeregr.score(X, y))
    
    # Lasso Regression
    lass = linear_model.Lasso(alpha=Alph_1)
    lassfit = lass.fit(X, y)
    R2_lasso.append(lass.score(X, y))
    
    # Iterate
    alphaStep.append(Alph_1)
    i += 1
    Alph_1 += .1

# Plot Results
plt.scatter(alphaStep,R2_ridge,color='blue')
plt.scatter(alphaStep,R2_lasso,color='red')
plt.legend(['Ridge', 'Lasso'])
plt.title('R-squared vs. Alpha')
plt.xlabel('Alpha')
plt.ylabel('R-squared')
plt.show()

**Ridge Regression**

In [50]:
Alph_1 = .5
y = prop_crime_pc

#Ridge Regression
ridgeregr = linear_model.Ridge(alpha=Alph_1, fit_intercept=False) 
fit_r = ridgeregr.fit(X, y)
print('Coefficients')
print(fit_r.coef_)
print("R^2 = ",ridgeregr.score(X, y))
score_ridge = cross_val_score(ridgeregr, X, y, cv=10)
print("Cross Validation, CV = 10:")
print(score_ridge)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score_ridge.mean(), score_ridge.std() * 2))

Coefficients
[ -1.53813819e-03   1.34210820e-04  -2.71213286e-04  -1.57613821e-04
  -7.15126030e-05  -6.90500574e-06   1.82438809e-05   2.56619242e-06
  -2.25874993e-04   3.97334761e+01  -7.56341929e-03  -1.88510070e-02
   4.90981114e-03   7.49902434e-07  -1.31392911e-09]
R^2 =  0.389602411358
Cross Validation, CV = 10:
[ 0.78300504  0.00592372 -0.28292164 -0.14900114 -1.54136874 -1.075824
 -0.2061249  -2.16813826 -1.10920803 -0.00594016]
Unweighted Accuracy: -0.57 (+/- 1.66)


**Lasso Regression**

In [51]:
Alph_1 = 0.1
# Lasso Regression
lass = linear_model.Lasso(alpha=Alph_1)
lassfit = lass.fit(X, y)
print('Coefficients')
print(lassfit.coef_)
print("R^2 = ",lass.score(X, y))
score_lass = cross_val_score(lass, X, y, cv=10)
print("Cross Validation, CV = 10:")
print(score_lass)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score_lass.mean(), score_lass.std() * 2))

Coefficients
[ -0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00  -2.49093442e-05   1.16757278e-05  -1.45370276e-06
  -0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   4.14032423e-08  -5.61461841e-10]
R^2 =  0.00012842426557
Cross Validation, CV = 10:
[-0.00495614 -0.00119499 -0.22466246 -0.15923292 -0.28307245 -0.65880693
 -0.15633098 -0.77624722 -0.44986865 -0.15902156]
Unweighted Accuracy: -0.29 (+/- 0.50)


**Discussion**

After reviewing all three models, it appears that logistic regression is the best model, which is due to its higher R^2 and average cross validation scores. A summary of these values is below: 
* Logistic Regression: R^2 = 0.90, Avg. CV = 0.88 +/- 0.11
* Ridge Regression: R^2 = 0.39, Avg. CV = -0.57 +/- 1.66
* Lasso Regression: R^2 = 0.000128, Avg. CV = -0.29 +/- 0.50

It should be noted that the logistic regression is predicting a categorical variable, whereas the ridge and lasso regression models are predicting continuous variables, and therefore this might not be an appropriate comparison.

To build these models, features were chosen which were speculated to be related to the outcome variables of whether or not the city was safe (for the logistic regression model) or what the predicted property crime was per capita (for the ridge and lasso regression models).

For the ridge regression model, I chose alpha = 0.5. This decreased the overall R^2 value but increased the average cross validation score. Even when the cross-validation score was optimized, however, the unweighted accuracy was -0.57, which is not a great score.

For the lasso regression model, I chose alpha = 0.1, however, the R^2 is essentially 0 for all values of alpha, which calls into question how valuable alpha is to the model (and therefore how appropriate this model is for predicting this dataset). The model also either assigned coefficients of zero or near-zero to almost all of the variables, essentially making the outcome zero for all input variables.

Regression as a modeling approach can be great for making predictions, but the explanatory power of the models is questionable as the number of features is increased.