# Lesson 04 Assignment

## Background

    You are working for a data science consulting company. Your company is approached by a client requesting that you analyze crime data across the United States. At first glance, you notice that the data has 128 attributes and cannot be examined manually. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. You are tasked to identify which are the most important features or attributes that contribute to crime. 

    Generally, such data might be used for predictive policing, where police departments can predict potential criminal activity so they can ensure they are properly staffed and the areas of concern are patrolled accordingly.

## Instructions

    It is recommended you complete the lab exercises for this lesson before beginning the assignment.

    Using the Communities and Crime dataset (Links to an external site.)Links to an external site., create a new notebook and perform each of the following tasks and answer the related questions:

    (1) Read data.
    (2) Apply three techniques for filter selection: Filter methods, Wrapper methods, Embedded methods.
    (3) Describe your findings.

In [None]:
# Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import OrderedDict
import datetime as dt
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sms
from sklearn import linear_model
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE #Recursive Feature Elimination

#Plot styling

import seaborn as sns; sns.set()  # for plot styling
%matplotlib inline

### (1) Read Data

In [None]:
# Reading url

data = pd.read_csv("/Users/matt.denko/Downloads/data.csv") 
data.columns = ['state',
'county',
'community  ',
'communityname',
'fold',
'population',
'householdsize',
'racepctblack',
'racePctWhite',
'racePctAsian  ',
'racePctHisp',
'agePct12t21  ',
'agePct12t29',
'agePct16t24  ',
'agePct65up',
'numbUrban',
'pctUrban',
'medIncome',
'pctWWage',
'pctWFarmSelf',
'pctWInvInc',
'pctWSocSec',
'pctWPubAsst',
'pctWRetire',
'medFamInc',
'perCapInc',
'whitePerCap',
'blackPerCap',
'indianPerCap',
'AsianPerCap',
'OtherPerCap',
'HispPerCap',
'NumUnderPov',
'PctPopUnderPov',
'PctLess9thGrade',
'PctNotHSGrad',
'PctBSorMore',
'PctUnemployed',
'PctEmploy',
'PctEmplManu',
'PctEmplProfServ',
'PctOccupManu',
'PctOccupMgmtProf',
'MalePctDivorce',
'MalePctNevMarr',
'FemalePctDiv',
'TotalPctDiv',
'PersPerFam',
'PctFam2Par',
'PctKids2Par',
'PctYoungKids2Par',
'PctTeen2Par',
'PctWorkMomYoungKids',
'PctWorkMom',
'NumIlleg',
'PctIlleg',
'NumImmig',
'PctImmigRecent',
'PctImmigRec5',
'PctImmigRec8',
'PctImmigRec10',
'PctRecentImmig',
'PctRecImmig5',
'PctRecImmig8',
'PctRecImmig10',
'PctSpeakEnglOnly',
'PctNotSpeakEnglWell',
'PctLargHouseFam',
'PctLargHouseOccup',
'PersPerOccupHous',
'PersPerOwnOccHous',
'PersPerRentOccHous',
'PctPersOwnOccup',
'PctPersDenseHous',
'PctHousLess3BR',
'MedNumBR',
'HousVacant',
'PctHousOccup',
'PctHousOwnOcc',
'PctVacantBoarded',
'PctVacMore6Mos',
'MedYrHousBuilt',
'PctHousNoPhone',
'PctWOFullPlumb',
'OwnOccLowQuart',
'OwnOccMedVal',
'OwnOccHiQuart',
'RentLowQ',
'RentMedian',
'RentHighQ',
'MedRent',
'MedRentPctHousInc',
'MedOwnCostPctInc',
'MedOwnCostPctIncNoMtg',
'NumInShelters',
'NumStreet',
'PctForeignBorn',
'PctBornSameState',
'PctSameHouse85',
'PctSameCity85',
'PctSameState85',
'LemasSwornFT',
'LemasSwFTPerPop',
'LemasSwFTFieldOps',
'LemasSwFTFieldPerPop',
'LemasTotalReq',
'LemasTotReqPerPop',
'PolicReqPerOffic',
'PolicPerPop',
'RacialMatchCommPol',
'PctPolicWhite',
'PctPolicBlack',
'PctPolicHisp',
'PctPolicAsian',
'PctPolicMinor',
'OfficAssgnDrugUnits',
'NumKindsDrugsSeiz',
'PolicAveOTWorked',
'LandArea',
'PopDens',
'PctUsePubTrans',
'PolicCars',
'PolicOperBudg',
'LemasPctPolicOnPatr',
'LemasGangUnitDeploy',
'LemasPctOfficDrugUn',
'PolicBudgPerPop',
'ViolentCrimesPerPop']
print(data.columns)
data.describe()
data.head()

In [None]:
#Treating Nulls - replacing ? with 0

data = data.replace(to_replace= "?", value=float(0))

## (2) Apply three techniques for filter selection: Filter methods, Wrapper methods, Embedded methods.

### Filter Methods

#### Comments:

    For my filter method I will be choosing the top 5 variables by correlation with y and running a multiple linear regression model.

In [None]:
# Define the target and features:

target_label = 'ViolentCrimesPerPop'
non_features = ['state','communityname','fold']
feature_labels = [x for x in data.columns if x not in [target_label] + non_features]

# One-hot encode inputs

data_expanded = pd.get_dummies(data, drop_first=True)
print('DataFrame one-hot-expanded shape: {}'.format(data_expanded.shape))

# Get target and original x-matrix

y = data[target_label]
x = data.as_matrix(columns=feature_labels)

In [None]:
x = x.astype(float)
#x.head()

In [None]:
# Model initialization

regression_model = LinearRegression()

# Fit the data(train the model)

regression_model.fit(x, y)

# Predict

y_predicted = regression_model.predict(x)

#Summary Statistics

X = sms.add_constant(x)

# Note the diference in argument order

model = sms.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics

model.summary()

### Comments:

    The top 5 features that had the lowest pvalue were NumStreet, PctRecImmig10, PctFam2Par, PctForeignBorn, OwnOccMedVal. These are the features that were selected via a Wrapper Method.
    

### Filter Methods

### Comments:

    For my wrapper method, I will be using recursive feature elimination to select 5 features.

In [None]:
# Recursive Feature Elimination

estimator = LinearRegression()
selector = RFE(estimator, 5, step=1)#select 5 features. Step=1 means each step only remove 1 variable from the model
selector = selector.fit(X, y)
print(selector.support_) # The mask of selected features.
print(selector.ranking_) # selected features are ranked 1. The 6th is the one that is removed first,
                         # 2nd is the one that is removed last

### Comments:

    The 5 features selected via recursive features selection were communityname, PctOccupManu, MalePctNevMarr, PctSameState85, LemasTotalReq.

### Embeded Methods

### Comments:

    For my embeded method I will be using ridge regression.

In [None]:
# Ridge Regression

alpha = 5 
clf = linear_model.Ridge(alpha=alpha)
clf.fit(X, y)
print(clf.coef_)
print(clf.intercept_)

# Increasing alpha can compress the L2 norm of the coefficients to 0 (but not selecting variables)

print("Sum of square of coefficients = %.2f"%np.sum(clf.coef_**2)) 

### Commments:

    Using an alpha of 5 ridge regression produced a model with a sum of square values of .26.

## (3) Describe your findings.

### Comments:

    For my filter method I ran a multiple linear regression model and chose the values with 5 lowest pvalues, for my wrapper methods I used recursive feature selection to select 5 features, and for my embedded method I used a ridge regression algorithm. The filter and wrapper methods produced entirely different lists of 5 features which is a good example of how different the outcome of a model can be when using different methods. The embedded method itself is different in that it used an algorithm to both run a model and remove features in the same step.