This is a Thinkful assignment. The goal for this assignment is to validate the FBI Property Crime model built previously. There will be several sections for this assignment. The first step is to create holdout groups from the original data set, and then check the R squared value and F-test score. The second section is to validate the FBI Property Crime model using crime data set from other states, or crime data set from different years but same state. The third section is to revised our model based on the previous two sections, and compare the old and new model. 

In [1]:
import math
import warnings

from IPython.display import display
from matplotlib import pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import statsmodels.formula.api as smf

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

## Code from Previous FBI Property Crime Model

In [2]:
#Our data set have some headers and footers, and we need to skip those headers and footers, and just import our data. 
df = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv',
                 skiprows=4,skipfooter=3,header=0,na_values='nan')

  This is separate from the ipykernel package so we can avoid doing imports until


In [3]:
#Fix column name format issue and change column name. 
df.rename(columns={'Violent\ncrime':'ViolentCrime'},inplace=True)
df.rename(columns={'Murder and\nnonnegligent\nmanslaughter':'Murder'},inplace=True)
df.rename(columns={'Rape\n(revised\ndefinition)1':'Rape_1'},inplace=True)
df.rename(columns={'Rape\n(legacy\ndefinition)2':'Rape_2'},inplace=True)
df.rename(columns={'Aggravated\nassault':'AggravatedAssault'},inplace=True)
df.rename(columns={'Property\ncrime':'PropertyCrime'},inplace=True)
df.rename(columns={'Larceny-\ntheft':'LarcenyTheft'},inplace=True)
df.rename(columns={'Motor\nvehicle\ntheft':'MotorVehicleTheft'},inplace=True)

In [4]:
#Fix data format, and fill na
df['Population'] = pd.to_numeric(df['Population'].str.replace(',',''))
df['ViolentCrime'] = pd.to_numeric(df['ViolentCrime'].str.replace(',',''))
df['Rape_2'] = pd.to_numeric(df['Rape_2'].str.replace(',',''))
df['Robbery'] = pd.to_numeric(df['Robbery'].str.replace(',',''))
df['AggravatedAssault'] = pd.to_numeric(df['AggravatedAssault'].str.replace(',',''))
df['PropertyCrime'] = pd.to_numeric(df['PropertyCrime'].str.replace(',',''))
df['Burglary'] = pd.to_numeric(df['Burglary'].str.replace(',',''))
df['LarcenyTheft'] = pd.to_numeric(df['LarcenyTheft'].str.replace(',',''))
df['MotorVehicleTheft'] = pd.to_numeric(df['MotorVehicleTheft'].str.replace(',',''))
df['Arson3_fillna'] = df['Arson3'].fillna(0)

In [5]:
#Fit our model
regr = linear_model.LinearRegression()
Y = df['PropertyCrime']
X = df[['ViolentCrime','Murder','Rape_2','Robbery','AggravatedAssault']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))


Coefficients: 
 [ 17.15909077  10.10484153  39.05466462 -16.88634685 -15.11406854]

Intercept: 
 152.35194272557874

R-squared:
0.9984225484405618


### Section 1.1 - Holdout Group and Cross Validation

In [6]:
#Test with 30% holdout group
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=30)
print('With 30% Holdout the R-Squared Value is: ' + str(regr.fit(X_train, Y_train).score(X_test, Y_test)))
print('Testing on Sample the R-Squared Value is: ' + str(regr.fit(X, Y).score(X, Y)))

With 30% Holdout the R-Squared Value is: 0.9180386702443953
Testing on Sample the R-Squared Value is: 0.9984225484405618


In [7]:
#Cross Validation
from sklearn.model_selection import cross_val_score
cross_val_score(regr, X, Y, cv=10)

array([0.7973143 , 0.92878715, 0.30792562, 0.4008504 , 0.55112141,
       0.52865333, 0.99981743, 0.90115369, 0.97698505, 0.81141256])

According to above holdout group and cross validation, we see that our model does not return a consistent R squared value. It may due to our model is overfitting the data. 

### Section 1.2 - F-test and P-value for each feature. 
We have looked at our holdout group test and cross validation, and we see that our model's R-squared value varies a lot in the cross validation test. We may need to make some changes in our model, and find out which feature could be
deleted using the F-test and P-value for each features. 

In [8]:
linear_formula = 'PropertyCrime ~ ViolentCrime + Murder + Rape_2 + Robbery + AggravatedAssault'
lm = smf.ols(formula=linear_formula, data=df).fit()

In [9]:
print('\nCoefficients:')
print(lm.params)
print('\nP-Values:')
print(lm.pvalues)
print('\nR-squared:')
print(lm.rsquared)


Coefficients:
Intercept           152.352
ViolentCrime         17.159
Murder               10.105
Rape_2               39.055
Robbery             -16.886
AggravatedAssault   -15.114
dtype: float64

P-Values:
Intercept           0.000
ViolentCrime        0.000
Murder              0.535
Rape_2              0.000
Robbery             0.000
AggravatedAssault   0.000
dtype: float64

R-squared:
0.9984225484405618


According to the P-value for each features, we see that the P-value for Murder is greater that 0.05, which means dropping it would not effect our R-squared value that much. Therefore, we could consider dropping this feature in our new model. 

### Section 2 - Validate the same model in other data set

In [10]:
test_df = pd.read_csv('FBI Crime Record NYS 2014.csv',skiprows=4,skipfooter=7,header=0,na_values='nan')

  """Entry point for launching an IPython kernel.


In [11]:
#Fix column name format issue and change column name. 
test_df.rename(columns={'Violent\ncrime':'ViolentCrime'},inplace=True)
test_df.rename(columns={'Murder and\nnonnegligent\nmanslaughter':'Murder'},inplace=True)
test_df.rename(columns={'Rape\n(revised\ndefinition)1':'Rape_1'},inplace=True)
test_df.rename(columns={'Rape\n(legacy\ndefinition)2':'Rape_2'},inplace=True)
test_df.rename(columns={'Aggravated\nassault':'AggravatedAssault'},inplace=True)
test_df.rename(columns={'Property\ncrime':'PropertyCrime'},inplace=True)
test_df.rename(columns={'Larceny-\ntheft':'LarcenyTheft'},inplace=True)
test_df.rename(columns={'Motor\nvehicle\ntheft':'MotorVehicleTheft'},inplace=True)

In [12]:
#Fix data format, and fill na
test_df['Population'] = pd.to_numeric(test_df['Population'].str.replace(',',''))
test_df['ViolentCrime'] = pd.to_numeric(test_df['ViolentCrime'].str.replace(',','').str.replace(',',''))
test_df['Rape_2'] = pd.to_numeric(test_df['Rape_2'])
test_df['Robbery'] = pd.to_numeric(test_df['Robbery'].str.replace(',','').str.replace(',',''))
test_df['AggravatedAssault'] = pd.to_numeric(test_df['AggravatedAssault'].str.replace(',',''))
test_df['PropertyCrime'] = pd.to_numeric(test_df['PropertyCrime'].str.replace(',',''))
test_df['Burglary'] = pd.to_numeric(test_df['Burglary'].str.replace(',',''))
test_df['LarcenyTheft'] = pd.to_numeric(test_df['LarcenyTheft'].str.replace(',',''))
test_df['MotorVehicleTheft'] = pd.to_numeric(test_df['MotorVehicleTheft'].str.replace(',',''))
test_df['Arson3_fillna'] = test_df['Arson3'].fillna(0)
test_df['Rape_2'] = test_df['Rape_2'].fillna(0)
test_df['PropertyCrime'] = test_df['PropertyCrime'].fillna(0)

In [13]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369 entries, 0 to 368
Data columns (total 15 columns):
City                 369 non-null object
Population           369 non-null int64
ViolentCrime         369 non-null int64
Murder               369 non-null int64
Rape_1               227 non-null object
Rape_2               369 non-null float64
Robbery              369 non-null int64
AggravatedAssault    369 non-null int64
PropertyCrime        369 non-null float64
Burglary             369 non-null int64
LarcenyTheft         368 non-null float64
MotorVehicleTheft    369 non-null int64
Arson3               365 non-null float64
Unnamed: 13          0 non-null float64
Arson3_fillna        369 non-null float64
dtypes: float64(6), int64(7), object(2)
memory usage: 43.3+ KB


In [14]:
Y_2014 = test_df['PropertyCrime']
X_2014 = test_df[['ViolentCrime','Murder','Rape_2','Robbery','AggravatedAssault']]

In [15]:
print('Test with FBI Crime Record for NYS at 2014 the R-Squared Value is: ' + str(regr.fit(X, Y).score(X_2014, Y_2014)))

Test with FBI Crime Record for NYS at 2014 the R-Squared Value is: 0.9749877917315228


### Section 3 - Revised model
From our previous two sections, we know that our model did not perform well on the cross validation part. For our F-test and P-value of each features, we find out that our Murder feature have a relatively high P-value compare to others. Therefore, our revised model will drop the Murder feature. 

In [16]:
regr_revised = linear_model.LinearRegression()
Y = df['PropertyCrime']
X_revised = df[['ViolentCrime','Rape_2','Robbery','AggravatedAssault']]
regr_revised.fit(X_revised, Y)

# Inspect the results.
print('\nCoefficients: \n', regr_revised.coef_)
print('\nIntercept: \n', regr_revised.intercept_)
print('\nR-squared:')
print(regr_revised.score(X_revised, Y))


Coefficients: 
 [ 27.2639323   28.94982309 -26.99118839 -25.21891007]

Intercept: 
 152.351942725581

R-squared:
0.9984225484405618


In [17]:
X_revised_train, X_revised_test, Y_train, Y_test = train_test_split(X_revised, Y, test_size=0.3, random_state=30)
print('With 30% Holdout the R-Squared Value for Revised Model is: ' + str(regr_revised.fit(X_revised_train, Y_train).score(X_revised_test, Y_test)))
print('Testing on Sample the R-Squared Value for the Revised Model is: ' + str(regr_revised.fit(X_revised, Y).score(X_revised, Y)))

With 30% Holdout the R-Squared Value for Revised Model is: 0.9180386702443957
Testing on Sample the R-Squared Value for the Revised Model is: 0.9984225484405618


In [18]:
cross_val_score(regr_revised, X_revised, Y, cv=10)

array([0.7973143 , 0.92878715, 0.30792562, 0.4008504 , 0.55112141,
       0.52865333, 0.99981743, 0.90115369, 0.97698505, 0.81141256])

According to the statics of our revised model, we see that the R-squared value did not change much in the revised model, which is same as our conslusion previously. However, the cross validation test still not performing well. 