# Validating a linear regression model

In a previous lession, a linear regression model was built to predict Property Crimes in New York state based on population and various violent crimes.  This model was trained on an [FBI dataset for 2013](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls).

We will now take that trained model, test it on subsequent years of New York crime data, iterate if necessary, and ultimately validate a singular model.

In [16]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn.metrics import r2_score
%matplotlib inline

In [2]:
# Loading 2013 NY crime data from FBI as training set
train = pd.read_excel("table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls", 
                    skiprows=[0,1,2,3], skipfooter=3)

# Loading 2014 NY crime data from FBI as test set
test = pd.read_excel('Table_8_Offenses_Known_to_Law_Enforcement_by_New_York_by_City_2014.xls',
                    skiprows=[0,1,2,3], skipfooter=5)


In [3]:
train.head(1)

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0,,0,0,0,12,2,10,0,0.0


In [4]:
test.head(1)

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1851.0,0.0,0.0,,0.0,0.0,0.0,11.0,1.0,10.0,0.0,0.0


In [5]:
# Running through identical data cleaning steps for each dataset

train.columns = ['City', 'Population', 'ViolentCrime', 'MurderManslaughter', 'Rape_DROP', 'Rape', 'Robbery', 
               'AggAssault', 'PropertyCrime', 'Burglary', 'LarcenyTheft', 'MotorVehicleTheft', 'Arson']
test.columns = ['City', 'Population', 'ViolentCrime', 'MurderManslaughter', 'Rape_DROP', 'Rape', 'Robbery', 
               'AggAssault', 'PropertyCrime', 'Burglary', 'LarcenyTheft', 'MotorVehicleTheft', 'Arson']

train = train.drop(['Rape_DROP'], axis=1)
test = test.drop(['Rape_DROP'], axis=1)

train = train.fillna(0)
test = test.fillna(0)

train['Arson'] = train['Arson'].astype('int32')
test['Arson'] = test['Arson'].astype('int32')

# Feature Engineering & Model Instantiation


The model takes the form:

$$PropertyCrime = \alpha + LargeCity + \beta_1Population + \beta_2RobberySqrt$$


The following cells will perform the necessary feature engineering, instantiate the model, and confirm that the parameters are identical to previous work.

In [7]:
train['Robbery_sqrt'] = train['Robbery'].apply(lambda x: math.sqrt(x))
test['Robbery_sqrt'] = test['Robbery'].apply(lambda x: math.sqrt(x))

In [8]:
def large_cities(x):
    # Function to One Hot Encode large cities
    
    if x > 120000:
        return 1
    else:
        return 0
    
train['LargeCity'] = train['Population'].apply(large_cities)
test['LargeCity'] = test['Population'].apply(large_cities)

In [10]:
# Instantiating the model

model_formula = 'PropertyCrime ~ LargeCity + Population + Robbery_sqrt'
smf_model = smf.ols(formula=model_formula, data=train).fit()

print('Model 1 Results')
print('Coef:\n', smf_model.params, '\n')
print('p-values:\n', smf_model.pvalues, '\n')
print('R^2:\n', smf_model.rsquared, '\n')


Model 1 Results
Coef:
 Intercept       -108.391760
LargeCity       1750.666246
Population         0.014713
Robbery_sqrt     121.124107
dtype: float64 

p-values:
 Intercept        6.387808e-06
LargeCity        1.241263e-10
Population      2.695147e-280
Robbery_sqrt     1.834173e-38
dtype: float64 

R^2:
 0.9976399654632164 



**Model results match the previous runs**

In [14]:
# Instantiating same model in sklearn

X_train = train[['LargeCity', 'Population', 'Robbery_sqrt']]
y_train = train['PropertyCrime']

lr_model = linear_model.LinearRegression()
lr_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [19]:
# Testing model on the 2014 data

X_test = test[['LargeCity', 'Population', 'Robbery_sqrt']]
y_test = test['PropertyCrime']

y_pred = lr_model.predict(X_test)

print('R^2:', r2_score(y_test, y_pred))

R^2: 0.9951285328152686
