# Challenge: Validating a Linear Regression

In this case, your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.
You'll need to validate it using some of the other crime datasets available at the FBI:UCR website.

Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [8]:
#Imports
import math
import warnings
import numpy as np
import pandas as pd

from sklearn import linear_model
import statsmodels.api as sm
import statsmodels.formula.api as smf

#Plotting
from IPython.display import display
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format


# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

  from pandas.core import datetools


In [2]:
#Bring in Training Data
data = pd.read_csv('ny_offenses.csv')
data.head()

Unnamed: 0,City,Population,Violent crime,Murder,Rape,Robbery,Aggravated Assault,Property Crime,Burglary,Larceny-theft,Motor vehicle theft,Arson
0,Adams Village,1861,0,0,0,0,0,12,2,10,0,0.0
1,Addison Town and Village,2577,3,0,0,0,3,24,3,20,1,0.0
2,Akron Village,2846,3,0,0,0,3,16,1,15,0,0.0
3,Albany,97956,791,8,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0,3,4,16,223,53,165,5,


In [3]:
#Clean Up Data
#Create Variables
def convert_number(number):
    try:
        converted = float(number.replace(',', ''))
    except:
        converted = number
    
    return converted

In [15]:
#Clean up the data
data.fillna(0)
data['Population'] = data['Population'].apply(lambda x: convert_number(x))
data['Murder'] = data['Murder'].apply(lambda x: convert_number(x))
data['Robbery'] = data['Robbery'].apply(lambda x: convert_number(x))
data['Property_Crime'] = data['Property Crime'].apply(lambda x: convert_number(x))
data['Violent_crime'] = data['Violent crime'].apply(lambda x: convert_number(x))
data['Arson'] = data['Arson'].apply(lambda x: convert_number(x))
data['Motor_vehicle_theft'] = data['Motor vehicle theft'].apply(lambda x: convert_number(x))
data['Larceny_theft'] = data['Larceny-theft'].apply(lambda x: convert_number(x))
data['Burglary'] = data['Burglary'].apply(lambda x: convert_number(x))
data['Aggravated_Assault'] = data['Aggravated Assault'].apply(lambda x: convert_number(x))
data['Rape'] = data['Rape'].apply(lambda x: convert_number(x))

In [16]:
#Limit the number of cities to deal with outliers
data = data[data['Population']<120000]

In [17]:
property_crime = data['Property_Crime']
population = data['Population']
violent_crime = data['Violent_crime']
robbery = data['Robbery']
burglary = data['Burglary']
motor_theft = data['Motor_vehicle_theft']
aggravated_assault = data['Aggravated_Assault']


In [11]:
#Stats Models
x = np.column_stack((population, violent_crime, robbery, burglary, motor_theft, aggravated_assault))
x = sm.add_constant(x, prepend=True)

results = smf.OLS(property_crime,x).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         Property Crime   R-squared:                       0.935
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     810.5
Date:                Mon, 26 Feb 2018   Prob (F-statistic):          1.79e-196
Time:                        12:59:23   Log-Likelihood:                -2150.4
No. Observations:                 343   AIC:                             4315.
Df Residuals:                     336   BIC:                             4342.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -26.8845      9.075     -2.962      0.0

In [29]:
X = data[['Population','Violent crime','Robbery','Burglary',
              'Motor vehicle theft','Aggravated_Assault']]

In [30]:
correlation_matrix = X.corr()
display(correlation_matrix)

Unnamed: 0,Population,Violent crime,Robbery,Burglary,Motor vehicle theft,Aggravated_Assault
Population,1.0,0.625,0.621,0.71,0.722,0.609
Violent crime,0.625,1.0,0.977,0.898,0.921,0.994
Robbery,0.621,0.977,1.0,0.858,0.935,0.95
Burglary,0.71,0.898,0.858,1.0,0.879,0.897
Motor vehicle theft,0.722,0.921,0.935,0.879,1.0,0.898
Aggravated_Assault,0.609,0.994,0.95,0.897,0.898,1.0


Removed Aggravated Assault, to make sure all the parameters are statistically significant and had high colinearity.

In [31]:
#Stats Models
x = np.column_stack((population, violent_crime, robbery, burglary, motor_theft))
x = sm.add_constant(x, prepend=True)

results = smf.OLS(property_crime,x).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         Property_Crime   R-squared:                       0.935
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     964.3
Date:                Mon, 26 Feb 2018   Prob (F-statistic):          3.34e-197
Time:                        13:40:45   Log-Likelihood:                -2152.2
No. Observations:                 343   AIC:                             4316.
Df Residuals:                     337   BIC:                             4339.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -26.6542      9.110     -2.926      0.0

#### Build Model

Use the Stats models, to look at the parameters. But will use Sklearn to actually train and validate the model.

In [40]:
#Sklearn

from sklearn.grid_search import GridSearchCV

# Instantiate and fit our model.
regr = linear_model.LinearRegression()

parameters = {'normalize':[True,False]}

#Note for regression problems have to do different scoreing methods:
grid = GridSearchCV(regr, parameters, scoring='r2', cv=5, verbose=0)


#population, violent_crime, robbery, burglary, motor_theft, aggravated_assault
Y = data['Property Crime'].values.reshape(-1, 1)
X = data[['Population','Violent crime','Robbery','Burglary',
              'Motor vehicle theft']]

#Fit the Data
grid.fit(X, Y)

GridSearchCV(cv=5, error_score='raise',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'normalize': [True, False]}, pre_dispatch='2*n_jobs',
       refit=True, scoring='r2', verbose=0)

#### Test Model On new Data

In [92]:
test_data = pd.read_csv('ny_offenses_2014.csv')
test_data.head()

Unnamed: 0,City,Population,Violent crime,Murder,Robbery,Aggravated assault,Property crime,Burglary,Larceny-theft,Motor vehicle theft,Arson,Unnamed: 11
0,Adams Village,1851,0,0.0,0,0,11,1,10,0,0.0,
1,Addison Town and Village,2568,2,0.0,1,1,49,1,47,1,0.0,
2,Afton Village4,820,0,0.0,0,0,1,0,1,0,0.0,
3,Akron Village,2842,1,0.0,0,1,17,0,17,0,0.0,
4,Albany4,98595,802,8.0,237,503,3888,683,3083,122,12.0,


In [93]:
#Clean up the data
test_data.fillna(0)
test_data['Population'] = test_data['Population'].apply(lambda x: convert_number(x))
test_data['Murder'] = test_data['Murder'].apply(lambda x: convert_number(x))
test_data['Robbery'] = test_data['Robbery'].apply(lambda x: convert_number(x))
test_data['Property_Crime'] = test_data['Property crime'].apply(lambda x: convert_number(x))
test_data['Violent_crime'] = test_data['Violent crime'].apply(lambda x: convert_number(x))
test_data['Arson'] = test_data['Arson'].apply(lambda x: convert_number(x))
test_data['Motor_vehicle_theft'] = test_data['Motor vehicle theft'].apply(lambda x: convert_number(x))
test_data['Larceny_theft'] = test_data['Larceny-theft'].apply(lambda x: convert_number(x))
test_data['Burglary'] = test_data['Burglary'].apply(lambda x: convert_number(x))
test_data['Aggravated_Assault'] = test_data['Aggravated assault'].apply(lambda x: convert_number(x))
#Drop Unamed Column
test_data.dropna(axis=1, thresh=10, inplace=True)
#Drop the nans which are note Numeric Nans, but text
test_data = test_data[:369]
test_data.fillna(0)
test_data.head()

Unnamed: 0,City,Population,Violent crime,Murder,Robbery,Aggravated assault,Property crime,Burglary,Larceny-theft,Motor vehicle theft,Arson,Property_Crime,Violent_crime,Motor_vehicle_theft,Larceny_theft,Aggravated_Assault
0,Adams Village,1851.0,0,0.0,0.0,0,11,1.0,10,0,0.0,11.0,0.0,0.0,10.0,0.0
1,Addison Town and Village,2568.0,2,0.0,1.0,1,49,1.0,47,1,0.0,49.0,2.0,1.0,47.0,1.0
2,Afton Village4,820.0,0,0.0,0.0,0,1,0.0,1,0,0.0,1.0,0.0,0.0,1.0,0.0
3,Akron Village,2842.0,1,0.0,0.0,1,17,0.0,17,0,0.0,17.0,1.0,0.0,17.0,1.0
4,Albany4,98595.0,802,8.0,237.0,503,3888,683.0,3083,122,12.0,3888.0,802.0,122.0,3083.0,503.0


In [94]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369 entries, 0 to 368
Data columns (total 16 columns):
City                   369 non-null object
Population             369 non-null float64
Violent crime          369 non-null object
Murder                 369 non-null float64
Robbery                369 non-null float64
Aggravated assault     369 non-null object
Property crime         368 non-null object
Burglary               369 non-null float64
Larceny-theft          368 non-null object
Motor vehicle theft    369 non-null object
Arson                  365 non-null float64
Property_Crime         368 non-null float64
Violent_crime          369 non-null float64
Motor_vehicle_theft    369 non-null float64
Larceny_theft          368 non-null float64
Aggravated_Assault     369 non-null float64
dtypes: float64(10), object(6)
memory usage: 46.2+ KB


In [95]:
#Remove null from Property Crime
test_data=test_data[~test_data['Property_Crime'].isnull()]
#pd.isnull(test_data['Property_Crime'])

In [90]:
#Need to figure out a way to remove the Nans. Thought I figured this out. but need to fix this.
Y = test_data['Property_Crime'].values.reshape(-1, 1)
X = test_data[['Population','Violent_crime','Robbery','Burglary',
              'Motor_vehicle_theft']]

In [91]:
#Get the R2 score. 
print(grid.score(X, Y))

0.999454571454




This shows that the model explains the new data very well.

### Fixing MultiCollinearity

In [None]:
#Could do PCA to further solve this.


### Write UP

Decided to use the results summary of the stats model, because it gives a pretty good statitical analysis of the things I care about and learned in this module. I was able to initially look and see that Aggravaged assault was not adding any value to my model.  I was also able to validate the test summary approach using the the method outlined in the course, and check the coefficients of the linear model with sklearn and they were really close. 

Additionally, I used k-folds within the grid search to build the model and apply the best parameters using the R2 as the scoring method.

As the summary pointed out, there is some multicollinearity issues, but this can be fixed with PCA. 