### Validating regression models for prediction
Statistical tests are useful for making sure a model is a good fit to the test data, and that all the features are useful to the model. However, to make sure a model has good predictive validity for new data, it is necessary to assess the performance of the model on new datasets.

The procedure is the same as what you learned in the Naive Bayes lesson – the holdout method and cross-validation method are both available. You've already had experience writing code to run these kinds of validation models for Naive Bayes: now you can try it again with linear regression. In this case, your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website. Options include other states crime rates in 2013 or crime rates in New York State in other years or a combination of these.

### Iterate
Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [1]:
import math
import warnings

from IPython.display import display
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import statsmodels.formula.api as smf

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

In [2]:
import xlrd
df=pd.read_excel('Lesson5-2.xlsx',
                 sheet_name='13tbl8ga')

In [3]:
df.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition),Rape (legacy definition),Robbery-Count,Aggravated assault,Property crime,Burglary,Larceny-theft,Motor vehicle theft,Arson
0,Abbeville,2888,3,0,,0,2,1,22,3,16,3,0.0
1,Adairsville,4686,13,0,,0,1,12,52,15,31,6,0.0
2,Adel,5240,18,0,,5,5,8,189,64,121,4,0.0
3,Adrian,656,0,0,,0,0,0,2,0,2,0,0.0
4,Alapaha,646,5,0,,0,1,4,6,0,6,0,0.0


In [4]:
df=df[['City','Population','Robbery-Count','Murder and nonnegligent manslaughter','Property crime']]

In [5]:
df.head()

Unnamed: 0,City,Population,Robbery-Count,Murder and nonnegligent manslaughter,Property crime
0,Abbeville,2888,2,0,22
1,Adairsville,4686,1,0,52
2,Adel,5240,5,0,189
3,Adrian,656,0,0,2
4,Alapaha,646,1,0,6


In [6]:
df['Population_Squared']=None
df['Murder']=1
df['Robbery']=1

In [7]:
df.head()

Unnamed: 0,City,Population,Robbery-Count,Murder and nonnegligent manslaughter,Property crime,Population_Squared,Murder,Robbery
0,Abbeville,2888,2,0,22,,1,1
1,Adairsville,4686,1,0,52,,1,1
2,Adel,5240,5,0,189,,1,1
3,Adrian,656,0,0,2,,1,1
4,Alapaha,646,1,0,6,,1,1


In [8]:
df['Population_Squared']=df['Population']*df['Population']
df.loc[df['Murder and nonnegligent manslaughter']==0,'Murder']=0
df.loc[df['Robbery-Count']==0,'Robbery']=0

In [9]:
df.head()

Unnamed: 0,City,Population,Robbery-Count,Murder and nonnegligent manslaughter,Property crime,Population_Squared,Murder,Robbery
0,Abbeville,2888,2,0,22,8340544,0,1
1,Adairsville,4686,1,0,52,21958596,0,1
2,Adel,5240,5,0,189,27457600,0,1
3,Adrian,656,0,0,2,430336,0,0
4,Alapaha,646,1,0,6,417316,0,1


In [10]:
df.isnull().sum()

City                                    0
Population                              0
Robbery-Count                           0
Murder and nonnegligent manslaughter    0
Property crime                          0
Population_Squared                      0
Murder                                  0
Robbery                                 0
dtype: int64

## Holdout Group

### 1st Model

In [11]:
data1 = df[['Population','Population_Squared','Murder','Robbery']]
target = df['Property crime'].values.reshape(-1,1)

from sklearn import linear_model
regr=linear_model.LinearRegression()
y_pred = regr.fit(data1, target).predict(data1)


In [12]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data1, target, test_size=0.5, random_state=20)
print('With 50% Holdout (R^2): ' + str(regr.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample (R^2): ' + str(regr.fit(data1, target).score(data1, target)))

With 50% Holdout (R^2): 0.6522427466823812
Testing on Sample (R^2): 0.9408621731579462


### 2nd Model

In [13]:
data2 = df[['Population','Murder','Robbery']]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(data2, target, test_size=0.5, random_state=20)
print('With 50% Holdout (R^2): ' + str(regr.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample (R^2): ' + str(regr.fit(data2, target).score(data2, target)))

With 50% Holdout (R^2): 0.6363788753591277
Testing on Sample (R^2): 0.9118989441934389


## Cross Validation

### 1st Model

In [15]:
from sklearn.model_selection import cross_val_score
cross_val_score(regr, data1, target, cv=10)

array([0.97795542, 0.72254198, 0.87297665, 0.61844714, 0.82935473,
       0.57229756, 0.80974488, 0.37703676, 0.82353561, 0.84443674])

### 2nd Model

In [16]:
from sklearn.model_selection import cross_val_score
cross_val_score(regr, data1, target, cv=10)

array([0.97795542, 0.72254198, 0.87297665, 0.61844714, 0.82935473,
       0.57229756, 0.80974488, 0.37703676, 0.82353561, 0.84443674])