## OLS Regression vs. KNN Regression

This is a Thinkful assignment for comparing OLS Regression vs. KNN Regression. The data set used for this assignment is the FBI Crime Record data set from the previous lession. The OLS regression model is also from previous assignment as well. We will be fitting the KNN regression model in this assignment, and compare these two regression models. Below is the link for the data set. 
https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls

In [1]:
import math
import warnings

from IPython.display import display
from matplotlib import pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

from sklearn import neighbors
from sklearn import linear_model

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

## Below codes are from previous assignment

In [2]:
#Our data set have some headers and footers, and we need to skip those headers and footers, and just import our data. 
df = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv',
                 skiprows=4,skipfooter=3,header=0,na_values='nan')

  This is separate from the ipykernel package so we can avoid doing imports until


### Data Cleaning

In [3]:
#Let see what our imported data set looks like. 
df.head(10)

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0,,0,0,0,12,2,10,0,0.0
1,Addison Town and Village,2577,3,0,,0,0,3,24,3,20,1,0.0
2,Akron Village,2846,3,0,,0,0,3,16,1,15,0,0.0
3,Albany,97956,791,8,,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0,,3,4,16,223,53,165,5,
5,Alfred Village,4089,5,0,,0,3,2,46,10,36,0,
6,Allegany Village,1781,3,0,,0,0,3,10,0,10,0,0.0
7,Amherst Town,118296,107,1,,7,31,68,2118,204,1882,32,3.0
8,Amityville Village,9519,9,0,,2,4,3,210,16,188,6,1.0
9,Amsterdam,18182,30,0,,0,12,18,405,99,291,15,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 13 columns):
City                                    348 non-null object
Population                              348 non-null object
Violent
crime                           348 non-null object
Murder and
nonnegligent
manslaughter    348 non-null int64
Rape
(revised
definition)1              0 non-null float64
Rape
(legacy
definition)2               348 non-null object
Robbery                                 348 non-null object
Aggravated
assault                      348 non-null object
Property
crime                          348 non-null object
Burglary                                348 non-null object
Larceny-
theft                          348 non-null object
Motor
vehicle
theft                     348 non-null object
Arson3                                  187 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 35.4+ KB


Looks like there are some null values in our data set, like the "Rape (revised definition)1" column and "Arson3" column. We will look at our data set description later and clean up the null values. From df.info, we see that some of the column names are long, and it may be inconvenicent when we use these columns for our model (I am too lazy to type long column names). Therefore, let's replace some of the long column names with shorter ones. 

In [5]:
#List all the column names. 
list(df)

['City',
 'Population',
 'Violent\ncrime',
 'Murder and\nnonnegligent\nmanslaughter',
 'Rape\n(revised\ndefinition)1',
 'Rape\n(legacy\ndefinition)2',
 'Robbery',
 'Aggravated\nassault',
 'Property\ncrime',
 'Burglary',
 'Larceny-\ntheft',
 'Motor\nvehicle\ntheft',
 'Arson3']

In [6]:
#Looks like some columns have format issue too. We will fix that as well. 
df.rename(columns={'Violent\ncrime':'ViolentCrime'},inplace=True)
df.rename(columns={'Murder and\nnonnegligent\nmanslaughter':'Murder'},inplace=True)
df.rename(columns={'Rape\n(revised\ndefinition)1':'Rape_1'},inplace=True)
df.rename(columns={'Rape\n(legacy\ndefinition)2':'Rape_2'},inplace=True)
df.rename(columns={'Aggravated\nassault':'AggravatedAssault'},inplace=True)
df.rename(columns={'Property\ncrime':'PropertyCrime'},inplace=True)
df.rename(columns={'Larceny-\ntheft':'LarcenyTheft'},inplace=True)
df.rename(columns={'Motor\nvehicle\ntheft':'MotorVehicleTheft'},inplace=True)

In [7]:
#Let's check our column names again. 
list(df)

['City',
 'Population',
 'ViolentCrime',
 'Murder',
 'Rape_1',
 'Rape_2',
 'Robbery',
 'AggravatedAssault',
 'PropertyCrime',
 'Burglary',
 'LarcenyTheft',
 'MotorVehicleTheft',
 'Arson3']

From our previous df.info, we can see that the whole column of "Rape (revised definition)1" is null value. The data set declearation did not provide too much information about why this whole colum is all null. Therefore, we will ignore this column while building our regression model. 

For the "Arson3" column, we have some null value here. In the data declearation, it stated that (Arson is not included in the property crime total in this table; however, if complete arson data were provided, they will appear in the arson column.) and (The FBI does not publish arson data unless it receives data from either the agency or the state for all 12 months of the calendar year.) Therefore, null Arson data means the agency or states did not provide the Arson data to the FBI. However, I would like to see the difference between models with Arson and without Arson data. Therefore, for practice purposes, the null Arson will replace with value 0. 

Also, by looking at the df.info, we see that there are only 3 columns contains float and int, the rest are objects, which means some of the number are classfied as str here. It will cause troubles for our model. Hence, we will need to fix this as well. 

In [8]:
df['Population'] = pd.to_numeric(df['Population'].str.replace(',',''))
df['ViolentCrime'] = pd.to_numeric(df['ViolentCrime'].str.replace(',',''))
df['Rape_2'] = pd.to_numeric(df['Rape_2'].str.replace(',',''))
df['Robbery'] = pd.to_numeric(df['Robbery'].str.replace(',',''))
df['AggravatedAssault'] = pd.to_numeric(df['AggravatedAssault'].str.replace(',',''))
df['PropertyCrime'] = pd.to_numeric(df['PropertyCrime'].str.replace(',',''))
df['Burglary'] = pd.to_numeric(df['Burglary'].str.replace(',',''))
df['LarcenyTheft'] = pd.to_numeric(df['LarcenyTheft'].str.replace(',',''))
df['MotorVehicleTheft'] = pd.to_numeric(df['MotorVehicleTheft'].str.replace(',',''))
df['Arson3_fillna'] = df['Arson3'].fillna(0)

### OLS Regression Model

We have clean our data and look at our features. Next, we will build the regression model. 

In [9]:
Y = df['PropertyCrime']
X = df[['Population','ViolentCrime','Murder','Rape_2','Robbery','AggravatedAssault']]

regr = linear_model.LinearRegression()
regr.fit(X, Y)

knn = neighbors.KNeighborsRegressor(n_neighbors=10)
knn.fit(X, Y)

knn_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
knn_w.fit(X, Y)

# Inspect the results.
print('\nThe R-squared for OLS:')
print(regr.score(X, Y))

print('\nThe R-squared for KNN:')
print(knn.score(X, Y))

print('\nThe R-squared for Weighted KNN:')
print(knn_w.score(X, Y))


The R-squared for OLS:
0.9990024878038867

The R-squared for KNN:
0.24299855717902097

The R-squared for Weighted KNN:
1.0


### Crime per Capita
After trying these two models, let's try one more thing. We see that we have several outliners in the data, which are some cities with a high property crime numbers. Those cities with high property crime number often have high population as well. However, we cannot just remove those outliners from our model, because they are not some "bad" data points. Therefore, I would like to create some new features, and construct a new model, based on crime per capita instead of total number of crimes. The model itself will also change to property crime per capita to justify this change. 

In [10]:
df['ViolentCrime_PC'] = df.apply(lambda x: x['ViolentCrime']/x['Population'], axis=1)
df['Murder_PC'] = df.apply(lambda x: x['Murder']/x['Population'], axis=1)
df['Rape_2_PC'] = df.apply(lambda x: x['Rape_2']/x['Population'], axis=1)
df['Robbery_PC'] = df.apply(lambda x: x['Robbery']/x['Population'], axis=1)
df['AggravatedAssault_PC'] = df.apply(lambda x: x['AggravatedAssault']/x['Population'], axis=1)
df['PropertyCrime_PC'] = df.apply(lambda x: x['PropertyCrime']/x['Population'], axis=1)
df['Burglary_PC'] = df.apply(lambda x: x['Burglary']/x['Population'], axis=1)
df['LarcenyTheft_PC'] = df.apply(lambda x: x['LarcenyTheft']/x['Population'], axis=1)
df['MotorVehicleTheft_PC'] = df.apply(lambda x: x['MotorVehicleTheft']/x['Population'], axis=1)

In [11]:
Y_PC = df['PropertyCrime_PC']
X_PC = df[['ViolentCrime_PC','Murder_PC','Rape_2_PC','Robbery_PC','AggravatedAssault_PC']]

regr_PC = linear_model.LinearRegression()
regr_PC.fit(X_PC, Y_PC)

knn_PC = neighbors.KNeighborsRegressor(n_neighbors=10)
knn_PC.fit(X_PC, Y_PC)

knn_PC_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
knn_PC_w.fit(X_PC, Y_PC)

print('\nThe R-squared for OLS:')
print(regr_PC.score(X_PC, Y_PC))

print('\nThe R-squared for KNN:')
print(knn_PC.score(X_PC, Y_PC))

print('\nThe R-squared for Weighted KNN:')
print(knn_PC_w.score(X_PC, Y_PC))


The R-squared for OLS:
0.32521132383967044

The R-squared for KNN:
0.4468998197108943

The R-squared for Weighted KNN:
0.9282738052156343


In [12]:
from sklearn import preprocessing

normalized_X = preprocessing.normalize(X)

In [13]:
normalized_regr = linear_model.LinearRegression()
normalized_regr.fit(normalized_X, Y)

normalized_knn = neighbors.KNeighborsRegressor(n_neighbors=10)
normalized_knn.fit(normalized_X, Y)

normalized_knn_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
normalized_knn_w.fit(normalized_X, Y)

# Inspect the results.
print('\nThe R-squared for Normalized_OLS:')
print(normalized_regr.score(normalized_X, Y))

print('\nThe R-squared for Normalized KNN:')
print(normalized_knn.score(normalized_X, Y))

print('\nThe R-squared for Normalized Weighted KNN:')
print(normalized_knn_w.score(normalized_X, Y))


The R-squared for Normalized_OLS:
0.06976113240000348

The R-squared for Normalized KNN:
0.11316559553630368

The R-squared for Normalized Weighted KNN:
0.9999987885457838


In [14]:
standard_X = preprocessing.scale(X)

In [15]:
standard_regr = linear_model.LinearRegression()
standard_regr.fit(standard_X, Y)

standard_knn = neighbors.KNeighborsRegressor(n_neighbors=10)
standard_knn.fit(standard_X, Y)

standard_knn_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
standard_knn_w.fit(standard_X, Y)

# Inspect the results.
print('\nThe R-squared for Standardized OLS:')
print(standard_regr.score(standard_X, Y))

print('\nThe R-squared for Standardized KNN:')
print(standard_knn.score(standard_X, Y))

print('\nThe R-squared for Standardized Weighted KNN:')
print(standard_knn_w.score(standard_X, Y))


The R-squared for Standardized OLS:
0.9990024878038867

The R-squared for Standardized KNN:
0.2447082787584106

The R-squared for Standardized Weighted KNN:
1.0


In [16]:
normalized_X_PC = preprocessing.normalize(X_PC)

In [17]:
normalized_regr_PC = linear_model.LinearRegression()
normalized_regr_PC.fit(normalized_X_PC, Y_PC)

normalized_knn_PC = neighbors.KNeighborsRegressor(n_neighbors=10)
normalized_knn_PC.fit(normalized_X_PC, Y_PC)

normalized_knn_PC_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
normalized_knn_PC_w.fit(normalized_X_PC, Y_PC)

# Inspect the results.
print('\nThe R-squared for Normalized_OLS:')
print(normalized_regr_PC.score(normalized_X_PC, Y_PC))

print('\nThe R-squared for Normalized KNN:')
print(normalized_knn_PC.score(normalized_X_PC, Y_PC))

print('\nThe R-squared for Normalized Weighted KNN:')
print(normalized_knn_PC_w.score(normalized_X_PC, Y_PC))


The R-squared for Normalized_OLS:
0.14690142117355665

The R-squared for Normalized KNN:
0.27032376440990813

The R-squared for Normalized Weighted KNN:
0.7766677774072643


In [18]:
standard_X_PC = preprocessing.scale(X_PC)

In [19]:
standard_regr_PC = linear_model.LinearRegression()
standard_regr_PC.fit(standard_X_PC, Y_PC)

standard_knn_PC = neighbors.KNeighborsRegressor(n_neighbors=10)
standard_knn_PC.fit(standard_X_PC, Y_PC)

standard_knn_PC_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
standard_knn_PC_w.fit(standard_X_PC, Y_PC)

# Inspect the results.
print('\nThe R-squared for Standardized OLS:')
print(standard_regr_PC.score(standard_X_PC, Y_PC))

print('\nThe R-squared for Standardized KNN:')
print(standard_knn_PC.score(standard_X_PC, Y_PC))

print('\nThe R-squared for Standardized Weighted KNN:')
print(standard_knn_PC_w.score(standard_X_PC, Y_PC))


The R-squared for Standardized OLS:
0.32521132383967044

The R-squared for Standardized KNN:
0.4428539838177446

The R-squared for Standardized Weighted KNN:
0.9281146467279509


In [20]:
from sklearn.model_selection import cross_val_score

score_w = cross_val_score(knn_w, X, Y, cv=5)
print("Weighted Accuracy: %0.2f (+/- %0.2f)" % (score_w.mean(), score_w.std() * 2))

score_w_PC = cross_val_score(knn_PC_w, X_PC, Y_PC, cv=5)
print("Weighted Accuracy for Crime per Capita: %0.2f (+/- %0.2f)" % (score_w_PC.mean(), score_w_PC.std() * 2))

score_w_normalized = cross_val_score(normalized_knn_w, normalized_X, Y, cv=5)
print("Weighted Accuracy for Normalized KNN: %0.2f (+/- %0.2f)" % (score_w_normalized.mean(), score_w_normalized.std() * 2))

score_w_standard = cross_val_score(standard_knn_w, standard_X, Y, cv=5)
print("Weighted Accuracy for Standardized KNN: %0.2f (+/- %0.2f)" % (score_w_standard.mean(), score_w_standard.std() * 2))

score_w_normalized_PC = cross_val_score(normalized_knn_PC_w, normalized_X_PC, Y_PC, cv=5)
print("Weighted Accuracy for Normalized Per Capita KNN: %0.2f (+/- %0.2f)" % (score_w_normalized_PC.mean(), score_w_normalized_PC.std() * 2))

score_w_standard_PC = cross_val_score(standard_knn_PC_w, standard_X_PC, Y_PC, cv=5)
print("Weighted Accuracy for Standardized Per Capita KNN: %0.2f (+/- %0.2f)" % (score_w_standard_PC.mean(), score_w_standard_PC.std() * 2))

Weighted Accuracy: 0.46 (+/- 0.49)
Weighted Accuracy for Crime per Capita: 0.29 (+/- 0.18)
Weighted Accuracy for Normalized KNN: -44.82 (+/- 158.33)
Weighted Accuracy for Standardized KNN: 0.55 (+/- 0.53)
Weighted Accuracy for Normalized Per Capita KNN: 0.03 (+/- 0.18)
Weighted Accuracy for Standardized Per Capita KNN: 0.30 (+/- 0.17)


### Written
We have tried several models above, including fitting the regular data set, dividing all crimes by population to make it crime per capita, normalizing the data, and rescaling the data. We can see that when using the original data set without standardizing and normalizing, the OLS model and weight model have higher R squared value compare to the non-weighted KNN model. However, looking at the cross validation score of the weighted KNN model, we see that the weighted accuracy is only 0.46 with 0.49 standard deviation, which means our perfect R squared value at the original weighted model is overfitting. 

When doing these two regression models using the same data set, we see that normalizing the data or standarding the data does not help increase the performance of the model. The OLS model works well when using the original data set. However, if we change the data set to crime per capita, the KNN regression model works better than the OLS model. I believe this is because the OLS model itself still needs improvment for the crime per capita data. The KNN regression works better using the crime per capita data is because this is kind of rescaling the data set, which is better for KNN model. If we look at the R-squared value of the weighted KNN regression model, we often see that the weighted KNN regression have a higher R-squared value, regardless of which data set is using. This is because KNN regression model does not have much assumption here. The KNN regression model can be easily fitted into majority of data set and perform very well. However, KNN regression model often require larger storage space for the data, and more computational power when fitting the model. OLS regression model require less storage once the model is created and less computational power when fitting the power. 