# Part 1: KNN

> * *How does K-nearest-neighbors work? Explain in your own words.*

K-nearest neighbors are a model of prediction that allows to predict values of variables according to the values of the neighbors.

> * *Explain in your own words: What is the curse of dimensionality?*

Curse of dimensionality is the trade-off data scientists have to deal with while applying K-nearest neighbors model to high dimensional datasets. It's a matter of truth that with high dimensional spaces distances between points get much more sparse, hence making it difficult to apply KNN. The lower the dimensions, the better the results. Otherwise, in order to apply KNN a dimensionality reduction might be applied before.

> *Exercise: K-nearest-neighbors map.*

>*The goal of this exercise is to create a useful real-world version of the example on pp153 in DSFS. We know from last week's exercises that the focus crimes PROSTITUTION, DRUG/NARCOTIC and DRIVING UNDER THE INFLUENCE tend to be concentrated in certain neighborhoods, so we focus on those crime types since they will make the most sense a KNN - map.*
* *Begin by using geoplotlib to plot all incidents of the three crime types on their own map using `geoplotlib.kde()`. This will give you an idea of how the varioius crimes are distributed across the city.*

Let's start by copying some content from Week 3 notebook:

In [27]:
import requests
from matplotlib import pyplot as plt
import numpy as np
import csv
import pandas as pd
from pandas import DataFrame
%matplotlib inline
import geoplotlib
from geoplotlib.utils import BoundingBox
from geoplotlib.colors import ColorMap
import sklearn
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier

In [6]:
#Storing the dataset as a dictionary (might be useful later on)
with open('..\data\sfpd_incidents.csv','rb') as f:
    reader = csv.DictReader(f)
    data_dict = [line for line in reader]
    
#Storing the dataset as a Pandas DataFrame
#Easily create DataFrame from list of dictionaries
data_dataframe = DataFrame(data_dict)
del data_dict
data_dataframe.head()

Unnamed: 0,Address,Category,Date,DayOfWeek,Descript,IncidntNum,Location,PdDistrict,PdId,Resolution,Time,X,Y
0,18TH ST / VALENCIA ST,NON-CRIMINAL,01/19/2015,Monday,LOST PROPERTY,150060275,"(37.7617007179518, -122.42158168137)",MISSION,15006027571000,NONE,14:00,-122.42158168137,37.7617007179518
1,300 Block of LEAVENWORTH ST,ROBBERY,02/01/2015,Sunday,"ROBBERY, BODILY FORCE",150098210,"(37.7841907151119, -122.414406029855)",TENDERLOIN,15009821003074,NONE,15:45,-122.414406029855,37.7841907151119
2,300 Block of LEAVENWORTH ST,ASSAULT,02/01/2015,Sunday,AGGRAVATED ASSAULT WITH BODILY FORCE,150098210,"(37.7841907151119, -122.414406029855)",TENDERLOIN,15009821004014,NONE,15:45,-122.414406029855,37.7841907151119
3,300 Block of LEAVENWORTH ST,SECONDARY CODES,02/01/2015,Sunday,DOMESTIC VIOLENCE,150098210,"(37.7841907151119, -122.414406029855)",TENDERLOIN,15009821015200,NONE,15:45,-122.414406029855,37.7841907151119
4,LOMBARD ST / LAGUNA ST,VANDALISM,01/27/2015,Tuesday,"MALICIOUS MISCHIEF, VANDALISM OF VEHICLES",150098226,"(37.8004687042875, -122.431118543788)",NORTHERN,15009822628160,NONE,19:00,-122.431118543788,37.8004687042875


Now that we imported the dataset in a Pandas dataframe, we can proceed with the exercise:

In [17]:
#Create a subset of the main dataframe with only the KNN crimes
knn_crimes = ['PROSTITUTION', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE']
knn_data = data_dataframe[data_dataframe['Category'].isin(knn_crimes)]

def kde_plot(lats,longs):
    #Getting the coordinates for plotting (filter the outliers with Y=90)
    geo_data_for_plotting = {"lat": lats,
                             "lon": longs}
    #Ready for plotting
    print(crime+': KDE Map')
    geoplotlib.kde(geo_data_for_plotting,bw=3)
    bbox = BoundingBox(north=max(lats),
                       west=min(longs),
                       south=min(lats),
                       east=max(longs))
    geoplotlib.set_bbox(bbox)
    geoplotlib.inline()

for crime in knn_crimes:
    lats = [float(el) for el in list(knn_data[(knn_data['Y']!='90')&(knn_data['Category']==crime)]['Y'])]
    longs = [float(el) for el in list(knn_data[(knn_data['Y']!='90')&(knn_data['Category']==crime)]['X'])]
    kde_plot(lats,longs)

PROSTITUTION: KDE Map
('smallest non-zero count', 1.9901719039787654e-09)
('max count:', 21.749158532433501)


DRUG/NARCOTIC: KDE Map
('smallest non-zero count', 1.9901719039787654e-09)
('max count:', 71.481904923832047)


DRIVING UNDER THE INFLUENCE: KDE Map
('smallest non-zero count', 3.0077073102663826e-08)
('max count:', 1.7514932450777021)


> *Next, it's time to set up your model based on the actual data.*

> * *You don't have to think a lot about testing/trainig and accuracy for this exercise. We're mostly interested in creating a map that's not too problematic. But do calculate the number of observations of each crime-type respectively. You'll find that the levels of each crime varies (lots of drug arrests, an intermediate amount of prostitiution registered, and very little drunk driving in the dataset). Since the algorithm classifies each point according to it's neighbors, what could a consequence of this imbalance in the number of examples from each class mean for your map?*

In [23]:
print 'Total number of observations:',len(knn_data.index)

for crime in knn_crimes:
    print 'Total number of ',crime,'observations:',len(knn_data[knn_data['Category']==crime].index)
    


Total number of observations: 136596
Total number of  PROSTITUTION observations: 16163
Total number of  DRUG/NARCOTIC observations: 115131
Total number of  DRIVING UNDER THE INFLUENCE observations: 5302


> * *You can make the dataset 'balanced' by grabbing an equal number of examples from each crime category. How do you expect that will change the KNN result? In which situations is the balanced map useful - and when is the map that data in proportion to occurrences useful? Choose which map you will work on in the following.*

> * *Now create an approximately square grid of point that runs over SF. You get to decide the grid-size, but I recommend somewhere between 50×50 and 100×100 points. I recommend plotting using geoplotlib.dot().*

In [32]:
geoplotlib.dot([float(el) for el in list(knn_data[(knn_data['Y']!='90')&(knn_data['Category']=='DRIVING UNDER THE INFLUENCE')]['Y'])])
geoplotlib.inline()

Traceback (most recent call last):
  File "C:\Users\ricky\Anaconda2\lib\site-packages\geoplotlib\__init__.py", line 32, in _runapp
    app.start()
  File "C:\Users\ricky\Anaconda2\lib\site-packages\geoplotlib\core.py", line 364, in start
    self.proj.fit(BoundingBox.from_bboxes([l.bbox() for l in self.geoplotlib_config.layers]),
  File "C:\Users\ricky\Anaconda2\lib\site-packages\geoplotlib\layers.py", line 159, in bbox
    return BoundingBox.from_points(lons=self.data['lon'], lats=self.data['lat'])
TypeError: list indices must be integers, not str
