## K-means clustering

In this exercise we explore $K$-means clustering - and we it out on the locations of the PROSTITUTION crime type. Applying a clustering method makes sense because we know from our earlier work that this crime type tends to happen in only a few locations. We'll also talk a little bit about model selection and overfitting in unsupervised models.

### Exercise: $K$-means
Visualize the prostitution data (e.g. by plotting it on a map)

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import geoplotlib
import numpy as np
from geoplotlib.utils import BoundingBox, DataAccessObject
from sklearn.cluster import KMeans



In [7]:
#Load data from csv to pandas
crimeData = pd.read_csv("SFPD_Incidents_-_from_1_January_2003.csv")

In [8]:
prostCrime = crimeData[crimeData['Category'].isin(['PROSTITUTION'])]
geoPlotData = {}
lon = [float(d) for d in prostCrime[prostCrime['Category'].isin(['PROSTITUTION'])].X.tolist()]
lat = [float(d) for d in prostCrime[prostCrime['Category'].isin(['PROSTITUTION'])].Y.tolist()]
geoPlotData['PROSTITUTION'] = {"lon": [float(d) for d in lon], "lat": [float(d) for d in lat]}

In [9]:
geoplotlib.inline()
print 'PROSTITUTION'
geoplotlib.kde(geoPlotData['PROSTITUTION'], bw=10, cut_below=1e-4, cmap='Blues')
north = np.mean(geoPlotData['PROSTITUTION']["lat"]) + 0.04
south = np.mean(geoPlotData['PROSTITUTION']["lat"]) - 0.07
west = np.mean(geoPlotData['PROSTITUTION']["lon"]) - 0.01
east = np.mean(geoPlotData['PROSTITUTION']["lon"]) + 0.01
bbox = BoundingBox(north = north, west = west, south = south, east = east)

geoplotlib.set_bbox(bbox)
geoplotlib.inline()

PROSTITUTION
('smallest non-zero count', 9.8168492680594421e-08)
('max count:', 4.2386914193370524)


Train models of $K = 2,\ldots,10$ on the prostitution data.

In [25]:
prostOutlier = [(x,y) for (x,y) in zip(lon,lat) if y != 90]

In [46]:
lonOutlier = [x[0] for x in prostOutlier]
latOutlier = [x[1] for x in prostOutlier]
meanLon = np.mean(lonOutlier)
meanLat = np.mean(latOutlier)

In [31]:
kmeans = []
for i in xrange(2,10,1):
    kmeans.append(KMeans(n_clusters=i, random_state=0).fit(prostOutlier))

Explore how the total squared error changes as a function of $K$ and identify what you think is the right number of clusers based on the knee-point in the squared error plot.

In [32]:
squaredErrors = []
for i in range(len(kmeans)):
    squaredErrors.append(kmeans[i].score(prostOutlier))

In [33]:
clusters = [i for i in range(2,10)]
plt.plot(clusters,squaredErrors)
plt.title('Total squared error as a function of number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Squared Error')

<matplotlib.text.Text at 0x11b30ab90>

**And by the way: The fit only gets better when we add more means - why not keep adding more of them: Explain in your own words why it makes sense to stop around a knee-point.** The fit gets better on our training set, when we introduce our test set it will lead to overfitting. We therefore conclude that $K=4$ is the optimal number of clusters.

In [34]:
centers = []
for i in range(len(kmeans)):
    centers.append(kmeans[i].cluster_centers_.tolist())
centers

[[[-122.41721258127922, 37.78739426221802],
  [-122.41924311718914, 37.76000421665213]],
 [[-122.41582476469686, 37.761346056903406],
  [-122.41709742374232, 37.78742454987841],
  [-122.47811474903897, 37.73890648569841]],
 [[-122.41708247002195, 37.78742711884176],
  [-122.4157933283197, 37.76144681116217],
  [-122.46632498052548, 37.718814247089576],
  [-122.48639782848089, 37.75857230467054]],
 [[-122.41584224261476, 37.76142569868439],
  [-122.41876997704011, 37.78765447103969],
  [-122.46632498052548, 37.718814247089576],
  [-122.48639782848089, 37.75857230467054],
  [-122.4045346858759, 37.78553068672912]],
 [[-122.41584104423832, 37.7614223162626],
  [-122.41734574019395, 37.78626408225128],
  [-122.48653563661559, 37.7584929825487],
  [-122.46632498052548, 37.718814247089576],
  [-122.40357635664827, 37.785479856299915],
  [-122.42168637567067, 37.790816208502285]],
 [[-122.42168787770677, 37.790846125985354],
  [-122.41599717453238, 37.761708702352564],
  [-122.46896682755931,

# Now we want to save the data we need (prostitution) as a csv-file:

In [None]:
import pandas as pd

In [None]:
locations1 = pd.DataFrame({ 'lon': lon, 'lat': lat })

In [None]:
locations1.to_csv('locations-WHATUP.csv')

In [None]:
locations1.to_json('locations.json')

In [None]:
locations1.reset_index().to_json('locations.json', orient='records')

In [None]:
locations1