# Data Extraction for Assignment 2B

This notebook shows how the data used for the visualization in exercise 2B is generated.

First, we import the necessary libraries. We also set a random seed such that it is possible to regenerate the exact same dataset (KMeans from sklearn uses randomness).  

In [2]:
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np
import json

np.random.seed(seed=1337)

Now we are ready to load the data into a pandas dataframe. We only need the 3 attributes; Category, X and Y. Note that we filter everything that is not in target category "PROSTITUTION". 

In [3]:
categories = ["Category", "X", "Y"]

# Load the data into a pandas dataframe
df = pd.read_csv('../data/Map__Crime_Incidents_-_from_1_Jan_2003.csv', header=0, usecols = categories)

# Filter the data on the categories which we focus on
df = df.loc[df['Category'].isin(["PROSTITUTION"])]


The dataset has an two outliers with coordinates not in San Francisco. We remove these. 

In [5]:
lat = df['Y'].tolist()
lon = df['X'].tolist()

# We know this constant in advance
number_of_outliers = 2

# Remove outliers 
for i in range(0, len(lat) - number_of_outliers):
    if lat[i] > 80:
        print "Found an outlier"
        del lon[i]
        del lat[i]

Found an outlier
Found an outlier


We chose to save the relevant data to a csv file format. The header of the file will correspond to: lat,lon,k2,k3,k4,k5,k6. We save the coordinates of each crime, and the other rows (k2,k3,...,k6) contain the responding classifications from training the different K-means models. 

We will also need to save the centroids of each model. Since the number of centroids depend on the k-parameter of the K-means model, we store the centroids in dictionaries. Dictionaries are easily conerted to json format, which can be loaded directly into javascript.

In [8]:
predictions = []
centroids = {"k2" : [], "k3" : [], "k4" : [], "k5" : [], "k6" : []}

Now, we will train 5 different K-means models.

In [9]:
# The data needs to be in correct format
X = np.column_stack((lat, lon))

# Train for k = 2..6
for i in range(2,7):
    # Train
    model = KMeans(n_clusters=i)
    model.fit(X)
    # Append predictions to our list 
    predictions.append(model.labels_)
    
    # Append the centroid data to the dictionary
    for j in range(0, len(model.cluster_centers_)):
        centroid = model.cluster_centers_[j].tolist()
        cur_class = model.predict([centroid])[0]
        centroids["k{0}".format(i)].append({"class" : "{0}".format(cur_class), "lat" : centroid[0], "lon" : centroid[1]})


Now we have generated all the data, and we simply need to save it to the correct files. First, we save all of the classifications from the different k-means models. We create a pandas dataframe initially consisting of 0's and then we add all of the data.  

In [17]:
# Create a 0's array for initializing the df
n_samples = len(lat)
zero_array = np.zeros(shape=(n_samples,7))

# Init df for csv
df_k = pd.DataFrame(zero_array, columns=["lat", "lon", "k2", "k3", "k4", "k5", "k6"])

# Add all the generated data to the dataframe
for i in range(n_samples):
    # the coordinates
    df_k.set_value(i, "lat", lat[i])
    df_k.set_value(i, "lon", lon[i])
    
    # The different of classifications
    for j in range(0,5):
        col = "k{0}".format((j+2))
        df_k.set_value(i, col, predictions[j][i])

# Save to file
df_k.to_csv("classifications.csv", index=False)

Now we save the centroids into a json file.

In [18]:
with open('centroids.json', 'w') as fp:
    json.dump(centroids, fp)