# Lab 9: A simple example of Machine Learning

Welcome to lab 9! In this exercise, you will step through a simple example using *scikit-learn*, one of the most popular machine learning libraries for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, amongst others.

<img src="scikit-learn-logo.png" />

If you have launched this notebook in `binder` the `sklearn` library should already have been installed on the underlying virtual machine that the copy of Jupyter Notebook is running in. If you have downloaded this notebook to use on your own computer (or in the CTR), you might need to install scikit-learn first before running this notebook. To install scikit-learn, open up the command-line (Command prompt on Windows, or Git Bash in the CTR) and type:

```
conda install scikit-learn
```

and press Enter. Some text should whiz by indicating that it's installing various things. Once that is done, please restart Jupyter Notebook.

For full documentation about how to use scikit-learn you can reference here: http://scikit-learn.org

In this lab session, we will simply try out some [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) by providing some data to the k-means clustering algorithm built into scikit-learn. To use it in Python code, you import from `sklearn`.

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

In [2]:
# might take a while to load on a slow Internet connection - file is 70MB
sfpd = pd.read_csv('http://s3.eu-west-2.amazonaws.com/ox-p4ds-assets/datasets/sfpd.csv.gz')
                   
# Complete the line below to calculate and print the number of rows, in millions, rounded to 3 decimal places
print('{num_rows} million rows'.format(num_rows=sfpd.shape[0]/10**6 ))

1.834948 million rows


That's a **lot** of rows in this CSV file! Let's work with just a sample of this dataset, otherwise your notebook will crash. 

Complete the code below to create a new DataFrame `sfpd_sample` with a sample of only 1000 rows from the original `sfpd` DataFrame.

In [3]:
sfpd_sample = sfpd.sample(1000)
sfpd_sample.shape

(1000, 13)

Now let's take a look at the structure of the CSV file we have loaded.

In [4]:
sfpd_sample.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
1570835,41141188,MISSING PERSON,MISSING JUVENILE,Tuesday,10/05/2004,06:15,PARK,LOCATED,100 Block of BELVEDERE ST,-122.449329,37.767774,"(37.7677738874748, -122.449328648219)",4114118874010
808272,106034642,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,03/25/2010,15:45,SOUTHERN,NONE,100 Block of NATOMA ST,-122.399115,37.786858,"(37.7868583013919, -122.39911538091)",10603464228150
112806,150114963,OTHER OFFENSES,TRAFFIC VIOLATION,Friday,02/06/2015,18:10,RICHMOND,NONE,FUNSTON AV / CLEMENT ST,-122.471916,37.782595,"(37.7825949606971, -122.471916495608)",15011496365015
1148349,71063546,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Wednesday,10/17/2007,02:18,PARK,PSYCHOPATHIC CASE,2100 Block of OFARRELL ST,-122.440157,37.782397,"(37.7823970703886, -122.440157274543)",7106354664020
438395,126202261,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,12/14/2012,11:15,SOUTHERN,NONE,HARRISON ST / HAWTHORNE ST,-122.396266,37.78344,"(37.7834404321177, -122.396265814237)",12620226106244


We are going to use `bokeh` to visualize data in this dataset and also to visualize the k-means clustering that we will do using `sklearn`. We will first import and initialize `bokeh`, and then create a Google Maps plot of the San Francisco city area. You will need a Google developer API key, which you can obtain from the link shown in the comment below. Note that if you do not already have a Google account, you will have to sign up to one to obtain an API key.

In [5]:
from bokeh.io import output_notebook, show
output_notebook()

In [6]:
# Enter the latitude and longitude coordinates for San Francisco into the GMapOptions below.
# Note that for latitudes, degrees North are positive values and degrees South are negative.
# Note that for longitudes, degrees East are positive and degrees West are negative.
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, Range1d, PanTool, WheelZoomTool, BoxSelectTool
)

map_options = GMapOptions(lat=37.773010 , lng=-122.418640 , map_type="roadmap", zoom=12)

plot = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=map_options)
plot.title.text = "San Francisco"

# For GMaps to function, Google requires you obtain and enable an API key:
#
# https://developers.google.com/maps/documentation/javascript/get-api-key
#
# To do this, you will need a Google account.
#
# Replace the GOOGLE_API_KEY below with your personal API key: 
api_key = "AIzaSyACTiIEDtiSkGshw2zAjgZKgBhKe_WBVfI"
plot.api_key = api_key
show(plot)

Let's plot some data points on the Google Map plot that we just created. We can do that by selecting the relevant longitude and latitude columns from the data set. You should be able to work out which are the relevant columns when you peeked at the head of the DataFrame earlier.

In [7]:
source = ColumnDataSource(
    data=dict(
        lat= sfpd_sample['Y'].values,  # provide an array with the latitude values
        lon= sfpd_sample['X'].values   # provide an array with the longitude values
    )
)

circle = Circle(x="lon", y="lat", size=5, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)
show(plot)

The previous step should have plotted a sample of the distribution of all SFPD incidents reported in the San Francisco area. We can observe that there are some parts of the city that have more incidents over others. In fact, you can comopare these results with the heatmaps produced by Trulia (an American real-estate website that also uses data to provide "neighborhood insights") here: http://www.trulia.com/blog/trends/trulia-local/

Something we might want to do is to understand where to focus policing resources, based on this data. We could use k-means clustering to produce a set of cluster centre points of interest, perhaps to decide where in the city there might be more or less crime. First, let's set up our `KMeans` estimator object.

In [8]:
estimator = KMeans(n_clusters=5, max_iter=300, init='random', verbose=1)

The `KMeans` object takes as parameters the number of clusters to try and fit the data to `n_clusters`, the maximum number of iterations of the k-means algorithm for a single run `max_iter`, and the k-means method for initializing the estimation algorithm `'init`. `init` can take values `k-means++`, `random` or a `np.ndarray`. `k-means++` selects initial cluster centers for k-mean clustering in a smart way to speed up convergence, while `random` chooses `k` observations (rows) at random from data for the initial centroids. If an `ndarray` (n-dimnesional array; we haven't looked at these in this course) is passed, it should be of shape (`n_clusters`, `n_features`) and gives the initial centers. We set `verbose` to `1` so the algorithm outputs some logging messages.

Next, we provide some data to the estimator object. 

In [10]:
# The estimator.fit() function takes as input any series. Slice your sample data frame by selecting
# the longitude and latitude columns only to provide as input below.
kmeans = estimator.fit(sfpd_sample.loc[:,['X','Y']])

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 0.3872168083303606
start iteration
done sorting
end inner loop
Iteration 1, inertia 0.2989286539555569
start iteration
done sorting
end inner loop
Iteration 2, inertia 0.281727017806903
start iteration
done sorting
end inner loop
Iteration 3, inertia 0.2761128084411967
start iteration
done sorting
end inner loop
Iteration 4, inertia 0.2756563812787215
start iteration
done sorting
end inner loop
Iteration 5, inertia 0.27558354441357563
start iteration
done sorting
end inner loop
Iteration 6, inertia 0.2754640427514047
start iteration
done sorting
end inner loop
Iteration 7, inertia 0.27543064354377245
start iteration
done sorting
end inner loop
Iteration 8, inertia 0.27543064354377245
center shift 0.000000e+00 within tolerance 6.123952e-08
Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 0.36125035252613724
start iteration
done sorting
end inner loop
Iteration

Basically what the algorithm does is to try and identify `n_clusters` groups of data points. We can see where these groups are centred on by looking at the `.cluster_centers_` attribute:

In [11]:
kmeans.cluster_centers_

array([[-122.41233772,   37.78696939],
       [-122.41923189,   37.76158266],
       [-122.39693422,   37.72697197],
       [-122.4546007 ,   37.72506589],
       [-122.46526933,   37.77226796]])

In fact, what we have done with the k-means clustering algorithm is identified 5 possible clusters and we can refer to them by labels that are just numbered groups as follows:

In [12]:
for x, cluster_center in enumerate(kmeans.cluster_centers_):
    print('Group {group_number} has center {cluster_center}'.format(group_number=x, cluster_center=cluster_center))

Group 0 has center [-122.41233772   37.78696939]
Group 1 has center [-122.41923189   37.76158266]
Group 2 has center [-122.39693422   37.72697197]
Group 3 has center [-122.4546007    37.72506589]
Group 4 has center [-122.46526933   37.77226796]


Extract the `lat` and `lon` arrays from the cluster centers generated by the k-means algorithm.

*Hint: You can use a list comprehension, or a `map()` function to create the new lists that can be cast into arrays.*

In [13]:
source = ColumnDataSource(
    data=dict(
        lat= [x[1] for x in kmeans.cluster_centers_], # latitude values from the cluster centers results
        lon= [x[0] for x in kmeans.cluster_centers_]  # longitude values from the cluster centers results
    )
)
circle = Circle(x="lon", y="lat", size=100, fill_color="orange", fill_alpha=0.5, line_color=None)
plot.add_glyph(source, circle)  # this adds to the GMapPlot we created earlier
show(plot)

Finally, not that we have not only identified the clusters and their centre points. Our `KMeans` model has been *trained* against the input data when we executed the `.fit()` function. This is an example of *unsupervised learning* because the input data is not labeled. The group numbers we used earlier are arbitrary labels, and you might decide later to add human-interpretable labels to these groups.

As we have something that has been trained/fit against the input data, we can also make predictions based on our trained model. To do this, we use the `.predict()` function, providing as input a list of coordinates. For example, we can take a new sample from our original full-sized `sfpd` DataFrame.

In [14]:
new_sample = sfpd.sample(5)
new_sample

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
919098,90569533,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Tuesday,06/02/2009,19:00,NORTHERN,PSYCHOPATHIC CASE,1500 Block of GREENWICH ST,-122.425185,37.800254,"(37.8002544025518, -122.425184766564)",9056953364020
1194138,70623961,MISSING PERSON,FOUND PERSON,Wednesday,06/20/2007,07:30,BAYVIEW,LOCATED,0 Block of LEDYARD ST,-122.402117,37.733235,"(37.7332354676556, -122.402116556292)",7062396175000
520491,120417985,LARCENY/THEFT,GRAND THEFT FROM PERSON,Friday,05/18/2012,17:15,NORTHERN,NONE,FILLMORE ST / HAYES ST,-122.431201,37.775833,"(37.7758330026111, -122.431201289001)",12041798506152
1580899,41040029,NON-CRIMINAL,LOST PROPERTY,Thursday,09/09/2004,14:00,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",4104002971000
1394608,60006731,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,Monday,01/02/2006,18:55,SOUTHERN,NONE,200 Block of 7TH ST,-122.408649,37.777311,"(37.7773110274033, -122.40864873517)",6000673106241


and apply the `.predict()` function to the sample that consists of the same kind of data that the algorithm was trained on. In the example below, we create a new DataFrame to illustrate the mapping between the coordinates and the group labels.

In [15]:
long_lats = new_sample.loc[:,['X','Y']]
predictions = kmeans.predict(long_lats)
prediction_results = pd.DataFrame(long_lats)
prediction_results['Group'] = predictions
prediction_results

Unnamed: 0,X,Y,Group
919098,-122.425185,37.800254,0
1194138,-122.402117,37.733235,2
520491,-122.431201,37.775833,1
1580899,-122.403405,37.775421,0
1394608,-122.408649,37.777311,0


You can explore this dataset further, by looking at the different categories of crime in the data you could rank different categories of crime by severity and use k-means to look for possible hot spots of different types of crimes. You can also try using different numbers of clusters to looks for, or try running the estimator on the full `sfpd` DataFrame that we loaded at the beginning. You may need to create a new `GMapPlot` rather than adding to the existing `plot`.

**Do not try and plot all 1.8M data points (it will cause all sorts of trouble in your Jupyter Notebook!).**

In [16]:
# For example, here are all of the categories found in the full dataset
np.array(sfpd['Category'].drop_duplicates())

array(['NON-CRIMINAL', 'OTHER OFFENSES', 'ASSAULT', 'LARCENY/THEFT',
       'SUSPICIOUS OCC', 'BURGLARY', 'VEHICLE THEFT', 'DRUG/NARCOTIC',
       'VANDALISM', 'FORGERY/COUNTERFEITING', 'WARRANTS', 'ROBBERY',
       'FRAUD', 'SEX OFFENSES, FORCIBLE', 'STOLEN PROPERTY',
       'WEAPON LAWS', 'RUNAWAY', 'TRESPASS', 'SECONDARY CODES',
       'MISSING PERSON', 'LIQUOR LAWS', 'DISORDERLY CONDUCT', 'ARSON',
       'DRUNKENNESS', 'FAMILY OFFENSES', 'KIDNAPPING', 'LOITERING',
       'EMBEZZLEMENT', 'DRIVING UNDER THE INFLUENCE', 'PROSTITUTION',
       'GAMBLING', 'SUICIDE', 'EXTORTION', 'BAD CHECKS',
       'SEX OFFENSES, NON FORCIBLE', 'BRIBERY', 'PORNOGRAPHY/OBSCENE MAT',
       'TREA', 'RECOVERED VEHICLE'], dtype=object)

In [17]:
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier

## First step - Turning Category column from categorical to numerical data

In [24]:
lables = preprocessing.LabelEncoder()
lables.fit(sfpd_sample['Category'])
y=lables.transform(sfpd_sample['Category'])
list(lables.classes_)

['ARSON',
 'ASSAULT',
 'BURGLARY',
 'DISORDERLY CONDUCT',
 'DRIVING UNDER THE INFLUENCE',
 'DRUG/NARCOTIC',
 'DRUNKENNESS',
 'EMBEZZLEMENT',
 'FAMILY OFFENSES',
 'FORGERY/COUNTERFEITING',
 'FRAUD',
 'KIDNAPPING',
 'LARCENY/THEFT',
 'MISSING PERSON',
 'NON-CRIMINAL',
 'OTHER OFFENSES',
 'PROSTITUTION',
 'ROBBERY',
 'RUNAWAY',
 'SECONDARY CODES',
 'SEX OFFENSES, FORCIBLE',
 'SEX OFFENSES, NON FORCIBLE',
 'STOLEN PROPERTY',
 'SUSPICIOUS OCC',
 'TRESPASS',
 'VANDALISM',
 'VEHICLE THEFT',
 'WARRANTS',
 'WEAPON LAWS']

## Fitting KNN calssifier

In [22]:
KNN = KNeighborsClassifier(n_neighbors=len(list(lables.classes_)))

In [25]:
KNN.fit(sfpd_sample.loc[:,['X','Y']],y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=29, p=2,
           weights='uniform')

## Trying to predict category of crime based just on lat,long (very naive)

When you're finished with lab 9 (or had completed what you can), choose **Save and Checkpoint** from the **File** menu, then choose **Download as Notebook** and save it to your computer or USB stick. You can then send a copy to the lecturer via Slack or email to check over.

In [31]:
y_precited=KNN.predict(new_sample.loc[:,['X','Y']])
y_precited

array([12, 15, 15, 12,  1], dtype=int64)

## Turning results from KNN predict bask to Category names

In [32]:
lables.inverse_transform(y_precited)

  if diff:


array(['LARCENY/THEFT', 'OTHER OFFENSES', 'OTHER OFFENSES',
       'LARCENY/THEFT', 'ASSAULT'], dtype=object)