# Part 3: K-means clustering

### Dataset
In this exercise we will use the dataset of Uber pickups in New York City area during September 2014 that can be found on https://github.com/fivethirtyeight/uber-tlc-foil-response/tree/master/uber-trip-data

In [None]:
import pandas as pd
df = pd.read_csv('data/uber-raw-data-sep14.csv')
print(len(df))
print(df.head())
print(df.dtypes)

We can cut out the pickups that happened outside of Manhattan, give or take.

In [None]:
df = df[(df['Lon'] > -74.02) & (df['Lon'] < -73.94)]
df = df[(df['Lat'] > 40.7) & (df['Lat'] < 40.8)]
len(df)

To split Date/time Column into two columns.

In [None]:
df['Date'], df['Time'] = df['Date/Time'].str.split(' ').str
df = df.drop(labels=['Date/Time', 'Base'], axis=1)

To convert string columns into datetime-specific types

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format="%m/%d/%Y")
df['Time'] = pd.to_timedelta(df['Time'])

To create new column with weekday number based on the date

In [None]:
df['Weekday'] = df['Date'].dt.dayofweek  # 0-6
print(df.dtypes)
df.head()

Now we can filter dataset to get pickups from specific days, hours or weekdays.

To get pickups that happened on weekend mornings

In [None]:
X = df[(df['Time'] > '08:00:00') & (df['Time'] < '10:00:00') & (df['Weekday'] >= 5)]
X = X[['Lon', 'Lat']]
len(X)

With the data prepared we can train the model to calculate clustering of pickup points. Clustering algorithms cannot be scored the same as supervised learning algorithms, we do not have labelled data. Therefore we do not split our data to training/testing datasets.

In this case, we can skip features preprocessing because Latitude and Longitude values are in the same unit. Otherwise we would need to scale them to avoid bias (features with large numerical values would have greater impact on results).

There is no perfect method for selecting the number of clusters, so we can start with arbitrary number and then modify it based on visualizing the results.

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=7)
model.fit(X)

coords = pd.DataFrame.from_records([(40.72143, -73.98847)], columns=['Lat', 'Lon'])
print(coords, model.predict(coords))

To enable inline matplotlib visualizations

In [None]:
%matplotlib inline

To import matplotlib and set plots style

In [None]:
import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')

To create a scatter plot with cluster centers and pickup points grouped by their cluster

In [None]:
centers = model.cluster_centers_
# To set specific color for each point, we must create a list of color values that contains color value for each point
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
point_colors = [colors[label] for label in model.labels_]

# to display two datasets on one plot we must create the plot as subplot on the figure
# and then create two scatter plots using the same axes
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111)
ax.scatter(X.Lon, X.Lat, marker='.', c=point_colors, alpha=0.8)
ax.scatter(centers[:, 0], centers[:, 1], marker='x', c='k', alpha=0.8, linewidths=3, s=169)
plt.show()

Now we can create a function to calculate clustering and visualize cluster centers for different timeframes.

In [None]:
def show_clusters(X, num_clusters=7):
    X = X[['Lon', 'Lat']]
    print(len(X), "records")
    model = KMeans(n_clusters=num_clusters)
    model.fit(X)
    centers = model.cluster_centers_
    # We can change the default size of generated plot
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111)
    ax.scatter(X.Lon, X.Lat, marker='.', c='b', alpha=0.8)
    ax.scatter(centers[:, 0], centers[:, 1], marker='x', c='k', alpha=0.8, linewidths=3, s=169)
    plt.show()

Saturday night

In [None]:
show_clusters(df[(df['Time'] > '22:00:00') & (df['Weekday'] == 5)])

Monday afternoon

In [None]:
show_clusters(df[(df['Time'] > '16:00:00') & (df['Time'] < '18:00:00') & (df['Weekday'] == 0)])

To help us understand the results we can display cluster centers on map using Folium package

To create a function that will calculate clustering and visualize cluster centers for different timeframes on interactive New York City map.

In [None]:
import folium
from IPython.display import display

def show_clusters_on_nyc_map(X, num_clusters=7):
    X = X[['Lon', 'Lat']]
    print(len(X), "records")
    model = KMeans(n_clusters=num_clusters)
    model.fit(X)
    centers = model.cluster_centers_
    #initialize the map
    nyc_map = folium.Map(location=[40.75, -73.98], tiles='Stamen Toner', zoom_start=12)
    # Add marker for all centroids
    for centroid in centers:
        folium.Marker([centroid[1], centroid[0]],
                      icon=folium.Icon(color='red', icon='flag'),
                     popup=str(centroid)).add_to(nyc_map)
    # display map inline
    display(nyc_map)

Display Monday afternoon cluster centers

In [None]:
show_clusters_on_nyc_map(df[(df['Time'] > '16:00:00') & (df['Time'] < '18:00:00') & (df['Weekday'] == 0)])

Display Saturday night cluster centers

In [None]:
show_clusters_on_nyc_map(df[(df['Time'] > '22:00:00') & (df['Weekday'] == 5)])

Display Weekend mornings cluster centers

In [None]:
show_clusters_on_nyc_map(df[(df['Time'] > '08:00:00') & (df['Time'] < '10:00:00') & (df['Weekday'] >= 5)])