## Machine Learning Exercise 3 - K-Means Clustering

For this exercise, you'll be working with a cleaned up version of the full WeGo data, meaning that you have all rows for each trip.

Your goal in this exercise is to find groupings of similar time points (as identified by TIME_POINT_ABBR).

1. Read in the csv into a DataFrame named  `wego`. First, we need some features to compare time points. One strategy for this is to try and create some features that measure characteristics about the distribution of, for example, adherence values. Use the following code to find the 0.5, 0.25,0.5, 0.75, and 0.95 quantiles of ADHERENCE values for each stop.

```
time_point_quantiles = (
    wego
    .groupby('TIME_POINT_ABBR')
    ['ADHERENCE']
    .quantile([0.05, 0.25, 0.5, 0.75, 0.95])
    .reset_index()
    .rename(columns = {'level_1': 'quantile'})
    .pivot_table(index = 'TIME_POINT_ABBR', 
                 columns = 'quantile', 
                 values = 'ADHERENCE')
)
```

What is each step doing in this code?

2. When performing k-means clustering, we usually want to standardize our features so that we can compare across multiple dimensions. This means that we are going to convert our original values into z-scores. Create a Pipeline object whose first step employs a StandardScaler to standarize the features and whose second step performs KMeans clustering with 2 clusters.

3. How many points end up in each cluster? How do the points in each cluster compare?

4. Try a range of different values for the number of clusters and choose one which you think is appropriate. Inspect the results and compare the resulting clusters.

**Bonus:** Perform clustering on operators (identified by the OPERATOR variable). You'll need to create some featues on which to compare operators. Think about what types of aggregate values you could calculate which might be useful for this task.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load in the dataset. Since dataset is already cleaned we probably don't need to remove missing values.

wego = pd.read_csv('../data/headway_data_clean.csv')

wego.head(2)

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS,NextDay_Scheduled,NextDay_Actual_Arrival,NextDay_Actual_Departure
0,120230801,1,99457890,2023-08-01,22,2200,1040,345104,0,TO DOWNTOWN,...,0,1,0,0.0,0,2,6.5,0,0,0
1,120230801,1,99457891,2023-08-01,22,2200,1040,345104,0,TO DOWNTOWN,...,0,1,0,0.0,0,9,0.0,0,0,0


In [3]:
#Read in the csv into a DataFrame named wego. First, we need some features to compare time points. 
#One strategy for this is to try and create some features that measure characteristics about the distribution 
#of, for example, adherence values. Use the following code to find the 0.5, 0.25,0.5, 0.75, and 0.95 quantiles
#of ADHERENCE values for each stop.
time_point_quantiles = (
    wego
    .groupby('TIME_POINT_ABBR')
    ['ADHERENCE']
    .quantile([0.05, 0.25, 0.5, 0.75, 0.95])
    .reset_index()
    .rename(columns = {'level_1': 'quantile'})
    .pivot_table(index = 'TIME_POINT_ABBR', 
                 columns = 'quantile', 
                 values = 'ADHERENCE')
)
#What is each step doing in this code?...See below

time_point_quantiles = (                       #building quantiles on adherence on the .05, .25, .5, .75, .95
    wego                     
    .groupby('TIME_POINT_ABBR')  #groupby TIME_POINT_ABBR
    ['ADHERENCE']
    .quantile([0.05, 0.25, 0.5, 0.75, 0.95])   #make the quantiles
    .reset_index()                             #reset index
    .rename(columns = {'level_1': 'quantile'}) #rename level_1 as quantile
    .pivot_table(index = 'TIME_POINT_ABBR',    #pivot table on time_point_abbr
                 columns = 'quantile', 
                 values = 'ADHERENCE')
)


In [4]:
time_point_quantiles

quantile,0.05,0.25,0.50,0.75,0.95
TIME_POINT_ABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21BK,-14.221666,-5.850000,-2.966666,-0.883333,0.733333
25ACLARK,-3.550000,-0.608333,0.850000,4.475000,6.500000
28&CHARL,-9.254167,-4.000000,-2.016666,-0.550000,0.666666
ARTS,-11.368333,-6.458333,-3.316666,-2.200000,0.950000
BRCJ,-13.336666,-5.083333,-2.433333,-0.850000,0.733333
...,...,...,...,...,...
WE23,-12.200000,-5.683333,-2.816666,-0.824999,0.866666
WE31,-12.486666,-5.616666,-2.750000,-0.733333,0.750000
WHBG,-12.505833,-4.900000,-2.416666,-0.666666,0.966666
WMRT,-15.310833,-4.500000,-1.666666,-0.362499,0.744167


2. When performing k-means clustering, we usually want to standardize our features so that we can compare across multiple dimensions. This means that we are going to convert our original values into z-scores. Create a Pipeline object whose first step employs a StandardScaler to standarize the features and whose second step performs KMeans clustering with 2 clusters.

In [5]:
n_clusters = 2

#assign scaler and kmeans cluster to data
kmeans = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters, n_init = 'auto'))
    ]
)
#fit data into pipeline
kmeans.fit(time_point_quantiles)



In [6]:
#Assign clusters to each data pts
#pipeline is the pipeline
#.named_steps allows access to the data in the pipeline
#.predict, predicts the centroids in the data. In this case time_point_quantiles
cluster_labels = kmeans.named_steps['cluster'].predict(time_point_quantiles)

3. How many points end up in each cluster? How do the points in each cluster compare?

In [7]:
# Count datapoints
cluster_counts = np.bincount(cluster_labels)

In [8]:
cluster_counts[0]

8

In [9]:
cluster_counts[1]

53

In [10]:
# if you want it pretty...
for cluster_idx, count in enumerate(cluster_counts):
    print(f"Cluster {cluster_idx}: {count} points")

Cluster 0: 8 points
Cluster 1: 53 points


4. Try a range of different values for the number of clusters and choose one which you think is appropriate. Inspect the results and compare the resulting clusters.

In [15]:
n_clusters = 3

#assign scaler and kmeans cluster to data
kmeans2 = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters, n_init = 'auto'))
    ]
)
#fit data into pipeline
kmeans2.fit(time_point_quantiles)

#pass info to pipeline
cluster_labels2 = kmeans2.named_steps['cluster'].predict(time_point_quantiles)



In [16]:
cluster_counts2 = np.bincount(cluster_labels2)

for cluster_idx, count in enumerate(cluster_counts2):
    print(f"Cluster {cluster_idx}: {count} points")


Cluster 0: 0 points
Cluster 1: 4 points
Cluster 2: 57 points


In [17]:
n_clusters = 5

#assign scaler and kmeans cluster to data
kmeans3 = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters, n_init = 'auto'))
    ]
)
#fit data into pipeline
kmeans3.fit(time_point_quantiles)






In [18]:
#pass info to pipeline
cluster_labels3 = kmeans3.named_steps['cluster'].predict(time_point_quantiles)
cluster_counts3 = np.bincount(cluster_labels3)

for cluster_idx, count in enumerate(cluster_counts3):
    print(f"Cluster {cluster_idx}: {count} points")

Cluster 0: 0 points
Cluster 1: 4 points
Cluster 2: 0 points
Cluster 3: 57 points
