# 04: Using a K-Mean Model to cluster London bicycle hires dataset
This project uses a k-means model in BigQuery ML to identify clusters of data in the London Bicycle Hires public dataset.


### Key Concepts: 
- K-Means 
- Unsupervised models
- Geospatial analysis 
- Davies-Bouldin Index

## Objective 

- Create a binary K-means clustering model
- Make data-driven decisions based on BQML Visualization of the clusters


## Steps
1. Create the dataset to store the model 
1. Examine the training data 
1. Used the CREATE MODEL statement to create the the K-Means model 
1. Used the ML.PREDICT function to predict the station cluster. 
1. Use the model to make data-driven decisions to know which features are the most important to determine the income bracket.

### Step 1: Create the dataset to store the model 
Data exists in the London Bicycle Hires public dataset. The dataset has data from 2011-present including timestands, station names and ride duration. Dataset region was set to the EU. 

### Step 2: Examine the training data
The `“london_bicycles”` table contains the data needed. Because k-means is an unsupervised learning technique, model training does not require labels nor does it require to split the data into training data and evaluation data. The following query compiled the training data. 

```sql
WITH
 hs AS (
 SELECT
   h.start_station_name AS station_name,
   IF
   (EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 1
     OR EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 7,
     "weekend",
     "weekday") AS isweekday,
   h.duration,
   ST_DISTANCE(ST_GEOGPOINT(s.longitude,
       s.latitude),
     ST_GEOGPOINT(-0.1,
       51.5))/1000 AS distance_from_city_center
 FROM
   `bigquery-public-data.london_bicycles.cycle_hire` AS h
 JOIN
   `bigquery-public-data.london_bicycles.cycle_stations` AS s
 ON
   h.start_station_id = s.id
 WHERE
   h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
   AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
 stationstats AS (
 SELECT
   station_name,
   AVG(duration) AS duration,
   COUNT(duration) AS num_trips,
   MAX(distance_from_city_center) AS distance_from_city_center
 FROM
   hs
 GROUP BY
   station_name )
SELECT
 *
FROM
 stationstats
ORDER BY
 distance_from_city_center ASC
```

#### Results
![01](assets/01.png "results")


### Step 3: Used the CREATE MODEL statement to create the the K-Means model 
When the model is created, the clustering field is `station_name`, and cluster the data based on station attribute, for example the distance of the station from the city center.

Clustering of bike stations was based on the following attributes:
- Duration of rentals
- Number of trips per day
- Distance from city center

```sql
CREATE OR REPLACE MODEL
 12_bqml_exporting_model_prediction.london_station_clusters OPTIONS(model_type='kmeans',
   num_clusters=4) AS
WITH
 hs AS (
 SELECT
   h.start_station_name AS station_name,
 IF
   (EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 1
     OR EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 7,
     "weekend",
     "weekday") AS isweekday,
   h.duration,
   ST_DISTANCE(ST_GEOGPOINT(s.longitude,
       s.latitude),
     ST_GEOGPOINT(-0.1,
       51.5))/1000 AS distance_from_city_center
 FROM
   `bigquery-public-data.london_bicycles.cycle_hire` AS h
 JOIN
   `bigquery-public-data.london_bicycles.cycle_stations` AS s
 ON
   h.start_station_id = s.id
 WHERE
   h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
   AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
 stationstats AS (
 SELECT
   station_name,
   isweekday,
   AVG(duration) AS duration,
   COUNT(duration) AS num_trips,
   MAX(distance_from_city_center) AS distance_from_city_center
 FROM
   hs
 GROUP BY
   station_name, isweekday)
SELECT
 * EXCEPT(station_name, isweekday)
FROM
 stationstats

```

#### Results 

The results matrix show the **Davies-Bouldin Index** and the **Mean Squared Distance**. 

*Davies-Bouldin index is a validation metric that is often used in order to evaluate the optimal number of clusters to use. It is defined as a ratio between the cluster scatter and the cluster’s separation and a lower value will mean that the clustering is better.*

*Regarding the second metric, the mean squared distance makes reference to the intra cluster variance, which we want to minimize as a lower WCSS (within-cluster sums of squares) will maximize the distance between clusters.*

![02](assets/02.png "results")

The Numerical Features tab displays visualizations of the clusters identified by the k-means model. Under Numerical features, bar graphs display up to 10 of the most important numerical feature values for each centroid. 

![03](assets/03.png "results")

### Step 4: Used the ML.PREDICT function to predict the station cluster. 

The ML.PREDICT function was used to predict the cluster for a given set of stations. You predict clusters for all station names that contain the string Kennington. 

```sql
WITH
 hs AS (
 SELECT
   h.start_station_name AS station_name,
   IF
   (EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 1
     OR EXTRACT(DAYOFWEEK
     FROM
       h.start_date) = 7,
     "weekend",
     "weekday") AS isweekday,
   h.duration,
   ST_DISTANCE(ST_GEOGPOINT(s.longitude,
       s.latitude),
     ST_GEOGPOINT(-0.1,
       51.5))/1000 AS distance_from_city_center
 FROM
   `bigquery-public-data.london_bicycles.cycle_hire` AS h
 JOIN
   `bigquery-public-data.london_bicycles.cycle_stations` AS s
 ON
   h.start_station_id = s.id
 WHERE
   h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
   AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
 stationstats AS (
 SELECT
   station_name,
   AVG(duration) AS duration,
   COUNT(duration) AS num_trips,
   MAX(distance_from_city_center) AS distance_from_city_center
 FROM
   hs
 GROUP BY
   station_name )
SELECT
 * EXCEPT(nearest_centroids_distance)
FROM
 ML.PREDICT( MODEL 12_bqml_exporting_model_prediction.london_station_clusters,
   (
   SELECT
     *
   FROM
     stationstats
   WHERE
     REGEXP_CONTAINS(station_name, 'Kennington')))
```
#### Results: Understanding the Data 

![04](assets/04.png "results")

### Step 5: Use the model to make data-driven decisions to know which features are the most important to determine the income bracket.

Lastly, the model was used to make data-driven decisions. For example, based on the model results, you can determine which stations would benefit from extra capacity.

![05](assets/05.png "results")

Cluster#3 shows a busy city station that is close to the city center. (num_trips value) Cluster#2 shows the second city station which is less busy. 

Cluster#1 shows a less busy suburban substation, with longer duration rentals. Cluster#4 shows another suburban station with trips that are shorter. Based on these results, you can use the data to inform your decisions. 

For example:
1. Assume that we need to experiment with a new type of lock. Which cluster of stations should you choose as a subject for this experiment? 
1. Assume that we want to stock some stations with racing bikes. Which stations should you choose?


