## 3.0 Cluster Analysis

**Tasks**: Based on the taxi trip patterns, can you identify clusters of trip types and/or customer types? How would you label these clusters? 

**Methods**: Identify clusters with soft-clustering and visualize your results. Compare your results to a hard-clustering method of your choice. You can use additional features like “distance to city center”, expressive hourly resolutions (e.g., “bar hours”, “morning commuting”), or even land-use/POI data.
Furthermore, can you identify spatial hot spots for trip demand using Gaussian Mixture Models (i.e., using Spatial Kernel Density Estimation)?

#### Outline of this notebook:
1. Helper Functions
2. Adding Additional Features
3. Feature Selection
4. ...

In [70]:
# Hexagon resolution to work with for the rest of the notebook
RES = 8

In [42]:
# Standard libraries - run pip install if necessary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

# Geospatial libraries
from h3 import h3 
import geopandas as gp
import folium
from shapely.ops import unary_union
from shapely.geometry.polygon import Polygon
## Color for map 
import branca
import branca.colormap as cm

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [54]:
taxi_df = pd.read_csv("data/prepped/prep_taxidata.csv")

In [55]:
taxi_df.head(3)

Unnamed: 0.1,Unnamed: 0,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,fare,tips,tolls,extras,trip_total,...,pickup_area_number,dropoff_community,dropoff_area_number,h3_res7_pickup,h3_res7_dropoff,h3_res8_pickup,h3_res8_dropoff,payment_type_encoded,company_encoded,taxi_id_encoded
0,0,2081.0,4.42,,,20.5,0.0,0.0,0.0,20.5,...,2,UPTOWN,3,872664d8effffff,872664d89ffffff,882664d8e1fffff,882664d897fffff,0,0,0
1,1,812.0,0.0,,,13.84,2.73,0.0,0.0,16.57,...,8,WEST TOWN,24,872664c1effffff,872664cacffffff,882664c1edfffff,882664cac3fffff,1,0,1
2,2,600.0,0.9,,,7.0,2.0,0.0,3.0,12.0,...,8,NEAR NORTH SIDE,8,872664c1effffff,872664c1effffff,882664c1edfffff,882664c1edfffff,2,1,2


In [56]:
taxi_df.isna().sum()

Unnamed: 0                    0
trip_seconds                  0
trip_miles                    0
pickup_census_tract     2884556
dropoff_census_tract    2884556
fare                          0
tips                          0
tolls                         0
extras                        0
trip_total                    0
trip_hours                    0
hour                          0
4_hour_window                 0
6_hour_window                 0
weekday                       0
month                         0
pickup_community              0
pickup_area_number            0
dropoff_community             0
dropoff_area_number           0
h3_res7_pickup                0
h3_res7_dropoff               0
h3_res8_pickup                0
h3_res8_dropoff               0
payment_type_encoded          0
company_encoded               0
taxi_id_encoded               0
dtype: int64

In [61]:
pd.to_numeric(taxi_df["pickup_area_number"], downcast='integer')
pd.to_numeric(taxi_df["dropoff_area_number"], downcast='integer')

0           3
1          24
2           8
3           8
4           8
           ..
5320304    22
5320305    32
5320306    77
5320307     3
5320308     8
Name: dropoff_area_number, Length: 5320309, dtype: int8

In [64]:
taxi_df.dtypes

Unnamed: 0                int64
trip_seconds            float64
trip_miles              float64
pickup_census_tract     float64
dropoff_census_tract    float64
fare                    float64
tips                    float64
tolls                   float64
extras                  float64
trip_total              float64
trip_hours              float64
hour                      int64
4_hour_window             int64
6_hour_window             int64
weekday                  object
month                    object
pickup_community         object
pickup_area_number        int64
dropoff_community        object
dropoff_area_number       int64
h3_res7_pickup           object
h3_res7_dropoff          object
h3_res8_pickup           object
h3_res8_dropoff          object
payment_type_encoded      int64
company_encoded           int64
taxi_id_encoded           int64
dtype: object

### 1.0 Helper functions

In [9]:
# Scaling
def scale_df(X):
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)
    return pd.DataFrame(df_scaled, columns = df_scaled.columns, index = df_scaled.index)

In [7]:
# Feature Selection
def feature_selection_kmeans(df, maxvars=3, kmin=2, kmax=8, cut_off=0.5, random_state=1984):
    """
    Perform feature selection using K-means clustering and silhouette score.
    Returns a tuple of the list of selected feature names, the optimal number of clusters and the cluster assignment itself
    """
    kmeans_kwargs = {
        "init": "random",
        "n_init": 20,
        "max_iter": 1000,
        "random_state": random_state
    }

    cols = list(df.columns)
    results_for_each_k = []
    vars_for_each_k = {}

    for k in range(kmin, kmax + 1):
        selected_variables = []
        while len(selected_variables) < maxvars:
            results = []
            for col in cols:
                scols = selected_variables + [col]
                kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
                kmeans.fit(df[scols])
                results.append(silhouette_score(df[scols], kmeans.predict(df[scols])))
            
            selected_var = cols[np.argmax(results)]
            selected_variables.append(selected_var)
            cols.remove(selected_var)
        
        results_for_each_k.append(max(results))
        vars_for_each_k[k] = selected_variables

    best_k = np.argmax(results_for_each_k) + kmin
    selected_variables = vars_for_each_k[best_k]

    kmeans = KMeans(n_clusters=best_k, **kmeans_kwargs)
    kmeans.fit(df[selected_variables])
    clusters = kmeans.predict(df[selected_variables])

    return selected_variables, best_k, clusters

In [65]:
taxi_df.head()

Unnamed: 0.1,Unnamed: 0,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,fare,tips,tolls,extras,trip_total,...,pickup_area_number,dropoff_community,dropoff_area_number,h3_res7_pickup,h3_res7_dropoff,h3_res8_pickup,h3_res8_dropoff,payment_type_encoded,company_encoded,taxi_id_encoded
0,0,2081.0,4.42,,,20.5,0.0,0.0,0.0,20.5,...,2,UPTOWN,3,872664d8effffff,872664d89ffffff,882664d8e1fffff,882664d897fffff,0,0,0
1,1,812.0,0.0,,,13.84,2.73,0.0,0.0,16.57,...,8,WEST TOWN,24,872664c1effffff,872664cacffffff,882664c1edfffff,882664cac3fffff,1,0,1
2,2,600.0,0.9,,,7.0,2.0,0.0,3.0,12.0,...,8,NEAR NORTH SIDE,8,872664c1effffff,872664c1effffff,882664c1edfffff,882664c1edfffff,2,1,2
3,3,546.0,0.85,,,6.5,0.0,0.0,0.0,6.5,...,8,NEAR NORTH SIDE,8,872664c1effffff,872664c1effffff,882664c1edfffff,882664c1edfffff,3,2,3
4,4,574.0,0.33,,,6.25,0.0,0.0,0.0,6.25,...,8,NEAR NORTH SIDE,8,872664c1effffff,872664c1effffff,882664c1edfffff,882664c1edfffff,3,3,4


### 2.0 Adding Additional Features 
You can use additional features like “distance to city center”, expressive hourly resolutions (e.g., “bar hours”, “morning commuting”), or even land-use/POI data.

#### 2.0.1 Distance to city center

In [90]:
# City center = hexagon with the most pickups
center_hex = taxi_df[f'h3_res{RES}_pickup'].value_counts().idxmax()

print(f'Center hexagon: {center_hex}')
taxi_df[taxi_df[f'h3_res{RES}_pickup'] == center_hex].iloc[0].pickup_community

Center hexagon: 882664c1edfffff


'NEAR NORTH SIDE'

In [73]:
# Calculate grid distances
def calculate_h3_grid_distances(df, center_hex):
    def h3_grid_distance(h1, h2):
            return h3.h3_distance(h1, h2)

    df[f'dist_from_center_pickup'] = df[f'h3_res{RES}_pickup'].apply(lambda x: h3_grid_distance(x, center_hex))
    df[f'dist_from_center_dropoff'] = df[f'h3_res{RES}_dropoff'].apply(lambda x: h3_grid_distance(x, center_hex))
    
    return df

In [75]:
taxi_df = calculate_h3_grid_distances(taxi_df, center_hex)

#### 2.0.2 Trip Direction

In [85]:
# Get the direction of the trip
## If moving_towards_center + -> moving towards city center
## If moving_towards_center - -> moving away from city center
taxi_df['moving_towards_center'] = taxi_df['dist_from_center_pickup'] - taxi_df['dist_from_center_dropoff'] 

In [93]:
taxi_df[taxi_df['moving_towards_center'] < 0].head(3)

Unnamed: 0.1,Unnamed: 0,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,fare,tips,tolls,extras,trip_total,...,h3_res7_pickup,h3_res7_dropoff,h3_res8_pickup,h3_res8_dropoff,payment_type_encoded,company_encoded,taxi_id_encoded,dist_from_center_pickup,dist_from_center_dropoff,moving_towards_center
1,1,812.0,0.0,,,13.84,2.73,0.0,0.0,16.57,...,872664c1effffff,872664cacffffff,882664c1edfffff,882664cac3fffff,1,0,1,0,4,-4
6,6,876.0,8.91,,,23.75,0.0,0.0,0.0,23.75,...,872664c1bffffff,872664cd8ffffff,882664c1b5fffff,882664cd89fffff,0,0,6,6,23,-17
7,8,540.0,0.7,17031080000.0,17031840000.0,7.25,0.0,0.0,0.0,7.25,...,872664c1effffff,872664c1affffff,882664c1e7fffff,882664c1a9fffff,3,4,7,2,3,-1


In [94]:
taxi_df.columns

Index(['Unnamed: 0', 'trip_seconds', 'trip_miles', 'pickup_census_tract',
       'dropoff_census_tract', 'fare', 'tips', 'tolls', 'extras', 'trip_total',
       'trip_hours', 'hour', '4_hour_window', '6_hour_window', 'weekday',
       'month', 'pickup_community', 'pickup_area_number', 'dropoff_community',
       'dropoff_area_number', 'h3_res7_pickup', 'h3_res7_dropoff',
       'h3_res8_pickup', 'h3_res8_dropoff', 'payment_type_encoded',
       'company_encoded', 'taxi_id_encoded', 'dist_from_center_pickup',
       'dist_from_center_dropoff', 'moving_towards_center'],
      dtype='object')

In [108]:
# Get weekday or not (0 if not, 1 if it is)
taxi_df['is_weekday'] = taxi_df['weekday'].isin(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]).astype(int)

# Analyze weekday impact
weekday_impact = taxi_df.groupby('is_weekday')[['trip_miles', 'trip_seconds', 'trip_total', 'dist_from_center_pickup']].mean()
print("Weekday Impact:")
print(weekday_impact)
taxi_df.head()

Weekday Impact:
            trip_miles  trip_seconds  trip_total  dist_from_center_pickup
is_weekday                                                               
0             5.497319    1049.49831   23.697390                 8.662318
1             5.330449    1111.13201   22.736041                 8.464165


Unnamed: 0.1,Unnamed: 0,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,fare,tips,tolls,extras,trip_total,...,h3_res7_dropoff,h3_res8_pickup,h3_res8_dropoff,payment_type_encoded,company_encoded,taxi_id_encoded,dist_from_center_pickup,dist_from_center_dropoff,moving_towards_center,is_weekday
0,0,2081.0,4.42,,,20.5,0.0,0.0,0.0,20.5,...,872664d89ffffff,882664d8e1fffff,882664d897fffff,0,0,0,13,8,5,0
1,1,812.0,0.0,,,13.84,2.73,0.0,0.0,16.57,...,872664cacffffff,882664c1edfffff,882664cac3fffff,1,0,1,0,4,-4,0
2,2,600.0,0.9,,,7.0,2.0,0.0,3.0,12.0,...,872664c1effffff,882664c1edfffff,882664c1edfffff,2,1,2,0,0,0,0
3,3,546.0,0.85,,,6.5,0.0,0.0,0.0,6.5,...,872664c1effffff,882664c1edfffff,882664c1edfffff,3,2,3,0,0,0,0
4,4,574.0,0.33,,,6.25,0.0,0.0,0.0,6.25,...,872664c1effffff,882664c1edfffff,882664c1edfffff,3,3,4,0,0,0,0


### 3.0 Weather Data

In [116]:
weather_df = pd.read_csv("data/weather_data_2022.csv")
weather_df.shape

(10194, 11)

In [117]:
## TODO????------- ADD THIS SOMEWHERE ELSE
weather_df.isna().sum()

Date           0
Time           0
Temperature    0
Dew Point      0
Humidity       0
Wind           6
Wind Speed     0
Wind Gust      0
Pressure       0
Precip.        0
Condition      0
dtype: int64

In [119]:
weather_df = weather_df.dropna()
weather_df.head(3)

Unnamed: 0,Date,Time,Temperature,Dew Point,Humidity,Wind,Wind Speed,Wind Gust,Pressure,Precip.,Condition
0,2022-01-01,6:15 PM,33°F,28°F,82°%,NNE,21°mph,31°mph,29.17°in,0.0°in,Light Snow / Windy
1,2022-01-01,6:29 PM,32°F,27°F,82°%,NNE,22°mph,31°mph,29.17°in,0.0°in,Light Snow / Windy
2,2022-01-01,6:53 PM,32°F,27°F,82°%,NNE,21°mph,33°mph,29.18°in,0.0°in,Light Snow / Windy


### 4.0 POI Data

In [121]:
poi_df = pd.read_csv("data/prepped/poi_df.csv")
poi_df.head()

Unnamed: 0,osmid,amenity,name,geometry,public_transport,latitude,longitude,count,h3_res7,h3_res8,poly_res7,poly_res8,poi_density_res7,poi_density_res8,category
0,20217109,ferry_terminal,Shoreline Sightseeing,POINT (-87.6225172 41.8891445),station,41.889145,-87.622517,45,872664c1effffff,882664c1e3fffff,POLYGON ((-87.63048927308355 41.90755371098675...,POLYGON ((-87.62100146715468 41.89255291202126...,1621,328,Transportation
1,20217442,ferry_terminal,Union Station/Willis Tower - Shoreline Water T...,POINT (-87.6377402 41.8790618),station,41.879062,-87.63774,45,872664c1affffff,882664c1adfffff,POLYGON ((-87.63912440648137 41.88713767856642...,POLYGON ((-87.63912440648137 41.88713767856642...,1822,417,Transportation
2,269449042,parking_entrance,,POINT (-87.6150579 41.8586894),,41.858689,-87.615058,10,872664c1bffffff,882664c1b1fffff,"POLYGON ((-87.61820944356228 41.8710984903598,...",POLYGON ((-87.61347038938916 41.86360034164272...,349,36,Transportation
3,269450074,parking_entrance,,POINT (-87.5841968 41.7917424),,41.791742,-87.584197,10,872664cc5ffffff,882664cc59fffff,POLYGON ((-87.5937067068291 41.798228672497444...,POLYGON ((-87.58229602405846 41.79718226760902...,385,47,Transportation
4,269450344,parking_entrance,,POINT (-87.6119452 41.8496767),,41.849677,-87.611945,10,872664c1bffffff,882664c1b3fffff,"POLYGON ((-87.61820944356228 41.8710984903598,...",POLYGON ((-87.60873281770539 41.85610258554422...,349,26,Transportation


In [124]:
columns_to_keep = [f'h3_res{RES}', 'category']
poi_df_cluster = poi_df[columns_to_keep].copy()

# Step 2: One-hot encode the categories
poi_df_cluster = pd.get_dummies(poi_df_cluster, columns=['category'], prefix='', prefix_sep='')

# Step 3: Group by H3 index and sum the one-hot encoded categories
poi_df_cluster = poi_df_cluster.groupby(f'h3_res{RES}').sum().reset_index()
poi_df_cluster.head()

Unnamed: 0,h3_res8,Accommodation,Animal Services,Education,Financial Services,Food and Drink,Healthcare,Miscellaneous,Personal Care,Public Services,Recreation and Entertainment,Religious and Community,Shopping and Retail,Sports and Fitness,Transportation,Utilities
0,8826641903fffff,0,0,0,0,0,0,0,0,0,0,0,0,0,11,0
1,8826641907fffff,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,8826641909fffff,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,882664190bfffff,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0
4,882664190dfffff,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


### 5.0 Clustering