# Clustering Neighbours based on their hourly pickup profiles

The idea here is to cluster neighbourhoods based on their 24 hour pickup profiles to find similar neighbourhoods to recommend to green taxi drivers in NYC. My initial thoughts are that if taxi drivers like working similar hours weekly, that I can find new neighbourhoods for them to work in that will have similar pickup profiles.

In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

The data used for this task has already been slighly preprocessed. To work build profiles for each neighbourhood, the pickup and dropoff lat/longs had to be mapped to neighbourhoods. This was done with a script that can be found in the code folder and was processed on Digital Ocean.

The data has also been reduced to look at one month, in this case August. This is to help speed up the overall analysis processes.

In [3]:
# Data Import
df = pd.read_csv('../data/processed/zones_data_August.csv')
df.head()

Unnamed: 0,pu_time,do_time,pu_nbrhood,do_nbrhood,pass_count,distance,fare,tip,total,payment_type,trip_type
0,08/29/2015 05:16:18 PM,08/29/2015 05:23:48 PM,MN33,MN11,1,1.9,8.0,0.0,8.8,1,1.0
1,08/28/2015 08:05:28 PM,08/28/2015 08:15:25 PM,BK73,BK75,1,2.3,9.5,2.16,12.96,1,1.0
2,08/01/2015 01:07:39 PM,08/01/2015 01:21:50 PM,BK42,BK34,1,2.7,12.0,2.55,15.35,1,1.0
3,08/10/2015 05:35:00 PM,08/10/2015 05:50:06 PM,QN18,QN21,1,3.08,13.0,2.96,17.76,1,1.0
4,08/07/2015 09:18:39 PM,08/07/2015 09:21:57 PM,QN68,QN70,1,0.69,4.5,0.0,5.8,2,1.0


In [4]:
df.shape

(1532343, 11)

During the mapping process, any lat/longs that were found sitting outside the neighbourhood bounderies of NYC were marked as XX00. As we won't need these for this analysis, they can be dropped.

In [6]:
# Drop XX00 neighbours
df = df[df.pu_nbrhood != 'XX00']

In [7]:
df.shape

(1492169, 11)

As we can see, from the shape of the dataframe, before and after dropping the unmapped rows, there were **40174 XX00 pickups**.

Now lets group all neighbourhoods and count how many pickups were recorded in each neighbourhood over the month of August.

In [101]:
pu_group = df.groupby('pu_nbrhood').count()
pu_group['total'].head()

pu_nbrhood
BK09    30284
BK17     4907
BK19      429
BK21     2530
BK23     1118
Name: total, dtype: int64

In [102]:
pu_group['total'].describe()

count       174.000000
mean       8575.683908
std       18252.850243
min           1.000000
25%         100.500000
50%        1102.000000
75%        6292.000000
max      111866.000000
Name: total, dtype: float64

In [103]:
most_pu = pu_group[pu_group.total == 111866]
most_pu['total']

pu_nbrhood
BK73    111866
Name: total, dtype: int64

In [104]:
least_pu = pu_group[pu_group.total == 1]
least_pu['total']

pu_nbrhood
MN21    1
MN25    1
SI32    1
SI54    1
Name: total, dtype: int64

This grouping shows us that we pickups recorded in **174 neighbourhoods** in the month of August. The most active neighbourhood had **111,866 recored pickups (BK73 - Williamsburg)**, while **four neighbourhood had only one registered pickup (MN21 - Gramercy, MN25 - Lower Manhattan, SI32 - Rossville/Woodrow and SI54 - Great Kills)**.

### Data Prep
For the purposes of clustering, I want an hourly profile of pickup numbers per neighbourhood.

What do I need to do:
- get a list of the neighbourhoods
- convert pickup times to pandas datetime and isolate hours
- create function to cycle through each hour and each neighbour and create unique column for each hour

In [267]:
# Create list of NYC neighbourhoods
vector_df = df.pu_nbrhood.unique()
vector_df

array(['MN33', 'BK73', 'BK42', 'QN18', 'QN68', 'BK61', 'MN34', 'BK09',
       'MN11', 'BK60', 'QN31', 'BK33', 'BK37', 'BK72', 'QN29', 'BX31',
       'BK68', 'MN04', 'QN02', 'BK63', 'BK35', 'MN03', 'QN28', 'BX05',
       'QN71', 'BK69', 'BK77', 'MN36', 'BK32', 'BK78', 'MN09', 'QN63',
       'BX35', 'BK90', 'MN35', 'BK75', 'QN61', 'BK38', 'BK31', 'QN17',
       'BX34', 'QN70', 'BX63', 'QN22', 'BK64', 'QN21', 'BK96', 'QN50',
       'BK21', 'BK76', 'BK81', 'QN72', 'BX39', 'QN60', 'BK82', 'BX28',
       'QN52', 'BX01', 'QN26', 'BX26', 'BK17', 'BX55', 'QN54', 'MN06',
       'BX14', 'BX43', 'QN27', 'BX37', 'BK58', 'BK83', 'BK91', 'BK79',
       'QN01', 'BX27', 'MN40', 'QN35', 'BX17', 'QN53', 'BX07', 'BX75',
       'QN19', 'BX40', 'QN25', 'QN62', 'BK41', 'BX46', 'BX30', 'BX41',
       'QN37', 'BX06', 'BX29', 'BX36', 'BK40', 'BK95', 'QN30', 'QN55',
       'MN31', 'BK45', 'BX08', 'BK34', 'BK44', 'BK46', 'QN34', 'BK23',
       'BK88', 'BX33', 'BK27', 'QN20', 'QN06', 'BK29', 'BX49', 'BX59',
      

In [167]:
# Find total trips per neighbourhood
count_df = df.groupby('pu_nbrhood').count()
count_df = count_df.loc[:, ['pu_time']]
count_df.columns = ['total']
count_df.head(2)

Unnamed: 0_level_0,total
pu_nbrhood,Unnamed: 1_level_1
BK09,30284
BK17,4907


In [168]:
# Convert pickup time to datetime datatype
nbrhood_df = df.loc[:,['pu_time', 'pu_nbrhood']]
nbrhood_df['pu_time'] = pd.to_datetime(nbrhood_df['pu_time'], format='%m/%d/%Y %I:%M:%S %p')

In [169]:
# Isolate day and hour from datetime into their own separate columns
nbrhood_df['day'] = nbrhood_df['pu_time'].dt.day
nbrhood_df['hour'] = nbrhood_df['pu_time'].dt.hour
nbrhood_df.head()

Unnamed: 0,pu_time,pu_nbrhood,day,hour
0,2015-08-29 17:16:18,MN33,29,17
1,2015-08-28 20:05:28,BK73,28,20
2,2015-08-01 13:07:39,BK42,1,13
3,2015-08-10 17:35:00,QN18,10,17
4,2015-08-07 21:18:39,QN68,7,21


Now that we have the data we want, lets split the count into hourly columns per neighbourhood.

In [170]:
for i in range(1, 25):
    # Isolate each hour and split them into their own separate dataframes
    df_temp = nbrhood_df[nbrhood_df.hour == i]
    
    # Group the smaller dataframes by their neighbourhood
    df_temp = df_temp.groupby('pu_nbrhood').count()

    # Create temporary string to name each column
    temp_name = 'hour_{}'.format(i)
    
    # Rename temporary series with it's unique hour name
    df_temp[temp_name] = df_temp['hour']
    df_temp = df_temp.loc[:, temp_name]
    
    # Convert series to dataframe
    df_temp = df_temp.to_frame()
    
    # Join the temporary dataframe onto the large final dataframe
    count_df = count_df.join(df_temp, how='left')

As there is not pickups every hour in every neighbour, the above loop left a lot of NA values that will not be for our algorithm. Lets replace all NA's with zeros.

In [171]:
# Replace NA values with zeros
count_df.fillna(0, inplace=True)
count_df.head()

Unnamed: 0_level_0,total,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,...,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23,hour_24
pu_nbrhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BK09,30284,511.0,296.0,137.0,107.0,163.0,336.0,842.0,1666.0,1658.0,...,1620.0,1966.0,2103.0,2261.0,2211.0,1986.0,1864.0,1645.0,1145.0,0.0
BK17,4907,151.0,96.0,52.0,69.0,32.0,49.0,143.0,237.0,234.0,...,225.0,292.0,399.0,485.0,306.0,264.0,280.0,222.0,204.0,0.0
BK19,429,17.0,6.0,5.0,2.0,8.0,1.0,4.0,12.0,24.0,...,38.0,38.0,20.0,17.0,22.0,13.0,11.0,8.0,14.0,0.0
BK21,2530,114.0,68.0,32.0,15.0,15.0,15.0,56.0,85.0,69.0,...,124.0,142.0,168.0,177.0,167.0,172.0,204.0,202.0,191.0,0.0
BK23,1118,85.0,61.0,25.0,14.0,2.0,3.0,4.0,6.0,8.0,...,44.0,48.0,61.0,70.0,66.0,90.0,100.0,122.0,127.0,0.0


Ok this should give us enough to work with for an initial trial of the KMeans clustering algorithm.

In [96]:
# Drop total column
count_df.drop('total', axis='columns', inplace=True)

In [97]:
from sklearn.cluster import KMeans

In [98]:
kmeans = KMeans(n_clusters=12, random_state=0, n_init).fit(count_df)

In [112]:
np.unique(kmeans.labels_, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32),
 array([130,   2,  11,   1,   2,  17,   2,   3,   1,   1,   3,   1]))

In [105]:
pu_group['kmean_clusters'] = kmeans.labels_
pu_group.drop(['pu_time', 'do_time', 'do_nbrhood', 'pass_count', 
               'distance', 'fare', 'tip', 'payment_type', 'trip_type'],
             axis='columns', inplace=True)

In [106]:
pu_group.reset_index().head()

Unnamed: 0,pu_nbrhood,total,kmean_clusters
0,BK09,30284,2
1,BK17,4907,0
2,BK19,429,0
3,BK21,2530,0
4,BK23,1118,0


In [107]:
for k, v in pu_group.groupby('kmean_clusters').groups.items():
    print(k)
    print(v)
    print("\n")

0
Index(['BK17', 'BK19', 'BK21', 'BK23', 'BK25', 'BK26', 'BK27', 'BK28', 'BK29',
       'BK30',
       ...
       'SI08', 'SI11', 'SI14', 'SI22', 'SI24', 'SI32', 'SI35', 'SI36', 'SI37',
       'SI54'],
      dtype='object', name='pu_nbrhood', length=130)


1
Index(['QN29', 'QN70'], dtype='object', name='pu_nbrhood')


2
Index(['BK09', 'BK33', 'BK61', 'BK69', 'BK75', 'BK76', 'MN04', 'MN36', 'QN17',
       'QN63', 'QN72'],
      dtype='object', name='pu_nbrhood')


3
Index(['BK73'], dtype='object', name='pu_nbrhood')


4
Index(['MN11', 'MN34'], dtype='object', name='pu_nbrhood')


5
Index(['BK35', 'BK60', 'BK64', 'BX14', 'BX34', 'BX39', 'BX63', 'MN06', 'MN35',
       'QN18', 'QN22', 'QN26', 'QN50', 'QN60', 'QN61', 'QN68', 'QN71'],
      dtype='object', name='pu_nbrhood')


6
Index(['MN03', 'MN09'], dtype='object', name='pu_nbrhood')


7
Index(['BK37', 'BK68', 'QN31'], dtype='object', name='pu_nbrhood')


8
Index(['MN33'], dtype='object', name='pu_nbrhood')


9
Index(['BK38'], dtype='obje

If we break the dataset into 12 clusters, we can see that there is one large cluster, with 11 other small clusters. This may be because it hasn't iterated through the algorithm enough. Lets test this assumption.

In [113]:
kmeans100 = KMeans(n_clusters=12, random_state=0, n_init=100).fit(count_df)

In [114]:
np.unique(kmeans100.labels_, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32),
 array([130,   2,  11,   1,   2,  17,   2,   3,   1,   1,   3,   1]))

Comparing both label counts, they look the exact same, so it it not the number of iterations that is causing the clustering to be so lopsided.

My next theory is that neighbours with low or very large pickup numbers are being clustered into the small outlier clusters. To improve this, lets scale each neighbourhood to between 0 and 1.

In [218]:
new_final_df = count_df[['total', 'hour_1']]
# final_df.drop('total', axis='columns', inplace=True)

In [219]:
for i in range(1, 25, 2):
    temp_name1 = 'hour_{}'.format(i)
    temp_name2 = 'hour_{}'.format(i+1)

    new_final_df[[temp_name1, temp_name2]] = count_df[[temp_name1, temp_name2]].div(count_df.total, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [220]:
new_final_df.head(2)

Unnamed: 0_level_0,total,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,...,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23,hour_24
pu_nbrhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BK09,30284,0.016874,0.009774,0.004524,0.003533,0.005382,0.011095,0.027803,0.055013,0.054748,...,0.053494,0.064919,0.069443,0.07466,0.073009,0.065579,0.061551,0.054319,0.037809,0.0
BK17,4907,0.030772,0.019564,0.010597,0.014062,0.006521,0.009986,0.029142,0.048298,0.047687,...,0.045853,0.059507,0.081312,0.098838,0.06236,0.053801,0.057061,0.045241,0.041573,0.0


In [221]:
# new_final_df.drop(list(vector_df), axis='columns', inplace=True)
new_final_df.drop('total', axis='columns', inplace=True)

In [275]:
# # Reorder columns
# cols = final_df.columns.tolist()
# cols = cols[-22:] + cols[:-22]
# final_df = final_df[cols]
# final_df.head()

In [223]:
# Zone code/name mapping
zone_names = pd.read_csv('../code/zones.csv', index_col=0)
zone_dict = dict(zip(zone_names.nta_code, zone_names.zone))

# Rename index values with real zone names
new_final_df = new_final_df.reset_index()
new_final_df['pu_nbrhood'].replace(zone_dict, inplace=True)

Unnamed: 0,pu_nbrhood,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,...,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23,hour_24
0,Cobble Hill,0.016874,0.009774,0.004524,0.003533,0.005382,0.011095,0.027803,0.055013,0.054748,...,0.053494,0.064919,0.069443,0.07466,0.073009,0.065579,0.061551,0.054319,0.037809,0.0
1,Sheepshead Bay,0.030772,0.019564,0.010597,0.014062,0.006521,0.009986,0.029142,0.048298,0.047687,...,0.045853,0.059507,0.081312,0.098838,0.06236,0.053801,0.057061,0.045241,0.041573,0.0
2,Brighton Beach,0.039627,0.013986,0.011655,0.004662,0.018648,0.002331,0.009324,0.027972,0.055944,...,0.088578,0.088578,0.04662,0.039627,0.051282,0.030303,0.025641,0.018648,0.032634,0.0
3,Coney Island,0.045059,0.026877,0.012648,0.005929,0.005929,0.005929,0.022134,0.033597,0.027273,...,0.049012,0.056126,0.066403,0.06996,0.066008,0.067984,0.080632,0.079842,0.075494,0.0
4,BK23,0.076029,0.054562,0.022361,0.012522,0.001789,0.002683,0.003578,0.005367,0.007156,...,0.039356,0.042934,0.054562,0.062612,0.059034,0.080501,0.089445,0.109123,0.113596,0.0


In [232]:
new_final_df.set_index('pu_nbrhood', inplace=True)

In [248]:
kmeans = KMeans(n_clusters=12, random_state=0).fit(new_final_df)

In [249]:
np.unique(kmeans.labels_, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32),
 array([ 4, 74,  1,  1,  2,  2,  1,  1,  1, 83,  3,  1]))

In [260]:
kmeans100 = KMeans(n_clusters=12, random_state=0, n_init=100).fit(new_final_df)

In [261]:
np.unique(kmeans100.labels_, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32),
 array([ 3,  3,  3,  1,  1,  1, 74,  1,  1,  2,  1, 83]))

Scaling down each row and upping the iterations gives us two unique clusters and a number of small clusters. While not a massive improvement, it does give us something to work with.

## Final Scenario: Suggesting similar neighbourhoods to a NYC Green Taxi Driver

Ok lets randomly choose a neighbourhood for where the NYC Green Taxi driver is from...

In [263]:
from random import randint

In [265]:
randint(1,19)

13

In [270]:
nbrhoods = []

for i in enumerate(vector_df):
    nbrhoods.append(i)

In [277]:
# The taxi drivers is from.....
taxi_divers_nbrh = nbrhoods[randint(0,len(nbrhoods))]
taxi_divers_nbrh

(111, 'BX59')

In [279]:
zone_dict[taxi_divers_nbrh[1]]

'Westchester Village/Unionport'

In [280]:
new_final_df['kmean_clusters'] = kmeans100.labels_

In [284]:
new_final_df[new_final_df.index == zone_dict[taxi_divers_nbrh[1]]].loc[:,'kmean_clusters']

pu_nbrhood
Westchester Village/Unionport    11
Name: kmean_clusters, dtype: int32

So the taxi driver lives within one of the two large clusters, number 11. So lets see if we can narrow the suggestions down further.

Ideally the NYC Green Taxi driver would like to work in an neighbourhood with similar working hours as Westchester Village but with better tips, and if possible less distance to travel (to save on gas!).

In [292]:
westchester = df[df['pu_nbrhood'] == taxi_divers_nbrh[1]]
westchester.groupby('pu_nbrhood').agg(['count', 'mean'])

Unnamed: 0_level_0,pass_count,pass_count,distance,distance,fare,fare,tip,tip,total,total,payment_type,payment_type,trip_type,trip_type
Unnamed: 0_level_1,count,mean,count,mean,count,mean,count,mean,count,mean,count,mean,count,mean
pu_nbrhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
BX59,2100,1.274286,2100,3.166681,2100,12.264929,2100,0.388852,2100,13.640848,2100,1.805714,2100,1.219524


So, looking at this info, we see that Westchester Village has:
* a total of 2100 pick ups in August (pretty low),
* taxi drivers there take in a mean total fare of 13.64 dollars, and
* they travel a a mean distance of 3.17 miles

In [317]:
totals_df = df.groupby('pu_nbrhood').mean()
totals_count = df.groupby('pu_nbrhood').count()
totals_df['count'] = totals_count.total

totals_df['clusters'] = kmeans100.labels_

totals_df = totals_df[totals_df.clusters == 11]

In [321]:
# Remove any neighbourhoods with distance higher than 3.17 miles
totals_df = totals_df[totals_df.distance < 3.17]
len(totals_df)

35

In [322]:
# Remove any neighbourhoods with total fare lower than 13.64 dollars
totals_df = totals_df[totals_df.total > 13.64]
len(totals_df)

14

In [330]:
# Remove any neighbourhoods with total number of pickups lower than 2100
totals_df = totals_df[totals_df['count'] > 2100]
len(totals_df)

9

So taking into account all the criteria we laid out above, we are left we 9 suggestions:

In [338]:
final_nbrhoods = totals_df
final_nbrhoods = final_nbrhoods.reset_index()
final_nbrhoods['pu_nbrhood'].replace(zone_dict, inplace=True)
final_nbrhoods.set_index('pu_nbrhood')

Unnamed: 0_level_0,pass_count,distance,fare,tip,total,payment_type,trip_type,count,clusters
pu_nbrhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Belmont,1.347596,2.903305,12.312433,0.642372,13.982074,1.735126,1.140587,2454,11
East Tremont,1.236364,3.042269,12.494624,0.516521,14.02376,1.680165,1.157025,2420,11
Spuyten Duyvil/Kingsbridge,1.210696,2.905067,12.162656,0.737782,14.037446,1.717842,1.08852,2169,11
Van Nest/Morris Park,1.27828,3.073842,12.187434,0.541388,13.724373,1.745211,1.198771,2767,11
Hamilton Heights,1.229236,2.921072,11.8278,1.023102,14.078917,1.587225,1.031938,23984,11
Manhattanville,1.249103,2.867386,11.696914,0.992775,13.914921,1.580956,1.032622,11710,11
Morningside Heights,1.31428,2.758586,11.739215,1.276069,14.342717,1.490644,1.007886,46920,11
Upper East Side South,1.356322,2.662498,11.676133,1.479648,14.388796,1.382691,1.003719,2958,11
Rego Park,1.525585,3.127697,13.0362,0.710555,15.050502,1.739967,1.005769,11960,11
