# UBER Pickups  ðŸ“‡

Uber already has data about pickups in major cities. The objective is to create algorithms that will determine where are the hot-zones that drivers should be in. 

Uber wants to have hot-zones per hour and per day of week, you should first start small. Pick one day at a given hour and then start to generalize your approach.

Purposes:

- Create an algorithm to find hot zones
- Visualize results on a nice dashboard

##### Summary :

- EDA

- K-Means (Elbow + Silhouette Methods)

- DBSCAN

- Conclusion


In [2]:
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import  silhouette_score

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import plotly.express as px
import plotly.io as pio

In [3]:
#read the Dataset
df = pd.read_csv('uber-raw-data-apr14.csv')
df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


- ### EDA

In [4]:
# Basic satistics
print("Number of rows : {}".format(df.shape[0]))
print()


print("Basic Stats: ")
data_desc = df.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values : ")
display(100*df.isnull().sum()/df.shape[0])

Number of rows : 564516

Basic Stats: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,



Percentage of missing values : 


Date/Time    0.0
Lat          0.0
Lon          0.0
Base         0.0
dtype: float64

In [5]:
# show the dataset
df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [6]:
# import datetime to manage the time
import datetime

df['Date/Time'] = pd.to_datetime(df['Date/Time']) # transform Date/Time column into a timestamp
df['Hour'] = df['Date/Time'].dt.hour # get hours
df['Time'] = df['Date/Time'].dt.time # get time
df['Day'] = df['Date/Time'].dt.day # get the day
df['DayOfWeek'] = df['Date/Time'].dt.day_of_week # get the day of the week
df['DayName'] = df['Date/Time'].dt.day_name() # get the day name
df['Date'] = df['Date/Time'].dt.date # get the date

df = df.drop('Date/Time', axis=1) # drop Date/Time

df.head()

Unnamed: 0,Lat,Lon,Base,Hour,Time,Day,DayOfWeek,DayName,Date
0,40.769,-73.9549,B02512,0,00:11:00,1,1,Tuesday,2014-04-01
1,40.7267,-74.0345,B02512,0,00:17:00,1,1,Tuesday,2014-04-01
2,40.7316,-73.9873,B02512,0,00:21:00,1,1,Tuesday,2014-04-01
3,40.7588,-73.9776,B02512,0,00:28:00,1,1,Tuesday,2014-04-01
4,40.7594,-73.9722,B02512,0,00:33:00,1,1,Tuesday,2014-04-01


In [None]:
fig = px.histogram(df, x='Hour',
                      title = 'Hours with the higher number of pickups',
                      barmode ='group',
                      width= 1000,
                      height = 600
                      ) 
fig.update_layout(title_x = 0.5, 
                      margin=dict(l=50,r=50,b=50,t=50,pad=4),
                      xaxis_title = '',
                      yaxis_title = '',
                      template = 'plotly_dark'
                      )
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
                      'paper_bgcolor': 'rgba(0, 0, 0, 0)'}
                      )       
fig.show()

In April 2014, we can observe that :

- Requests increase around 5 am and then stagning from 9 to 12 am
- The time slot with the most requests from users is between 15 pm and 22 pm

In [None]:
fig = px.histogram(df, x=df.DayName,
                      title = 'Day with the higher number of pickups',
                      barmode ='group',
                      width= 700,
                      height = 400
                      ) 
fig.update_layout(title_x = 0.5, 
                      margin=dict(l=50,r=50,b=50,t=50,pad=10),
                      xaxis_title = '',
                      yaxis_title = '',
                      template = 'plotly_dark'
                      )
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
                      'paper_bgcolor': 'rgba(0, 0, 0, 0)'}
                      )       
fig.show()

The day of the week with the highest number of requests is Wednesday

In [9]:
# let's check which day of the month has the highest number of pickups on wednesday
df_wednesday = df[df['DayName'] == 'Wednesday']
wednesday = dict(df_wednesday.Day.groupby(df_wednesday.Day).count().sort_values(ascending=False))

In [None]:
fig = px.histogram(df, x=df.Base,
                      title = 'Base with the highest number of pickups',
                      barmode ='group',
                      width= 700,
                      height = 400
                      ) 
fig.update_layout(title_x = 0.5, 
                      margin=dict(l=50,r=50,b=50,t=50,pad=10),
                      xaxis_title = '',
                      yaxis_title = '',
                      template = 'plotly_dark'
                      )
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
                      'paper_bgcolor': 'rgba(0, 0, 0, 0)'}
                      )       
fig.show()

- B02682 is the base with the highest number of pickups

In [None]:
fig = px.pie(wednesday.items(), values= wednesday.values(), names= wednesday.keys(), color= wednesday,
            title= "Proportion of pickups on Wednesday",
             color_discrete_map={'mobile':'lightcyan',
                                 'connect':'royalblue',
                                })
fig.update_traces(textposition = 'outside', textfont_size = 15)             
fig.update_layout(title_x = 0.5, 
                    margin=dict(l=50,r=50,b=50,t=50,pad=4), 
                    template = 'plotly_dark'
                    )   
fig.show()

On April 30 we have the highest number of requests.

In order to have an optimal amount of data, we will keep only 30th april 2014

In [12]:
# select data on April 30
data = df[df['Day']==30]
data.shape

(36251, 9)

Now, we will display the hours with the highest number of pickups in our final dataset

In [None]:
fig = px.histogram(data, x='Hour',
                      title = 'Hours with the highest number of pickups',
                      barmode ='group',
                      width= 1000,
                      height = 600
                      ) 
fig.update_layout(title_x = 0.5, 
                      margin=dict(l=50,r=50,b=50,t=50,pad=4),
                      xaxis_title = '',
                      yaxis_title = '',
                      template = 'plotly_dark'
                      )
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
                      'paper_bgcolor': 'rgba(0, 0, 0, 0)'}
                      )       
fig.show()

- As we can see, on April 30 the number of requests increase from 15 pm to 20 pm.

- Early in the morning the number of users is low until 5 am, where the requests duplicates.

- From 9 am to 13 pm they stagnate. 

In [14]:
# We keep only data at the late afternoon part
data_evening = data[data['Hour'].isin([i for i in range(15,21)])]
data_evening.shape

(16501, 9)

- ## K-Means

#### Elbow Method

In [15]:
# Elbow method to find the optimal number of clusters 
X = data_evening[['Lat', 'Lon']]

wcss =  []
for i in range (2,11): 
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
print(wcss)

[24.644497910895513, 17.43925960526812, 13.804553543658097, 10.84161038401813, 8.780203220965008, 7.017771207616099, 5.847649549317733, 5.047997185565777, 4.579710831733294]


In [None]:
# a graph can help us to choose the number of clusters
fig = px.line(x = range(2,11), y = wcss)
fig.show()

K-Means Elbow method suggests that 3 is the optimal number of clusters.

In [17]:
# We choose the number of clusters
kmeans = KMeans(n_clusters= 3)
kmeans.fit(X)

# add a new column in the dataset with predictions
data_evening.loc[:,'Cluster_Elbow_KMeans'] = kmeans.predict(X)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
# show it
fig = px.scatter_mapbox(
        data_evening, 
        lat="Lat", 
        lon="Lon",
        color="Cluster_Elbow_KMeans",
        mapbox_style="carto-positron"
)

fig.update_layout(width = 1000,
                  height = 800,
                  title_x = 0.5, 
                  template = 'plotly_dark',
                  margin = {"l": 0, "r": 0, "b": 0, "t": 80},
                  )

fig.show()

#### Silhouette Method

In [19]:
# Now use the Silhouette score to choose the optimal number of clusters

s_score = []
for i in range (2,11): 
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(X)
    s_score.append(silhouette_score(X, kmeans.predict(X)))

print(s_score)

[0.7876237123521267, 0.44762895634566596, 0.4621852337308618, 0.4708449135479972, 0.4885728282284252, 0.49096380200963763, 0.4127858358004401, 0.42310988167595026, 0.4331425561283819]


In [None]:
# Show the scores depending on clusters
fig = px.bar(x = range(2,11), y = s_score)
fig.show()

However, Silhouette method has predicted only two clusters

In [21]:
# We choose the number of clusters
kmeans = KMeans(n_clusters= 2)
kmeans.fit(X)

# add a new column in the dataset with predictions
data_evening.loc[:,'Cluster_Silhouette_KMeans'] = kmeans.predict(X)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
# show it
fig = px.scatter_mapbox(
        data_evening, 
        lat="Lat", 
        lon="Lon",
        color="Cluster_Silhouette_KMeans",
        mapbox_style="carto-positron"
)

fig.update_layout(width = 1000,
                  height = 800,
                  title_x = 0.5, 
                  template = 'plotly_dark',
                  margin = {"l": 0, "r": 0, "b": 0, "t": 80},
                  )

fig.show()

Kmeans is usefull to make a separation but it does not inform about the density of requests.

Let's try DBSCAN

- ## DBSCAN

In [23]:
# Let's try DBSCAN with the right settings
from sklearn.cluster import DBSCAN

db = DBSCAN(n_jobs=-1, eps = 0.008, min_samples = 22, metric = 'manhattan') # instanciate DBSCAN with the optimals params

db.fit(X)
# use numpy.unique to show the number of unique values
import numpy as np
print(np.unique(db.labels_))
# add a new column in the dataset with predictions
data_evening["cluster"] = db.labels_

[-1  0  1  2  3  4  5  6  7]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



According to DBSCAN, 8 is the optimal number of clusters.

In [None]:
# show it
fig = px.scatter_mapbox(data_evening[data_evening.cluster != -1], # we select all the columns except outliers
        lat="Lat", 
        lon="Lon",
        color="cluster",
        mapbox_style="carto-positron",
)

fig.update_layout(width = 1000,
                  height = 800,
                  title_x = 0.5, 
                  template = 'plotly_dark',
                  margin = {"l": 0, "r": 0, "b": 0, "t": 80},
                  )

fig.show()

From 15 pm to 20 pm, we can observe that :
- The highest number of users requests are from Manhattan.
- The second highest area is Brooklyn.
- Some users reserve drivers from Kennedy and LG Airports.

#### Let's try our algorithm on the whole day of April 30

In [25]:
april = data[['Lat', 'Lon']] # select Latitude and Longitude from April 30 dataset

db2 = DBSCAN(n_jobs=-1, eps = 0.005, min_samples = 20, metric = 'manhattan') # instanciate DBSCAN with the optimals params

db2.fit(april)
# use numpy.unique to show the number of unique values
import numpy as np
print(np.unique(db2.labels_))
# add a new column in the dataset with predictions
data["cluster"] = db2.labels_

[-1  0  1  2  3  4  5  6  7  8  9 10 11]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
# show it
fig = px.scatter_mapbox(data[data.cluster != -1],
        lat="Lat", 
        lon="Lon",
        color="cluster",
        mapbox_style="carto-positron",
)

fig.update_layout(width = 1000,
                  height = 800,
                  title_x = 0.5, 
                  template = 'plotly_dark',
                  margin = {"l": 0, "r": 0, "b": 0, "t": 80},
                  )

fig.show()

On the whole day of April 30 :
- The hottest zone is still Manhattan
- A significant number of users requests are from Brooklyn
- Uber drivers must supervise Airports Areas and New Jersey

### Let's try to see ours clusters on each hour 

In [None]:
# show hot zone areas with hours animation frame
fig = px.scatter_mapbox(data[data.cluster != -1],
        lat="Lat", 
        lon="Lon",
        color="cluster",
        mapbox_style="carto-positron",
        animation_frame='Hour'
)

fig.update_layout(width = 1000,
                  height = 800,
                  title_x = 0.5, 
                  template = 'plotly_dark',
                  margin = {"l": 0, "r": 0, "b": 0, "t": 80},
                  )

fig.show()

- On April 30 and during the day, users calls are lower from 1 am until 6 am.
- From 7 am and 20 pm Uber has to increase the number of Drivers to cover users requests.

## Conclusion

Our sample was based on April 30 and we observed :

- Manhattan is the area to __supervise carefully__ to avoid making users wait more than 7 minutes.

- Some drivers have to cover Brooklyn, New Jersey and Airports areas too.


DBSCN seems to be the perfect algorithm to deal with this kind of problem.

To improve its own services, Uber should notify its drivers in real time about hot zones and, using drivers gps, advice the right place to be.