<a href="https://colab.research.google.com/github/harnalashok/Clustering/blob/master/uber_trips_clusters_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Students Exercise Solution:

Last amended: 23rd May, 2022
My folder: D:\data\OneDrive\Documents\uber
           /home/ashok/Documents/4.clustering
           
DataSource:
             https://github.com/fivethirtyeight/uber-tlc-foil-response/tree/master/uber-trip-data   
             https://github.com/caroljmcdonald/spark-ml-kmeans-uber  
                   

Ref: knuggets 
    https://www.kdnuggets.com/2020/07/clustering-rideshare-data-uber.html
    Aurelion Book: https://github.com/ageron/handson-ml2/blob/master/09_unsupervised_learning.ipynb


Refere file:   uber_trips_clusters_I in github; 

Students Exercise: Discover whether uber-taxi hubs will change 
                   during office hours and non-office hours; 
                   Assume six clusters for both of them

"""

# Problem
Uber Technologies Inc. is a peer-to-peer ride sharing platform. Uber's platform connects the drivers who can drive to a customer's location. Uber uses machine learning, for calculating pricing to finding the optimal positioning of cars to maximizing profits. We have used the public Uber trip dataset to discuss building a real-time example for analysis and monitoring of car GPS data.

The Uber trip dataset contains data generated by Uber from New York City. The data is freely available on [FiveThirtyEight](https://data.fivethirtyeight.com/).

### Software install & Call libraries

In [None]:
# 0.1 IF not correct versions, 
#     install desied versions
!pip install yellowbrick==1.4
!pip install folium==0.12.1

In [2]:
# 1.0 Call libraries
%reset -f
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score

# 1.1 Visualization
import matplotlib.pyplot as plt
# conda install -c conda-forge folium

import folium
# conda install -c districtdatalabs yellowbrick
#from yellowbrick.cluster import SilhouetteVisualizer
#from yellowbrick.cluster import silhouette_visualizer

# 1.2
import os,time,gc


In [3]:
# 1.3 Let therebe display from multiple commands
#     from a jupyter cell:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
# 1.4 Check
import yellowbrick
yellowbrick.__version__   # 1.4

# 1.4.1
import folium
folium.__version__        # 0.12.1

'1.4'

'0.12.1'

In [5]:
# 1.5 Connect to your google drive
#     Transfer rossmann files from 
#     gdrive to colab VM

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# 2.0 Read data:

#pathToFolder = "D:\\data\\OneDrive\\Documents\\uber"
pathToFolder = "/content/drive/MyDrive/Colab_data_files/uber"
os.chdir(pathToFolder)
os.listdir(pathToFolder)

['uber_raw_data_apr_sep2014.csv.zip', 'clus_mem.npy']

### Read data

In [7]:
# 2.1 

data = pd.read_csv(
                     "uber_raw_data_apr_sep2014.csv.zip",
                     names =["dtime","lat","long","base"],
                     parse_dates = ['dtime']                     # MAke dtype of dtime as datetime
                   )

# 2.1.1 About data:

data.shape    # (45,34,327, 4)
data.head()
data.dtypes

(4534327, 4)

Unnamed: 0,dtime,lat,long,base
0,2014-04-01 00:11:00,40.769,-73.9549,B02512
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512


dtime    datetime64[ns]
lat             float64
long            float64
base             object
dtype: object

The dataset has 45 lakh observations and four columns. It has four attributes:

> **Date/Time:** The date and time of the Uber pickup.<br>
**Lat(Latitude)**: The latitude of the Uber pickup<br>
**Lon(Longitude)**: The longitude of the Uber pickup.<br>
**Base**: the TLC base company code affiliated with the Uber pickup. <br>

In [8]:
# 2.2 Take a sample of data:
data = data.sample(n = 100000)

In [9]:
# 2.3 Parse dates. We just want 'hours':

data['hour'] = data['dtime'].dt.hour

In [10]:
# 2.4 Divide date into day-wise periods:

data['period'] = pd.cut(
                       data['hour'],
                       bins = [-1, 7, 11, 16, 24 ],
                       labels = ['earlymorning', 'officetime', 'slacktime', 'evening']
                     )


# 2.5 Delete 'hour' column
#      and release memory:

del data['hour']
gc.collect()

129

In [11]:
# 3.0 Group data by 'hour'
grpd = data.groupby(['period'])

In [14]:
# 3.1 And extract data subsets by period:

df = []
tperiod = []
for i,j in grpd:
    tperiod.append(i)
    df.append(j)  

In [29]:
# 3.2 Here are two subsets
df[0].head()
df[1].head()
print()
tperiod[0], tperiod[1]

Unnamed: 0,dtime,lat,long,base,period
1473687,2014-06-28 05:22:00,40.738,-73.9851,B02598,earlymorning
2084890,2014-07-23 07:12:00,40.7665,-73.9158,B02598,earlymorning
2636686,2014-07-27 01:03:00,40.7257,-73.9512,B02682,earlymorning
3898696,2014-09-10 06:22:00,40.7496,-73.9815,B02617,earlymorning
1737230,2014-06-10 07:59:00,40.6843,-73.9961,B02682,earlymorning


Unnamed: 0,dtime,lat,long,base,period
1309871,2014-06-08 08:01:00,40.683,-73.9811,B02598,officetime
2832560,2014-08-17 08:04:00,40.7218,-73.9957,B02598,officetime
3246204,2014-08-28 11:32:00,40.7579,-73.9677,B02617,officetime
635847,2014-05-05 11:17:00,40.7621,-73.974,B02598,officetime
458731,2014-04-18 11:22:00,40.7192,-73.9897,B02682,officetime





('earlymorning', 'officetime')

In [17]:
# 4.0 We need only two columns
#     to perform clustering:
#     Take 400000 rows if you have small RAM
#     clus is a copy of data.

clus =  df[0][["lat","long"]]    # For earlymorning
clus1 = df[1][["lat", "long"]]   # For office hours

In [18]:
# 4.1 Quickly create clustering models for both periods:

# 4.2 Instantiate class with parameter values
model = KMeans(
               n_clusters = 6,
               max_iter = 300
               )

# 4.3 Train the obkect on data:

model.fit(clus)    

KMeans(n_clusters=6)

In [19]:
# 4.4 Instantiate class with parameter values
model1 = KMeans(
               n_clusters = 6,
               max_iter = 300
               )

# 4.5 Train the obkect on data:

model1.fit(clus1)    

KMeans(n_clusters=6)

#### Centroids

In [30]:
# 4.6 Where are cluster centers?

print("--Early morning\n")
model.cluster_centers_ 
print("\n--Office hours")
model1.cluster_centers_

--Early morning



array([[ 40.67064758, -73.79219079],
       [ 40.73669062, -73.9943287 ],
       [ 40.84531077, -73.58533846],
       [ 40.78249563, -73.95300167],
       [ 40.68988042, -73.96086635],
       [ 40.70012353, -74.21555529]])


--Office hours


array([[ 40.73338935, -73.99562313],
       [ 40.77170865, -73.96808687],
       [ 40.79585347, -73.87000193],
       [ 40.6799099 , -73.75493652],
       [ 40.69864593, -74.19764593],
       [ 40.68683583, -73.96349024]])

#### SSE: Sum of squared Errors

In [31]:
# 4.7 What is the sum of squares from each pt to respective cluster center?

model.inertia_   # Around 1212
print()
model1.inertia_

21.872902041727762




17.011977889914768

#### Show Cluster centers on folium map
Steps:
* Get latitude and longitude where map will be centered
* Draw folium map with only the above location parameters
* Add, one by one, <i>markers</i> to map. Each marker specifies <i>(lat,long)</i>
* Optionally, specify what to display when a marker is clicked

In [32]:
# 5.0 Centroids for the two periods?

centroids = model.cluster_centers_
centroids1 = model1.cluster_centers_

In [33]:
# 5.1 Transform  cluster-centroids to a DataFrame:
clocation = pd.DataFrame(
                         centroids,
                         columns = ["Latitude", "Longitude"]
                         )

# 5.2
clocation

# 5.3
clocation1 = pd.DataFrame(
                         centroids1,
                         columns = ["Latitude", "Longitude"]
                         )

# 5.4
clocation1

Unnamed: 0,Latitude,Longitude
0,40.670648,-73.792191
1,40.736691,-73.994329
2,40.845311,-73.585338
3,40.782496,-73.953002
4,40.68988,-73.960866
5,40.700124,-74.215555


Unnamed: 0,Latitude,Longitude
0,40.733389,-73.995623
1,40.771709,-73.968087
2,40.795853,-73.870002
3,40.67991,-73.754937
4,40.698646,-74.197646
5,40.686836,-73.96349


In [24]:
# 5.5 What are mean centroid values in each case
#      Our drawn maps will be centerd 
#      at this point:

# 5.5.1 Earlymorning
lat_mean = clocation['Latitude'].mean()
long_mean = clocation['Longitude'].mean()

# 5.5.2 Office time
lat_mean1 = clocation1['Latitude'].mean()
long_mean1 = clocation1['Longitude'].mean()

In [25]:
# 5.3 Converting two centroid DataFrames into lists:

centroid = clocation.values.tolist()
centroid1 = clocation1.values.tolist()

In [26]:
# 6.0  Plotting the centroids on google map using Folium library
#      Just specify the location of pt where map will be centered

# Early morning
map = folium.Map( location=[lat_mean,long_mean] ) # Center the map here.

# Office time
map1 = folium.Map( location=[lat_mean,long_mean] ) # Center the map here.


# 6.1 Add markers at each centroid on the map: 
#       https://python-visualization.github.io/folium/quickstart.html#Markers

# Early morning
tooltip = "Taxi hub-earlymorning"
for point in range(0, len(centroid)):
    abc=folium.Marker(
                      location =    centroid[point],             # Where to draw on the map
                      tooltip  =    tooltip +": " + str(point),  # What to display when mouse hovers on marker
                      popup    =    centroid[point]              # What to display when you click on marker
                      ).add_to(map)  
    

# Office time    
tooltip1 = "Taxi hub--Office hours"
for point in range(0, len(centroid)):
    abc=folium.Marker(
                      location =    centroid1[point],             # Where to draw on the map
                      tooltip  =    tooltip1 +": " + str(point),  # What to display when mouse hovers on marker
                      popup    =    centroid1[point]              # What to display when you click on marker
                      ).add_to(map1)      

In [28]:
# 6.2 Draw google map
#      Early morning
map


In [27]:
# Office hour
map1

In [None]:
####### I am done ################