Bay Wheels is a regional public bicycle sharing system in California's San Francisco Bay Area. This dataset is taken from the following website https://www.lyft.com/bikes/bay-wheels/system-data and represents trips taken by members of the service for the month of June of 2020.

The data is anonymized and and trips include:

- Start Time and Date
- End Time and Date
- Trip duration (seconds)
- Rideable Type
- Start Station ID
- Start Station Name
- Start Station Latitude
- Start Station Longitude
- End Station ID
- End Station Name
- End Station Latitude
- End Station Longitude
- Ride ID
- User Type

In [158]:
#from IPython.display import Image
#Image("vPrHTOga.jpg") 

<IPython.core.display.Image object>

#### Upload the data 

In [1]:
import pandas as pd
df = pd.read_csv('baywheeljune2020.csv')

In [2]:
df.head() #method to returns the first 5 rows if a number not specified

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,trip_duration
0,172957A20160D568,electric_bike,2020-06-03 15:16:06,2020-06-03 16:06:10,Church St at Duboce Ave,85.0,48th Ave at Cabrillo St,521.0,37.769841,-122.42921,37.772894,-122.509079,casual,3004.0
1,AC29BDD9051D1827,electric_bike,2020-06-03 12:13:30,2020-06-03 12:36:27,Cesar Chavez St at Dolores St,140.0,4th St at 16th St,104.0,37.747758,-122.425121,37.767008,-122.390851,casual,1377.0
2,7E0C4C5917A9EEC2,electric_bike,2020-06-02 19:18:23,2020-06-02 19:46:05,The Embarcadero at Vallejo St,8.0,Hyde St at Post St,369.0,37.799943,-122.398562,37.787527,-122.41683,casual,1662.0
3,6B0E4BF2BBD49A9D,electric_bike,2020-06-03 10:06:26,2020-06-03 10:38:15,Green St at Van Ness Ave,496.0,Green St at Van Ness Ave,496.0,37.797636,-122.423418,37.797653,-122.423335,casual,1909.0
4,27C607CB14528333,electric_bike,2020-06-03 13:09:05,2020-06-03 13:31:33,4th St at 16th St,104.0,Cesar Chavez St at Dolores St,140.0,37.767064,-122.3909,37.747827,-122.425056,casual,1348.0


In [3]:
df.shape #dimension of a python object

(79858, 14)

In [4]:
df.columns #membaca columns yang ada

Index(['ride_id', 'rideable_type', 'started_at', 'ended_at',
       'start_station_name', 'start_station_id', 'end_station_name',
       'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual', 'trip_duration'],
      dtype='object')

- How many rideable type are there in the dataset?

In [5]:
df["rideable_type"].nunique()

2

- Compute the speed using the following function (Km/h).

In [6]:
from geopy.distance import geodesic #We import a function 'geodesic' from a module file 'distance' which belongs to 'geopy' package

def distance(row): 
    add1 = (row['start_lat'], row['start_lng']) 
    add2 = (row['end_lat'], row['end_lng']) 
    return (geodesic(add1, add2).km) 

df['distance'] = df.apply(lambda row: distance(row), axis = 1)*1.2 

In [7]:
df['speed']=df['distance']/(df['trip_duration']/3600)

In [17]:
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,trip_duration,distance,speed,zscore,outlier
0,172957A20160D568,electric_bike,2020-06-03 15:16:06,2020-06-03 16:06:10,Church St at Duboce Ave,85.0,48th Ave at Cabrillo St,521.0,37.769841,-122.429210,37.772894,-122.509079,casual,3004.0,8.453977,10.131264,0.049131,False
1,AC29BDD9051D1827,electric_bike,2020-06-03 12:13:30,2020-06-03 12:36:27,Cesar Chavez St at Dolores St,140.0,4th St at 16th St,104.0,37.747758,-122.425121,37.767008,-122.390851,casual,1377.0,4.439166,11.605663,0.246157,False
2,7E0C4C5917A9EEC2,electric_bike,2020-06-02 19:18:23,2020-06-02 19:46:05,The Embarcadero at Vallejo St,8.0,Hyde St at Post St,369.0,37.799943,-122.398562,37.787527,-122.416830,casual,1662.0,2.542207,5.506585,-0.568872,False
3,6B0E4BF2BBD49A9D,electric_bike,2020-06-03 10:06:26,2020-06-03 10:38:15,Green St at Van Ness Ave,496.0,Green St at Van Ness Ave,496.0,37.797636,-122.423418,37.797653,-122.423335,casual,1909.0,0.009054,0.017074,-1.302443,False
4,27C607CB14528333,electric_bike,2020-06-03 13:09:05,2020-06-03 13:31:33,4th St at 16th St,104.0,Cesar Chavez St at Dolores St,140.0,37.767064,-122.390900,37.747827,-122.425056,casual,1348.0,4.428411,11.826616,0.275683,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79853,0B71759604CAC23A,docked_bike,2020-06-03 17:36:00,2020-06-03 17:48:03,Sanchez St at 15th St,95.0,11th St at Natoma St,77.0,37.766218,-122.431059,37.773507,-122.416040,casual,723.0,1.861156,9.267169,-0.066340,False
79854,F9B05C6AF19DDFA6,docked_bike,2020-06-09 16:11:52,2020-06-09 16:22:46,MacArthur BART Station,176.0,Miles Ave at Cavour St,205.0,37.828409,-122.266314,37.838800,-122.258732,casual,654.0,1.599047,8.802093,-0.128489,False
79855,7CD57741868F792F,docked_bike,2020-06-20 21:23:17,2020-06-20 21:38:26,Bay Pl at Vernon St,195.0,Bay Pl at Vernon St,195.0,37.812314,-122.260778,37.812314,-122.260779,casual,909.0,0.000059,0.000234,-1.304694,False
79856,B63CC04A5F7D3245,docked_bike,2020-06-13 11:40:08,2020-06-13 11:55:17,Alcatraz Ave at Shattuck Ave,168.0,Rockridge BART Station,171.0,37.849594,-122.265568,37.844279,-122.251900,casual,909.0,1.607790,6.367486,-0.453829,False


### Statistical analysis

#### Compute the z-score  of speed 

In [12]:
# method 1
from scipy import stats
df['zscore']=stats.zscore(df['speed'])

- Use another method to compute the z-score

In [13]:
df["zscore"] = ((df['speed']-df['speed'].mean())/df['speed'].std())

In [14]:
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,trip_duration,distance,speed,zscore,outlier
0,172957A20160D568,electric_bike,2020-06-03 15:16:06,2020-06-03 16:06:10,Church St at Duboce Ave,85.0,48th Ave at Cabrillo St,521.0,37.769841,-122.429210,37.772894,-122.509079,casual,3004.0,8.453977,10.131264,0.049131,False
1,AC29BDD9051D1827,electric_bike,2020-06-03 12:13:30,2020-06-03 12:36:27,Cesar Chavez St at Dolores St,140.0,4th St at 16th St,104.0,37.747758,-122.425121,37.767008,-122.390851,casual,1377.0,4.439166,11.605663,0.246157,False
2,7E0C4C5917A9EEC2,electric_bike,2020-06-02 19:18:23,2020-06-02 19:46:05,The Embarcadero at Vallejo St,8.0,Hyde St at Post St,369.0,37.799943,-122.398562,37.787527,-122.416830,casual,1662.0,2.542207,5.506585,-0.568872,False
3,6B0E4BF2BBD49A9D,electric_bike,2020-06-03 10:06:26,2020-06-03 10:38:15,Green St at Van Ness Ave,496.0,Green St at Van Ness Ave,496.0,37.797636,-122.423418,37.797653,-122.423335,casual,1909.0,0.009054,0.017074,-1.302443,False
4,27C607CB14528333,electric_bike,2020-06-03 13:09:05,2020-06-03 13:31:33,4th St at 16th St,104.0,Cesar Chavez St at Dolores St,140.0,37.767064,-122.390900,37.747827,-122.425056,casual,1348.0,4.428411,11.826616,0.275683,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79853,0B71759604CAC23A,docked_bike,2020-06-03 17:36:00,2020-06-03 17:48:03,Sanchez St at 15th St,95.0,11th St at Natoma St,77.0,37.766218,-122.431059,37.773507,-122.416040,casual,723.0,1.861156,9.267169,-0.066340,False
79854,F9B05C6AF19DDFA6,docked_bike,2020-06-09 16:11:52,2020-06-09 16:22:46,MacArthur BART Station,176.0,Miles Ave at Cavour St,205.0,37.828409,-122.266314,37.838800,-122.258732,casual,654.0,1.599047,8.802093,-0.128489,False
79855,7CD57741868F792F,docked_bike,2020-06-20 21:23:17,2020-06-20 21:38:26,Bay Pl at Vernon St,195.0,Bay Pl at Vernon St,195.0,37.812314,-122.260778,37.812314,-122.260779,casual,909.0,0.000059,0.000234,-1.304694,False
79856,B63CC04A5F7D3245,docked_bike,2020-06-13 11:40:08,2020-06-13 11:55:17,Alcatraz Ave at Shattuck Ave,168.0,Rockridge BART Station,171.0,37.849594,-122.265568,37.844279,-122.251900,casual,909.0,1.607790,6.367486,-0.453829,False


- How many outliers are there? Set threshold to 2.5

In [15]:
df['outlier'] = (df["zscore"] < -2.5)

In [16]:
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,trip_duration,distance,speed,zscore,outlier
0,172957A20160D568,electric_bike,2020-06-03 15:16:06,2020-06-03 16:06:10,Church St at Duboce Ave,85.0,48th Ave at Cabrillo St,521.0,37.769841,-122.429210,37.772894,-122.509079,casual,3004.0,8.453977,10.131264,0.049131,False
1,AC29BDD9051D1827,electric_bike,2020-06-03 12:13:30,2020-06-03 12:36:27,Cesar Chavez St at Dolores St,140.0,4th St at 16th St,104.0,37.747758,-122.425121,37.767008,-122.390851,casual,1377.0,4.439166,11.605663,0.246157,False
2,7E0C4C5917A9EEC2,electric_bike,2020-06-02 19:18:23,2020-06-02 19:46:05,The Embarcadero at Vallejo St,8.0,Hyde St at Post St,369.0,37.799943,-122.398562,37.787527,-122.416830,casual,1662.0,2.542207,5.506585,-0.568872,False
3,6B0E4BF2BBD49A9D,electric_bike,2020-06-03 10:06:26,2020-06-03 10:38:15,Green St at Van Ness Ave,496.0,Green St at Van Ness Ave,496.0,37.797636,-122.423418,37.797653,-122.423335,casual,1909.0,0.009054,0.017074,-1.302443,False
4,27C607CB14528333,electric_bike,2020-06-03 13:09:05,2020-06-03 13:31:33,4th St at 16th St,104.0,Cesar Chavez St at Dolores St,140.0,37.767064,-122.390900,37.747827,-122.425056,casual,1348.0,4.428411,11.826616,0.275683,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79853,0B71759604CAC23A,docked_bike,2020-06-03 17:36:00,2020-06-03 17:48:03,Sanchez St at 15th St,95.0,11th St at Natoma St,77.0,37.766218,-122.431059,37.773507,-122.416040,casual,723.0,1.861156,9.267169,-0.066340,False
79854,F9B05C6AF19DDFA6,docked_bike,2020-06-09 16:11:52,2020-06-09 16:22:46,MacArthur BART Station,176.0,Miles Ave at Cavour St,205.0,37.828409,-122.266314,37.838800,-122.258732,casual,654.0,1.599047,8.802093,-0.128489,False
79855,7CD57741868F792F,docked_bike,2020-06-20 21:23:17,2020-06-20 21:38:26,Bay Pl at Vernon St,195.0,Bay Pl at Vernon St,195.0,37.812314,-122.260778,37.812314,-122.260779,casual,909.0,0.000059,0.000234,-1.304694,False
79856,B63CC04A5F7D3245,docked_bike,2020-06-13 11:40:08,2020-06-13 11:55:17,Alcatraz Ave at Shattuck Ave,168.0,Rockridge BART Station,171.0,37.849594,-122.265568,37.844279,-122.251900,casual,909.0,1.607790,6.367486,-0.453829,False


#### Compute the kurtosis and skewness

In [29]:
stats.kurtosis(df['speed'])

3813.534316978821

In [11]:
stats.skew(df['speed'])

32.03774309725752

#### Hypothesis Testing

In [18]:
df.groupby('rideable_type')['speed'].agg(['mean','std','count'])

Unnamed: 0_level_0,mean,std,count
rideable_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
docked_bike,8.189214,7.65067,48677
electric_bike,12.221408,6.49118,31181


We want to compare the sample means for both docked and electric bikes. 
- Which test shall we perform to test whether the two means are significantly different? 
- Write a function 'test' which takes two sample groups and return the appropriate statistics of the test and its degree of freedom.

In [20]:
scipy.stats.ttest_ind(docked_bike, electric_bike, equal_var=False)

NameError: name 'scipy' is not defined

* Are the two means statistically different from each other with a significance level of 5%? t_0.05 = 1.96

In [19]:
x = df[df['rideable_type'] == 'docked_bike']['speed']
y = df[df['rideable_type'] == 'electric_bike']['speed']

#test(x,y)

### Clustering methods

#### K-means

Pseucode algorithm :
1. Select K as the initial centroids
2. Repeat:
3.  $\;\;\;\;$ Form K clusters by assigning all points to the closest centroids
4.  $\;\;\;\;$ Recompute the centroid for each cluster
5.  Until centroid stop changing

In [87]:
X= df[df['start_station_name']=='Church St at Duboce Ave'] ## We select a specific station

In [88]:
#Scale data

from sklearn.cluster import KMeans
from sklearn import preprocessing
Z = preprocessing.scale(X[['speed','trip_duration','distance']])

- Why do we use scaling before clustering? 

In [90]:
score = []
for cluster in range(1,11):
    kmeans = KMeans(n_clusters = cluster)
    preds= kmeans.fit(Z)
    score.append(kmeans.inertia_) 

In [None]:
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.plot(range(1,11), score)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.show()

- What is the purpose of using the Elbow method? What does WCSS refers to? How do we interpret the result? 

The silhouette analysis measures how well the data points are clustered by estimating
the average distance between clusters. The silhouette plot tells about how close each
point in one cluster is to points in the neighboring clusters.

In [93]:
from sklearn.metrics import silhouette_score
silhouette_coefficients = []

for k in range(2, 10):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(Z)
    score = silhouette_score(Z, kmeans.labels_)
    silhouette_coefficients.append(score)

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(range(2, 10), silhouette_coefficients)
plt.xticks(range(2, 10))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

- Perform k-means on the best number of cluster (based on the silhouette index).

- We plot the speed and the distance of the trips, with the assigned clusters. Interpret the latter by projecting also on other variables. 

In [None]:
y_kmeans = kmeans.predict(Z) #Assign clusters to observations
plt.scatter(X['speed'], X['distance'], c=y_kmeans, s=50, cmap='viridis', alpha=0.5)

#### Agglomerative Hierarchical clustering

The initialization of this algorithm consists to calculate an array of distances (or
dissimilarities) between the data points to be classified. The algorithm starts from the
trivial partition of the N singletons (each observation represents a cluster) and seeks, at
each step, to form clusters by aggregating the two closest data points to the stage partition former. The algorithm stops by obtaining a single class. The successive groupings
are represented in the form of a dendrogram.

In [121]:
from sklearn.cluster import AgglomerativeClustering
silhouette_coefficients = []
# silhouette coefficient
for k in range(2, 10):
    model = AgglomerativeClustering(n_clusters=k, affinity='euclidean', linkage='ward')  
    model = model.fit(Z)
    score = silhouette_score(Z, model.labels_)
    silhouette_coefficients.append(score)

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(range(2, 10), silhouette_coefficients)
plt.xticks(range(2, 10))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

- Perform an HC based on the convenient number and interpret the clusters. 