# Indego Bikes: Do electric bikes tempt riders to go faster?

<center><b>Abstract</b></center>

Bike-share programs are increasingly popular in cities around the world. Following successful programs in Amsterdam, Paris, and Copenhagen, the City of Philadelphia launched Indego Bikes in 2015. Indego currently maintains a fleet of over 2,000 bicycles, more than half of which are electric bicycles. Urban cycling can be dangerous. To investigate whether riders travel faster on electric bicycles, we compared the duration of trips made using standard and electric bicycles between a pair of stations two miles apart in downtown Philadelphia. Based on a sample of 142 trips in the third quarter of 2023 (94 electric, 48 standard), trips on electric bicycles averaged 0.64 minutes shorter (12.34 versus 12.98 minutes) than trips on standard bicycles. A simulation-based hypothesis test indicates we can reject the null hypothesis at the 95% confidence level—but not at 99%—suggesting a modest but statistically significant difference in mean trip duration. However, the effect size is small (about 40 seconds on a two-mile ride) and likely of limited practical importance; these results do not support the conclusion that riders of electric bicycles are at greater risk solely due to speed.


![Indego riders](Indego_riders.jpg)
You are probably all familiar with [Indego Bikes](https://www.rideindego.com/), the Philly bike-share program. What you may not know is that Indego makes ridership data available on their [trip data](https://www.rideindego.com/about/data/) website. 

In this sample project, we will:
* Explore trip data for the third quarter of 2023
* Make data visualizations
* Formulate an hypothesis
* Test the hypothesis using simulation

In [None]:
import numpy as np
from datascience import *
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## Load the data

In [None]:
# Note: this is a large data, hence the low_memory flag.
indego = Table.read_table('indego-trips-2023-q3-2.csv', low_memory=False)

## Exploratory Data Analysis

In [None]:
indego.show(3)

In [None]:
indego.num_rows

**That is a lot of bike trips!**

### Trip Duration
The data set includes the trip durations, which is measured in minutes according to the [website](https://www.rideindego.com/about/data/). Let's see how this variable is distributed. 

In [None]:
min_duration = min(indego.column('duration'))
max_duration = max(indego.column('duration'))
print(f" The longest trip was {max_duration} minutes.")
print(f" The shortest trip was {min_duration} minutes.")

The Indego site says trips below 1 minute are removed. Probably, the rider changed their mind and put the bike back. **But honestly, can you go anywhere in one minute?  Let's see.**

There is a field "trip_route_category" that is either "Round Trip" or "One Way." If a 1-minute trip is round trip, clearly the bike never left the station. Are their any one way trips?

In [None]:
quick_trips = indego.where('duration', 1).where('trip_route_category', 'One Way')
quick_trips.show(5)

**Wow!** Yes, there are some really fast trips! How is that possible? Let's looks at the start and end stations for the first of these trips in the table above:

start_station is 3061

end_station is 3161

We need to know the station names. Fortunately, Indego provides this in a separate CSV file.

### Load Station Name Data

In [None]:
station_names = Table.read_table('indego-stations-2023-10-01.csv', low_memory=False)
station_names.show(3)

In [None]:
# We just need the first two columns
station_names = station_names.select('Station_ID', 'Station_Name')
print('Start Station')
print(station_names.where('Station_ID', 3061))
print()
print('End Station')
print(station_names.where('Station_ID', 3161))

![Quick bike trip](quick_trip.jpg)

If you put these addresses into Google Maps, it says this is a two-minute ride. I can imagine a Penn or Drexel student getting off the train at 30th street station and making this ride in double-quick time, as it is only a third of a mile. Alternatively, the clocks at the two stations might not be perfectly in sync. Or it could be a rounding down a minute and 29 seconds. Anyway, it appears feasible.

Now let's look at the overall distribution of ride durations.

In [None]:
indego.hist('duration', bins=np.arange(0, 80, 5))

We see most trips are short, the peak being 5-10 minutes. Very few are over an hour.

### Trips by Time of Day

Let's look at what the distribution of rides by time of day. This requires parsing the hour from strings such as the following.

In [None]:
st = indego.column("start_time")
example = st[100]
example

To extract the hour, we can split first split the date and time on the space.

In [None]:
example.split()

Then we keep the second term and split on the colon.
example.split()[1].split(:)

In [None]:
example.split()[1].split(":")

Hour is the first element after this second split. Finally, we convert the hour from a string to an integer so we can plot the distribution.

In [None]:
hour = example.split()[1].split(":")[0]
hour = int(hour)
hour

Put this in a function.

In [None]:
def extract_hour(date_time_string):
    '''
    This function expects a datetime string in the form
    '7/1/2023 1:29' and returns just the hour.
    '''
    hour = date_time_string.split()[1].split(":")[0]
    return int(hour)

In [None]:
# Test our function
test_datetime = st[5000]
print(test_datetime)
print(extract_hour(test_datetime))

Now use our function to add a "start_hour" column to our data table.

In [None]:
indego = indego.with_column('start_hour', indego.apply(extract_hour, 'start_time'))
indego.show(3)

In [None]:
# Use 24 bins for the 24 hours in a day.
indego.hist('start_hour', bins=24)

So the least likely time to start an Indego ride is 3 AM (Hmmm, I wonder why :-), and peak ridership is in the late afternoon and evening with the maximum to about 6 PM (18:00 hours).

### Ride by type of bicycle.
Indego offers two types of bikes: standard and electric. Let's see which constitutes the majority of the rides.

In [None]:
indego_grp = indego.group('bike_type')
indego_grp

Wow! The ebikes, which Indego introduced only a few year earlier, account for slightly more than half of the rides.

This begs the question: are more riders choosing ebikes because they perfer them, or is it a simple matter of availabilty? To answer this, we need to know what fraction of the bike fleet is ebikes. We can get this using the bike ID in concert with the bike type.

In [None]:
ebike = indego.where('bike_type', 'electric')
ebike_id = np.unique(ebike.column('bike_id'))
num_ebikes = len(ebike_id)

standard = indego.where('bike_type', 'standard')
standard_id = np.unique(standard.column('bike_id'))
num_standard = len(standard_id)

print("The number of ebikes is:", num_ebikes)
print("The number of standard bikes is:", num_standard)

So there are more ebikes in the fleet. Compare ratios.

In [None]:
rides_per_ebike = indego_grp.column('count').item(0) / num_ebikes
rides_per_ebike

In [None]:
rides_per_standard = indego_grp.column('count').item(1) / num_standard
rides_per_standard

So even though there are more ebikes, riders show a slight preference for standard bikes. Why? Probably simple economics. The ebikes cost an extra to ride. Here is the current [pricing information.](https://www.rideindego.com/buy-a-pass/#/)

## Hypothesis
While thinking about ebikes and standard bikes, an interesting hypothesis occured to me. I wondered whether riders rode faster on ebikes than standard bikes. If so, this could make ebikes more dangerous, particularly in an urban environment. 

I did a some background research. Here is are some references on this topic:

```
Langford, B. C., Chen, J., & Cherry, C. R. (2015). Risky riding: Naturalistic methods comparing safety behavior from conventional bicycle riders and electric bike riders. Accident Analysis & Prevention, 82, 220-226.

Gogola, M. (2018, April). Are the e-bikes more dangerous than traditional bicycles?. In 2018 XI International Science-Technical Conference Automotive Safety (pp. 1-4). IEEE.

Siman-Tov, M. (2018). A look at electric bike casualties: do they differ from the mechanical bicycle? J Transp Heal 11 (October): 176–182.
```




**Hypothesis: Riders bike at higher speeds on ebikes.**

**Null Hypothesis: Any difference in mean ride speed by type of bike is can be explained by natural variability in ride times.**

To test this hypothesis, we can compare mean trip duration by bicycle type where all of the the trips start and end at the same two bike stations. Presumably, the time taken to check out and return the bikes would be roughly the same, so any difference would be attributable to ride speed. Clearly, we need stations reasonably far apart so the ride isn't too short, and a pair of stations with a lot of trips to furnish an adequate sample size.

### Finding rides of reasonable duration
If the ride is too short, it will be hard to tell if there is a speed difference between standard and ebike riders. Let's find rides between 10 and 20 minutes long.

In [None]:
rides_10to20min = indego.where('duration', are.between(10, 20))

### Find the most popular starting station

In [None]:
rides_10to20min.select('start_station', 'end_station').group('start_station').sort('count', descending=True).show(5)

### Find the most popular destinations from the most popular starting station, which is 3010

In [None]:
station_names.where('Station_ID', 3010)

In [None]:
starts = rides_10to20min.where('start_station', 3010)
starts.group('end_station').sort('count', descending=True).show(5)

The most popular destination not counting round-trips (which would be back to station 3010) is:

In [None]:
station_names.where('Station_ID', 3053)

We want to include trips in both directions to enlarge the sample size.

In [None]:
common_trip = rides_10to20min.where('start_station', 3010).where('end_station', 3053)
common_trip.num_rows

In [None]:
reverse_trip = rides_10to20min.where('start_station', 3053).where('end_station', 3010)
reverse_trip.num_rows

In [None]:
trip = common_trip.append(reverse_trip)
trip.num_rows

So we have a data set with 142 trips between these two stations. The possible routes are shown below, again courtesy of Google Maps.

![bike_route from Google Maps](bike_route.jpg)

### Check for Outliers
Make a Box Plot (You can learn more about Box Plots [here.](https://statisticsbyjim.com/graphs/box-plot/))

In [None]:
trip.select('duration').boxplot()

According to Google Maps, this ride is 2.0 miles and should take about 13 minutes by bike. Long rides may mean the rider went somewhere in between stations, but our longest ride is about 19 minutes. It is debatable whether or not to keep this point, but I'll keep it assuming one slow rider.

Notice that the median ride speed is 12 minutes, a minute faster than Google predicts, but as any frequent bike rider will tell you, Google Maps tends to be conservative.

### Compare standard and ebike histograms

In [None]:
type_count = trip.group('bike_type')
type_count

In [None]:
num_standard = type_count.column('count').item(0)
num_ebike = type_count.column('count').item(1)

In [None]:
print(f"Our total sample size is {num_standard + num_ebike} trips.")

In [None]:
trip.where('bike_type', 'standard').hist('duration')
plt.title("Standard bikes");

In [None]:
trip.where('bike_type', 'electric').hist('duration')
plt.title("Electric Bikes");

It certainly appears the ebike riders are a bit quicker, but is the different statistically significant?

Our test statistic will be the difference in the means.

In [None]:
standard_mean_duration = np.mean(trip.where('bike_type', 'standard').column('duration'))
ebike_mean_duration = np.mean(trip.where('bike_type', 'electric').column('duration'))
print(f"The average trip on a standard bike takes {np.round(standard_mean_duration, 2)} minutes.")
print(f"The average trip on an electric bike takes {np.round(ebike_mean_duration, 2)} minutes.")
print()
print(f"The difference in means is {np.round(standard_mean_duration - ebike_mean_duration, 2)} minutes")

In [None]:
test_statistic = np.abs(standard_mean_duration - ebike_mean_duration)

So the average difference is less than a minute. Could this be random? Run a simulation!

### Simulation Under the Null Hypothesis
If the null hypothesis is true, then all of the trip durations come from the same distribution. So we repeatedly sample this distribution and calculate the difference in means.

In [None]:
def compare_means(tbl, num_standard, num_electric):
    sample1 = tbl.sample(num_standard)
    sample2 = tbl.sample(num_ebike)
    sample1_mean = np.mean(sample1.column('duration'))
    sample2_mean = np.mean(sample2.column('duration'))
    return sample2_mean - sample1_mean

In [None]:
# Test function
test = compare_means(trip, num_standard, num_ebikes)
test

So for a single simulation the difference is small, but we need to run the simulation many time to find the distribution.

In [None]:
def simulate(num_sim):
    means = make_array()
    for i in np.arange(num_sim):
        m = compare_means(trip, num_standard, num_ebikes)
        means = np.append(means, m)
    return means

In [None]:
num_simulations = 5000
means = simulate(num_simulations)
results = Table().with_column("Means", means)

In [None]:
results.hist('Means')
plt.scatter(test_statistic, 0, color='red', s=200);
plt.title('Simulated Differences in Means');

In [None]:
p = np.count_nonzero(results.column('Means') >= test_statistic) / num_simulations
p

## Conclusions

The distributions hinted that the riders might be slightly quicker on electric bikes. Our simulations show that we can reject the null hypothesis with 95% confidence, but not 99% confidence. Thus, we conclude the riders might be significantly faster on electric bikes, at least for a two mile ride in the city.

Is is possible the difference would prove significant with 99% confidence if we had more data? Absolutely! We could easily obtain more data by analyzing more station pairs and by downloading more data from the Indego site. (An exercise left to the reader ;-)

The **effect size**, however appears to be small. Does anyone care if we shave 40 seconds off of a two-mile commuter ride? Unlikely. Sometimes even results that are "statistically signficant" do not have practical importance. It does not appear the ebike riders are taking greater risks becuase they are not going that much faster.

## Future Work
Any good scientific study ends with a reflection on possibilities for further study. There a there many ideas we could pursue:

* How does ridership change with the weather? We could download Philly weather data and compare ridership on warm, sunny days versus cold, rainy ones.
* How does ridership change with the academic calendar? Do students dominate the ridership near universities? 
* Which stations still show activity in the wee hours of the morning? Where are people riding at 3 AM?
* How has the balance between standard and ebikes changed over the last few years?
* Are there parts of the city that are under-served and could use more stations?

**Just a few of many ideas for follow-up research!**
