## Bike Sharing Trips Data Preparation

__Requirements__

Before running this notebook, you should place the `la_2019.csv` and the `weather_hourly_la.csv` files in the `00_data` folder and run the `00_bike_sharing_stations.ipynb` notebook.

In [1]:
import pandas as pd
import numpy as np

In [2]:
bikesharing_df = pd.read_csv('../00_data/la_2019.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
bikesharing_df.head(2)

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,bike_id,user_type,start_station_name,end_station_name
0,2019-01-01 00:07:00,2019-01-01 00:14:00,3046,3051,6468,Walk-up,2nd & Hill,7th & Broadway
1,2019-01-01 00:08:00,2019-01-01 00:14:00,3046,3051,12311,Walk-up,2nd & Hill,7th & Broadway


In [4]:
bikesharing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290342 entries, 0 to 290341
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   start_time          290342 non-null  object
 1   end_time            290342 non-null  object
 2   start_station_id    290342 non-null  int64 
 3   end_station_id      290342 non-null  int64 
 4   bike_id             290342 non-null  object
 5   user_type           290342 non-null  object
 6   start_station_name  290342 non-null  object
 7   end_station_name    290342 non-null  object
dtypes: int64(2), object(6)
memory usage: 17.7+ MB


As seen from the column `Non-Null`, all values are the same (=290342). So there are no missing values in this dataset. In the next step, we will explore data ranges for continuous variables and possible values for discrete variables.

In [5]:
# first, we will convert the start and end time of the trip into a new more convenient for future use data format
bikesharing_df['start_time'] = pd.to_datetime(bikesharing_df['start_time'])
bikesharing_df['end_time'] = pd.to_datetime(bikesharing_df['end_time'])
print(f"New datatype for 'start_time' - {bikesharing_df['start_time'].dtype} and 'end_time' - {bikesharing_df['end_time'].dtype}.")

New datatype for 'start_time' - datetime64[ns] and 'end_time' - datetime64[ns].


In [6]:
datetime_format = '%d.%m.%Y %H:%M:%S'
print(f"Earliest observation: {format(bikesharing_df['start_time'].min(), datetime_format)}")
print(f"Latest observation: {format(bikesharing_df['end_time'].max(), datetime_format)}")

Earliest observation: 01.01.2019 00:07:00
Latest observation: 06.01.2020 09:50:52


There are trips ending in the year 2020. Our dataset should, however, be only for the year 2019. As we cannot be sure that we have complete data between 01.01.2020 and 06.01.2020, we will simply remove those trips from our dataset.

In [7]:
bikesharing_df = bikesharing_df[
    (bikesharing_df["start_time"] >= "2019-01-01 00:00:00")
    & (bikesharing_df["end_time"] <= "2019-12-31 23:59:59")
]

In [8]:
# add new column duration and determine its minimum and maximum values
bikesharing_df['duration'] = (bikesharing_df['end_time'] - bikesharing_df['start_time'])

print(f"Shortest trip: {bikesharing_df['duration'].min()}")
print(f"Longest trip: {bikesharing_df['duration'].max()}")

Shortest trip: 0 days 00:00:00
Longest trip: 61 days 15:47:00


In [9]:
# determine how many trips lasted longer than one day
n_trips_above_1d = (
    bikesharing_df["duration"]
    .apply(lambda duration: duration > pd.Timedelta("1d"))
    .sum()
)
print(
    f"Number of trips longer than 1 day: {n_trips_above_1d}"
    + f" ({n_trips_above_1d / len(bikesharing_df) * 100:.2f}%)"
)


Number of trips longer than 1 day: 1265 (0.44%)


Metro Bike Share specified the maximum rental time of 24 hours [(link)](https://bikeshare.metro.net/user-agreement/#:~:text=1%20Maximum%20rental%20time%20is%2024%20hours.). Therefore, we will remove trips above this threshold from our dataset.

Trips that lasted 0 days 00:00:00 can be interpreted as erroneous data. However, there are also cases, where it's the user's mistake. For example, a user rents a bike and decides instantly that they don't actually need a bike. Or there could be a problem with a bike lock. It is often the case that even after the rental has started the lock won't open and the user has to start a new rental. 

We consider the second explanation more plausible. That's why we decided to make an assumption that 0 days 00:00:00 trips represent actual rentals and keep them in our dataset.

In [10]:
bikesharing_df = bikesharing_df[bikesharing_df['duration'] <= pd.Timedelta("1d")]

In [11]:
# next, we will explore what user types exist
bikesharing_df['user_type'].unique()

array(['Walk-up', 'Monthly Pass', 'Annual Pass', 'One Day Pass',
       'Flex Pass', 'Testing'], dtype=object)

While most user types are self-explanatory, `Testing` could mean a couple of things. First, those could be trips conducted by workers of the Metro Bike Share for some testing purposes. Second, those could be test trips granted to new users or by using coupons, etc. In this case, we make an assumption it is the second type of trip, because it sounds more plausible to us.

In the next step, we will import the bike stations dataset, merge it with the trips data and check whether there are any missing data.

In [12]:
stations_df = pd.read_pickle('../00_data/stations.pkl')

In [13]:
stations_df = stations_df.set_index("station_id")
stations_small_df = stations_df[["latitude", "longitude"]]

bikesharing_df = bikesharing_df.join(stations_small_df, on="start_station_id")
bikesharing_df.rename(
    columns={"latitude": "start_latitude", "longitude": "start_longitude"}, inplace=True
)
bikesharing_df = bikesharing_df.join(stations_small_df, on="end_station_id")
bikesharing_df.rename(
    columns={"latitude": "end_latitude", "longitude": "end_longitude"}, inplace=True
)


In [14]:
bikesharing_df[
    ["start_longitude", "start_latitude", "end_longitude", "end_latitude"]
].isna().sum()

start_longitude    55590
start_latitude     55590
end_longitude      58111
end_latitude       58111
dtype: int64

There are some stations in our trip data that cannot be linked to the station data. This is most likely due to the fact that our trip data is from 2019 and our station data is from 2020. Some stations have probably been removed. However, as we don't expect to require stations data in all of the coming parts of our project, we will remove those trips on demand.

Next, we calculate the distance covered and the speed of the trips. To calculate the distance of the trip we use haversine distance instead of the direct line because of its higher accuracy.

In [15]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    
    author: derricw (https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836)
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [16]:
bikesharing_df["distance"] = haversine(
    bikesharing_df["start_latitude"],
    bikesharing_df["start_longitude"],
    bikesharing_df["end_latitude"],
    bikesharing_df["end_longitude"],
)

print(f"Smallest distance: {bikesharing_df['distance'].min()} km")
print(f"Greatest distance: {bikesharing_df['distance'].max()} km")


Smallest distance: 0.0 km
Greatest distance: 28.47503197477015 km


In [17]:
bikesharing_df["speed"] = bikesharing_df[
    "distance"
] / bikesharing_df["duration"].apply(
    lambda duration: duration.total_seconds() / (60 * 60)
)

print(f"Lowest speed: {bikesharing_df['speed'].min()}")
print(f"Highest speed: {bikesharing_df['speed'].max()}")

Lowest speed: 0.0
Highest speed: 40.969097401067344


We will omit all trips that exceed the speed of 20mph, which is the limit for e-bikes in the U.S. [(link)](https://electricbikereport.com/fast-electric-bike/#:~:text=You'll%20most%20likely%20know,throttle)
This seems plausible as trips that exceed this limit are very likely to be faulty because they would need to cycle faster than the maximum speed of e-bikes without any stops during the trip.
Also, our `distance` column is calculated as the distance between the start and end station, which is a lower bound on the actual distance traveled. 
Therefore the actual distance traveled is most likely longer and the actual speed is most likely lower.

In [18]:
max_allowed_kmh = 20 * 1.60934 # 20mp/h in km/h
n_trips_too_fast = len(bikesharing_df[bikesharing_df["speed"] > max_allowed_kmh])
print(
    f"Number of trips faster than {max_allowed_kmh} km/h: {n_trips_too_fast}"
    + f" ({n_trips_too_fast / len(bikesharing_df) * 100:.4f}%)"
)


Number of trips faster than 32.1868 km/h: 10 (0.0035%)


In [19]:
bikesharing_df = bikesharing_df[bikesharing_df['speed'] < max_allowed_kmh]

In [20]:
pd.to_pickle(bikesharing_df, '../00_data/trips.pkl')