# To eBike or Not to eBike?

This notebook shall perform all the functions that the ones written for combining, cleaning, feature engineering (preliminary), geolocation, and feature engineering (final) files are designed to do for efficiency, given the knowledge of the workflow that has been established.

## Specifically for eBike Study

**This notebook shall create a combined dataset for NYC CitiBike rides one month into the future from that of the general study in the capstone project, due to better data gathered regaring electric bike usage during this time frame.**

<a id=toc></a>
## Table of Contents

<ul>
    <li><a href=#01-import-packages>Import Packages</a>
    <li><a href=#02-load-dataset>Load Datasets and Clean Data</a>
    <li><a href=#03-conv-stamps>Convert Timestamps to pandas Format</a>
    <li><a href=#04-geoloc-stations>Geolocate Stations</a>
    <li><a href=#05-calc-values>Calculate Distance and Speed</a>
    <li><a href=#06-feat-eng>Perform Feature Engineering</a>
    <li><a href=#07-final-check>Perform Final Check</a>
    <li><a href=#08-save-file>Save Fully Cleaned DataFrame to .parquet</a>
</ul>

<a id=01-import-packages></a>
## Import Packages

Import necessary packages.

In [1]:
# Apache parquet files (to save space)
import pyarrow as pa
import pyarrow.parquet as pq

# Dataframes and numerical
import pandas as pd
import numpy as np

# Geolocation
import geopandas as gpd
import matplotlib.pyplot as plt

# Increase pandas default display 
pd.options.display.max_rows = 250
pd.options.display.max_columns = 250

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

<a href=#toc>Back to the top</a>

<a id=02-load-dataset></a>
## Load Datasets and Clean Data

Load CSV files for months and years into their own dataframes and combine them.

In [2]:
CB_01 = pd.read_csv('CitiBike_data/Raw/New_Format/202106-citibike-tripdata.csv', low_memory=False)
CB_02 = pd.read_csv('CitiBike_data/Raw/New_Format/202107-citibike-tripdata.csv', low_memory=False)
CB_03 = pd.read_csv('CitiBike_data/Raw/New_Format/202108-citibike-tripdata.csv', low_memory=False)
CB_04 = pd.read_csv('CitiBike_data/Raw/New_Format/202109-citibike-tripdata.csv', low_memory=False)
CB_05 = pd.read_csv('CitiBike_data/Raw/New_Format/202110-citibike-tripdata.csv', low_memory=False)
CB_06 = pd.read_csv('CitiBike_data/Raw/New_Format/202111-citibike-tripdata.csv', low_memory=False)
CB_07 = pd.read_csv('CitiBike_data/Raw/New_Format/202112-citibike-tripdata.csv', low_memory=False)
CB_08 = pd.read_csv('CitiBike_data/Raw/New_Format/202201-citibike-tripdata.csv', low_memory=False)
CB_09 = pd.read_csv('CitiBike_data/Raw/New_Format/202202-citibike-tripdata.csv', low_memory=False)
CB_10 = pd.read_csv('CitiBike_data/Raw/New_Format/202203-citibike-tripdata.csv', low_memory=False)
CB_11 = pd.read_csv('CitiBike_data/Raw/New_Format/202204-citibike-tripdata.csv', low_memory=False)
CB_12 = pd.read_csv('CitiBike_data/Raw/New_Format/202205-citibike-tripdata.csv', low_memory=False)

months = [CB_01, CB_02, CB_03, CB_04, CB_05, CB_06, CB_07, CB_08, CB_09, CB_10, CB_11, CB_12]

CB_Data = pd.concat(months, ignore_index=True, sort=False)

Drop columns that are not relevant.

In [3]:
drop_col = ['ride_id', 'start_station_id', 'end_station_id']
CB_Data = CB_Data.drop(axis = 1, columns = drop_col)

Count total number of rides during that time frame (June 2021 through May 2021).

In [4]:
no_rides = len(CB_Data)

Check for null values.

In [5]:
# "Ghost bikes" coming in from unknown locations
bike_ghos = CB_Data.loc[pd.isnull(CB_Data.start_station_name)]
no_bike_ghos = len(bike_ghos)

# Bikes that are lost, i.e. not docked at the end
bike_lost = CB_Data.loc[pd.isnull(CB_Data.end_station_name)]
no_bike_lost = len(bike_lost)

# Bikes that are docked at the same station they are picked up, for joyride, rider changing mind, defective bike, etc.
bike_joyr = CB_Data.loc[CB_Data.start_station_name == CB_Data.end_station_name]
no_bike_joyr = len(bike_joyr)

print(f'Total number of Citibike rides from June 2021 through May 2022: {no_rides}')
print(f'Total number of "ghost bikes" in that time frame: {no_bike_ghos}')
print(f'Total number of lost bikes in that time frame: {no_bike_lost}')
print(f'Total number of bikes being docked at the same location in that time frame: {no_bike_joyr}')
print(f'Total number of "dud rides" to be removed from data: {no_bike_ghos + no_bike_lost + no_bike_joyr}')

print(f'Percentage of rides from June 2021 through May 2022 with bikes missing: \
      {100*(no_bike_ghos + no_bike_lost)/no_rides}')
print(f'Percentage of rides from June 2021 through May 2022 with bikes docked at the same location: \
      {100*no_bike_joyr/no_rides}')
print(f'Total number of "dud rides" to be removed from data: \
      {100*(no_bike_ghos + no_bike_lost + no_bike_joyr)/no_rides}')

Total number of Citibike rides from June 2021 through May 2022: 29032983
Total number of "ghost bikes" in that time frame: 245
Total number of lost bikes in that time frame: 102300
Total number of bikes being docked at the same location in that time frame: 1549785
Total number of "dud rides" to be removed from data: 1652330
Percentage of rides from June 2021 through May 2022 with bikes missing:       0.35320173610820493
Percentage of rides from June 2021 through May 2022 with bikes docked at the same location:       5.338015043097707
Total number of "dud rides" to be removed from data:       5.691216779205912


Now, eliminate these "dud rides" once and for all.

In [6]:
dud_rides = bike_ghos.index.tolist() + bike_lost.index.tolist() + bike_joyr.index.tolist()

CB_Data = CB_Data.drop(axis = 0, index = dud_rides)

In [7]:
print(f'Total number of rides to work with after clearning: {len(CB_Data)}')

Total number of rides to work with after clearning: 27380897


Make sure that all the values of **rideable_type** are consistent. In other words, **docked_bike** = **classic_bike**.

In [8]:
CB_Data.rideable_type.loc[CB_Data.rideable_type == 'classic_bike'] = 'Classic Bike'
CB_Data.rideable_type.loc[CB_Data.rideable_type == 'docked_bike'] = 'Classic Bike'
CB_Data.rideable_type.loc[CB_Data.rideable_type == 'electric_bike'] = 'Electric Bike'

Capitalize **member_casual** entries.

In [9]:
CB_Data.member_casual.loc[CB_Data.member_casual == 'member'] = 'Member'
CB_Data.member_casual.loc[CB_Data.member_casual == 'casual'] = 'Casual'

<a href=#toc>Back to the top</a>

<a id=03-conv-stamps></a>
## Convert Timestamps Strings to pandas Format

Convert to timestamps from strings to actual timestamp data types in order to work for extracting date and time information from them to be used as features, as well as calculating ride durations.

In [10]:
# From https://dataindependent.com/pandas/pandas-to-datetime-string-to-date-pd-to_datetime/
CB_Data.started_at = pd.to_datetime(CB_Data.started_at, format="%Y-%m-%d %H:%M:%S")
CB_Data.ended_at = pd.to_datetime(CB_Data.ended_at, format="%Y-%m-%d %H:%M:%S")

<a href=#toc>Back to the top</a>

<a id=04-geoloc-stations></a>
## Geolocate Stations

Since the heavy lifting associated with creating a master file of all stations in the CitiBike system have already been completed, just import it from its .parquet file at this point.

In [11]:
CB_Stations = pq.read_table('CitiBike_data/202206-citibike-stations.parquet').to_pandas()

In order to normalize all coordinates of the distinct stations as well as the travel distances, durations, and speeds between the various stations; the averaged latitudes and longitudes of the dataframe **CB_Stations** shall replace those provided in the **CB_Data** one in addition to assigning borough and neighboorhood associations.

In [12]:
# Join CB_Data with CB_Stations
# Replace old coordinates with new ones
CB_Data = CB_Data.join(CB_Stations, on='start_station_name', how='right')
CB_Data = CB_Data.drop(columns=['start_lat', 'start_lng'])
CB_Data = CB_Data.rename(columns={'lat': 'start_lat', 'lng': 'start_lng', 'boro': 'start_boro', 'hood': 'start_hood'})
CB_Data = CB_Data.join(CB_Stations, on='end_station_name', how='right')
CB_Data = CB_Data.drop(columns=['end_lat', 'end_lng'])
CB_Data = CB_Data.rename(columns={'lat': 'end_lat', 'lng': 'end_lng', 'boro': 'end_boro', 'hood': 'end_hood'})

Sort by index values.

In [13]:
CB_Data = CB_Data.sort_index(ascending=True)

Drop any **NaN** values in the new dataframe.

In [14]:
CB_Data = CB_Data.dropna(axis=0)

Reconvert index in dataframe to integer.

In [15]:
CB_Data.index = CB_Data.index.astype('int64')

<a href=#toc>Back to the top</a>

<a id=05-calc-values></a>
## Calculate Distance and Speed

Calculate (Manhattan) distance and speed with normalized coordinates.

In [16]:
# Conversion factor here:
# https://www.usgs.gov/faqs/how-much-distance-does-degree-minute-and-second-cover-your-maps#:~:text=One%20degree%20of%20latitude%20equals,one%20second%20equals%2080%20feet.
CB_Data['distance_mi'] = 69 * ( abs( CB_Data.start_lat - CB_Data.end_lat ) 
                                  + abs( CB_Data.start_lng - CB_Data.end_lng ) )
CB_Data['duration_min'] = (CB_Data.ended_at - CB_Data.started_at)/np.timedelta64(1,'m')
CB_Data['speed_mph'] = CB_Data.distance_mi / (CB_Data.duration_min / 60)

<a href=#toc>Back to the top</a>

<a id=06-feat-eng></a>
## Perform Feature Engineering

With the lengthy computations completed, dummify various values based on the timestamp.

In [17]:
CB_Data['year'] = CB_Data.started_at.dt.year
CB_Data['month'] = CB_Data.started_at.dt.month
CB_Data['week_of_year'] = CB_Data.started_at.dt.week
CB_Data['day_of_week'] = CB_Data.started_at.dt.day_of_week
CB_Data['hour_of_day'] = CB_Data.started_at.dt.hour

Ensure that nominal dummy values are integers.

In [18]:
CB_Data = CB_Data.astype({'year': int, 'month': int, 'week_of_year': int,
                                  'day_of_week': int, 'hour_of_day': int})

Reindex columns for readability.

In [19]:
col_names = ['member_casual', 'rideable_type',
             'started_at', 'start_station_name', 'start_lat', 'start_lng', 'start_boro', 'start_hood',
             'ended_at', 'end_station_name', 'end_lat', 'end_lng', 'end_boro', 'end_hood',
             'year', 'month', 'week_of_year', 'day_of_week', 'hour_of_day',
             'duration_min', 'distance_mi', 'speed_mph']
CB_Data = CB_Data.reindex(columns=col_names)

<a href=#toc>Back to the top</a>

<a id=07-final-check></a>
## Perform Final Check

Perform a final check of the structure and integrity of the dataframe before finally writing it to a .parquet file.

In [20]:
CB_Data.head()

Unnamed: 0,member_casual,rideable_type,started_at,start_station_name,start_lat,start_lng,start_boro,start_hood,ended_at,end_station_name,end_lat,end_lng,end_boro,end_hood,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph
0,Member,Classic Bike,2021-06-01 23:12:34,Driggs Ave & N 9 St,40.71817,-73.955201,Brooklyn,Williamsburg,2021-06-01 23:14:46,Bayard St & Leonard St,40.719156,-73.948855,Brooklyn,Greenpoint,2021,6,22,1,23,2.2,0.50594,13.798369
1,Casual,Classic Bike,2021-06-16 17:14:56,Fulton St & Broadway,40.711066,-74.009447,Manhattan,Financial District,2021-06-16 17:29:15,Mercer St & Spring St,40.723627,-73.999496,Manhattan,SoHo,2021,6,24,2,17,14.316667,1.553374,6.510067
2,Casual,Classic Bike,2021-06-07 19:41:55,Devoe St & Lorimer St,40.713352,-73.949103,Brooklyn,Williamsburg,2021-06-07 19:51:28,Manhattan Av & Leonard St,40.72084,-73.94844,Brooklyn,Greenpoint,2021,6,23,0,19,9.55,0.56241,3.533465
3,Member,Electric Bike,2021-06-17 15:13:15,Driggs Ave & N 9 St,40.71817,-73.955201,Brooklyn,Williamsburg,2021-06-17 15:33:25,Greenwich Ave & Charles St,40.735238,-74.000271,Manhattan,West Village,2021,6,24,3,15,20.166667,4.287502,12.756204
4,Member,Electric Bike,2021-06-18 08:27:03,Graham Ave & Conselyea St,40.715143,-73.944507,Brooklyn,Williamsburg,2021-06-18 08:53:37,E 30 St & Park Ave S,40.744449,-73.983035,Manhattan,Midtown,2021,6,24,4,8,26.566667,4.680605,10.571003


In [21]:
CB_Data.tail()

Unnamed: 0,member_casual,rideable_type,started_at,start_station_name,start_lat,start_lng,start_boro,start_hood,ended_at,end_station_name,end_lat,end_lng,end_boro,end_hood,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph
29032978,Member,Classic Bike,2022-05-15 07:57:48,Broadway & W 36 St,40.750977,-73.987654,Manhattan,Midtown,2022-05-15 08:12:55,West End Ave & W 60 St,40.77237,-73.99005,Manhattan,Upper West Side,2022,5,19,6,7,15.116667,1.641419,6.515002
29032979,Member,Classic Bike,2022-05-05 18:13:05,Crescent St & 30 Ave,40.768692,-73.924957,Queens,Astoria,2022-05-05 18:20:10,Vernon Blvd & 31 Ave,40.769247,-73.935451,Queens,Astoria,2022,5,18,3,18,7.083333,0.762329,6.457375
29032980,Member,Classic Bike,2022-05-28 00:12:09,45 Ave & 21 St,40.747371,-73.947774,Queens,Long Island City,2022-05-28 00:30:00,Vernon Blvd & 31 Ave,40.769247,-73.935451,Queens,Astoria,2022,5,21,5,0,17.85,2.359807,7.932125
29032981,Member,Classic Bike,2022-05-19 13:06:36,Crescent St & 30 Ave,40.768692,-73.924957,Queens,Astoria,2022-05-19 13:18:02,46 St & 28 Ave,40.763328,-73.908782,Queens,Astoria,2022,5,20,3,13,11.433333,1.486234,7.799481
29032982,Member,Classic Bike,2022-05-09 18:47:28,W 50 St & 9 Ave,40.763605,-73.98918,Manhattan,Hell's Kitchen,2022-05-09 18:52:38,West End Ave & W 60 St,40.77237,-73.99005,Manhattan,Upper West Side,2022,5,19,0,18,5.166667,0.664846,7.720793


In [22]:
CB_Data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27374552 entries, 0 to 29032982
Data columns (total 22 columns):
 #   Column              Dtype         
---  ------              -----         
 0   member_casual       object        
 1   rideable_type       object        
 2   started_at          datetime64[ns]
 3   start_station_name  object        
 4   start_lat           float64       
 5   start_lng           float64       
 6   start_boro          object        
 7   start_hood          object        
 8   ended_at            datetime64[ns]
 9   end_station_name    object        
 10  end_lat             float64       
 11  end_lng             float64       
 12  end_boro            object        
 13  end_hood            object        
 14  year                int64         
 15  month               int64         
 16  week_of_year        int64         
 17  day_of_week         int64         
 18  hour_of_day         int64         
 19  duration_min        float64       
 20  

In [23]:
CB_Data.dtypes

member_casual                 object
rideable_type                 object
started_at            datetime64[ns]
start_station_name            object
start_lat                    float64
start_lng                    float64
start_boro                    object
start_hood                    object
ended_at              datetime64[ns]
end_station_name              object
end_lat                      float64
end_lng                      float64
end_boro                      object
end_hood                      object
year                           int64
month                          int64
week_of_year                   int64
day_of_week                    int64
hour_of_day                    int64
duration_min                 float64
distance_mi                  float64
speed_mph                    float64
dtype: object

Check for null values.

In [24]:
CB_Data.isnull().sum()

member_casual         0
rideable_type         0
started_at            0
start_station_name    0
start_lat             0
start_lng             0
start_boro            0
start_hood            0
ended_at              0
end_station_name      0
end_lat               0
end_lng               0
end_boro              0
end_hood              0
year                  0
month                 0
week_of_year          0
day_of_week           0
hour_of_day           0
duration_min          0
distance_mi           0
speed_mph             0
dtype: int64

<a href=#toc>Back to the top</a>

<a id=08-save-file></a>
## Save Fully Cleaned DataFrame to .parquet

No null values and the dataframe has all the values that are needed for the EDA. Now it is time to export this completed dataframe to a .parquet file.

In [25]:
CB_Data_arrow = pa.Table.from_pandas(CB_Data)
pq.write_table(CB_Data_arrow, 'CitiBike_data/202106-202205-citibike-tripdata.parquet')

<a href=#toc>Back to the top</a>