# To eBike or Not to eBike?

## Final Feature Engineering

<a id=toc></a>
## Table of Contents

<ul>
    <li><a href=#01-import-packages>Import Packages</a>
    <li><a href=#02-load-dataset>Load Datasets and Check Properties</a>
    <li><a href=#03-norm-data>Normalize Coordinate Data Between DataFrames</a>
    <li><a href=#04-clean-data>Clean Data</a>
        <ul>
            <li><a href=#04-a-extra-row>Remove Extraneous Row</a>
            <li><a href=#04-b-reorg-df>Reorganize DataFrame</a>
            <li><a href=#04-c-conv-dtypes>Convert Data Types</a>
            <li><a href=#04-d-recalc-data>Recalculate Distance and Speed</a>
        </ul>
    <li><a href=#05-final-check>Final Check</a>
    <li><a href=#06-save-file>Save Feature Engineered File</a>
</ul>

<a id=01-import-packages></a>
## Import Packages

Import necessary packages.

In [1]:
# Dataframes and numerical
import pandas as pd
import numpy as np

# Apache parquet files (to save space)
import pyarrow as pa
import pyarrow.parquet as pq

# Increase pandas default max_rows 
pd.options.display.max_rows = 250
pd.options.display.max_columns = 250

# Counter to measure progress of long script
#from tqdm import tqdm_notebook as tqdm

<a href=#toc>Back to the top</a>

<a id=02-load-dataset></a>
## Load Datasets and Check Properties

Load .parquet files into dataframes.

In [2]:
CB_Data = pq.read_table('CitiBike_data/202106-202205-citibike-tripdata.parquet').to_pandas()
CB_Stations = pq.read_table('CitiBike_data/202206-citibike-stations.parquet').to_pandas()

Check raw data of dataframes.

In [3]:
print(CB_Data.shape)
print(CB_Stations.shape)

(27380897, 18)
(1682, 4)


In [4]:
print(CB_Data.columns)
print(CB_Stations.columns)

Index(['rideable_type', 'started_at', 'ended_at', 'start_station_name',
       'end_station_name', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual', 'year', 'month', 'week_of_year', 'day_of_week',
       'hour_of_day', 'duration_min', 'distance_mi', 'speed_mph'],
      dtype='object')
Index(['lat', 'lng', 'boro', 'hood'], dtype='object')


In [5]:
print(CB_Data.dtypes)
print(CB_Stations.dtypes)

rideable_type                 object
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name            object
end_station_name              object
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual                 object
year                           int64
month                          int64
week_of_year                   int64
day_of_week                    int64
hour_of_day                    int64
duration_min                 float64
distance_mi                  float64
speed_mph                    float64
dtype: object
lat     float64
lng     float64
boro     object
hood     object
dtype: object


In [6]:
CB_Data.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph
0,Classic Bike,2021-06-01 23:12:34,2021-06-01 23:14:46,Driggs Ave & N 9 St,Bayard St & Leonard St,40.718169,-73.955201,40.719156,-73.948854,Member,2021,6,22,1,23,2.2,0.506033,13.800891
1,Classic Bike,2021-06-16 17:14:56,2021-06-16 17:29:15,Fulton St & Broadway,Mercer St & Spring St,40.711066,-74.009447,40.723627,-73.999496,Casual,2021,6,24,2,17,14.316667,1.553328,6.509873
2,Classic Bike,2021-06-07 19:41:55,2021-06-07 19:51:28,Devoe St & Lorimer St,Manhattan Av & Leonard St,40.713352,-73.949103,40.72084,-73.94844,Casual,2021,6,23,0,19,9.55,0.562419,3.533523
3,Electric Bike,2021-06-17 15:13:15,2021-06-17 15:33:25,Driggs Ave & N 9 St,Greenwich Ave & Charles St,40.718169,-73.955201,40.735238,-74.000271,Member,2021,6,24,3,15,20.166667,4.287591,12.756469
4,Electric Bike,2021-06-18 08:27:03,2021-06-18 08:53:37,Graham Ave & Conselyea St,E 30 St & Park Ave S,40.715143,-73.944507,40.744449,-73.983035,Member,2021,6,24,4,8,26.566667,4.680581,10.570947


In [7]:
CB_Stations.head()

Unnamed: 0,lat,lng,boro,hood
1 Ave & E 110 St,40.792327,-73.9383,Manhattan,East Harlem
1 Ave & E 16 St,40.732219,-73.981655,Manhattan,Stuyvesant Town
1 Ave & E 18 St,40.733812,-73.980544,Manhattan,Stuyvesant Town
1 Ave & E 30 St,40.741444,-73.975361,Manhattan,Gramercy
1 Ave & E 39 St,40.74714,-73.97113,Manhattan,Tudor City


<a href=#toc>Back to the top</a>

<a id=03-norm-data></a>
## Normalize Coordinate Data Between DataFrames

In order to normalize all coordinates of the distinct stations as well as the travel distances, durations, and speeds between the various stations; the averaged latitudes and longitudes of the dataframe **CB_Stations** shall replace those provided in the **CB_Data** one in addition to assigning borough and neighboorhood associations.

In [8]:
# Join CB_Data with CB_Stations into a new datafram CB_Data_New
# Replace old coordinates with new ones
CB_Data_New = CB_Data.join(CB_Stations, on='start_station_name', how='right')
CB_Data_New = CB_Data_New.drop(columns=['start_lat', 'start_lng'])
CB_Data_New = CB_Data_New.rename(columns={'lat': 'start_lat', 'lng': 'start_lng', 'boro': 'start_boro', 'hood': 'start_hood'})
CB_Data_New = CB_Data_New.join(CB_Stations, on='end_station_name', how='right')
CB_Data_New = CB_Data_New.drop(columns=['end_lat', 'end_lng'])
CB_Data_New = CB_Data_New.rename(columns={'lat': 'end_lat', 'lng': 'end_lng', 'boro': 'end_boro', 'hood': 'end_hood'})

# Sort by index values and reconvert to integers (they were converted to float by the join processes)
# CB_Data_New = CB_Data_New.sort_index(ascending=True).index.astype('int64')

<a href=#toc>Back to the top</a>

<a id=04-clean-data></a>
## Clean Data

<a id=04-a-extra-row></a>
### Remove Extraneous Row

Seems like an extra row was added. Investigate how and where.

In [9]:
CB_Data_New.shape

(27380898, 22)

Sort by index values

In [10]:
CB_Data_New = CB_Data_New.sort_index(ascending=True)

In [11]:
CB_Data_New.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph,start_lat,start_lng,start_boro,start_hood,end_lat,end_lng,end_boro,end_hood
0.0,Classic Bike,2021-06-01 23:12:34,2021-06-01 23:14:46,Driggs Ave & N 9 St,Bayard St & Leonard St,Member,2021.0,6.0,22.0,1.0,23.0,2.2,0.506033,13.800891,40.71817,-73.955201,Brooklyn,Williamsburg,40.719156,-73.948855,Brooklyn,Greenpoint
1.0,Classic Bike,2021-06-16 17:14:56,2021-06-16 17:29:15,Fulton St & Broadway,Mercer St & Spring St,Casual,2021.0,6.0,24.0,2.0,17.0,14.316667,1.553328,6.509873,40.711066,-74.009447,Manhattan,Financial District,40.723627,-73.999496,Manhattan,SoHo
2.0,Classic Bike,2021-06-07 19:41:55,2021-06-07 19:51:28,Devoe St & Lorimer St,Manhattan Av & Leonard St,Casual,2021.0,6.0,23.0,0.0,19.0,9.55,0.562419,3.533523,40.713352,-73.949103,Brooklyn,Williamsburg,40.72084,-73.94844,Brooklyn,Greenpoint
3.0,Electric Bike,2021-06-17 15:13:15,2021-06-17 15:33:25,Driggs Ave & N 9 St,Greenwich Ave & Charles St,Member,2021.0,6.0,24.0,3.0,15.0,20.166667,4.287591,12.756469,40.71817,-73.955201,Brooklyn,Williamsburg,40.735238,-74.000271,Manhattan,Greenwich Village
4.0,Electric Bike,2021-06-18 08:27:03,2021-06-18 08:53:37,Graham Ave & Conselyea St,E 30 St & Park Ave S,Member,2021.0,6.0,24.0,4.0,8.0,26.566667,4.680581,10.570947,40.715143,-73.944507,Brooklyn,Williamsburg,40.744449,-73.983035,Manhattan,Flatiron District


In [12]:
CB_Data_New.tail()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph,start_lat,start_lng,start_boro,start_hood,end_lat,end_lng,end_boro,end_hood
29032979.0,Classic Bike,2022-05-05 18:13:05,2022-05-05 18:20:10,Crescent St & 30 Ave,Vernon Blvd & 31 Ave,Member,2022.0,5.0,18.0,3.0,18.0,7.083333,0.762346,6.457523,40.768692,-73.924957,Queens,Astoria,40.769247,-73.935451,Queens,Astoria
29032980.0,Classic Bike,2022-05-28 00:12:09,2022-05-28 00:30:00,45 Ave & 21 St,Vernon Blvd & 31 Ave,Member,2022.0,5.0,21.0,5.0,0.0,17.85,2.359753,7.931944,40.747371,-73.947774,Queens,Hunters Point,40.769247,-73.935451,Queens,Astoria
29032981.0,Classic Bike,2022-05-19 13:06:36,2022-05-19 13:18:02,Crescent St & 30 Ave,46 St & 28 Ave,Member,2022.0,5.0,20.0,3.0,13.0,11.433333,1.486219,7.799398,40.768692,-73.924957,Queens,Astoria,40.763328,-73.908782,Queens,Astoria
29032982.0,Classic Bike,2022-05-09 18:47:28,2022-05-09 18:52:38,W 50 St & 9 Ave,West End Ave & W 60 St,Member,2022.0,5.0,19.0,0.0,18.0,5.166667,0.664866,7.721026,40.763605,-73.98918,Manhattan,Columbus Circle,40.77237,-73.99005,Manhattan,Upper West Side
,,NaT,NaT,,Clinton St & Newark St,,,,,,,,,,,,,,40.73743,-74.03571,New Jersey,Hoboken


Drop the one **NaN** value in the new dataframe.

In [13]:
CB_Data_New = CB_Data_New.dropna(axis=0)

<a href=#toc>Back to the top</a>

<a id=04-b-reorg-df></a>
### Reorganize DataFrame

Reconvert index in dataframe to integer.

In [14]:
CB_Data_New.index = CB_Data_New.index.astype('int64')

In [15]:
CB_Data_New.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph,start_lat,start_lng,start_boro,start_hood,end_lat,end_lng,end_boro,end_hood
0,Classic Bike,2021-06-01 23:12:34,2021-06-01 23:14:46,Driggs Ave & N 9 St,Bayard St & Leonard St,Member,2021.0,6.0,22.0,1.0,23.0,2.2,0.506033,13.800891,40.71817,-73.955201,Brooklyn,Williamsburg,40.719156,-73.948855,Brooklyn,Greenpoint
1,Classic Bike,2021-06-16 17:14:56,2021-06-16 17:29:15,Fulton St & Broadway,Mercer St & Spring St,Casual,2021.0,6.0,24.0,2.0,17.0,14.316667,1.553328,6.509873,40.711066,-74.009447,Manhattan,Financial District,40.723627,-73.999496,Manhattan,SoHo
2,Classic Bike,2021-06-07 19:41:55,2021-06-07 19:51:28,Devoe St & Lorimer St,Manhattan Av & Leonard St,Casual,2021.0,6.0,23.0,0.0,19.0,9.55,0.562419,3.533523,40.713352,-73.949103,Brooklyn,Williamsburg,40.72084,-73.94844,Brooklyn,Greenpoint
3,Electric Bike,2021-06-17 15:13:15,2021-06-17 15:33:25,Driggs Ave & N 9 St,Greenwich Ave & Charles St,Member,2021.0,6.0,24.0,3.0,15.0,20.166667,4.287591,12.756469,40.71817,-73.955201,Brooklyn,Williamsburg,40.735238,-74.000271,Manhattan,Greenwich Village
4,Electric Bike,2021-06-18 08:27:03,2021-06-18 08:53:37,Graham Ave & Conselyea St,E 30 St & Park Ave S,Member,2021.0,6.0,24.0,4.0,8.0,26.566667,4.680581,10.570947,40.715143,-73.944507,Brooklyn,Williamsburg,40.744449,-73.983035,Manhattan,Flatiron District


In [16]:
CB_Data_New.tail()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph,start_lat,start_lng,start_boro,start_hood,end_lat,end_lng,end_boro,end_hood
29032978,Classic Bike,2022-05-15 07:57:48,2022-05-15 08:12:55,Broadway & W 36 St,West End Ave & W 60 St,Member,2022.0,5.0,19.0,6.0,7.0,15.116667,1.641414,6.514984,40.750977,-73.987654,Manhattan,Garment District,40.77237,-73.99005,Manhattan,Upper West Side
29032979,Classic Bike,2022-05-05 18:13:05,2022-05-05 18:20:10,Crescent St & 30 Ave,Vernon Blvd & 31 Ave,Member,2022.0,5.0,18.0,3.0,18.0,7.083333,0.762346,6.457523,40.768692,-73.924957,Queens,Astoria,40.769247,-73.935451,Queens,Astoria
29032980,Classic Bike,2022-05-28 00:12:09,2022-05-28 00:30:00,45 Ave & 21 St,Vernon Blvd & 31 Ave,Member,2022.0,5.0,21.0,5.0,0.0,17.85,2.359753,7.931944,40.747371,-73.947774,Queens,Hunters Point,40.769247,-73.935451,Queens,Astoria
29032981,Classic Bike,2022-05-19 13:06:36,2022-05-19 13:18:02,Crescent St & 30 Ave,46 St & 28 Ave,Member,2022.0,5.0,20.0,3.0,13.0,11.433333,1.486219,7.799398,40.768692,-73.924957,Queens,Astoria,40.763328,-73.908782,Queens,Astoria
29032982,Classic Bike,2022-05-09 18:47:28,2022-05-09 18:52:38,W 50 St & 9 Ave,West End Ave & W 60 St,Member,2022.0,5.0,19.0,0.0,18.0,5.166667,0.664866,7.721026,40.763605,-73.98918,Manhattan,Columbus Circle,40.77237,-73.99005,Manhattan,Upper West Side


Reindex columns for readability.

In [17]:
col_names = ['member_casual', 'rideable_type',
             'started_at', 'start_station_name', 'start_lat', 'start_lng', 'start_boro', 'start_hood',
             'ended_at', 'end_station_name', 'end_lat', 'end_lng', 'end_boro', 'end_hood',
             'year', 'month', 'week_of_year', 'day_of_week', 'hour_of_day',
             'duration_min', 'distance_mi', 'speed_mph']
CB_Data_New = CB_Data_New.reindex(columns=col_names)

<a href=#toc>Back to the top</a>

<a id=04-c-conv-dtypes></a>
### Convert Data Types

Reconvert nominal dummy values to integers.

In [18]:
CB_Data_New = CB_Data_New.astype({'year': int, 'month': int, 'week_of_year': int,
                                  'day_of_week': int, 'hour_of_day': int})

In [19]:
CB_Data_New.dtypes

member_casual                 object
rideable_type                 object
started_at            datetime64[ns]
start_station_name            object
start_lat                    float64
start_lng                    float64
start_boro                    object
start_hood                    object
ended_at              datetime64[ns]
end_station_name              object
end_lat                      float64
end_lng                      float64
end_boro                      object
end_hood                      object
year                           int64
month                          int64
week_of_year                   int64
day_of_week                    int64
hour_of_day                    int64
duration_min                 float64
distance_mi                  float64
speed_mph                    float64
dtype: object

<a href=#toc>Back to the top</a>

<a id=04-d-recalc-data></a>
### Recalculate Distance and Speed

Recalculate distance and speed with normalized coordinates.

In [20]:
# Conversion factor here:
# https://www.usgs.gov/faqs/how-much-distance-does-degree-minute-and-second-cover-your-maps#:~:text=One%20degree%20of%20latitude%20equals,one%20second%20equals%2080%20feet.
CB_Data_New.distance_mi = 69 * ( abs( CB_Data_New.start_lat - CB_Data_New.end_lat ) 
                                  + abs( CB_Data_New.start_lng - CB_Data_New.end_lng ) )

CB_Data_New.speed_mph = CB_Data_New.distance_mi / (CB_Data_New.duration_min / 60)

<a href=#toc>Back to the top</a>

<a id=05-final-check></a>
## Final Check

In [21]:
CB_Data_New.shape

(27380897, 22)

In [22]:
CB_Data_New.head()

Unnamed: 0,member_casual,rideable_type,started_at,start_station_name,start_lat,start_lng,start_boro,start_hood,ended_at,end_station_name,end_lat,end_lng,end_boro,end_hood,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph
0,Member,Classic Bike,2021-06-01 23:12:34,Driggs Ave & N 9 St,40.71817,-73.955201,Brooklyn,Williamsburg,2021-06-01 23:14:46,Bayard St & Leonard St,40.719156,-73.948855,Brooklyn,Greenpoint,2021,6,22,1,23,2.2,0.50594,13.798369
1,Casual,Classic Bike,2021-06-16 17:14:56,Fulton St & Broadway,40.711066,-74.009447,Manhattan,Financial District,2021-06-16 17:29:15,Mercer St & Spring St,40.723627,-73.999496,Manhattan,SoHo,2021,6,24,2,17,14.316667,1.553374,6.510067
2,Casual,Classic Bike,2021-06-07 19:41:55,Devoe St & Lorimer St,40.713352,-73.949103,Brooklyn,Williamsburg,2021-06-07 19:51:28,Manhattan Av & Leonard St,40.72084,-73.94844,Brooklyn,Greenpoint,2021,6,23,0,19,9.55,0.56241,3.533465
3,Member,Electric Bike,2021-06-17 15:13:15,Driggs Ave & N 9 St,40.71817,-73.955201,Brooklyn,Williamsburg,2021-06-17 15:33:25,Greenwich Ave & Charles St,40.735238,-74.000271,Manhattan,Greenwich Village,2021,6,24,3,15,20.166667,4.287502,12.756204
4,Member,Electric Bike,2021-06-18 08:27:03,Graham Ave & Conselyea St,40.715143,-73.944507,Brooklyn,Williamsburg,2021-06-18 08:53:37,E 30 St & Park Ave S,40.744449,-73.983035,Manhattan,Flatiron District,2021,6,24,4,8,26.566667,4.680605,10.571003


In [23]:
CB_Data_New.tail()

Unnamed: 0,member_casual,rideable_type,started_at,start_station_name,start_lat,start_lng,start_boro,start_hood,ended_at,end_station_name,end_lat,end_lng,end_boro,end_hood,year,month,week_of_year,day_of_week,hour_of_day,duration_min,distance_mi,speed_mph
29032978,Member,Classic Bike,2022-05-15 07:57:48,Broadway & W 36 St,40.750977,-73.987654,Manhattan,Garment District,2022-05-15 08:12:55,West End Ave & W 60 St,40.77237,-73.99005,Manhattan,Upper West Side,2022,5,19,6,7,15.116667,1.641419,6.515002
29032979,Member,Classic Bike,2022-05-05 18:13:05,Crescent St & 30 Ave,40.768692,-73.924957,Queens,Astoria,2022-05-05 18:20:10,Vernon Blvd & 31 Ave,40.769247,-73.935451,Queens,Astoria,2022,5,18,3,18,7.083333,0.762329,6.457375
29032980,Member,Classic Bike,2022-05-28 00:12:09,45 Ave & 21 St,40.747371,-73.947774,Queens,Hunters Point,2022-05-28 00:30:00,Vernon Blvd & 31 Ave,40.769247,-73.935451,Queens,Astoria,2022,5,21,5,0,17.85,2.359807,7.932125
29032981,Member,Classic Bike,2022-05-19 13:06:36,Crescent St & 30 Ave,40.768692,-73.924957,Queens,Astoria,2022-05-19 13:18:02,46 St & 28 Ave,40.763328,-73.908782,Queens,Astoria,2022,5,20,3,13,11.433333,1.486234,7.799481
29032982,Member,Classic Bike,2022-05-09 18:47:28,W 50 St & 9 Ave,40.763605,-73.98918,Manhattan,Columbus Circle,2022-05-09 18:52:38,West End Ave & W 60 St,40.77237,-73.99005,Manhattan,Upper West Side,2022,5,19,0,18,5.166667,0.664846,7.720793


In [28]:
CB_Data_New.isna().sum()

member_casual         0
rideable_type         0
started_at            0
start_station_name    0
start_lat             0
start_lng             0
start_boro            0
start_hood            0
ended_at              0
end_station_name      0
end_lat               0
end_lng               0
end_boro              0
end_hood              0
year                  0
month                 0
week_of_year          0
day_of_week           0
hour_of_day           0
duration_min          0
distance_mi           0
speed_mph             0
dtype: int64

No null values and the dataframe has all the values that are needed for the EDA. Now it is time to export this completed dataframe to a .parquet file.

<a href=#toc>Back to the top</a>

<a id=06-save-file></a>
## Save Feature Engineered File

In [29]:
CB_Data_arrow = pa.Table.from_pandas(CB_Data_New)
pq.write_table(CB_Data_arrow, 'CitiBike_data/202106-202205-citibike-tripdata.parquet')

<a href=#toc>Back to the top</a>