Since the user experienced problems with R, further data cleaning shall be performed in Python for efficiency.

First, import relevant packages:

In [1]:
import pandas as pd
import numpy as np

Next, load the CSV file into a dataframe:

In [2]:
# CB_Data = pd.read_csv('CitiBike_data/202105-202204-citibike-trip-data.csv', low_memory=False)

In [4]:
CB_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27170850 entries, 0 to 27170849
Data columns (total 18 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Unnamed: 0          int64  
 1   rideable_type       object 
 2   start_station_name  object 
 3   end_station_name    object 
 4   start_lat           float64
 5   start_lng           float64
 6   end_lat             float64
 7   end_lng             float64
 8   member_casual       object 
 9   year                int64  
 10  month               int64  
 11  duration            float64
 12  CB_start_hood       object 
 13  CB_start_boro       object 
 14  CB_end_hood         object 
 15  CB_end_boro         object 
 16  distance            float64
 17  speed               float64
dtypes: float64(7), int64(3), object(8)
memory usage: 3.6+ GB


In [3]:
CB_Data.columns

Index(['Unnamed: 0', 'rideable_type', 'start_station_name', 'end_station_name',
       'start_lat', 'start_lng', 'end_lat', 'end_lng', 'member_casual', 'year',
       'month', 'duration', 'CB_start_hood', 'CB_start_boro', 'CB_end_hood',
       'CB_end_boro', 'distance', 'speed'],
      dtype='object')

In [20]:
CB_Data.head()

Unnamed: 0,rideable_type,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,year,month,duration,CB_start_hood,CB_start_boro,CB_end_hood,CB_end_boro,distance,speed
0,Classic Bike,Broadway & W 60 St,1 Ave & E 78 St,40.769155,-73.981918,40.771404,-73.953517,member,2021,5,20.4,Midtown West,Manhattan,Upper East Side,Manhattan,2.114896,6.220281
1,Classic Bike,Broadway & W 25 St,E 2 St & Avenue B,40.742868,-73.989186,40.722174,-73.983688,member,2021,5,19.483333,Midtown East,Manhattan,East Village,Manhattan,1.807232,5.565471
2,Classic Bike,46 Ave & 5 St,34th Ave & Vernon Blvd,40.74731,-73.95451,40.765354,-73.939863,member,2021,5,15.566667,Long Island City,Queens,Long Island City,Queens,2.255655,8.694175
3,Classic Bike,46 Ave & 5 St,34th Ave & Vernon Blvd,40.74731,-73.95451,40.765354,-73.939863,member,2021,5,16.216667,Long Island City,Queens,Long Island City,Queens,2.255655,8.345693
4,Classic Bike,46 Ave & 5 St,34th Ave & Vernon Blvd,40.74731,-73.95451,40.765354,-73.939863,member,2021,5,14.566667,Long Island City,Queens,Long Island City,Queens,2.255655,9.291029


Get rid of extraneous column 'Unnamed: 0'.

In [9]:
# CB_Data = CB_Data.drop('Unnamed: 0', axis=1)

It turns out that distance between two coordinates were improperly calculated. **Haversine** was used instead of **block distance**, which reflects the movement most found within a city with a rectangluar grid. This will fix that:

In [16]:
# Conversion factor here:
# https://www.usgs.gov/faqs/how-much-distance-does-degree-minute-and-second-cover-your-maps#:~:text=One%20degree%20of%20latitude%20equals,one%20second%20equals%2080%20feet.
# CB_Data.distance = 69 * ( abs( CB_Data.start_lat - CB_Data.end_lat ) 
#                                   + abs( CB_Data.start_lng - CB_Data.end_lng ) )

In [19]:
# CB_Data.speed = CB_Data.distance / (CB_Data.duration / 60)

Now, eliminate all instances of "joy rides", in which the rider docked the bike at the same station, either due to literally riding for fun or having a problem with the bike.

In [42]:
# joy_rides = CB_Data.loc[CB_Data['start_station_name'] == CB_Data['end_station_name']]

Int64Index([     258,      259,      260,      261,      262,      263,
                 264,      265,      266,      267,
            ...
            28814993, 28814994, 28814995, 28814996, 28814997, 28814999,
            28815001, 28815002, 28815003, 28815005],
           dtype='int64', length=1542162)

In [43]:
# CB_Data = CB_Data.drop(joy_rides.index)

Rule out the bikes that seem to come out of nowhere, if applicable

In [74]:
# ghost_bikes = CB_Data.loc[CB_Data['start_station_name'] == 'Unknown']

In [76]:
# CB_Data = CB_Data.drop(ghost_bikes.index)

Now, rule out the missing bikes.

In [77]:
# missing_bikes = CB_Data.loc[CB_Data['end_station_name'] == 'Unknown']

In [79]:
# CB_Data = CB_Data.drop(missing_bikes.index)

Finally, rule out the rides that were never completed, most likely to a rider changing their decision.

In [53]:
# no_rides = CB_Data.loc[CB_Data['duration'] == 0]

In [55]:
# CB_Data = CB_Data.drop(no_rides.index)

The issue in the R data analysis was the inability to calculate speed, due to null values in the duration. Therefore, as a final check, such a calculation shall be performed in order to assess the cleanliness of this dataframe so far:

In [57]:
# CB_Data['speed'] = CB_Data['distance'] / (0.0166667 * CB_Data['duration']) # in mph, convert min to hr at denom

So far, so good. Now, the station names need cleanup through an iterative process

In [87]:
# nulls = CB_Data.loc[pd.isnull(CB_Data['CB_start_hood']) | pd.isnull(CB_Data['CB_end_hood'])]

In [88]:
# nulls['start_station_name'].value_counts()

2 Ave & E 29 St                63472
8 Ave & W 16 St                63362
W 44 St & 11 Ave               56702
Forsyth St\t& Grand St         53086
6 Ave & W 45 St                53012
                               ...  
Lab - NYC                          7
E 6 St 2 Ave                       6
Avenue D & E 8 St                  3
Grand Concourse & E 161  St        2
Yankee Ferry Terminal              1
Name: start_station_name, Length: 1592, dtype: int64

In [89]:
# nulls['end_station_name'].value_counts()

8 Ave & W 16 St           63465
2 Ave & E 29 St           62081
Forsyth St\t& Grand St    53520
6 Ave & W 45 St           53259
W 44 St & 11 Ave          52587
                          ...  
Brunswick St                  1
Hudson St & 4 St              1
StuyTown Depot                1
Jackson Square                1
Adams St & 2 St               1
Name: end_station_name, Length: 1672, dtype: int64

In [86]:
# ques = ['Broadway & E 21 St','E 13 St & 2 Ave','Broadway & W 58 St','E 20 St & 2 Ave','5 Ave & E 72 St']
# hood = ['Midtown East','East Village','Midtown West','Midtown East','Upper East Side']
# boro = ['Manhattan','Manhattan','Manhattan','Manhattan','Manhattan']
# zipped = zip(ques, hood, boro)

# for q, h, b in zipped:
#     CB_Data.CB_start_hood.loc[CB_Data.start_station_name == q] = h
#     CB_Data.CB_start_boro.loc[CB_Data.start_station_name == q] = b
#     CB_Data.CB_end_hood.loc[CB_Data.end_station_name == q] = h
#     CB_Data.CB_end_boro.loc[CB_Data.end_station_name == q] = b

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [21]:
# CB_Data.to_csv('CitiBike_data/202105-202204-citibike-trip-data.csv')