Since the user experienced problems with R, further data cleaning shall be performed in Python for efficiency.

First, import relevant packages:

In [35]:
import pandas as pd
import numpy as np

Next, load the CSV file into a dataframe:

In [2]:
CB_Data = pd.read_csv('CitiBike_data/202105-202204-citibike-trip-data.csv', low_memory=False)

Now, eliminate all instances of "joy rides", in which the rider docked the bike at the same station, either due to literally riding for fun or having a problem with the bike.

In [42]:
joy_rides = CB_Data.loc[CB_Data['start_station_name'] == CB_Data['end_station_name']]

Int64Index([     258,      259,      260,      261,      262,      263,
                 264,      265,      266,      267,
            ...
            28814993, 28814994, 28814995, 28814996, 28814997, 28814999,
            28815001, 28815002, 28815003, 28815005],
           dtype='int64', length=1542162)

In [43]:
CB_Data = CB_Data.drop(joy_rides.index)

Rule out the bikes that seem to come out of nowhere, if applicable

In [74]:
ghost_bikes = CB_Data.loc[CB_Data['start_station_name'] == 'Unknown']

In [76]:
CB_Data = CB_Data.drop(ghost_bikes.index)

Now, rule out the missing bikes.

In [77]:
missing_bikes = CB_Data.loc[CB_Data['end_station_name'] == 'Unknown']

In [79]:
CB_Data = CB_Data.drop(missing_bikes.index)

Finally, rule out the rides that were never completed, most likely to a rider changing their decision.

In [53]:
no_rides = CB_Data.loc[CB_Data['duration'] == 0]

In [55]:
CB_Data = CB_Data.drop(no_rides.index)

The issue in the R data analysis was the inability to calculate speed, due to null values in the duration. Therefore, as a final check, such a calculation shall be performed in order to assess the cleanliness of this dataframe so far:

In [57]:
CB_Data['speed'] = CB_Data['distance'] / (0.0166667 * CB_Data['duration']) # in mph, convert min to hr at denom

So far, so good. Now, the station names need cleanup through an iterative process

In [80]:
nulls = CB_Data.loc[pd.isnull(CB_Data['CB_start_hood']) | pd.isnull(CB_Data['CB_end_hood'])]

In [81]:
nulls['start_station_name'].value_counts()

Broadway & E 21 St             73210
E 13 St & 2 Ave                72727
Broadway & W 58 St             72268
E 20 St & 2 Ave                71916
5 Ave & E 72 St                63593
                               ...  
Lab - NYC                          7
E 6 St 2 Ave                       6
Avenue D & E 8 St                  5
Yankee Ferry Terminal              5
Grand Concourse & E 161  St        2
Name: start_station_name, Length: 1592, dtype: int64

In [82]:
nulls['end_station_name'].value_counts()

Broadway & E 21 St                               73558
E 13 St & 2 Ave                                  73436
Broadway & W 58 St                               70765
E 20 St & 2 Ave                                  70750
5 Ave & E 72 St                                  64006
                                                 ...  
7 Ave & Bleecker St                                  1
S 5th St & Kent Ave (Domino Park Movie Shoot)        1
Hilltop                                              1
JCBS Depot                                           1
Adams St & 2 St                                      1
Name: end_station_name, Length: 1672, dtype: int64

In [85]:
ques = ['Broadway & E 21 St','E 13 St & 2 Ave','Broadway & W 58 St','E 20 St & 2 Ave','5 Ave & E 72 St']
hood = ['Midtown East','East Village','Midtown West','Midtown East','Upper East Side']
boro = ['Manhattan','Manhattan','Manhattan','Manhattan','Manhattan']
zipped = zip(ques, hood, boro)

for q, h, b in zipped:
    CB_Data.CB_start_hood.loc[CB_Data.start_station_name == q] = h
    CB_Data.CB_start_boro.loc[CB_Data.start_station_name == q] = b
    CB_Data.CB_end_hood.loc[CB_Data.end_station_name == q] = h
    CB_Data.CB_end_boro.loc[CB_Data.end_station_name == q] = b

5902496     NaN
5902500     NaN
5902940     NaN
5902941     NaN
5904070     NaN
           ... 
28813905    NaN
28813906    NaN
28813921    NaN
28813922    NaN
28813924    NaN
Name: CB_start_hood, Length: 73210, dtype: object

In [None]:
[]