<h2> Part 1 </h2>

<h3> Overview of Part 1 </h3>

- In Part 1, we clean and wrangle data that spans the years 2011 and 2012 and is divided into a daily data set and an hourly data set. 
- The data includes information on time (daily or hourly), number of rides by casual (unregistered) users, number of rides by registered users, as well as various weather-related information.
- The datasets were downloaded from: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

<h4> Data source provides the following information on the attributes </h4>

- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

In [124]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import copy
import math

import warnings
warnings.filterwarnings('ignore')


<h5> Load data sets </h5>

In [125]:
df_daily = pd.read_csv('./data/day.csv')
df_hourly = pd.read_csv('./data/hour.csv')

<h5> 1) Basic cleaning for daily data set </h5>

In [126]:
df_daily.head(2)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801


In [127]:
df_daily.tail(2)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1,1,12,0,1,1,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


- The data spans the dates 2011-01-01 to 2012-12-13
- 2012 was a leap year and thus had 366 days. 
- We have the correct number of days (365 + 366 = 731)

In [128]:
df_daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


In [129]:
# Drop redundant column in daily data
df_daily.drop('instant', axis = 1, inplace = True)

# Rename column labels in df_day to make them more readable
columns = {'dteday':'date',
            'yr': 'year',
            'mnth': 'month',
            'weekday':'day_of_week',
            'workingday': 'work_day',
            'weathersit': 'weather_sit',
            'atemp': 'app_temp',
            'hum': 'humidity',
            'windspeed': 'wind_speed',
            'cnt': 'total'
            }

df_daily.rename(columns = columns, inplace = True)

In [130]:
# check for NaN values
print (df_daily.isna().sum())

date           0
season         0
year           0
month          0
holiday        0
day_of_week    0
work_day       0
weather_sit    0
temp           0
app_temp       0
humidity       0
wind_speed     0
casual         0
registered     0
total          0
dtype: int64


In [131]:
# check for duplicate values
print(df_daily.duplicated().sum())

0


- Great. No missing values and no duplicates.

<h5> 2) Basic cleaning for hourly data set </h5>

In [132]:
df_hourly.head(2)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40


In [133]:
# Drop redundant column in hourly data
df_hourly.drop('instant', axis = 1, inplace = True)

In [134]:
df_hourly.shape

(17379, 16)

In [135]:
# Calculate how many rows are missing from df_hour
# For every row in df_day, there should be 24 rows in df_hour
(df_daily.shape[0] * 24) - (df_hourly.shape[0])

165

- We have 165 rows missing from df_hour. Small enough number that we can live with.

In [136]:
# Rename column labels in df_hourly to make them more readable

columns = {'dteday':'date',
            'yr': 'year',
            'mnth': 'month',
            'weekday':'day_of_week',
            'workingday': 'work_day',
            'weathersit': 'weather_sit',
            'atemp': 'app_temp',
            'hum': 'humidity',
            'windspeed': 'wind_speed',
            'cnt': 'total',
            'hr':'hour'
            }

df_hourly.rename(columns = columns, inplace = True)

In [137]:
df_hourly.head(2)

Unnamed: 0,date,season,year,month,hour,holiday,day_of_week,work_day,weather_sit,temp,app_temp,humidity,wind_speed,casual,registered,total
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40


In [138]:
# check for NaN values
print(df_hourly.isna().sum())

date           0
season         0
year           0
month          0
hour           0
holiday        0
day_of_week    0
work_day       0
weather_sit    0
temp           0
app_temp       0
humidity       0
wind_speed     0
casual         0
registered     0
total          0
dtype: int64


In [139]:
# check for duplicate values
print(df_hourly.duplicated().sum())

0


- Great. No NaN or duplicate values.

<h5> 3) Additional cleaning for both daily and hourly data sets </h5>

In [140]:
# Change "holiday" column values to "yes" or "no" in both dataframes
df_daily['holiday'] = df_daily['holiday'].apply(lambda x: 'no' if x == 0 else 'yes')
df_hourly['holiday'] = df_hourly['holiday'].apply(lambda x: 'no' if x == 0 else 'yes')


# Change "day_of_week" column values to "mon", "tue", etc. 
# Count starts from 0 = sunday

df_daily ['day_of_week'] = df_daily['day_of_week'].apply(lambda x: 'sun' if x == 0
                                                            else 'mon' if x == 1
                                                            else 'tue' if x == 2
                                                            else 'wed' if x == 3
                                                            else 'thu' if x == 4
                                                            else 'fri' if x == 5
                                                            else 'sat')

df_hourly ['day_of_week'] = df_hourly['day_of_week'].apply(lambda x: 'sun' if x == 0
                                                            else 'mon' if x == 1
                                                            else 'tue' if x == 2
                                                            else 'wed' if x == 3
                                                            else 'thu' if x == 4
                                                            else 'fri' if x == 5
                                                            else 'sat')

# Denormalise temperature
# Normalised temperatures are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 
# So to denormalise: t = temp * (t_max - t_min) + t_min)

t_min = -8.0
t_max = 39.0

df_daily['temp'] = df_daily['temp'].apply(lambda x: round (x * (t_max - t_min) + t_min, 1) )
df_hourly['temp'] = df_hourly['temp'].apply(lambda x: round (x * (t_max - t_min) + t_min, 1))


# Denormalise apparent temperature. 
# Normalised apparent temperatures are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 
# So to denormalise: t = temp * (t_max - t_min) + t_min)

t_min = -16.0
t_max = 50.0

df_daily['app_temp'] = df_daily['app_temp'].apply(lambda x: round (x * (t_max - t_min) + t_min, 1) )
df_hourly['app_temp'] = df_hourly['app_temp'].apply(lambda x: round (x * (t_max - t_min) + t_min, 1))


# Denormalise humidity. The values were divided by 100 (the max value), so reverse that.
df_daily['humidity'] = df_daily['humidity'].apply(lambda x: round (x * 100, 1) )
df_hourly['humidity'] = df_hourly['humidity'].apply(lambda x: round (x * 100, 1))


# Denormalise wind speed. The values were divided by 67 (the max value), so reverse that.
df_daily['wind_speed'] = df_daily['wind_speed'].apply(lambda x: round (x * 67, 1) )
df_hourly['wind_speed'] = df_hourly['wind_speed'].apply(lambda x: round (x * 67, 1))

# convert 'date' columns to date-time format
df_daily['date'] = pd.to_datetime(df_daily['date'])
df_hourly['date'] = pd.to_datetime(df_hourly['date'])

# convert "work_day" column values to "yes" or "no"
df_daily['work_day'] = df_daily['work_day'].apply(lambda x: 'yes' if x == 1 else 'no')
df_hourly['work_day'] = df_hourly['work_day'].apply(lambda x: 'yes' if x == 1 else 'no')

# Create 'year' column
df_daily['year'] = df_daily['date'].dt.year
df_hourly['year'] = df_hourly['date'].dt.year


# Change weather situation column values as follows: 

# 1: Clear
# 2: Misty
# 3: Bit wet
# 4: Very wet

df_daily['weather_sit'] = df_daily['weather_sit'].apply(lambda x: 'clear' if x == 1
                                                                else 'misty' if x == 2
                                                                else 'bit wet' if x == 3
                                                                else 'very wet')

df_hourly['weather_sit'] = df_hourly['weather_sit'].apply(lambda x: 'clear' if x == 1
                                                                else 'misty' if x == 2
                                                                else 'bit wet' if x == 3
                                                                else 'very wet')

# Convert "season" values to strings
# (1:winter, 2:spring, 3:summer, 4:fall)

df_daily['season'] = df_daily['season'].apply(lambda x: 'winter' if x == 1
                                                    else 'spring' if x == 2
                                                    else 'summer' if x == 3
                                                    else 'autumn')

df_hourly['season'] = df_hourly['season'].apply(lambda x: 'winter' if x == 1
                                                    else 'spring' if x == 2
                                                    else 'summer' if x == 3
                                                    else 'autumn')

<h5> 4) Seasonal boundaries </h5>

In [141]:
# Check the boundaries for the seasons in 2011

spring_first = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'spring')].iloc[0]['date']
spring_last = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'spring')].iloc[-1]['date']

summer_first = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'summer')].iloc[0]['date']
summer_last = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'summer')].iloc[-1]['date']

autumn_first = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'autumn')].iloc[0]['date']
autumn_last = df_daily [(df_daily['date'].dt.year == 2011) & (df_daily['season'] == 'autumn')].iloc[-1]['date']

spring_first, spring_last, summer_first, summer_last, autumn_first, autumn_last

(Timestamp('2011-03-21 00:00:00'),
 Timestamp('2011-06-20 00:00:00'),
 Timestamp('2011-06-21 00:00:00'),
 Timestamp('2011-09-22 00:00:00'),
 Timestamp('2011-09-23 00:00:00'),
 Timestamp('2011-12-20 00:00:00'))

- Seasonal boundaries for 2011:

- spring: 03-21 to 06-20
- summer: 06-21 to 09-22
- autumn: 09-03 to 12-20
- winter: 12-21 to 03-20

- These seem like very strange cut-off dates for the seasons. I think most people would agree that early December is definitely already winter.

In [142]:
# Check the boundaries for the seasons in 2012

spring_first = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'spring')].iloc[0]['date']
spring_last = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'spring')].iloc[-1]['date']

summer_first = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'summer')].iloc[0]['date']
summer_last = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'summer')].iloc[-1]['date']

autumn_first = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'autumn')].iloc[0]['date']
autumn_last = df_daily [(df_daily['date'].dt.year == 2012) & (df_daily['season'] == 'autumn')].iloc[-1]['date']

spring_first, spring_last, summer_first, summer_last, autumn_first, autumn_last

(Timestamp('2012-03-21 00:00:00'),
 Timestamp('2012-06-20 00:00:00'),
 Timestamp('2012-06-21 00:00:00'),
 Timestamp('2012-09-22 00:00:00'),
 Timestamp('2012-09-23 00:00:00'),
 Timestamp('2012-12-20 00:00:00'))

- Good. The boundary dates are the same for both years. 
- How about in the hourly data set?

In [143]:
# Check season boundaries for df_hourly, in year 2011

spring_first = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'spring')].iloc[0]['date']
spring_last = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'spring')].iloc[-1]['date']

summer_first = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'summer')].iloc[0]['date']
summer_last = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'summer')].iloc[-1]['date']

autumn_first = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'autumn')].iloc[0]['date']
autumn_last = df_hourly [(df_hourly['date'].dt.year == 2011) & (df_hourly['season'] == 'autumn')].iloc[-1]['date']

spring_first, spring_last, summer_first, summer_last, autumn_first, autumn_last

(Timestamp('2011-03-21 00:00:00'),
 Timestamp('2011-06-20 00:00:00'),
 Timestamp('2011-06-21 00:00:00'),
 Timestamp('2011-09-22 00:00:00'),
 Timestamp('2011-09-23 00:00:00'),
 Timestamp('2011-12-20 00:00:00'))

In [144]:
# Check season boundaries for df_hourly, in year 2012

spring_first = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'spring')].iloc[0]['date']
spring_last = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'spring')].iloc[-1]['date']

summer_first = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'summer')].iloc[0]['date']
summer_last = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'summer')].iloc[-1]['date']

autumn_first = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'autumn')].iloc[0]['date']
autumn_last = df_hourly [(df_hourly['date'].dt.year == 2012) & (df_hourly['season'] == 'autumn')].iloc[-1]['date']

spring_first, spring_last, summer_first, summer_last, autumn_first, autumn_last

(Timestamp('2012-03-21 00:00:00'),
 Timestamp('2012-06-20 00:00:00'),
 Timestamp('2012-06-21 00:00:00'),
 Timestamp('2012-09-22 00:00:00'),
 Timestamp('2012-09-23 00:00:00'),
 Timestamp('2012-12-20 00:00:00'))

- OK. Seasonal boundaries are the same for both the daily and hourly data sets. Let's change the cut-off dates to more reasonable ones.

In [145]:
# Change date boundaries for seasons in "df_daily" to:
# spring: 03-01 to 05-31
# summer: 06-01 to 08-31
# autumn: 09-01 to 11-31
# winter: 12-01 to 02-28

spring_mask = ((df_daily['date'].dt.month >= 3) & (df_daily['date'].dt.day >= 1)) & ((df_daily['date'].dt.month <= 5) & (df_daily['date'].dt.day <= 31))
summer_mask = ((df_daily['date'].dt.month >= 6) & (df_daily['date'].dt.day >= 1)) & ((df_daily['date'].dt.month <= 8) & (df_daily['date'].dt.day <= 31))
autumn_mask = ((df_daily['date'].dt.month >= 9) & (df_daily['date'].dt.day >= 1)) & ((df_daily['date'].dt.month <= 11) & (df_daily['date'].dt.day <= 30))
winter_mask = (((df_daily['date'].dt.month >= 12) & (df_daily['date'].dt.day >= 1)) & ((df_daily['date'].dt.month <= 12) & (df_daily['date'].dt.day <= 31))) |(((df_daily['date'].dt.month >= 1) & (df_daily['date'].dt.day >= 1)) & ((df_daily['date'].dt.month <= 2) & (df_daily['date'].dt.day <= 29)))


df_daily.loc[spring_mask, 'season'] = 'spring'
df_daily.loc[summer_mask, 'season'] = 'summer'
df_daily.loc[autumn_mask, 'season'] = 'autumn'
df_daily.loc[winter_mask, 'season'] = 'winter'

# Change date boundaries for seasons in "df_hourly" to:
# spring: 03-01 to 05-31
# summer: 06-01 to 08-31
# autumn: 09-01 to 11-31
# winter: 12-01 to 02-28

spring_mask = ((df_hourly['date'].dt.month >= 3) & (df_hourly['date'].dt.day >= 1)) & ((df_hourly['date'].dt.month <= 5) & (df_hourly['date'].dt.day <= 31))
summer_mask = ((df_hourly['date'].dt.month >= 6) & (df_hourly['date'].dt.day >= 1)) & ((df_hourly['date'].dt.month <= 8) & (df_hourly['date'].dt.day <= 31))
autumn_mask = ((df_hourly['date'].dt.month >= 9) & (df_hourly['date'].dt.day >= 1)) & ((df_hourly['date'].dt.month <= 11) & (df_hourly['date'].dt.day <= 30))
winter_mask = (((df_hourly['date'].dt.month >= 12) & (df_hourly['date'].dt.day >= 1)) & ((df_hourly['date'].dt.month <= 12) & (df_hourly['date'].dt.day <= 31))) |(((df_hourly['date'].dt.month >= 1) & (df_hourly['date'].dt.day >= 1)) & ((df_hourly['date'].dt.month <= 2) & (df_hourly['date'].dt.day <= 29)))


df_hourly.loc[spring_mask, 'season'] = 'spring'
df_hourly.loc[summer_mask, 'season'] = 'summer'
df_hourly.loc[autumn_mask, 'season'] = 'autumn'
df_hourly.loc[winter_mask, 'season'] = 'winter'

<h5> 5) Divide apparent temperature into categories </h5>

In [146]:
# Use following categories for apparent temperature:
# 'below 0'
# '0 to 10'
# '10 to 20'
# '20 to 30'
# 'above 30'

df_daily['app_temp_cat'] = df_daily['app_temp'].apply(lambda x: 'below -5' if x <= -5
                                                            else '-5 to 5' if x <= 5
                                                            else '5 to 15' if x <= 15
                                                            else '15 to 25' if x <= 25
                                                            else '25 to 35' if x <= 35
                                                            else '35 or higher')

df_hourly['app_temp_cat'] = df_hourly['app_temp'].apply(lambda x: 'below -5' if x <= -5
                                                            else '-5 to 5' if x <= 5
                                                            else '5 to 15' if x <= 15
                                                            else '15 to 25' if x <= 25
                                                            else '25 to 35' if x <= 35
                                                            else '35 or higher')

<h5> 6) Add columns for different granularities of the dates </h5>

In [147]:
# Add "year_month" column
df_daily['year_month'] = df_daily['date'].dt.to_period('M')
df_hourly['year_month'] = df_hourly['date'].dt.to_period('M')

# Add "calendar_week" column
df_daily['calendar_week'] = df_daily['date'].dt.week


<h5> 7) Save dataframes as CSV files </h5>

In [148]:
# Check dataframes one last time

df_daily.head()

Unnamed: 0,date,season,year,month,holiday,day_of_week,work_day,weather_sit,temp,app_temp,humidity,wind_speed,casual,registered,total,app_temp_cat,year_month,calendar_week
0,2011-01-01,winter,2011,1,no,sat,no,misty,8.2,8.0,80.6,10.7,331,654,985,5 to 15,2011-01,52
1,2011-01-02,winter,2011,1,no,sun,no,misty,9.1,7.3,69.6,16.7,131,670,801,5 to 15,2011-01,52
2,2011-01-03,winter,2011,1,no,mon,yes,clear,1.2,-3.5,43.7,16.6,120,1229,1349,-5 to 5,2011-01,1
3,2011-01-04,winter,2011,1,no,tue,yes,clear,1.4,-2.0,59.0,10.7,108,1454,1562,-5 to 5,2011-01,1
4,2011-01-05,winter,2011,1,no,wed,yes,clear,2.7,-0.9,43.7,12.5,82,1518,1600,-5 to 5,2011-01,1


In [149]:
df_hourly.head()

Unnamed: 0,date,season,year,month,hour,holiday,day_of_week,work_day,weather_sit,temp,app_temp,humidity,wind_speed,casual,registered,total,app_temp_cat,year_month
0,2011-01-01,winter,2011,1,0,no,sat,no,clear,3.3,3.0,81.0,0.0,3,13,16,-5 to 5,2011-01
1,2011-01-01,winter,2011,1,1,no,sat,no,clear,2.3,2.0,80.0,0.0,8,32,40,-5 to 5,2011-01
2,2011-01-01,winter,2011,1,2,no,sat,no,clear,2.3,2.0,80.0,0.0,5,27,32,-5 to 5,2011-01
3,2011-01-01,winter,2011,1,3,no,sat,no,clear,3.3,3.0,75.0,0.0,3,10,13,-5 to 5,2011-01
4,2011-01-01,winter,2011,1,4,no,sat,no,clear,3.3,3.0,75.0,0.0,0,1,1,-5 to 5,2011-01


In [150]:
df_daily.to_csv('cleaned_data/df_daily.csv', index = False)
df_hourly.to_csv('cleaned_data/df_hourly.csv', index = False)

<h2> Part 2 </h2>

<h3> Part 2 Overview </h3>

- In Part 2, I clean and wrangle data for the years 2010 to 2019, downloaded directly from the Capital Bike Share website.
- The data was downloaded from the following link: https://s3.amazonaws.com/capitalbikeshare-data/index.html

In [28]:
import glob 

<h3> 1) Load data </h3>

In [29]:
# path to directory
path = "./data/years_2010_to_2019"

# use glob to retrieve all csv files in directory
all_files = glob.glob(os.path.join(path, "*.csv"))

# files are already arranged alphabetically in the directory, so we preserve the order
all_files.sort() 

# create generator object for all csv files
df_from_each_file = (pd.read_csv(f) for f in all_files)

# concatenate all dataframes into one
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

- run time: 50 s

In [30]:
# save as a deep copy to retain original dataframe
df_all_years = concatenated_df.copy(deep = True)

- run time: 14 s

In [31]:
# check columns
df_all_years.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26058744 entries, 0 to 26058743
Data columns (total 9 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   Duration              int64 
 1   Start date            object
 2   End date              object
 3   Start station number  int64 
 4   Start station         object
 5   End station number    int64 
 6   End station           object
 7   Bike number           object
 8   Member type           object
dtypes: int64(3), object(6)
memory usage: 1.7+ GB


In [32]:
# Convert date columns into date-time format
df_all_years['Start date'] = pd.to_datetime(df_all_years['Start date'])
df_all_years['End date'] = pd.to_datetime(df_all_years['End date'])

<h3> 2) Create aggregated dataframes </h3>

- Create aggragated dataframe to track number of rides by casual vs registered users

<h4> 2.1) Aggregate on daily basis </h3>

In [33]:
# select for rides by registered users -- 'Member type' == 'Member'
# then count rides on daily basis
df_agg_registered = df_all_years [df_all_years['Member type'] == 'Member'].groupby(pd.Grouper(key = 'Start date', freq = 'D')).count().rename(columns={'Member type': 'Registered'})
df_agg_registered.reset_index(inplace = True)
df_agg_registered = df_agg_registered[['Start date', 'Registered']]

# select rides for casual users -- 'Member type' == 'Casual'
# then count rides on daily basis
df_agg_casual = df_all_years [df_all_years['Member type'] == 'Casual'].groupby(pd.Grouper(key = 'Start date', freq = 'D')).count().rename(columns={'Member type': 'Casual'})
df_agg_casual = df_agg_casual.reset_index()
df_agg_casual = df_agg_casual [['Start date', 'Casual']]

# then merge into single dataframe
df_all_agg_daily = pd.merge(df_agg_registered, df_agg_casual, how = 'outer', on = ['Start date'])
df_all_agg_daily = df_all_agg_daily.rename(columns = {'Start date': 'date', 
                                            'Registered': 'registered',
                                            'Casual': 'casual'})

# Add "year", "month", and "day" columns
df_all_agg_daily['year'] = df_all_agg_daily['date'].dt.year
df_all_agg_daily['month'] = df_all_agg_daily['date'].dt.month
df_all_agg_daily['day'] = df_all_agg_daily['date'].dt.day_name()
df_all_agg_daily['day'] = df_all_agg_daily['day'].apply(lambda x: x[0:3])

# Add "year_month" column
df_all_agg_daily ['year_month'] = df_all_agg_daily['date'].dt.to_period('M')


- run time: 20 s

In [34]:
df_all_agg_daily.head()

Unnamed: 0,date,registered,casual,year,month,day,year_month
0,2010-09-20,178,34,2010,9,Mon,2010-09
1,2010-09-21,215,109,2010,9,Tue,2010-09
2,2010-09-22,260,117,2010,9,Wed,2010-09
3,2010-09-23,249,124,2010,9,Thu,2010-09
4,2010-09-24,206,156,2010,9,Fri,2010-09


In [35]:
df_all_agg_daily.isna().sum()

date          0
registered    0
casual        0
year          0
month         0
day           0
year_month    0
dtype: int64

In [36]:
df_all_agg_daily.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3390 entries, 0 to 3389
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        3390 non-null   datetime64[ns]
 1   registered  3390 non-null   int64         
 2   casual      3390 non-null   int64         
 3   year        3390 non-null   int64         
 4   month       3390 non-null   int64         
 5   day         3390 non-null   object        
 6   year_month  3390 non-null   period[M]     
dtypes: datetime64[ns](1), int64(4), object(1), period[M](1)
memory usage: 211.9+ KB


<h4> 2.2) Aggregate on hourly basis </h4>

In [37]:
# Create deep copy to preserve original
df_all_years_hourly = concatenated_df.copy(deep = True)

# Convert date columns into date-time format
df_all_years_hourly['Start date'] = pd.to_datetime(df_all_years_hourly['Start date'])
df_all_years_hourly['End date'] = pd.to_datetime(df_all_years_hourly['End date'])

# Create "hour" column
df_all_years_hourly['hour'] = df_all_years_hourly['Start date'].dt.hour

# Count rides by registered users on hourly basis and save as new dataframe
df_agg_registered_hourly = df_all_years_hourly [df_all_years_hourly['Member type'] == 'Member'].groupby(pd.Grouper(key = 'Start date', freq = 'H')).count().rename(columns={'Member type': 'Registered'})
df_agg_registered_hourly = df_agg_registered_hourly[['Registered']]

# Count rides by unregistered users on hourly basis and save as new dataframe
df_agg_casual_hourly = df_all_years_hourly [df_all_years_hourly['Member type'] == 'Casual'].groupby(pd.Grouper(key = 'Start date', freq = 'H')).count().rename(columns={'Member type': 'Casual'})
df_agg_casual_hourly = df_agg_casual_hourly [['Casual']]

# Merge dataframes for registered and unregistered users
df_all_agg_hourly = pd.merge(df_agg_registered_hourly, df_agg_casual_hourly, how = 'outer', on = ['Start date'])
df_all_agg_hourly.reset_index(inplace = True)

# Create "hour" column
df_all_agg_hourly['hour'] = df_all_agg_hourly['Start date'].dt.hour

# We know that NaN are all zeros. So:
df_all_agg_hourly['Registered'] = df_all_agg_hourly['Registered'].fillna(0)
df_all_agg_hourly['Casual'] = df_all_agg_hourly['Casual'].fillna(0)

# Create 'year', 'month' and 'day' columns
df_all_agg_hourly['year'] = df_all_agg_hourly['Start date'].dt.year
df_all_agg_hourly['month'] = df_all_agg_hourly['Start date'].dt.month
df_all_agg_hourly['day'] = df_all_agg_hourly['Start date'].dt.day_name()
df_all_agg_hourly['day'] = df_all_agg_hourly['day'].apply(lambda x: x[0:3])

# rename columns
df_all_agg_hourly = df_all_agg_hourly.rename(columns = {'Start date': 'date',
                                    'Registered': 'registered',
                                    'Casual': 'casual'})

# Add "year_month" column

df_all_agg_hourly ['year_month'] = df_all_agg_hourly['date'].dt.to_period('M')

- run time: 43 s

In [38]:
df_all_agg_hourly.head()

Unnamed: 0,date,registered,casual,hour,year,month,day,year_month
0,2010-09-20 11:00:00,2,0.0,11,2010,9,Mon,2010-09
1,2010-09-20 12:00:00,17,0.0,12,2010,9,Mon,2010-09
2,2010-09-20 13:00:00,11,0.0,13,2010,9,Mon,2010-09
3,2010-09-20 14:00:00,4,2.0,14,2010,9,Mon,2010-09
4,2010-09-20 15:00:00,10,2.0,15,2010,9,Mon,2010-09


In [39]:
df_all_agg_hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81349 entries, 0 to 81348
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        81349 non-null  datetime64[ns]
 1   registered  81349 non-null  int64         
 2   casual      81349 non-null  float64       
 3   hour        81349 non-null  int64         
 4   year        81349 non-null  int64         
 5   month       81349 non-null  int64         
 6   day         81349 non-null  object        
 7   year_month  81349 non-null  period[M]     
dtypes: datetime64[ns](1), float64(1), int64(4), object(1), period[M](1)
memory usage: 5.0+ MB


<h4> 2.3) Save aggregated dataframes as csv files </h4>

In [40]:
# Export dataframes

df_all_agg_daily.to_csv('./cleaned_data/df_all_agg_daily.csv', index = False)
df_all_agg_hourly.to_csv('./cleaned_data/df_all_agg_hourly.csv', index = False)

<h3> 3) Clean + wrangle full data set (df_all_years) </h3>

- Now we return to the full data set that we downloaded from the Capital Bike Share website

<h4> 3.1) Divide full dataframe into separate ones based on year </h4>

In [41]:
df_all_years.head()

Unnamed: 0,Duration,Start date,End date,Start station number,Start station,End station number,End station,Bike number,Member type
0,1012,2010-09-20 11:27:04,2010-09-20 11:43:56,31208,M St & New Jersey Ave SE,31108,4th & M St SW,W00742,Member
1,61,2010-09-20 11:41:22,2010-09-20 11:42:23,31209,1st & N St SE,31209,1st & N St SE,W00032,Member
2,2690,2010-09-20 12:05:37,2010-09-20 12:50:27,31600,5th & K St NW,31100,19th St & Pennsylvania Ave NW,W00993,Member
3,1406,2010-09-20 12:06:05,2010-09-20 12:29:32,31600,5th & K St NW,31602,Park Rd & Holmead Pl NW,W00344,Member
4,1413,2010-09-20 12:10:43,2010-09-20 12:34:17,31100,19th St & Pennsylvania Ave NW,31201,15th & P St NW,W00883,Member


In [42]:
# Rename columns for better legibility
df_all_years = df_all_years.rename(columns = {'Duration': 'duration',
                                'Start date': 'start_date',
                                'End date': 'end_date',
                                'Start station number': 'start_station_number',
                                'Start station': 'start_station',
                                'End station number': 'end_station_number',
                                'End station': 'end_station',
                                'Bike number': 'bike_number',
                                'Member type': 'member_type'
                                })

# Add "registered" and "casual" columns
df_all_years['registered'] = df_all_years['member_type'].apply(lambda x: 1 if x == 'Member' else 0)
df_all_years['casual'] = df_all_years['member_type'].apply(lambda x: 1 if x == 'Casual' else 0)

# Divide into separate dataframes, based on year
df_2011 = df_all_years[df_all_years['start_date'].dt.year == 2011]
df_2012 = df_all_years[df_all_years['start_date'].dt.year == 2012]
df_2013 = df_all_years[df_all_years['start_date'].dt.year == 2013]
df_2014 = df_all_years[df_all_years['start_date'].dt.year == 2014]
df_2015 = df_all_years[df_all_years['start_date'].dt.year == 2015]
df_2016 = df_all_years[df_all_years['start_date'].dt.year == 2016]
df_2017 = df_all_years[df_all_years['start_date'].dt.year == 2017]
df_2018 = df_all_years[df_all_years['start_date'].dt.year == 2018]
df_2019 = df_all_years[df_all_years['start_date'].dt.year == 2019]

- run time: 22 s

In [43]:
df_2019.head(2)

Unnamed: 0,duration,start_date,end_date,start_station_number,start_station,end_station_number,end_station,bike_number,member_type,registered,casual
22660327,230,2019-01-01 00:04:48,2019-01-01 00:08:39,31203,14th & Rhode Island Ave NW,31200,Massachusetts Ave & Dupont Circle NW,E00141,Member,1,0
22660328,1549,2019-01-01 00:06:37,2019-01-01 00:32:27,31321,15th St & Constitution Ave NW,31114,18th St & Wyoming Ave NW,W24067,Casual,0,1


<h4> 3.2) Export "df_all_years" as csv file </h4>

In [44]:
df_all_years.to_csv('cleaned_data/df_all_years.csv', encoding = 'utf-8', index = False)

Run time: 2m 6 s

<h4> 3.3) Export "df_2011" to "df_2019" as csv files </h4>

In [45]:
df_2011.to_csv('cleaned_data/df_2011.csv', encoding = 'utf-8', index = False)
df_2012.to_csv('cleaned_data/df_2012.csv', encoding = 'utf-8', index = False)
df_2013.to_csv('cleaned_data/df_2013.csv', encoding = 'utf-8', index = False)
df_2014.to_csv('cleaned_data/df_2014.csv', encoding = 'utf-8', index = False)
df_2015.to_csv('cleaned_data/df_2015.csv', encoding = 'utf-8', index = False)
df_2016.to_csv('cleaned_data/df_2016.csv', encoding = 'utf-8', index = False)
df_2017.to_csv('cleaned_data/df_2017.csv', encoding = 'utf-8', index = False)
df_2018.to_csv('cleaned_data/df_2018.csv', encoding = 'utf-8', index = False)
df_2019.to_csv('cleaned_data/df_2019.csv', encoding = 'utf-8', index = False)

Run time: 2 m 5s

<h2> Part 3 </h2>

- I did Part 3 more as an exercise. 
- From the Capital Bike Share github page, I downloaded additional station-related data, showing all currently active bike stations along with their address and geo-coordinates (latitude and longitude).
- However, some bike stations that were active in 2011 have since been phased out and are not included in the station location data, so we don't know their geo-coordinates. 
- And so, I used geopy to automate the search for the geo-cordinates of certain bike stations based on their address.

In [48]:
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import json
from difflib import SequenceMatcher


<h3> 1) Load station data + data wrangling </h3>

In [49]:
with open('station_information.json', 'r') as json_file:
    json_load = json.load(json_file)

    station_info = pd.DataFrame(json_load['data']['stations'])

In [50]:
station_info.head(2)

Unnamed: 0,lon,external_id,rental_uris,capacity,station_id,rental_methods,name,eightd_station_services,short_name,eightd_has_key_dispenser,electric_bike_surcharge_waiver,lat,legacy_id,region_id,station_type,has_kiosk
0,-77.05323,082469cc-1f3f-11e7-bf6b-3863bb334450,"{'ios': 'https://dc.lft.to/lastmile_qr_scan', ...",15,1,"[KEY, CREDITCARD]",Eads St & 15th St S,[],31000,False,False,38.858971,1,41,classic,True
1,-77.049232,08246c35-1f3f-11e7-bf6b-3863bb334450,"{'ios': 'https://dc.lft.to/lastmile_qr_scan', ...",17,3,"[KEY, CREDITCARD]",Crystal Dr & 20th St S,[],31002,False,False,38.856425,3,41,classic,True


- Select just the "name", "lat", "lon", and "region_id" columns

In [52]:
# Create "station_loc" dataframe with just the relevant info
station_loc = station_info[['name','lat','lon','region_id']]

In [53]:
station_loc.head(2)

Unnamed: 0,name,lat,lon,region_id
0,Eads St & 15th St S,38.858971,-77.05323,41
1,Crystal Dr & 20th St S,38.856425,-77.049232,41


- Add "region" column
- Mapping from "region_id" to "region" was found on Capital Bike Share github page

In [62]:
# Dictionary of region_id and region names
# Pulled from the Capital Bikeshare github

# Create dictionary for region_id and region
regions_dict = {
40: "Alexandria, VA",
41: "Arlington, VA", 
42: "Washington, DC",
43: "Montgomery County, MD (North)",
44: "Montgomery County, MD (South)",
48: "Test & Operations",
104: "Fairfax, VA",
128: "8D",
133: "Prince George's County",
152: "Falls Church, VA"
}

# Add "region" column in "station_loc".
station_loc['region_id'] = station_loc['region_id'].fillna(0)
station_loc['region_id'] = station_loc['region_id'].astype('int64')
station_loc['region'] = station_loc['region_id'].map(regions_dict)

In [64]:
station_loc.head(2)

Unnamed: 0,name,lat,lon,region_id,region
0,Eads St & 15th St S,38.858971,-77.05323,41,"Arlington, VA"
1,Crystal Dr & 20th St S,38.856425,-77.049232,41,"Arlington, VA"


- Save dataframe station_loc as csv file

In [65]:
station_loc.to_csv('cleaned_data/station_loc.csv', index = False)

<h3> 2) Find 2011 stations whose coordinates are missing </h3>

- Load dataframe for 2011

In [66]:
df_2011 = pd.read_csv('cleaned_data/df_2011.csv', parse_dates = ['start_date', 'end_date'])
df_2011.head(2)

Unnamed: 0,duration,start_date,end_date,start_station_number,start_station,end_station_number,end_station,bike_number,member_type,registered,casual
0,3548,2011-01-01 00:01:29,2011-01-01 01:00:37,31620,5th & F St NW,31620,5th & F St NW,W00247,Member,1,0
1,346,2011-01-01 00:02:46,2011-01-01 00:08:32,31105,14th & Harvard St NW,31101,14th & V St NW,W00675,Casual,0,1


- Check that the set of all start stations is the same as the set of all end stations

In [67]:
start_stations_2011 = df_2011['start_station'].unique()
end_stations_2011 = df_2011['end_station'].unique()

# number of start stations, number of end stations, size of intersection
len(start_stations_2011), len (end_stations_2011), len (np.intersect1d(start_stations_2011, end_stations_2011))

(144, 144, 144)

- Create "stations_2011", a dataframe listing all stations from 2011, along with "region", "lat" and "lon"

In [70]:
# Retrieve just the station name
stations_2011 = pd.DataFrame(start_stations_2011).rename(columns = {0: 'station'})

# Add "region", "lat", and "lon" columns using stations_loc
stations_2011['region'] = np.zeros(stations_2011.shape[0])
stations_2011['lat'] = np.zeros(stations_2011.shape[0])
stations_2011['lon'] = np.zeros(stations_2011.shape[0])

for i in stations_2011.index:
    for j in station_loc.index:
        if stations_2011.loc[i,'station'] == station_loc.loc[j, 'name']:
            stations_2011.loc[i, 'lat'] = station_loc.loc[j, 'lat']
            stations_2011.loc[i, 'lon'] = station_loc.loc[j, 'lon']
            stations_2011.loc[i, 'region'] = station_loc.loc[j, 'region']

In [71]:
stations_2011.head(5)

Unnamed: 0,station,region,lat,lon
0,5th & F St NW,"Washington, DC",38.897222,-77.019347
1,14th & Harvard St NW,"Washington, DC",38.9268,-77.0322
2,Georgia & New Hampshire Ave NW,"Washington, DC",38.936684,-77.024181
3,10th & U St NW,"Washington, DC",38.9172,-77.0259
4,Adams Mill & Columbia Rd NW,"Washington, DC",38.922925,-77.042581


- Create a list of the missing station names (i.e. stations for which 'lat' and 'lon' are not available)

In [72]:
missing_stations = []
for i in stations_2011.index:
    if stations_2011.loc[i,'lat'] == 0 or stations_2011.loc[i,'lat'] == 0:
        missing_stations.append(stations_2011.loc[i,'station'])

In [73]:
missing_stations

['Crystal City Metro / 18th & Bell St',
 '21st & M St NW',
 'Eastern Market Metro / Pennsylvania Ave & 7th St SE',
 'Connecticut Ave & Newark St NW / Cleveland Park',
 '18th & Eads St.',
 '19th & L St NW',
 '23rd & Crystal Dr',
 'Aurora Hills Community Ctr/18th & Hayes St',
 'S Joyce & Army Navy Dr',
 'Georgia Ave and Fairmont St NW',
 '20th & Crystal Dr',
 'S Glebe & Potomac Ave',
 'USDA / 12th & Independence Ave SW',
 '27th & Crystal Dr',
 'Pentagon City Metro / 12th & S Hayes St',
 '12th & Army Navy Dr',
 '26th & S Clark St',
 '15th & Crystal Dr',
 'Eads & 22nd St S',
 '1st & N St  SE',
 'Lynn & 19th St North',
 'N Rhodes & 16th St N',
 'Rosslyn Metro / Wilson Blvd & Ft Myer Dr',
 'Wilson Blvd & Franklin Rd',
 '11th & H St NE']

<h3> 3) Use geopy to retrieve the coordinates of the missing stations </h3>

In [113]:
# Create dataframe for just missing stations 

missing = pd.DataFrame(missing_stations)
missing = missing.rename(columns = {0: 'station'})
missing.head(2)

Unnamed: 0,station
0,Crystal City Metro / 18th & Bell St
1,21st & M St NW


In [114]:
# If the address in "missing" is very close to the address in "station_loc", then use the region name from "station_loc"
for i in range(len(missing)):
    for j in range (len(station_loc)):
        if (SequenceMatcher(isjunk = None, a = missing.loc[i,'station'], b = station_loc.loc[j, 'name'])).ratio()>= 0.9:
            missing.loc[i,'region'] = station_loc.loc[j,'region']

# fill NaN with zero
missing['region'] = missing['region'].fillna(0)

# Initialise 'lat' and 'lon' columns
missing['lat'] = np.zeros((len(missing),1))
missing['lon'] = np.zeros((len(missing),1))

# Initialise geolocator and geocode
geolocator = Nominatim(user_agent="bike_search")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Add a space at the beginning of the "region" entries + make sure format of "region" column is "string"
missing['region'] = " " + missing['region'].astype(str)
            

In [115]:
missing

Unnamed: 0,station,region,lat,lon
0,Crystal City Metro / 18th & Bell St,"Arlington, VA",0.0,0.0
1,21st & M St NW,"Washington, DC",0.0,0.0
2,Eastern Market Metro / Pennsylvania Ave & 7th ...,"Washington, DC",0.0,0.0
3,Connecticut Ave & Newark St NW / Cleveland Park,0,0.0,0.0
4,18th & Eads St.,0,0.0,0.0
5,19th & L St NW,"Washington, DC",0.0,0.0
6,23rd & Crystal Dr,0,0.0,0.0
7,Aurora Hills Community Ctr/18th & Hayes St,0,0.0,0.0
8,S Joyce & Army Navy Dr,0,0.0,0.0
9,Georgia Ave and Fairmont St NW,"Washington, DC",0.0,0.0


In [116]:
# Now perform geocoding for stations with missing coordinates

for i in range(len(missing)):

    try:
        # Check case where the "region" is missing
        if missing.loc[i, 'region'] == 0:           
            dummy_lat = geocode(missing.loc[i,'station'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'], timeout = 15).longitude

            # Make sure coordinates are in or around Washington DC 
            if (dummy_lon > -79 and dummy_lon <-76) and (dummy_lat > 38 and dummy_lat<40):     
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon

        # If "region" name is available, add it to the geocode search
        else:                                       
            dummy_lat = geocode(missing.loc[i,'station'] + missing.loc[i, 'region'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'] + missing.loc[i, 'region'], timeout = 15).longitude

            if (dummy_lon > -79 and dummy_lon <-76) and (dummy_lat > 38 and dummy_lat<40):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon

    except AttributeError:
        pass


In [117]:
missing

Unnamed: 0,station,region,lat,lon
0,Crystal City Metro / 18th & Bell St,"Arlington, VA",0.0,0.0
1,21st & M St NW,"Washington, DC",38.905107,-77.057402
2,Eastern Market Metro / Pennsylvania Ave & 7th ...,"Washington, DC",38.884056,-76.995262
3,Connecticut Ave & Newark St NW / Cleveland Park,0,0.0,0.0
4,18th & Eads St.,0,0.0,0.0
5,19th & L St NW,"Washington, DC",38.903799,-77.053958
6,23rd & Crystal Dr,0,0.0,0.0
7,Aurora Hills Community Ctr/18th & Hayes St,0,0.0,0.0
8,S Joyce & Army Navy Dr,0,0.0,0.0
9,Georgia Ave and Fairmont St NW,"Washington, DC",38.9249,-77.0222


In [118]:
for i in range(len(missing)):

    try:
        # When region name is available, try searching just the address without the region name
        if missing.loc[i, 'lat'] == 0 and missing.loc[i, 'region'] != 0:
            dummy_lat = geocode(missing.loc[i,'station'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'], timeout = 15).longitude

            if (dummy_lon > -78 and dummy_lon <-76) and (dummy_lat > 38.5 and dummy_lat<39.5):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon
            
    

    except AttributeError:
        pass

In [119]:
missing

Unnamed: 0,station,region,lat,lon
0,Crystal City Metro / 18th & Bell St,"Arlington, VA",0.0,0.0
1,21st & M St NW,"Washington, DC",38.905107,-77.057402
2,Eastern Market Metro / Pennsylvania Ave & 7th ...,"Washington, DC",38.884056,-76.995262
3,Connecticut Ave & Newark St NW / Cleveland Park,0,38.934267,-77.057979
4,18th & Eads St.,0,0.0,0.0
5,19th & L St NW,"Washington, DC",38.903799,-77.053958
6,23rd & Crystal Dr,0,0.0,0.0
7,Aurora Hills Community Ctr/18th & Hayes St,0,0.0,0.0
8,S Joyce & Army Navy Dr,0,0.0,0.0
9,Georgia Ave and Fairmont St NW,"Washington, DC",38.9249,-77.0222


In [120]:
# Even after the geocode search, some station data re missing. 
# We notice that some station names have a "/" in them, which might be confusing the geocode searcher.
# Try a geocode search for the part that comes before the "/"

for i in range(len(missing)):

    try:
        if "/" in missing.loc[i, 'station']:
            dummy_string = missing.loc[i, 'station'].split('/')[0]          # Check part that comes before separator '/'
            
            dummy_lat = geocode(dummy_string, timeout = 15).latitude
            dummy_lon = geocode(dummy_string, timeout = 15).longitude

            if (dummy_lon > -79 and dummy_lon <-76) and (dummy_lat > 38 and dummy_lat<40):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon
            

    

    except AttributeError:
        pass

In [121]:
missing

Unnamed: 0,station,region,lat,lon
0,Crystal City Metro / 18th & Bell St,"Arlington, VA",38.857756,-77.051196
1,21st & M St NW,"Washington, DC",38.905107,-77.057402
2,Eastern Market Metro / Pennsylvania Ave & 7th ...,"Washington, DC",38.884056,-76.995262
3,Connecticut Ave & Newark St NW / Cleveland Park,0,38.934267,-77.057979
4,18th & Eads St.,0,0.0,0.0
5,19th & L St NW,"Washington, DC",38.903799,-77.053958
6,23rd & Crystal Dr,0,0.0,0.0
7,Aurora Hills Community Ctr/18th & Hayes St,0,38.857792,-77.059103
8,S Joyce & Army Navy Dr,0,0.0,0.0
9,Georgia Ave and Fairmont St NW,"Washington, DC",38.9249,-77.0222


- Good, we were able to fill 16 of the missing 25 station coordinates