This notebook is intended mostly as an exercise. <br>
The aim is to find the geocoordinates of places using their address. <br>
To this end, I use the geopy library.

<h3> Outline for Notebook </h3>
<li> Part 1: Load JSON file with all station data hosted on the Capital Bike Share website. Note, this data will be incomplete, since it only includes the stations that are currently in use (at the point of the last update). It does not include data for stations that were active in 2011 but have since been retired. </li>
<li> Part 2: Use the above data to add the geocoordinates to 2011 dataframe bike station data. </li>
<li> Part 3: There will be some rows in the 2011 dataframe with missing coordinates. For these rows, conduct geocode search to fill in coordinates. </li>

In [5]:
import numpy as np
import pandas as pd
import json
from difflib import SequenceMatcher

<h3> Part 1 </h3>

Load json file as pandas dataframe

In [6]:
with open('station_information.json', 'r') as json_file:
    json_load = json.load(json_file)

    station_info = pd.DataFrame(json_load['data']['stations'])

In [7]:
station_info.shape

(680, 16)

In [8]:
station_info.head(2)

Unnamed: 0,lon,external_id,rental_uris,capacity,station_id,rental_methods,name,eightd_station_services,short_name,eightd_has_key_dispenser,electric_bike_surcharge_waiver,lat,legacy_id,region_id,station_type,has_kiosk
0,-77.05323,082469cc-1f3f-11e7-bf6b-3863bb334450,"{'ios': 'https://dc.lft.to/lastmile_qr_scan', ...",15,1,"[KEY, CREDITCARD]",Eads St & 15th St S,[],31000,False,False,38.858971,1,41,classic,True
1,-77.049232,08246c35-1f3f-11e7-bf6b-3863bb334450,"{'ios': 'https://dc.lft.to/lastmile_qr_scan', ...",17,3,"[KEY, CREDITCARD]",Crystal Dr & 20th St S,[],31002,False,False,38.856425,3,41,classic,True


Select just the "name", "lat", "lon", and "region_id" columns

In [9]:
# Create "station_loc" dataframe with just the relevant info
station_loc = station_info[['name','lat','lon','region_id']]

In [10]:
station_loc.head(2)

Unnamed: 0,name,lat,lon,region_id
0,Eads St & 15th St S,38.858971,-77.05323,41
1,Crystal Dr & 20th St S,38.856425,-77.049232,41


Capital Bike Share defines its own set of region ids. <br>
This info was found here: <br>

In [11]:
# Dictionary of region_id and region names
# Pulled from the Capital Bikeshare github


regions = [
{
"name": "Alexandria, VA",
"region_id": "40"
},

{
"name": "Arlington, VA",
"region_id": "41"
},
{
"name": "Washington, DC",
"region_id": "42"
},
{
"name": "Montgomery County, MD (North)",
"region_id": "43"
},
{
"name": "Montgomery County, MD (South)",
"region_id": "44"
},
{
"name": "Test & Operations",
"region_id": "48"
},
{
"name": "Fairfax, VA",
"region_id": "104"
},
{
"name": "8D",
"region_id": "128"
},
{
"name": "Prince George's County",
"region_id": "133"
},
{
"name": "Falls Church, VA",
"region_id": "152"
}
]


# Create dictionary for region_id and region
regions_dict = {
40: "Alexandria, VA",
41: "Arlington, VA", 
42: "Washington, DC",
43: "Montgomery County, MD (North)",
44: "Montgomery County, MD (South)",
48: "Test & Operations",
104: "Fairfax, VA",
128: "8D",
133: "Prince George's County",
152: "Falls Church, VA"
}

In [12]:
# Add "region" column in "station_loc".
station_loc['region_id'] = station_loc['region_id'].fillna(0)
station_loc['region_id'] = station_loc['region_id'].astype('int64')
station_loc['region'] = station_loc['region_id'].map(regions_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  station_loc['region_id'] = station_loc['region_id'].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  station_loc['region_id'] = station_loc['region_id'].astype('int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  station_loc['region'] = station_loc['region_id'].map(regions_dict)


In [13]:
station_loc.head(2)

Unnamed: 0,name,lat,lon,region_id,region
0,Eads St & 15th St S,38.858971,-77.05323,41,"Arlington, VA"
1,Crystal Dr & 20th St S,38.856425,-77.049232,41,"Arlington, VA"


In [15]:
station_loc.to_csv('cleaned_data/station_loc.csv', index = False)

<h3> Part 2 </h3>

In [16]:
df_2011 = pd.read_csv('cleaned_data/df_2011.csv', parse_dates = ['start_date', 'end_date'])

In [17]:
df_2011.head(2)

Unnamed: 0,duration,start_date,end_date,start_station_number,start_station,end_station_number,end_station,bike_number,member_type,registered,casual
0,3548,2011-01-01 00:01:29,2011-01-01 01:00:37,31620,5th & F St NW,31620,5th & F St NW,W00247,Member,1,0
1,346,2011-01-01 00:02:46,2011-01-01 00:08:32,31105,14th & Harvard St NW,31101,14th & V St NW,W00675,Casual,0,1


Check that the set of all start stations is the same as the set of all end stations

In [18]:
start_stations_2011 = df_2011['start_station'].unique()
end_stations_2011 = df_2011['end_station'].unique()

In [19]:
len(start_stations_2011), len (end_stations_2011), len (np.intersect1d(start_stations_2011, end_stations_2011))

(144, 144, 144)

Ok. Start stations and end stations are the same.

In [20]:
stations_2011 = pd.DataFrame(start_stations_2011).rename(columns = {0: 'station'})

In [21]:
stations_2011

Unnamed: 0,station
0,5th & F St NW
1,14th & Harvard St NW
2,Georgia & New Hampshire Ave NW
3,10th & U St NW
4,Adams Mill & Columbia Rd NW
...,...
139,3000 Connecticut Ave NW / National Zoo
140,Benning Rd & East Capitol St NE / Benning Rd M...
141,Anacostia Ave & Benning Rd NE / River Terrace
142,15th St & Massachusetts Ave SE


In [24]:
# Add "region", "lat", and "lon" columns to stations_2011 using stations_loc

stations_2011['region'] = np.zeros(stations_2011.shape[0])
stations_2011['lat'] = np.zeros(stations_2011.shape[0])
stations_2011['lon'] = np.zeros(stations_2011.shape[0])

In [25]:
# For each station in "stations_2011", add the latitude, longitude, and region using "station_loc"

for i in stations_2011.index:
    for j in station_loc.index:
        if stations_2011.loc[i,'station'] == station_loc.loc[j, 'name']:
            stations_2011.loc[i, 'lat'] = station_loc.loc[j, 'lat']
            stations_2011.loc[i, 'lon'] = station_loc.loc[j, 'lon']
            stations_2011.loc[i, 'region'] = station_loc.loc[j, 'region']
       


In [26]:
stations_2011.head(5)

Unnamed: 0,station,region,lat,lon
0,5th & F St NW,"Washington, DC",38.897222,-77.019347
1,14th & Harvard St NW,"Washington, DC",38.9268,-77.0322
2,Georgia & New Hampshire Ave NW,"Washington, DC",38.936684,-77.024181
3,10th & U St NW,"Washington, DC",38.9172,-77.0259
4,Adams Mill & Columbia Rd NW,"Washington, DC",38.922925,-77.042581


Create a list of the missing station names (i.e. stations for which 'lat' and 'lon' are not available)

In [27]:
missing_stations = []
for i in stations_2011.index:
    if stations_2011.loc[i,'lat'] == 0 or stations_2011.loc[i,'lat'] == 0:
        missing_stations.append(stations_2011.loc[i,'station'])

In [28]:
missing_stations

['Crystal City Metro / 18th & Bell St',
 '21st & M St NW',
 'Eastern Market Metro / Pennsylvania Ave & 7th St SE',
 'Connecticut Ave & Newark St NW / Cleveland Park',
 '18th & Eads St.',
 '19th & L St NW',
 '23rd & Crystal Dr',
 'Aurora Hills Community Ctr/18th & Hayes St',
 'S Joyce & Army Navy Dr',
 'Georgia Ave and Fairmont St NW',
 '20th & Crystal Dr',
 'S Glebe & Potomac Ave',
 'USDA / 12th & Independence Ave SW',
 '27th & Crystal Dr',
 'Pentagon City Metro / 12th & S Hayes St',
 '12th & Army Navy Dr',
 '26th & S Clark St',
 '15th & Crystal Dr',
 'Eads & 22nd St S',
 '1st & N St  SE',
 'Lynn & 19th St North',
 'N Rhodes & 16th St N',
 'Rosslyn Metro / Wilson Blvd & Ft Myer Dr',
 'Wilson Blvd & Franklin Rd',
 '11th & H St NE']

<h3> Part 3 </h3>

In [29]:
import geopandas as gpd
import geopandas.tools
from geopy.geocoders import Nominatim
import geopy
from geopandas.tools import geocode
from geopy.extra.rate_limiter import RateLimiter
from difflib import SequenceMatcher

In [32]:
missing = pd.DataFrame(missing_stations)
missing = missing.rename(columns = {0: 'station'})

In [34]:
missing.head(2)

Unnamed: 0,station
0,Crystal City Metro / 18th & Bell St
1,21st & M St NW


In [35]:
# If the address in "missing" is very close to the address in "station_loc", then use the region name from "station_loc"

for i in range(len(missing)):
    for j in range (len(station_loc)):
        if (SequenceMatcher(None, missing.loc[i,'station'], station_loc.loc[j, 'name'])).ratio()>= 0.9:
            missing.loc[i,'region'] = station_loc.loc[j,'region']

missing['region'] = missing['region'].fillna(0)
            

In [36]:
# Initialise 'lat' and 'lon' columns
missing['lat'] = np.zeros((len(missing),1))
missing['lon'] = np.zeros((len(missing),1))

In [37]:
# Initialise geolocator and geocode

geolocator = Nominatim(user_agent="bike_search")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

In [38]:
# Add a space at the beginning of the "region" entries + make sure format of "region" column is "string"
missing['region'] = " " + missing['region'].astype(str)

In [39]:
# Now perform geocoding for stations with missing coordinates

for i in range(len(missing)):

    try:
        if missing.loc[i, 'region'] == 0:           # If the "region" is missing, then do a bare geocode search
            dummy_lat = geocode(missing.loc[i,'station'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'], timeout = 15).longitude

            if (dummy_lon > -78 and dummy_lon <-76) and (dummy_lat > 38.5 and dummy_lat<39.5):     # Want coordinates to be in or around Washington DC 
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon

        else:                                       # If "region" name is available, add it to the geocode search
            dummy_lat = geocode(missing.loc[i,'station'] + missing.loc[i, 'region'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'] + missing.loc[i, 'region'], timeout = 15).longitude

            if (dummy_lon > -79 and dummy_lon <-76) and (dummy_lat > 38 and dummy_lat<40):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon

    except AttributeError:
        pass





In [42]:
for i in range(len(missing)):

    try:
        if missing.loc[i, 'lat'] == 0 and missing.loc[i, 'region'] != 0:
            dummy_lat = geocode(missing.loc[i,'station'], timeout = 15).latitude
            dummy_lon = geocode(missing.loc[i,'station'], timeout = 15).longitude

            if (dummy_lon > -78 and dummy_lon <-76) and (dummy_lat > 38.5 and dummy_lat<39.5):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon
            
    

    except AttributeError:
        pass

In [40]:
# Even after the geocode search, some station data re missing. 
# For those, some station names have a "/" in them. 
# Try a geocode search for the part that comes before the "/"

for i in range(len(missing)):

    try:
        if "/" in missing.loc[i, 'station']:
            dummy_string = missing.loc[i, 'station'].split('/')[0]          # Check part that comes before separator '/'
            
            dummy_lat = geocode(dummy_string, timeout = 15).latitude
            dummy_lon = geocode(dummy_string, timeout = 15).longitude

            if (dummy_lon > -78 and dummy_lon <-76) and (dummy_lat > 38.5 and dummy_lat<39.5):
                missing.loc[i, 'lat'] = dummy_lat
                missing.loc[i, 'lon'] = dummy_lon
    

    except AttributeError:
        pass
            
            
            

In [43]:
missing

Unnamed: 0,station,region,lat,lon
0,Crystal City Metro / 18th & Bell St,"Arlington, VA",38.857756,-77.051196
1,21st & M St NW,"Washington, DC",38.905107,-77.057402
2,Eastern Market Metro / Pennsylvania Ave & 7th ...,"Washington, DC",38.884056,-76.995262
3,Connecticut Ave & Newark St NW / Cleveland Park,0,38.934267,-77.057979
4,18th & Eads St.,0,0.0,0.0
5,19th & L St NW,"Washington, DC",38.903799,-77.053958
6,23rd & Crystal Dr,0,38.853166,-77.050493
7,Aurora Hills Community Ctr/18th & Hayes St,0,38.857792,-77.059103
8,S Joyce & Army Navy Dr,0,38.86571,-77.061773
9,Georgia Ave and Fairmont St NW,"Washington, DC",38.9249,-77.0222


Nice. After doing the geocode search, we were able to find the coordinates for 23 of the 25 missing stations.