<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Load-County-County-HCV-Trip-Counts" data-toc-modified-id="Load-County-County-HCV-Trip-Counts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load County-County HCV Trip Counts</a></span></li><li><span><a href="#Map-Nodes-to-County" data-toc-modified-id="Map-Nodes-to-County-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Map Nodes to County</a></span><ul class="toc-item"><li><span><a href="#Load-nodes" data-toc-modified-id="Load-nodes-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load nodes</a></span></li><li><span><a href="#latlon-→-county,-for-each-node" data-toc-modified-id="latlon-→-county,-for-each-node-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>latlon → county, for each node</a></span></li><li><span><a href="#County-name-→-county_id,-for-each-node" data-toc-modified-id="County-name-→-county_id,-for-each-node-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>County name → county_id, for each node</a></span></li></ul></li><li><span><a href="#Split-county-county-trips-among-nodes-inside" data-toc-modified-id="Split-county-county-trips-among-nodes-inside-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Split county-county trips among nodes inside</a></span></li><li><span><a href="#How-many-trips-are-considered-in-the-network-graph?" data-toc-modified-id="How-many-trips-are-considered-in-the-network-graph?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>How many trips are considered in the network graph?</a></span></li></ul></div>

In [1]:
import pandas as pd

## Load County-County HCV Trip Counts

***Note***: Run "`01 CSTDM County-County CV Trips.. Extract HD from All CVs.ipynb`" before this script

In [2]:
hcv_trips = pd.read_csv('../scratch/HCV_TripsOD_2040.csv')

In [3]:
hcv_trips.head()

Unnamed: 0,O_County,D_County,Trips
0,1,1,81881.0
1,1,2,0.0
2,1,3,6.0
3,1,4,68.0
4,1,5,8.0


## Map Nodes to County

### Load nodes

In [4]:
nodes = pd.read_excel('../input/Nodes_manual_input.xlsx')

In [5]:
display(nodes.head())
display(nodes.info())

Unnamed: 0,node_id,node_name,rank,lon,lat,latlon_str,latlon_delim_pos,faf_zone,OD
0,1,Redding,1,-122.360642,40.58545,"40.58545,-122.360642",9,Rest of CA,1
1,2,Red Bluff,1,-122.224084,40.179209,"40.179209,-122.224084",10,Rest of CA,1
2,3,Dunnigan,2,-121.953458,38.860841,"38.860841,-121.953458",10,,0
3,4,Woodland,2,-121.755159,38.689872,"38.689872,-121.755159",10,,0
4,5,SMF,1,-121.59364,38.67109,"38.67109,-121.59364",9,Sacramento CA,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 9 columns):
node_id             75 non-null int64
node_name           75 non-null object
rank                75 non-null int64
lon                 75 non-null float64
lat                 75 non-null float64
latlon_str          75 non-null object
latlon_delim_pos    75 non-null int64
faf_zone            75 non-null object
OD                  75 non-null int64
dtypes: float64(2), int64(4), object(3)
memory usage: 5.4+ KB


None

### latlon → county, for each node

Define a function to find county from lat-lon coordinates

Note: the `Nominatim.reverse()` function of `geopy` returns a `Location` object.  
Raw `Location` data (`Location.raw`) is a dictionary that looks like this:
```
{'address': {'city': 'SF',
  'country': 'United States of America',
  'country_code': 'us',
  'county': 'SF',
  'neighbourhood': 'West SoMa',
  'postcode': '94114',
  'road': 'James Lick Freeway',
  'state': 'California'},
 'boundingbox': ['37.7719252', '37.776113', '-122.4068281', '-122.4050958'],
 'display_name': 'James Lick Freeway, West SoMa, SF, California, 94114, United States of America',
 'lat': '37.7743586',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
 'lon': '-122.4066586',
 'osm_id': '31129732',
 'osm_type': 'way',
 'place_id': '78309211'}
```

In [6]:
def getCounty(latlon):
    # Refs:
    # - TimeOutError: https://gis.stackexchange.com/questions/173569/avoid-time-out-error-nominatim-geopy-open-street-maps
    from geopy.geocoders import Nominatim
    from geopy.exc import GeocoderTimedOut
    import time
    geolocator = Nominatim()
    try:
        return geolocator.reverse(latlon).raw['address']['county']
    except GeocoderTimedOut:
        time.sleep(0.01)    # wait and try again
        return getCounty(latlon)

(Expect ~3 minutes running time for the county matching)

In [7]:
%%time
nodes['county_long'] = nodes.latlon_str.apply(getCounty)

CPU times: user 1.18 s, sys: 119 ms, total: 1.3 s
Wall time: 1min 41s


Re-format county name

In [8]:
def shortenCounty(county_long):
    '''
    Strip the word 'County' from county name string. 
    E.g. shortenCounty('Alameda County') → 'Alameda'
    Acronyms like 'SF' will be spelt out.
    E.g. shortenCounty('SF County') → 'San Francisco'
    '''
    import re
    acronyms = {'SF': 'San Francisco'}
    short = re.sub(pattern='county', repl='', string=county_long, flags=re.IGNORECASE).strip()
    if short in acronyms: short = acronyms[short];
    return short

In [9]:
nodes['county_short'] = nodes['county_long'].apply(shortenCounty)

In [10]:
nodes.head()

Unnamed: 0,node_id,node_name,rank,lon,lat,latlon_str,latlon_delim_pos,faf_zone,OD,county_long,county_short
0,1,Redding,1,-122.360642,40.58545,"40.58545,-122.360642",9,Rest of CA,1,Shasta County,Shasta
1,2,Red Bluff,1,-122.224084,40.179209,"40.179209,-122.224084",10,Rest of CA,1,Tehama County,Tehama
2,3,Dunnigan,2,-121.953458,38.860841,"38.860841,-121.953458",10,,0,Yolo County,Yolo
3,4,Woodland,2,-121.755159,38.689872,"38.689872,-121.755159",10,,0,Yolo County,Yolo
4,5,SMF,1,-121.59364,38.67109,"38.67109,-121.59364",9,Sacramento CA,1,Sacramento County,Sacramento


Unique county names

In [11]:
nodes.county_short.unique()

array(['Shasta', 'Tehama', 'Yolo', 'Sacramento', 'Placer', 'San Joaquin',
       'Solano', 'Contra Costa', 'Alameda', 'San Francisco', 'San Mateo',
       'Santa Clara', 'Merced', 'Kern', 'Stanislaus', 'Fresno',
       'Los Angeles', 'Orange', 'San Bernardino', 'San Diego', 'Kings',
       'Madera', 'Tulare', 'Glenn'], dtype=object)

### County name → county_id, for each node

In [12]:
county_lookup = pd.read_csv('../input/CaliforniaCountyRegionLookup.csv')

In [13]:
county_lookup = county_lookup.filter(regex='County')

In [14]:
county_lookup.head()

Unnamed: 0,County_No,County/Gateway
0,1,Alameda
1,2,Alpine
2,3,Amador
3,4,Butte
4,5,Calaveras


In [15]:
nodes = pd.merge(left=nodes, right=county_lookup, 
                 how='left', left_on='county_short', right_on='County/Gateway')\
        .drop('County/Gateway', axis=1)

In [16]:
nodes.head()

Unnamed: 0,node_id,node_name,rank,lon,lat,latlon_str,latlon_delim_pos,faf_zone,OD,county_long,county_short,County_No
0,1,Redding,1,-122.360642,40.58545,"40.58545,-122.360642",9,Rest of CA,1,Shasta County,Shasta,45
1,2,Red Bluff,1,-122.224084,40.179209,"40.179209,-122.224084",10,Rest of CA,1,Tehama County,Tehama,52
2,3,Dunnigan,2,-121.953458,38.860841,"38.860841,-121.953458",10,,0,Yolo County,Yolo,57
3,4,Woodland,2,-121.755159,38.689872,"38.689872,-121.755159",10,,0,Yolo County,Yolo,57
4,5,SMF,1,-121.59364,38.67109,"38.67109,-121.59364",9,Sacramento CA,1,Sacramento County,Sacramento,34


Save the processed `nodes` table in `../scratch/` folder

In [17]:
nodes.to_csv('../scratch/Nodes_info.csv', index=False)

## Split county-county trips among nodes inside

In [18]:
nodes.head()

Unnamed: 0,node_id,node_name,rank,lon,lat,latlon_str,latlon_delim_pos,faf_zone,OD,county_long,county_short,County_No
0,1,Redding,1,-122.360642,40.58545,"40.58545,-122.360642",9,Rest of CA,1,Shasta County,Shasta,45
1,2,Red Bluff,1,-122.224084,40.179209,"40.179209,-122.224084",10,Rest of CA,1,Tehama County,Tehama,52
2,3,Dunnigan,2,-121.953458,38.860841,"38.860841,-121.953458",10,,0,Yolo County,Yolo,57
3,4,Woodland,2,-121.755159,38.689872,"38.689872,-121.755159",10,,0,Yolo County,Yolo,57
4,5,SMF,1,-121.59364,38.67109,"38.67109,-121.59364",9,Sacramento CA,1,Sacramento County,Sacramento,34


In [19]:
OD_nodes = nodes[['node_id', 'County_No']][nodes.OD>0]
display(OD_nodes.head())

Unnamed: 0,node_id,County_No
0,1,45
1,2,52
4,5,34
5,6,34
6,7,31


In [20]:
## node-node trips
nn_trips = pd.merge(left=hcv_trips, right=OD_nodes, 
                    how='inner', left_on='O_County', right_on='County_No')\
                .drop('County_No', axis=1)\
                .rename(columns={'node_id':'orig_node_id'})
nn_trips = pd.merge(left=nn_trips, right=OD_nodes, 
                    how='inner', left_on='D_County', right_on='County_No')\
                .drop('County_No', axis=1)\
                .rename(columns={'node_id':'dest_node_id'})

In [21]:
nn_trips = nn_trips[nn_trips.orig_node_id != nn_trips.dest_node_id]

In [22]:
nn_trips.head()

Unnamed: 0,O_County,D_County,Trips,orig_node_id,dest_node_id
1,1,1,81881.0,13,18
2,1,1,81881.0,18,13
4,7,1,6895.0,12,13
5,7,1,6895.0,12,18
6,10,1,490.0,25,13


In [23]:
od_count = nn_trips.groupby(['O_County', 'D_County']).size().to_frame(name='same_county_OD_count').reset_index()

In [24]:
od_count.head()

Unnamed: 0,O_County,D_County,same_county_OD_count
0,1,1,2
1,1,7,2
2,1,10,2
3,1,19,6
4,1,30,2


In [25]:
nn_trips = pd.merge(left=nn_trips, right=od_count, on=['O_County','D_County'])

In [26]:
nn_trips['NodeNode_trips'] = nn_trips['Trips']/nn_trips['same_county_OD_count']

In [27]:
display(nn_trips.head())
display(nn_trips.tail())

Unnamed: 0,O_County,D_County,Trips,orig_node_id,dest_node_id,same_county_OD_count,NodeNode_trips
0,1,1,81881.0,13,18,2,40940.5
1,1,1,81881.0,18,13,2,40940.5
2,7,1,6895.0,12,13,2,3447.5
3,7,1,6895.0,12,18,2,3447.5
4,10,1,490.0,25,13,2,245.0


Unnamed: 0,O_County,D_County,Trips,orig_node_id,dest_node_id,same_county_OD_count,NodeNode_trips
375,38,52,0.0,15,2,2,0.0
376,39,52,9.0,19,2,1,9.0
377,41,52,3.0,16,2,1,3.0
378,43,52,10.0,17,2,1,10.0
379,45,52,867.0,1,2,1,867.0


Save node-node trips as a csv in `../scratch/` folder. Only include the origin node, destination node, and # trips columns.

In [38]:
OD_info = nn_trips[['orig_node_id', 'dest_node_id', 'NodeNode_trips']].copy()
OD_info['path_id'] = OD_info.index
OD_info['OD_id'] = OD_info.index
OD_info.set_index('OD_id', inplace=True)
OD_info.rename(columns={'NodeNode_trips':'trip_count'}, inplace=True)
OD_info.to_csv('../scratch/OD_info.csv', index=False)

In [39]:
OD_info.head()

Unnamed: 0_level_0,orig_node_id,dest_node_id,trip_count,path_id
OD_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,13,18,40940.5,0
1,18,13,40940.5,1
2,12,13,3447.5,2
3,12,18,3447.5,3
4,25,13,245.0,4


## How many trips are considered in the network graph?

In [29]:
OD_counties = OD_nodes.County_No.unique()

In [30]:
OD_counties.size

15

In [32]:
considered_trips_count = hcv_trips[hcv_trips.O_County.isin(OD_counties) 
                                   & hcv_trips.D_County.isin(OD_counties)]\
                                  ['Trips']\
                                  .sum()

In [33]:
considered_trips_count

2308760.0

In [34]:
c2c_trips_count = hcv_trips.Trips.sum()

In [35]:
considered_trips_count/c2c_trips_count

0.62434287098853836