# Demo -- Walkable Accessibility Score (WAS)

### Date: July, 25, 2024

### Compute a Walkable Accessibility Score (WAS) with a small number of data

This notebook creates a Walkable Accessibility Score (WAS) computing the distance between businesses (points) and the centroids of block groups (points). The goal is to show through an example how to compute an access metric and to make it accessible enough for practitioners and scholars to use for their own purpose. Thus, businesses could be easily changed with other data of interest, such as schools, parks, or any other data. Also, the polygons (in this case, block groups), can be interchanged with other geographies, such as tracts, blocks or a similar type of geography that you might be interested in.

### 1. Load libraries needed

In [1]:
# Load libraries for creating the score.
from sklearn.neighbors import BallTree # Estimate nearest neighbors
import numpy as np # Manipulate data and create functions
import pandas as pd # Manipulate data frames (tables)
import geopandas as gpd # Manipulate geographic data frames (tables with a geometry such as lines, polygons, or points.)

# Libraries to download and format data from web URLs.
import json      # Working with JSON-formatted text strings
import requests  # Accessing content from web URLs



In [27]:
# In case that you don't have the libraries installed, you can install them from here uncommenting the libraries needed:
#! pip install sklearn
#! pip install numpy
#! pip install pandas
#! pip install geopandas
#! pip install json
#! pip install requests

### 2. Download data

We need data that contain latitude and longitude as columns of the table.

These could be points or centroids of polygons.
In this example, using an API, we download business data from the City of Chicago open data portal: [data.cityofchicago.org](https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q/about_data) and the geometries of the block groups from [IPUMS NHGIS](https://data2.nhgis.org/).

In [9]:
# Download data
endpoint_url = "https://data.cityofchicago.org/resource/uupf-x98q.json" # Reads up to a 1,000 rows
response = requests.get(endpoint_url)
results = response.text

# parse the string into a Python dictionary (loads = "load string")
data = json.loads(results)

# Convert list of dictionaries to Pandas dataframe (easier to read and format)
df = pd.DataFrame.from_records(data)

df.head()

Unnamed: 0,license_description,zip_code,license_id,location,date_issued,city,ward_precinct,address,license_status,conditional_approval,...,license_number,license_approved_for_issuance,expiration_date,account_number,site_number,license_code,legal_name,id,payment_date,ssa
0,Valet Parking Operator,60611,2977242,"{'latitude': '41.89019627869941', 'human_addre...",2024-07-17T00:00:00.000,CHICAGO,42-25,30 E HUBBARD ST,AAI,N,...,2917753,2024-07-16T00:00:00.000,2025-06-30T00:00:00.000,488915,10,2101,ADVANCE PARKING SERVICE INC.,2917753-20241116,,
1,Limited Business License,60607,2978227,"{'latitude': '41.876855139603904', 'human_addr...",2024-07-16T00:00:00.000,CHICAGO,34-16,550 W VAN BUREN ST 3 300,AAI,N,...,2896069,2024-07-15T00:00:00.000,2026-09-15T00:00:00.000,4553,4,1010,SUPERIOR GRAPHITE CO,2896069-20240916,2024-07-15T00:00:00.000,
2,Limited Business License,60601,2980225,"{'latitude': '41.88670773797291', 'human_addre...",2024-07-17T00:00:00.000,CHICAGO,42-9,77 W WACKER DR 7TH 700,AAI,N,...,1893713,2024-07-16T00:00:00.000,2026-09-15T00:00:00.000,325308,2,1010,"ALYESKA INVESTMENT GROUP, L.P.",1893713-20240916,2024-07-16T00:00:00.000,
3,Limited Business License,60606,2980276,"{'latitude': '41.88053924995021', 'human_addre...",2024-07-18T00:00:00.000,CHICAGO,42-9,227 W MONROE ST 21ST,AAI,N,...,2972492,2024-07-17T00:00:00.000,2026-09-15T00:00:00.000,327365,2,1010,THE CHICAGO HIRE COMPANY,2972492-20240916,2024-07-17T00:00:00.000,
4,Retail Food Establishment,60610,2980530,"{'latitude': '41.89816036918342', 'human_addre...",2024-07-16T00:00:00.000,CHICAGO,42-45,850 N STATE ST,AAI,N,...,2856308,2024-07-15T00:00:00.000,2026-09-15T00:00:00.000,349928,19,1006,RDK VENTURES LLC,2856308-20240916,2024-07-15T00:00:00.000,


### 3. Know your data!

#### Check how large is your data and what information it contains.

In [11]:
print("There are " + str(len(df)) + " rows in your dataset.")

There are 1000 rows in your dataset.


The table contains the followning information:

In [9]:
sorted(list(df.columns.values.tolist())) # Make sure there are "latitude" and "longitude" columns

['account_number',
 'address',
 'application_requirements_complete',
 'application_type',
 'business_activity',
 'business_activity_id',
 'city',
 'conditional_approval',
 'date_issued',
 'doing_business_as_name',
 'expiration_date',
 'id',
 'latitude',
 'legal_name',
 'license_approved_for_issuance',
 'license_code',
 'license_description',
 'license_id',
 'license_number',
 'license_start_date',
 'license_status',
 'location',
 'longitude',
 'payment_date',
 'police_district',
 'precinct',
 'site_number',
 'ssa',
 'state',
 'ward',
 'ward_precinct',
 'zip_code']

### 4. Bring in the spatial!

In [12]:
# Create a geodataframe from coordinates (latitude and longitude)
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df.longitude, df.latitude),
    crs='epsg:4326') # epsg specifies the projection

In [13]:
# Note that a geometry column is added at the end of the table
gdf.head(3)

Unnamed: 0,license_description,zip_code,license_id,location,date_issued,city,ward_precinct,address,license_status,conditional_approval,...,license_approved_for_issuance,expiration_date,account_number,site_number,license_code,legal_name,id,payment_date,ssa,geometry
0,Valet Parking Operator,60611,2977242,"{'latitude': '41.89019627869941', 'human_addre...",2024-07-17T00:00:00.000,CHICAGO,42-25,30 E HUBBARD ST,AAI,N,...,2024-07-16T00:00:00.000,2025-06-30T00:00:00.000,488915,10,2101,ADVANCE PARKING SERVICE INC.,2917753-20241116,,,POINT (-87.62674 41.8902)
1,Limited Business License,60607,2978227,"{'latitude': '41.876855139603904', 'human_addr...",2024-07-16T00:00:00.000,CHICAGO,34-16,550 W VAN BUREN ST 3 300,AAI,N,...,2024-07-15T00:00:00.000,2026-09-15T00:00:00.000,4553,4,1010,SUPERIOR GRAPHITE CO,2896069-20240916,2024-07-15T00:00:00.000,,POINT (-87.64154 41.87686)
2,Limited Business License,60601,2980225,"{'latitude': '41.88670773797291', 'human_addre...",2024-07-17T00:00:00.000,CHICAGO,42-9,77 W WACKER DR 7TH 700,AAI,N,...,2024-07-16T00:00:00.000,2026-09-15T00:00:00.000,325308,2,1010,"ALYESKA INVESTMENT GROUP, L.P.",1893713-20240916,2024-07-16T00:00:00.000,,POINT (-87.63079 41.88671)


In [14]:
# Change the Coordinate Reference System (CRS)
# Check for different projections here: https://epsg.io/
gdf = gdf.to_crs('esri:102003')

In [15]:
# Check that the CRS actually changed
gdf.crs

<Projected CRS: ESRI:102003>
Name: USA_Contiguous_Albers_Equal_Area_Conic
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.
- bounds: (-124.79, 24.41, -66.91, 49.38)
Coordinate Operation:
- name: USA_Contiguous_Albers_Equal_Area_Conic
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [16]:
# Make sure that the geometry for each row has a value
gdf = gdf[~gdf.is_empty]

In [17]:
print("The data contains " + str(len(gdf)) + " rows.")

The data contains 923 rows.


#### 4.3 Add more data: schools and parks

In [None]:

# I skipped this one now (Irene)

# Add 2011 GreatSchools school data (can use other sources)
sch = gpd.read_file('GreatSchools_2011_us48.shp') 
sch = sch.to_crs('esri:102003')
#2021 ESRI parks data (centroids)
prk = gpd.read_file('Centroids_for_USA_Parks_2021_Buffer2.shp') 
prk = prk.to_crs('esri:102003')

In [None]:

#I skipped this one now (Irene)

lst=[gbis,sch,prk]
am=pd.concat(lst, ignore_index=True, axis=0)
am["ID"] = am.index

In [18]:
# Change this one back from gdf to am (Irene)

# Isolate the geometry of the data
am_id = gdf[['geometry']]
am_id

Unnamed: 0,geometry
0,POINT (689027.03 522216.763)
1,POINT (687942.698 520622.685)
2,POINT (688728.2 521798.565)
3,POINT (688477.137 521083.604)
4,POINT (688813.081 523092.326)
...,...
995,POINT (678490.812 524274.869)
996,POINT (672704.208 531505.752)
997,POINT (688649.216 521117.42)
998,POINT (689722 518075.108)


### 5. Load the geography!

#### 5.1. In this case, we upload block groups

In [23]:
# Block group file we're using in this case - one spatial deifnition of demand units for all time periods
s_v = gpd.read_file('WAS_USA/data/US_blck_grp_2015.shp') # Load geography (oftentimes as shapefile).

# Check the data
s_v.head()

Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,BLKGRPCE,GEOID,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,GISJOIN,Shape_Leng,Shape_Area,geometry
0,6,1,400100,1,60014001001,Block Group 1,G5030,S,6894340.0,0.0,37.8676275,-122.231946,G06000104001001,14302.720874,6894336.0,"POLYGON ((-2255602.272 353149.335, -2255597.39..."
1,6,1,400200,1,60014002001,Block Group 1,G5030,S,288960.0,0.0,37.8497418,-122.2488605,G06000104002001,2970.286365,288961.4,"POLYGON ((-2258184.246 353217.527, -2258186.81..."
2,6,1,400200,2,60014002002,Block Group 2,G5030,S,298490.0,0.0,37.8465865,-122.2503095,G06000104002002,3162.343955,298488.7,"POLYGON ((-2258439.13 352894.146, -2258619.651..."
3,6,1,400300,1,60014003001,Block Group 1,G5030,S,265695.0,0.0,37.8439848,-122.2486668,G06000104003001,2553.074982,265694.8,"POLYGON ((-2258662.984 352641.307, -2258755.16..."
4,6,1,400300,2,60014003002,Block Group 2,G5030,S,269098.0,0.0,37.836255,-122.2516875,G06000104003002,3529.914115,269099.5,"POLYGON ((-2259955.644 352133.337, -2259945.80..."


In [24]:
# Clean the geometry data
s_v = s_v.set_crs('esri:102003', allow_override=True) # Set the Coordinate Reference System
s_v.rename(columns={'GEOID': 'ID'}, inplace=True) # Rename the columns for convenience

In [26]:
#Size of the dataset
len(s_v)

219768

#### 5.2 Create subsets of data to *avoid* computing irrelevant distances.

In this case, we create a subset of continental US Block Groups to avoid estimating distances between a business in California and a block group in New York.

In [27]:
# Kevin -- not sure how you defined these cutoffs? (Irene)
s_v1 = s_v.iloc[0:43167]
s_v1 = s_v1.reset_index(drop=True)

s_v2 = s_v.iloc[43167:86333]
s_v2 = s_v2.reset_index(drop=True)

s_v3 = s_v.iloc[86333:129500]
s_v3 = s_v3.reset_index(drop=True)

s_v4 = s_v.iloc[129500:172666]
s_v4 = s_v4.reset_index(drop=True)

s_v5 = s_v.iloc[172666:len(s_v)]
s_v5 = s_v5.reset_index(drop=True)

### 6. We have the data ready, let's create the access score!

#### 6.1. Find number of nearest k POI points to each block group

In [28]:
# This cell is creating a function for eastimating nearest neighbors from point to point.

def get_nearest_neighbors(gdf1, gdf2, k_neighbors=2):
    '''Find k nearest neighbors for all source points from a set of candidate points
    modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html    
    Parameters
    ----------
    gdf1 : geopandas.DataFrame
    Geometries to search from.
    gdf2 : geopandas.DataFrame
    Geoemtries to be searched.
    k_neighbors : int, optional
    Number of nearest neighbors. The default is 2.
    Returns
    -------
    gdf_final : geopandas.DataFrame
    gdf1 with distance, index and all other columns from gdf2.'''

    src_points = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
    candidates =  [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='euclidean')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    closest_gdfs = []
    for k in np.arange(k_neighbors):
        gdf_new = gdf2.iloc[indices[k]].reset_index()
        gdf_new['distance'] =  distances[k]
        gdf_new = gdf_new.add_suffix(f'_{k+1}')
        closest_gdfs.append(gdf_new)
    
    closest_gdfs.insert(0,gdf1)    
    gdf_final = pd.concat(closest_gdfs,axis=1)

    return gdf_final

In [29]:
#find closest k amenities for each BG and get also the distance based on Euclidean distance
#whole US subsets
closest_am1 = get_nearest_neighbors(s_v1, am_id, k_neighbors=150)
closest_am2 = get_nearest_neighbors(s_v2, am_id, k_neighbors=150)
closest_am3 = get_nearest_neighbors(s_v3, am_id, k_neighbors=150)
closest_am4 = get_nearest_neighbors(s_v4, am_id, k_neighbors=150)
closest_am5 = get_nearest_neighbors(s_v5, am_id, k_neighbors=150)

ValueError: x attribute access only provided for Point geometries

In [None]:
#Wide to long
#Whole US subsets
closest_am1["ID2"] = closest_am1.index
closest_l1 = pd.wide_to_long(closest_am1, ["distance_","index_","geometry_"], i="ID2", j="neighbor")

In [None]:
closest_am2["ID2"] = closest_am2.index
closest_l2 = pd.wide_to_long(closest_am2, ["distance_","index_","geometry_"], i="ID2", j="neighbor")

In [None]:
closest_am3["ID2"] = closest_am3.index
closest_l3 = pd.wide_to_long(closest_am3, ["distance_","index_","geometry_"], i="ID2", j="neighbor")

In [None]:
closest_am4["ID2"] = closest_am4.index
closest_l4 = pd.wide_to_long(closest_am4, ["distance_","index_","geometry_"], i="ID2", j="neighbor")

In [None]:
closest_am5["ID2"] = closest_am5.index
closest_l5 = pd.wide_to_long(closest_am5, ["distance_","index_","geometry_"], i="ID2", j="neighbor")

In [None]:
#rename to 'eucidean', 'origin', 'dest'
#whole US subsets
closest_l1['origin'] = closest_l1['ID']
closest_l1['dest'] = closest_l1['index_']
closest_l1['euclidean'] = closest_l1['distance_']
closest_l1= closest_l1.reset_index(level=("neighbor",))
cost1 = closest_l1[['euclidean', 'origin', 'dest','neighbor']]
cost1.sort_values(by=['origin','euclidean'],inplace=True)

In [None]:
closest_l2['origin'] = closest_l2['ID']
closest_l2['dest'] = closest_l2['index_']
closest_l2['euclidean'] = closest_l2['distance_']
closest_l2= closest_l2.reset_index(level=("neighbor",))
cost2 = closest_l2[['euclidean', 'origin', 'dest','neighbor']]
cost2.sort_values(by=['origin','euclidean'],inplace=True)

In [None]:
closest_l3['origin'] = closest_l3['ID']
closest_l3['dest'] = closest_l3['index_']
closest_l3['euclidean'] = closest_l3['distance_']
closest_l3= closest_l3.reset_index(level=("neighbor",))
cost3 = closest_l3[['euclidean', 'origin', 'dest','neighbor']]
cost3.sort_values(by=['origin','euclidean'],inplace=True)

In [None]:
closest_l4['origin'] = closest_l4['ID']
closest_l4['dest'] = closest_l4['index_']
closest_l4['euclidean'] = closest_l4['distance_']
closest_l4= closest_l4.reset_index(level=("neighbor",))
cost4 = closest_l4[['euclidean', 'origin', 'dest','neighbor']]
cost4.sort_values(by=['origin','euclidean'],inplace=True)

In [None]:
closest_l5['origin'] = closest_l5['ID']
closest_l5['dest'] = closest_l5['index_']
closest_l5['euclidean'] = closest_l5['distance_']
closest_l5= closest_l5.reset_index(level=("neighbor",))
cost5 = closest_l5[['euclidean', 'origin', 'dest','neighbor']]
cost5.sort_values(by=['origin','euclidean'],inplace=True)

#### 6.2. Calculate accessibility measure

In [None]:
# https://journals-sagepub-com.may.idm.oclc.org/doi/10.1177/0265813516641685
#convert distance into time (rate of 5kph)
cost1['time'] = (cost1.euclidean*3600)/5000
cost2['time'] = (cost2.euclidean*3600)/5000
cost3['time'] = (cost3.euclidean*3600)/5000
cost4['time'] = (cost4.euclidean*3600)/5000
cost5['time'] = (cost5.euclidean*3600)/5000

# choose 'upper' parameter (for testing)
# upper = 800
# upper = 1600
# upper = 2400

# choose decay rate
# decay = .005
# decay = .008
# decay = .01

In [None]:
cost1['LogitT_5'] = 1-(1/(np.e**((upper/180)-decay*cost1.time)+1))
cost2['LogitT_5'] = 1-(1/(np.e**((upper/180)-decay*cost2.time)+1))
cost3['LogitT_5'] = 1-(1/(np.e**((upper/180)-decay*cost3.time)+1))
cost4['LogitT_5'] = 1-(1/(np.e**((upper/180)-decay*cost4.time)+1))
cost5['LogitT_5'] = 1-(1/(np.e**((upper/180)-decay*cost5.time)+1))

In [None]:
# plt.hist(cost.LogitT_5, bins=50)
# plt.hist(cost1.LogitT_5, bins=50)

In [None]:
#sum weighted distances by tract (origin) ID
cost_sum1 = cost1.groupby("origin").sum()
cost_sum1['ID'] = cost_sum1.index
cost_sum2 = cost2.groupby("origin").sum()
cost_sum2['ID'] = cost_sum2.index
cost_sum3 = cost3.groupby("origin").sum()
cost_sum3['ID'] = cost_sum3.index
cost_sum4 = cost4.groupby("origin").sum()
cost_sum4['ID'] = cost_sum4.index
cost_sum5 = cost5.groupby("origin").sum()
cost_sum5['ID'] = cost_sum5.index

In [None]:
cost_merge1 = s_v1.merge(cost_sum1, how='inner', on='ID')
cost_merge2 = s_v2.merge(cost_sum2, how='inner', on='ID')
cost_merge3 = s_v3.merge(cost_sum3, how='inner', on='ID')
cost_merge4 = s_v4.merge(cost_sum4, how='inner', on='ID')
cost_merge5 = s_v5.merge(cost_sum5, how='inner', on='ID')

In [None]:
#export for given year
# cost_merge1.to_file('us_walkability_access_score_2019_1.shp')
# cost_merge2.to_file('us_walkability_access_score_2019_2.shp')
# cost_merge3.to_file('us_walkability_access_score_2019_3.shp')
# cost_merge4.to_file('us_walkability_access_score_2019_4.shp')
# cost_merge5.to_file('us_walkability_access_score_2019_5.shp')