# Walkable Accessibility Score (WAS)

### Last update: November 14, 2024

### Compute a Walkable Accessibility Score (WAS) at the block group scale using InfoUSA POI data

This notebook creates a Walkable Accessibility Score (WAS) computing the distance between businesses (points) and the centroids of block groups (points). The goal is to show through an example how to compute an access metric and to make it accessible enough for practitioners and scholars to use for their own purpose. Thus, businesses could be easily changed with other data of interest, such as schools, parks, or any other data. Also, the polygons (in this case, block groups), can be interchanged with other geographies, such as tracts, blocks or a similar type of geography that you might be interested in.

In this example, we use business data from INFO USA and the geometries of the block groups from [IPUMS NHGIS](https://data2.nhgis.org/).

It takes approximately 13 minutes to run the entire notebook.

In [1]:
# Add this cell to time how long it takes to run the notebook

import timeit
start_time = timeit.default_timer()

### 1. Load libraries needed

In [2]:
# Load libraries
from sklearn.neighbors import BallTree
import numpy as np
import pandas as pd
import geopandas as gpd
from scipy import stats # for correlation

In [3]:
import sys  
sys.path.insert(1, '/users/ifarah/appdata/roaming/python/python39/site-packages')

We must specify the correct directory path to load some packages. In the code below, REPLACE 'ifarah' with your personal username. It should be shown in the output directories of the installed packages above.

### 2. Load data

Load data that contain latitude and longitude as columns of the table. These could be points or centroids of polygons.
In this case, we use data from Info USA that is private data. However, you can use any business data you have access to by adding it to the `data` folder.

In [4]:
# Load 2011 InfoUSA data - other data can be used
# Takes ~2 min to run
# df = pd.read_csv('../data/1997_Business_Academic_QCQ.txt', sep=",", encoding='latin-1')
df = pd.read_csv('../data/2011_Business_Academic_QCQ.txt', sep=",", encoding='latin-1')

  df = pd.read_csv('../data/2011_Business_Academic_QCQ.txt', sep=",", encoding='latin-1')


### 3. Know your data!

#### Check how large is your data and what information it contains.

In [5]:
"Your data contains " + str(len(df)) + " rows."

'Your data contains 13468613 rows.'

The table contains the followning information:

In [6]:
sorted(list(df.columns.values.tolist()))

['ABI',
 'Address Line 1',
 'Address Type Indicator',
 'Archive Version Year',
 'Area Code',
 'Business Status Code',
 'CBSA Code',
 'CBSA Level',
 'CSA Code',
 'Census Block',
 'Census Tract',
 'City',
 'Company',
 'Company Holding Status',
 'County Code',
 'Employee Size (5) - Location',
 'FIPS Code',
 'IDCode',
 'Industry Specific First Byte',
 'Latitude',
 'Location Employee Size Code',
 'Location Sales Volume Code',
 'Longitude',
 'Match Code',
 'NAICS8 Descriptions',
 'Office Size Code',
 'Parent Actual Employee Size',
 'Parent Actual Sales Volume',
 'Parent Employee Size Code',
 'Parent Number',
 'Parent Sales Volume Code',
 'Population Code',
 'Primary NAICS Code',
 'Primary SIC Code',
 'SIC Code',
 'SIC Code 1',
 'SIC Code 2',
 'SIC Code 3',
 'SIC Code 4',
 'SIC6_Descriptions',
 'SIC6_Descriptions (SIC)',
 'SIC6_Descriptions (SIC1)',
 'SIC6_Descriptions(SIC2)',
 'SIC6_Descriptions(SIC3)',
 'SIC6_Descriptions(SIC4)',
 'Sales Volume (9) - Location',
 'Site Number',
 'State',
 'S

### 4. Clean data of interest

#### 4.1. Filter data

Amenities that we select: groceries, restaurants, coffee shops, banks, parks, schools, bookstores, entertainment, and general shopping establishments.

Data from schools comes from the 2011 [Great Schools](https://www.greatschools.org/catalog/pdf/GreatSchools-2011-AR-final.pdf) data and parks of the centroids extracted from open 2021 [ArcGIS data](https://www.arcgis.com/home/item.html?id=f092c20803a047cba81fbf1e30eff0b5).


In [7]:
#Convert the column to string
df['Primary NAICS Code'].astype(str)

#Created new categories of NAICS codes so it is easier to filter the categories of interest.
df['NAICS'] = df['Primary NAICS Code'].astype(str)
df['NAICS2'] = df.NAICS.str[:2]
df['NAICS4'] = df.NAICS.str[:4]
df['NAICS6'] = df.NAICS.str[:6]
df.NAICS4.value_counts()

# Filter by specific amenity NAICS codes
filtered = df.loc[(df['NAICS2'] == '72') | (df['NAICS4'] == '4421') | (df['NAICS4'] == '4431') | (df['NAICS4'] == '4451') | 
                (df['NAICS4'] == '4461') | (df['NAICS4'] == '4481') | (df['NAICS4'] == '4482') | (df['NAICS4'] == '4483') |
                (df['NAICS4'] == '4511') | (df['NAICS4'] == '4531') | (df['NAICS4'] == '4532') | (df['NAICS4'] == '4539') |
                (df['NAICS4'] == '4453') | (df['NAICS4'] == '4523') | (df['NAICS4'] == '5221') | (df['NAICS6'] == '311811') |
                (df['NAICS6'] == '451211')]

# Remove Puerto Rico, Alaska, Hawaii, and US Virgin Islands because we will be measuring distances and islands will affect our analysis
filtered = filtered[(filtered['State'] != 'PR') & (filtered['State'] != 'AK') & (filtered['State'] != 'HI') & (filtered['State'] != 'VI')]

# Making sure that the latitude and longitude include all decimal points. # Is this right?
filtered = filtered[filtered.Longitude != '-000.000-76']
filtered = filtered[filtered.Latitude != '-000.000-76']

#### Check your data...How large is your filtered data and how does it look?

In [8]:
"Your filtered data contains " + str(len(filtered)) + " rows."

'Your filtered data contains 1968832 rows.'

In [9]:
filtered.head(3)

Unnamed: 0,Company,Address Line 1,City,State,ZipCode,Zip4,County Code,Area Code,IDCode,Location Employee Size Code,...,Longitude,Match Code,CBSA Code,CBSA Level,CSA Code,FIPS Code,NAICS,NAICS2,NAICS4,NAICS6
101581,AMGARI HOME & GARDEN,127 LEALAND AVE,AGAWAM,MA,1001.0,2413.0,13.0,413,2,A,...,-72.620332,P,44140.0,2.0,521.0,25013.0,45311001.0,45,4531,453110
101590,ISTANBUL MEDITERRENEAN GRILL,365 WALNUT STREET EXT,AGAWAM,MA,1001.0,1523.0,13.0,413,2,C,...,-72.628956,P,44140.0,2.0,521.0,25013.0,72251117.0,72,7225,722511
101603,DAVE'S SODA & PET CITY INC,151 SPRINGFIELD ST,AGAWAM,MA,1001.0,1553.0,13.0,413,2,D,...,-72.633365,P,44140.0,2.0,521.0,25013.0,45391003.0,45,4539,453910


#### 4.2 Bring in the spatial!

In [10]:
# Create a geodataframe from coordinates (latitude and longitude)
gdf = gpd.GeoDataFrame(
    filtered,
    geometry=gpd.points_from_xy(filtered.Longitude, filtered.Latitude),
    crs='epsg:4326') # epsg specifies the projection

# Change the Coordinate Reference System (CRS)
# Check for different projections here: https://epsg.io/
gdf = gdf.to_crs('esri:102003')

In [11]:
# Check that the CRS actually changed
gdf.crs

<Projected CRS: ESRI:102003>
Name: USA_Contiguous_Albers_Equal_Area_Conic
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.
- bounds: (-124.79, 24.41, -66.91, 49.38)
Coordinate Operation:
- name: USA_Contiguous_Albers_Equal_Area_Conic
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [12]:
# Make sure that the geometry for each row has a value
gdf = gdf[~gdf.is_empty]

In [13]:
"The data contains " + str(len(gdf)) + " rows."

'The data contains 1968809 rows.'

#### 4.3 Add more data: schools and parks

In [14]:
# Add 2011 GreatSchools school data (can use other sources)
sch = gpd.read_file('../data/GreatSchools_2011_us48/GreatSchools_2011_us48.shp') 
sch = sch.to_crs('esri:102003')
#2021 ESRI parks data (centroids)
prk = gpd.read_file('../data/Centroids_for_USA_Parks_2021/parks.shp') 
prk = prk.to_crs('esri:102003')

lst=[gdf,sch,prk]
am=pd.concat(lst, ignore_index=True, axis=0)
am["ID"] = am.index

# Keep only the geometry
am_id = gdf[['geometry']]
am_id.head(3)

Unnamed: 0,geometry
101581,POINT (1902104.651 747129.973)
101590,POINT (1900945.730 748835.840)
101603,POINT (1900558.840 748889.230)


### 5. Load the geography!

#### 5.1. In this case, we upload block groups

In [15]:
# Load geography (oftentimes as shapefile).
# Read the block group file we're using in this case that we got from IPUMS- one spatial definition of demand units for all time periods

s_v = gpd.read_file('../data/2015_US_BG/BG_mainland.shp') # Load geography (oftentimes as shapefile).

# Change the Coordinate Refernce System
s_v = s_v.set_crs('esri:102003', allow_override=True) # Set the Coordinate Reference System
s_v.rename(columns={'GEOID': 'ID'}, inplace=True) # Rename the columns for convenience

# Extract the centroids of the polygons.
# Replace the column "geometry" with the centroids of geography.
# This will change the geometry from "polygon" to "point" geometry.
s_v['geometry'] = s_v.centroid

# Check that the geometry is indeed in point form
s_v[['geometry']].head(3)

Unnamed: 0,geometry
0,POINT (-2256868.242 354675.748)
1,POINT (-2258832.974 353148.920)
2,POINT (-2259050.925 352843.123)


### Create the Functions for Running the Access Score

In [16]:
# This cell is creating a function for eastimating nearest neighbors from point to point.
def get_nearest_neighbors(gdf1, gdf2, k_neighbors=2):
    '''Find k nearest neighbors for all source points from a set of candidate points
    modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html    
    Parameters
    ----------
    gdf1 : geopandas.DataFrame
    Geometries to search from.
    gdf2 : geopandas.DataFrame
    Geoemtries to be searched.
    k_neighbors : int, optional
    Number of nearest neighbors. The default is 2.
    Returns
    -------
    gdf_final : geopandas.DataFrame
    gdf1 with distance, index and all other columns from gdf2.'''

    src_points = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
    candidates =  [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='euclidean')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    closest_gdfs = []
    for k in np.arange(k_neighbors):
        gdf_new = gdf2.iloc[indices[k]].reset_index()
        gdf_new['distance'] =  distances[k]
        gdf_new = gdf_new.add_suffix(f'_{k+1}')
        closest_gdfs.append(gdf_new)
    
    closest_gdfs.insert(0,gdf1)    
    gdf_final = pd.concat(closest_gdfs,axis=1)

    return gdf_final

def clean_dataframe(df):
    # Create the ID2 column
    df["ID2"] = df.index

    # Reshape the dataframe from wide to long format using the provided suffix
    long_df = pd.wide_to_long(df, stubnames=["distance_", "index_", "geometry_"], i="ID2", j="neighbor")

    # Rename columns
    long_df.loc[:, 'origin'] = long_df['ID']
    long_df.loc[:, 'dest'] = long_df['index_']
    long_df.loc[:, 'euclidean'] = long_df['distance_']

    # Reset index and keep necessary columns
    long_df = long_df.reset_index(level="neighbor")
    cost_df = long_df[['euclidean', 'origin', 'dest', 'neighbor']]

    # Sort the dataframe by origin and euclidean distance
    cost_df.sort_values(by=['origin', 'euclidean'], inplace=True)

    return cost_df

def access_measure(df_cost, df_sv, upper, decay):
    # Calculate time from euclidean distance
    # https://journals-sagepub-com.may.idm.oclc.org/doi/10.1177/0265813516641685
    df_cost['time'] = (df_cost['euclidean'] * 3600) / 5000  # convert distance into time (rate of 5kph)
    
    # Calculate LogitT_5 using the provided formula
    df_cost['LogitT_5'] = 1 - (1 / (np.exp((upper / 180) - decay * df_cost['time']) + 1))
    
    # Sum weighted distances by tract (origin) ID
    cost_sum = df_cost.groupby("origin").sum()
    cost_sum['ID'] = cost_sum.index
    
    # Merge with the corresponding smaller sv original dataframe
    cost_merge = df_sv.merge(cost_sum, how='inner', on='ID')
    
    return cost_merge

# Estimate Access Metrics

### Run the k nearest neighbors using 10, 50, and 150 nearest neighbors

In [17]:
#For 10 NN:
closest10 = get_nearest_neighbors(s_v, am_id, k_neighbors=10)
#For 50 NN: #10 seconds
closest50 = get_nearest_neighbors(s_v, am_id, k_neighbors=50)
#For 150NN: #30 s approx
closest150 = get_nearest_neighbors(s_v, am_id, k_neighbors=150)

### Clean the dataframe

In [18]:
cost10 = clean_dataframe(closest10)
cost50 = clean_dataframe(closest50)
cost150 = clean_dataframe(closest150)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cost_df.sort_values(by=['origin', 'euclidean'], inplace=True)
  super().__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cost_df.sort_values(by=['origin', 'euclidean'], inplace=True)
  super().__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cost_df.sort_values(by=['origin', 'euclidean'], inplace=True)


### Estimate the access metric

This metric uses the nearest neighbor dataframe, the spatial geometry, the upper limit to estimate the distance (in meters), and the specify the parameter of the distance decay function. This parameter specifies how willing a person is to travel, weighing down the amenities that are at a particular distance.

In [19]:
result10_800 = access_measure(cost10, s_v, upper=800, decay=.008)
result50_800 = access_measure(cost50, s_v, upper=800, decay=.008)
result150_800 = access_measure(cost150, s_v, upper=800, decay=.008)

result10_1600 = access_measure(cost10, s_v, upper=1600, decay=.008)
result50_1600 = access_measure(cost50, s_v, upper=1600, decay=.008)
result150_1600 = access_measure(cost150, s_v, upper=1600, decay=.008)

result10_2400 = access_measure(cost10, s_v, upper=2400, decay=.008)
result50_2400 = access_measure(cost50, s_v, upper=2400, decay=.008)
result150_2400 = access_measure(cost150, s_v, upper=2400, decay=.008)

## Estimate the Correlation betweeen the Official Walkscore and the Walkscore Access Score

In [20]:
#  Check out correlations with walkscore
ws = gpd.read_file('../data/2011_walkscore/2011ws.shp')
#ws = ws.set_crs('esri:102003', allow_override=True)

In [21]:
ws.rename(columns={'GEOID': 'ID'}, inplace=True)

In [22]:
ws.head(2)

Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,BLKGRPCE,ID,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,...,12LogitT_5,13LogitT_5,14LogitT_5,15LogitT_5,16LogitT_5,17LogitT_5,18LogitT_5,19LogitT_5,AVG_Logit,geometry
0,6,45,11300,1,60450113001,Block Group 1,G5030,S,5189669.0,83111.0,...,17.129483,17.25817,18.99555,18.065838,18.743008,19.757786,18.651607,18.05101,16.560474,"POLYGON ((-2296258.565 519630.635, -2296247.36..."
1,6,45,11500,5,60450115005,Block Group 5,G5030,S,1541883.0,13997.0,...,18.68489,18.532505,18.634044,18.673034,17.754464,16.413181,13.568248,15.265348,16.153387,"POLYGON ((-2295930.431 515135.162, -2295948.97..."


### Upper threshold = 800

In [23]:
# Estimate score for 10 amenities, 800 upper threshold
df_ws_access = ws.merge(result10_800, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.9033093261815489, pvalue=0.0)

In [24]:
# Estimate score for 50 amenities, 800 upper threshold
df_ws_access = ws.merge(result50_800, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.9101014119708172, pvalue=0.0)

In [25]:
# Estimate score for 150 amenities, 800 upper threshold
df_ws_access = ws.merge(result150_800, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.9069446479480093, pvalue=0.0)

### Changing upper threshold to 1600

In [26]:
# Estimate score for 10 amenities, 1600 upper threshold
df_ws_access = ws.merge(result10_1600, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.904987729749101, pvalue=0.0)

In [27]:
# Estimate score for 50 amenities, 1600 upper threshold
df_ws_access = ws.merge(result50_1600, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.8858973401070748, pvalue=0.0)

In [28]:
# Estimate score for 150 amenities, 1600 upper threshold
df_ws_access = ws.merge(result150_1600, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.8635614934226934, pvalue=0.0)

### Changing upper threshold to 2400

In [29]:
# Estimate score for 10 amenities, 2400 upper threshold
df_ws_access = ws.merge(result10_2400, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.90400020509866, pvalue=0.0)

In [30]:
# Estimate score for 50 amenities, 2400 upper threshold
df_ws_access = ws.merge(result50_2400, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.8739985591639057, pvalue=0.0)

In [31]:
# Estimate score for 150 amenities, 2400 upper threshold
df_ws_access = ws.merge(result150_2400, how='inner', on='ID')
df=(df_ws_access[['ssws2use_m','LogitT_5']]).dropna()
stats.spearmanr(df['LogitT_5'],df['ssws2use_m'])

SignificanceResult(statistic=0.8320930721172721, pvalue=0.0)

Since the best correlation is between the original Walkscore and the model that has an upper threshold of 800 meters and 50 nearest neighbors amenities, we choose that model to estimate the historical data.

In [32]:
elapsed = timeit.default_timer() - start_time
elapsed
# Approximately 13 minutes

823.5288582080001