## Baseline model

Creates the baseline model, and gets results for placing one new grocery store. This is the analysis a policy analyst, with no background in data science, would perform. We assume that they would simply download the data, calculate the percentage of individuals identified as food insecure  within each census tract, and sort in descending order based on the calculation. We assume the analyst would then identify the most food insecure area relative to the most densely populated tract and determine the location of a new grocery store based on the most food insecure area with the largest population. While the analyst is most likely to perform this analysis in Excel, we use Python for the purpose of consistency. 

Inputs: 
- relevant_buildings.shp
- usda_lowincomelowaccess.csv.


Pre-optimiation setup: 
1. Use the helper_population_allocation.py to allocate a population count to each residential building 
2. Use the helper_distance_calculation.py to calculate existing access and distance between a residential and commercial building
3. Use the helper_distance_calculation.py to calculate existing access to grocery stores for each residential building (0.5 mile)
4. Once 1-3 are done, all parameters are ready. 

Note, in this case, the pre-optimization set-up helps us get the results from the baseline, and create consistency with models 1 and 2 to compare results. All of these steps are not needed for the bare bones, policymaker implementation of the baseline model.

Baseline model methodology and output:
1. We find the neighborhood that has the highest population experiencing food apartheid, as per USDA data; 'Central Oakland'
2. This file processes the buildings data, filters to the relevant census tract for Central Oakland, and draws a random commercial building
3. Based on this chosen commercial building, calculates marginal new access created by a grocery store placed in that building. This is the output for the baseline model

In [14]:
# Import libraries
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import haversine as hs
import gurobipy as gp
from gurobipy import GRB

# Helper modules: includes functions to calculate distance and estimate populations
import helper_population_allocation as pa
import helper_distance_calculation as dc

# Avoid printing set copy warnings
import warnings
warnings.filterwarnings("ignore")


### PRE-OPTIMIZATION SETUP

In [15]:
# Get the main buildings dataset 
buildings_df = gpd.read_file('../processed_data/relevant_buildings.shp')

# Create building ID variable
buildings_df.reset_index(drop=True, inplace=True)
buildings_df['building_id'] = buildings_df.index + 1
buildings_df['building_id'] = buildings_df.apply(lambda row: str(row['building_id']) + '-' + str(row['CLASS']) , axis=1)

In [16]:
# Create arrays to track ordering - map between numpy and dataframe (residential)
res_buildings = buildings_df[buildings_df['class_reco'].str.contains('Residential')]
res_buildings = res_buildings.sort_values('building_id')
res_buildings = dc.get_geocoordinate(res_buildings, 'geometry')

res_buildings_array = np.array(res_buildings['building_id'])    # ith element represents the building id of ith residential building
res_buildings_coordinates_array = np.array(res_buildings['coordinates'])    # ith element represents the coordinates of the ith residential building

In [17]:
# Create arrays to track ordering - map between numpy and dataframe (Commercial)
comm_buildings = buildings_df[buildings_df['class_reco'].str.contains('commercial')]
comm_buildings = comm_buildings.sort_values('building_id')
comm_buildings = dc.get_geocoordinate(comm_buildings, 'geometry')

comm_buildings_array = np.array(comm_buildings['building_id'])  # ith element represents the building id of ith commercial building
comm_buildings_coordinates_array = np.array(comm_buildings['coordinates'])  # ith element represents the coordinates of the ith commercial building


In [18]:
# Create arrays to track ordering (grocery stores)
grocery_stores = buildings_df[buildings_df['class_reco'].str.contains('Grocery')]
grocery_stores = grocery_stores.sort_values('building_id')
grocery_stores = dc.get_geocoordinate(grocery_stores, 'geometry')

grocery_stores_array = np.array(grocery_stores['building_id'])  # ith element represents the building id of ith grocery store
grocery_stores_coordinates_array = np.array(grocery_stores['coordinates'])  # ith element represents the coordinates of the ith grocery store

In [19]:
### Calculate pairwise distances ###

# DONT RUN THIS AGAIN
# WE have run this and stored the matrices in processed_data
# This code block takes about 66 mins (more depending on CPU)

# Create parameter matrices (Res comm access matrix - Bij)
# [i,j] value indicates whether residential building i is within access distance of commercial building j
# res_comm_distance_matrix, res_comm_access_matrix = dc.calculate_access(res_buildings_coordinates_array, comm_buildings_coordinates_array)

# # Save file
# np.save('../processed_data/res_comm_distance_matrix', res_comm_distance_matrix)

In [20]:
# Importing Distance Matrix (calculated and stored in previous cell)
res_comm_distance_matrix = np.load('../processed_data/res_comm_distance_matrix.npy')

# Creating a binary access matrix
# [i,j] indicates whether residential building i and commercial building j are within 0.5 miles of each other
res_comm_access_matrix_half_mile = res_comm_distance_matrix.copy()
res_comm_access_matrix_half_mile[res_comm_access_matrix_half_mile <= 0.5] = 1
res_comm_access_matrix_half_mile[res_comm_access_matrix_half_mile != 1] = 0

In [21]:
# Create parameter matrices (Res groc access array)
# i-th entry indicates whether i-th residential building currently has access to grocery store within 0.5 miles
res_groc_distance_matrix, res_groc_access_matrix = dc.calculate_access(res_buildings_coordinates_array, grocery_stores_coordinates_array)

res_groc_access_matrix_half_mile = res_groc_distance_matrix.copy()
res_groc_access_matrix_half_mile[res_groc_access_matrix_half_mile <= 0.5] = 1
res_groc_access_matrix_half_mile[res_groc_access_matrix_half_mile != 1] = 0

res_access_array_half_mile = np.amax(res_groc_access_matrix_half_mile, 1)


In [22]:
# Create parameter matrices (Res Population - Pi)
# ith value indicates the population of the ith residential building
res_population = pa.get_population(buildings_df) 
res_population = res_population.drop_duplicates('building_id') # drop duplicates

res_population = res_population.sort_values('building_id') # Just to be safe
res_population_array = np.array(res_population['population'])


### Baseline model: select building and get results

In [23]:
# Which census tract/neighborhood should we place the store in?

#Loading in USDA data, food insecurity measure 
data = pd.read_csv("../input_data/usda_lowincomelowaccess.csv")
data.describe()

#Create food insecure feature using the LALOWI05_10, 
#which is defined as count of ppl by census tract that are .5 mile from supermarket & low income
foodinsecure = data[["Allegheny_Tracts_GEOID", "USDA_Data_Pop2010", "USDA_Data_LALOWI05_10"]]
print(foodinsecure)

#Review dataset for NaN and drop null values
foodinsecure.isnull().sum()
foodinsecure = foodinsecure.dropna()

#Calculate percentage of individuals identified as food insecure within each tract as a new feature
foodinsecure["FI_percent"] =  foodinsecure["USDA_Data_LALOWI05_10"] / foodinsecure["USDA_Data_Pop2010"]

#Drop infinity values
foodinsecure.replace([np.inf], np.nan, inplace=True)
foodinsecure = foodinsecure.dropna()

#Sort new feature in descending order
foodinsecure = foodinsecure.sort_values(by=["FI_percent", "USDA_Data_Pop2010"], ascending = False)
maxP = np.max(foodinsecure[foodinsecure["USDA_Data_LALOWI05_10"] > 25]["FI_percent"])
result = foodinsecure[foodinsecure["FI_percent"] == maxP]["Allegheny_Tracts_GEOID"].iloc[0]
print("The most food insecure area with the highest population is census tract: ", result)

print("This corresponds to neighborhood Central Oakland")

     Allegheny_Tracts_GEOID  USDA_Data_Pop2010  USDA_Data_LALOWI05_10
0               42003050900             1202.0                   84.0
1               42003070300             2197.0                    NaN
2               42003120700              818.0                  418.0
3               42003140400             2306.0                  239.0
4               42003180700             2011.0                  505.0
..                      ...                ...                    ...
173             42003111300             2639.0                  370.0
174             42003130300             1153.0                  113.0
175             42003141000              928.0                   37.0
176             42003202300             4144.0                  284.0
177             42003320600             2261.0                  496.0

[178 rows x 3 columns]
The most food insecure area with the highest population is census tract:  42003040500
This corresponds to neighborhood Central Oakland


In [24]:
# Get the relevant geoid for central oakland that exists in the data
# Note: we can't use USDA tract ID definitions because they don't align exactly (i.e tract above doesn't exist in our data)
# Hence, we pull the niehgborhood of the tract identified above, and instead use neighborhood name to find the relevant tract
# in our dataset
oakland_geoid = buildings_df[buildings_df["hood"] == 'Central Oakland']['geoid10'].iloc[0]

# Subset the commercial buildings to central oakland
comm_buildings_geoid = comm_buildings[comm_buildings["geoid10"] == str(oakland_geoid)]

# Pick a random building within oakland
random_build = comm_buildings_geoid.sample(n=1, random_state= 29)['building_id'].iloc[0]

print(f"Chosen commercial building to place store in: {random_build}")

# Get the corresponding index in the commercial buildings array
chosen_building_index = np.where(comm_buildings_array == '106749-C')


Chosen commercial building to place store in: 106749-C


In [25]:
existing_access_indices = res_access_array_half_mile.nonzero()[0] # These are indices of residential buildings that currently have access
res_comm_access_matrix_subset = np.delete(res_comm_access_matrix_half_mile, existing_access_indices, axis=0 )

###########################
# STEP 2: Do the same thing for res_population_array so that the ordering matches
###########################
res_population_array_sub = np.delete(res_population_array, existing_access_indices, axis=0)

len(res_population_array_sub)

18248

In [26]:
# Calculating new access generated by each commercial building

###########################
# STEP 1: Take the res_comm_access_matrix, remove those rows (each row represents a residential building) which have existing access
###########################
existing_access_indices = res_access_array_half_mile.nonzero()[0] # These are indices of residential buildings that currently have access
res_comm_access_matrix_subset = np.delete(res_comm_access_matrix_half_mile, existing_access_indices, axis=0 )

###########################
# STEP 2: Do the same thing for res_population_array so that the ordering matches
###########################
res_population_array_sub = np.delete(res_population_array, existing_access_indices, axis=0)

###########################
# STEP 3: Do a matrix multiplication between res_population_array_sub and res_comm_access_matrix_sub
###########################

# How this works:
# 1. Reshape res_population_array_sub to be (1 * 18248) 2D array
# 2. res_comm_access_matrix_sub is (18248 * 6895)
# 3. When you do matrix multiplication of 1 and 2, you get a (1*6895) array
# 4. Each element of this array would represent the sum of the population at each residential building multiplied by whether that residential building and that particular commercial building
# are within access region. So for example, first element of this result would be P0 * whether res building 0 and comm building 0 are within access + P1 * whether res building 1 and comm building 0 are within access and so on
# So each element of the result represents the total new population that would gain access if a commercial building is put at that index

res_population_array_sub = np.reshape(res_population_array_sub, (-1, len(res_population_array_sub)))
new_access_array = np.matmul(res_population_array_sub, res_comm_access_matrix_subset) # The result contains the marginal population access for each commercial building

In [27]:
# Subset this matrix to our relevant chosen store by baseline method, to get the final result
new_people_given_access = new_access_array[0,chosen_building_index][0][0]

print(f"The baseline model gives access to {new_people_given_access} new people.")

The baseline model gives access to 1976.5699999999913 new people.
