#### Introduction

The purpose of this notebook, along with 01_data_setup_example.ipynb and 03_demo_explore_results.ipynb, is to provide a demo of how you may want to use the popexposure package functions.

Please see 01_data_setup_example.ipynb before you work through this notebook!

#### Outline

To recap, our goal was to demo all the options available in popexposure. In this notebook we'll do five separate things, which align with the five options available in the package. 

First we'll do:

0. Setup, and then:

1. Find the total number of people residing within 10km of one or more California wildfire 
disasters in 2016, 2017, and 2018. 
2. Find the total number of people residing within 10 km of each unique California wildfire
disaster in 2016, 2017, and 2018.
3. Find the total number of people residing within 10km of one or more California wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of each unique California wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.
5. Find the population of all 2020 ZCTAs. 

In the last notebook, we prepared the wildfire disaster exposure data and ZCTA data to pass to the popexposure functions so we could complete these computations. Here, we'll complete each of them in this order. 

#### 0. Setup

We need to import some libraries and also install and import popexposure. If you haven't installed popexposure in the environment you're working in now, go ahead and activate that environment, and pip install popexposure in the terminal. popexposure is included in the pop_exp environment for this demo. We can then import the functions within popexposure. 

In [1]:
# We start by importing necessary libraries.
import pathlib
import sys
import glob
import pandas as pd
import geopandas as gpd
# Here's the popexposure import 
from popexposure import PopEstimator 

We'll also set some paths to make it easy to access the data we cleaned for 
this demo. 

To find the number of people affected by one or more CA wildfire disaster by year 2016-2018, and by ZCTA, we need to get the paths to each of our wildfire files that we made in the data setup notebook.

The regular expression below selects all the files in the interim data directory that have 'fire' in the name. 

In [2]:
# Paths 
base_path = pathlib.Path.cwd().parent
data_dir = base_path / "demo_data"
# wf paths regex
wildfire_paths = glob.glob(str(data_dir / "02_interim_data" / "*fire*"))

We also need the path to the population raster we're using, and the ZCTA file.

In [3]:
# GHSL pop raster
ghsl_path = data_dir / "01_raw_data" / "GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0_R5_C8" / "GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0_R5_C8.tif"

# ZCTA path 
zcta_path = glob.glob(str(data_dir / "02_interim_data" / "*zcta*"))[0]

We're now set up to run the five cases we're interested in. 

#### 1. Find the total number of people residing within 10km of one or more California wildfire disasters in 2016, 2017, and 2018. 

Our first goal was to find the total number of people residing within 10 km of one or more California wildfire disaster in 2016, 2017, and 2018.

To do this, we can run prep_data and then est_exposed_pop with the parameter hazard_specific = False.

Because we're looping over three years, we'll initialize an empty list first, and then store the results in this list. We're also adding a year variable to the result as we go. In total, this takes around 5 seconds.


In [4]:
est = PopEstimator(pop_data=ghsl_path)

num_exposed_list = []
start_year = 2016

for i in range(0, 3):
    num_exposed = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=False)
    num_exposed['year'] = start_year + i
    num_exposed_list.append(num_exposed)

Because we added a year variable to each output, we can concatonate these dataframes together, and then look at the output. 

In [5]:
# Join those dataframes together. 
num_exposed_df = pd.concat(num_exposed_list, axis=0)
num_exposed_df.head()

Unnamed: 0,ID_hazard,exposed_10,year
0,merged_geoms,1891751,2016
0,merged_geoms,5296487,2017
0,merged_geoms,2091309,2018


Our output has three columns: ID_hazard, exposed_10, and year. 
We added year, but the other two are output from est_exposed_pop.
exposed_10 is called that because we named our buffer_dist column buffer_dist_10, so that suffix got carried through to our results.
Because we ran the function with hazard_specific = False, our ID_hazard column has changed. It now says 'merged_geoms', which means that we got one number representing the count of everyone exposed to one or more wildifre disasters in each year, so the IDs are no longer wildfire IDs.

We wanted to count the number of people residing within 10km of *any* California wildfire disaster. There are some people who live within 10km of two or more wildfire disasters. Because we just wanted the total, we did not want to  double count those people. When computing a total, rather than the number of people affected by each unique hazard, est_exposed_pop takes the unary union of any buffered hazards that are overlapping, and finds the total of 
everyone residing within that area.

We can save the output.

In [6]:
num_exposed_df.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire.parquet")

#### 2. Find the total number of people residing within 10 km of each unique California wildfire disaster in 2016, 2017, and 2018.

We also wanted to 2. Find the total number of people residing within 10 km of each unique California wildfire disaster in 2016, 2017, and 2018. 

To do this, we also need to use est_exposed_pop, with all the same arguments except for hazard_specific. In this case, we set hazard_specific to True. This means that we will count the number of people within 10km of each wildfire disaster boundary, regardless of whether two or more exposed areas overlap. 

In [7]:
num_exposed_list = []
for i in range(0, 3):
    num_exposed = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=True)
    num_exposed['year'] = start_year + i
    num_exposed_list.append(num_exposed)

num_exposed_unique = pd.concat(num_exposed_list, axis=0)
num_exposed_unique.head()

Unnamed: 0,ID_hazard,exposed_10,year
0,2016-06-23_ERSKINE_CA_KERN_4254,14909,2016
1,2016-06-24_MARINA_CA_MONO_4255,451,2016
2,2016-06-19_BORDER 3_CA_SAN DIEGO_4256,0,2016
3,2016-06-15_SHERPA_CA_SANTA BARBARA_4287,5095,2016
4,2016-07-08_FIDDLER_CA_SHASTA_4288,257,2016


Again, our output has three columns: ID_hazard, exposed_10, and year. 

This time, the ID_hazard column is the same as the one we passed to this 
function. This time, if people lived within 10 km of one or more fires, they
are counted in the total people affected by that fire. This means that est_exposed_pop
returns a dataframe with one row per hazard ID, and people may
be double counted or triple or more. 

We can save the output.

In [8]:
num_exposed_unique.to_parquet(data_dir / "03_results" / "num_aff_by_unique_wildfire.parquet")

There were two more ways we wanted to define exposure. 

3. Find the total number of people residing within 10km of one or more California wildfire 
disasters in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of each unique California wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.

These are analogous to the two quantities we just computed, but this time, we want to know these exposures by ZCTA. 

To do this, we need to run est_exposed_pop again but this time with additional administrative geographies: ZCTAs.


#### 3. 
To do "3. Find the total number of people residing within 10km of one or more California wildfire disaster in 2016, 2017, and 2018 by 2020 ZCTA", we'll run run est_exposed_pop with hazard_specific = False, and we'll set the optional argument 'admin_units' to a dataframe of 2020 ZCTAs.

We need to prepare the ZCTA data once first before we include it! We'll run the prep_data function once to prepare the ZCTA data, and then use that prepared data in every interation of the loop over years 2016-2018.


In [9]:
est = PopEstimator(pop_data=ghsl_path, admin_data=zcta_path)

num_exposed_zcta_list = []
for i in range(0, 3):
    num_exposed_zcta = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=False)
    num_exposed_zcta['year'] = start_year + i
    num_exposed_zcta_list.append(num_exposed_zcta)

This computation took about 9 seconds - a little bit longer than when we weren't looking for ZCTA-specific estimates. 

In [10]:
# putting all years into one dataframe
num_affected_zcta_df = pd.concat(num_exposed_zcta_list, axis=0)
num_affected_zcta_df.head()

Unnamed: 0,ID_hazard,ID_admin_unit,exposed_10,year
0,merged_geoms,91301,8394,2016
1,merged_geoms,91352,1349,2016
2,merged_geoms,91731,1710,2016
3,merged_geoms,91739,1631,2016
4,merged_geoms,91791,13943,2016


In [11]:
# And we can save 
num_affected_zcta_df.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire_by_zcta.parquet")
num_affected_zcta_df.head()

Unnamed: 0,ID_hazard,ID_admin_unit,exposed_10,year
0,merged_geoms,91301,8394,2016
1,merged_geoms,91352,1349,2016
2,merged_geoms,91731,1710,2016
3,merged_geoms,91739,1631,2016
4,merged_geoms,91791,13943,2016


#### 4. 
For our final case of counting exposed people, we wanted to find the number of people living near each unique hazard by each ZCTA. For this, we need to use est_exposed_pop in the same way that we just did, but with hazard_specific = True.  We already prepared the zcta data with prep_data in the previous step, so we can just use that data again. 

In [12]:
num_exposed_zcta_unique_list = []
for i in range(0, 3):
    num_exposed_zcta_unique = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=True)
    num_exposed_zcta_unique['year'] = start_year + i
    num_exposed_zcta_unique_list.append(num_exposed_zcta_unique)

# all years in one dataframe
num_exposed_df_zcta_unique = pd.concat(num_exposed_zcta_unique_list, axis=0)

# and we can save
num_exposed_df_zcta_unique.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire_zcta_unique.parquet")

In [13]:
num_exposed_df_zcta_unique

Unnamed: 0,ID_hazard,ID_admin_unit,exposed_10,year
0,2016-06-23_ERSKINE_CA_KERN_4254,93205,2414,2016
1,2016-06-23_ERSKINE_CA_KERN_4254,93255,503,2016
2,2016-06-23_ERSKINE_CA_KERN_4254,93283,2442,2016
3,2016-06-23_ERSKINE_CA_KERN_4254,93518,24,2016
4,2016-06-23_ERSKINE_CA_KERN_4254,93238,238,2016
...,...,...,...,...
301,2018-08-09_HOLY_CA_RIVERSIDE_5594,92679,22713,2018
302,2018-08-09_HOLY_CA_RIVERSIDE_5594,92881,13864,2018
303,2018-08-09_HOLY_CA_RIVERSIDE_5594,92883,33243,2018
304,2018-08-09_HOLY_CA_RIVERSIDE_5594,92503,0,2018


We will explore some of the output from these runs in the next section of the demo.  

#### 5. Find the population of all 2020 ZCTAs. 

Finally, let's use the function est_pop to get some denominators for our dataset. This function can help us use the gridded population data we used to find the number of people residing within the hazard buffers to also find the number of people residing in each ZCTA. This is useful if we're using a gridded population dataset that we think is a big improvement over other population counts in our additional admin units, or we just want to be consistent. 

To call this function, all we need to do is use the same paths we've used previously:

In [None]:
num_residing_by_zcta = est.est_total_pop()
num_residing_by_zcta.to_parquet(data_dir / "03_results" / "num_people_residing_by_zcta.parquet")
num_residing_by_zcta.head()


Unnamed: 0,ID_admin_unit,population
0,95641,2279
1,95680,162
2,95919,1560
3,95920,289
4,95930,209


This time, we have a column for admin unit and a column for the number of people living in that admin unit. 
Please continue to part 3 of this demo to explore the output of these functions! 


In [17]:
num_residing_by_zcta.shape

(1802, 2)

In [18]:
est.admin_data.shape

(1802, 3)