#### Introduction

The purpose of this notebook, along with 01_data_setup_example.ipynb and 03_demo_explore_results.ipynb, is to provide a tutorial of how you may want to use the pop_exp pacakge functions.

Please see 01_data_setup_example.ipynb before you work through this notebook!

#### Goals

To recap, to demo all the functions available in pop_exp, in this notebook we'll
do five separate things, which align with the five options available in three functions in the package. 

1. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018. 
2. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018.
3. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.
5. Find the population of all 2020 ZCTAs. 

In the last notebook, we prepared the wildfire disaster exposure data and ZCTA
data to pass to the pop_exp functions so we could complete these computations. 

#### Let's do it

We need to import some libraries and also install and import pop_exp. If you haven't installed pop_exp in the environment you're working in now, go ahead and activate that environment, and pip install pop_exp in the terminal. We can then import the functions within pop ex. 

In [2]:
# We start by importing necessary libraries.
import pathlib
import sys
import matplotlib.pyplot as plt
import glob
import pandas as pd
from pop_exp import pop_ex_helpers as px

We'll also set some paths to make it easy to access the data we cleaned for 
this tutorial. 

To find the number of people affected by one or more wildfire disaster by year 2016-2018, and by ZCTA, we need to get the paths to each of our wildfire files that we made in the data setup notebook.

The regular expression below selects all the files in the interim data directory that have 'fire' in the name. 

In [3]:
# Paths 
base_path = pathlib.Path.cwd().parent
data_dir = base_path / "demo_data"
# wf paths regex
wildfire_paths = glob.glob(str(data_dir / "02_interim_data" / "*fire*"))

We also need the path to the population raster we're using, and the ZCTA file.

In [4]:
# GHSL pop raster
ghsl_path = data_dir / "01_raw_data" / "GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif"

# ZCTA path 
zcta_path = glob.glob(str(data_dir / "02_interim_data" / "*zcta*"))[0]

We're now set up to run the five cases we're interested in. 

Our first goal was:
1. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018. 

To do this, we can run find_number_of_people_affected with the parameter 
by_unique_hazrad = False. 

Because we're looping over three years, we'll initialize an empty list first, 
and then store the results in this list. We're also adding a year variable to 
the result as we go. In total, this takes around 24 seconds. 


In [5]:
num_affected_list = []

for i in range(0, 2):
    num_affected = px.find_num_people_affected(
        path_to_hazards=wildfire_paths[i],
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

Running the function
Reading data and finding best UTM projection for hazard geometries (1/4)



  ch_shp["centroid_lon"] = ch_shp.centroid.x

  ch_shp["centroid_lat"] = ch_shp.centroid.y
Buffering hazard geometries (2/4):  37%|███▋      | 107/291 [00:01<00:03, 61.28it/s] 


KeyboardInterrupt: 

Because we added a year variable to each output, we can concatonate these dataframes together, and then look at the output. 

In [None]:
# Ok, now we'll join those dataframes together. 
num_affected_df = pd.concat(num_affected_list, axis=0)
num_affected_df.head()

Our output has three columns: ID_hazard, num_people_affected, and year. 
We added year, but the other two are output from find_num_people_affected.

To find the total number of people affected by any wildfire disaster by year, 
we could group this output dataframe by year and sum. But, we're about to do two more interesting examples, so we'll leave this output as is, and dive into the output from the later examples in the third section of this tutorial. 

In our output, one thing has changed: our ID_hazard column is not the same as 
the ID_hazard column that we started with. 

We wanted to count the number of people residing within 10km of *any* 
US wildfire disaster. There are some people who live within 10km of two or
more wildfire disasters. Because we just wanted the total, we did not want to 
double count those people. When computing a total, rather than the number of 
people affected by each unique hazard, find_num_people_affected takes the unary 
union of any buffered hazards that are overlapping, and finds the total of 
everyone residing within that area. In the output, the hazard IDs of any 
overlapping hazards are concatenated. This avoids double counting, while still 
giving the user as much information about how many people lived near each 
hazard or group of hazards as possible. 

We can save the output.

In [None]:
num_affected.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

We also wanted to find the total number of people residing within 10 km of 
*EACH* US wildfire disaster in 2016, 2017, and 2018. 

To do this, we also need to use find_num_people_affected, with all the same 
arguments except for by_unique_hazard. In this case, we set by_unique_hazard to 
True. This means that we will count the number of people within 10km of each 
wildfire disaster boundary, regardless of whether two or more exposed areas 
overlap. 

In [None]:
# NOTE: NEED to suppress those warnings about centroids somehow, so that the 
# output is nicer.

In [None]:
num_affected_list_unique_hazard = []

for i in range(0, 2):
    num_affected_unique = px.find_num_people_affected(
        path_to_hazards=wildfire_paths[i],
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list_unique_hazard.append(num_affected_unique)

num_affected_unique = pd.concat(num_affected_list_unique_hazard, axis=0)
num_affected_unique.head()

Again, our output has three columns: ID_hazard, num_people_affected, and year. 

This time, the ID_hazard column is the same as the one we passed to this 
function. This time, if people lived within 10 km of one or more fires, they
are counted in the total people affected by that fire. This means people may
be double counted or triple or more. 

We can save the output.

In [None]:
num_affected.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

There were two more ways we wanted to define exposure. These are analogous to 
the two quantities we just computed, but this time, we want to know these
exposures by ZCTA. 

3. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.

To do this, we need to run find_number_of_people_affected_by_geo. 

First, we'll find the total number of people affected by ZCTA 
(by_unique_hazard = False).

This time, because we need to read in the large ZCTA file, the run will take slightly longer. It should take about 3 minutes, with the majority of this run time being the time it takes to read in the ZCTA file. If you're running these functions on giant datasets, we recommend parallelizeing over space or time. 

In [None]:
num_affected_list = []

for i in range(0, 2):
    num_affected = px.find_num_people_affected_by_geo(
        path_to_hazards=wildfire_paths[i],
        path_to_additional_geos=zcta_path,
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)




As in the case where we were not computing by ZCTA, the results come with 
hazard ids for groups of overlapping hazards.

Since we're interested in the total number of people affected by ZCTA, we'll 
group the output dataframe by ZCTA and year and sum over any hazard IDs.

In [None]:
# putting all years into one dataframe
num_affected_df = pd.concat(num_affected_list, axis=0)

In [None]:
agg_num_af = num_affected_df.groupby(["ID_spatial_unit", "year"]).agg(
     {"num_people_affected": "sum"}
).reset_index()

# And we can save 
agg_num_af.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

For our final case of counting exposed people, in which we aim to find the number of people affected by each hazard by each ZCTA, we do the same as we just did, but with by_unique_hazard = True. 

In [None]:
num_affected_list = []

for i in range(0, 2):
    num_affected = px.find_num_people_affected_by_geo(
        path_to_hazards=wildfire_paths[i],
        path_to_additional_geos=zcta_path,
        raster_path=ghsl_path,
        by_unique_hazard=True # setting by unique hazard to true 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

# all years in one dataframe
num_affected_df = pd.concat(num_affected_list, axis=0)
# and we can save
agg_num_af.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

We will explore some of the output from these runs in the next section of the tutorial.  

Finally, let's use the function find_number_of_people_residing_by_geo to get some denominators for our dataset. This function can help us use the gridded population data we used to find the number of people residing within the hazard buffers to also find the number of people residing in each ZCTA. This is useful if we're using a gridded population dataset that we think is a big improvement over other population counts in our additional spatial units, or we just want to be consistent. 

To call this function, all we need to do is use the same paths we've used previously:

In [None]:
num_residing_by_zcta = px.find_number_of_people_residing_by_geo(
    path_to_additional_geos=zcta_path,
    raster_path=ghsl_path
)

num_residing_by_zcta.to_csv(data_dir / "03_results" / "num_people_residing_by_zcta.csv")

Please continue to part 3 of this tutorial to explore the output of these functions! 