INTRODUCTION:

The purpose of this notebook, along with 01_data_setup_example.ipynb, is to 
provide a tutorial of how you may want to use the pop_exp pacakge functions.

Please see 01_data_setup_example.ipynb before you work through this notebook!

GOALS:

To recap, to demo all the functions available in pop_exp, in this notebook we'll
use the pop_exp functions to do five separate things, which align with the 
five options available in the package. 

1. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018. 
2. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018.
3. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.
5. Find the population of all 2020 ZCTAs. 

In the last notebook, we prepared the wildfire disaster exposure data and ZCTA
data to pass to the pop_exp functions so we could complete these computations. 

LET'S DO IT:

We need to import some libraries and also install and import pop_exp. 

In [None]:
# We start by importing necessary libraries.
import pathlib
import sys
import matplotlib.pyplot as plt
import glob
import pandas as pd
#import pop_exp 

In [None]:
# PLACEHOLDER IMPORT STATEMENT 
code_path = '/Users/heathermcbrien/Documents/Documents/GitHub.nosync/casey_lab_shared_functions/pop_ex/code'
# these statements together import all the functions defined in the Pop Ex 
# module.
sys.path.append(str(code_path))
from pop_ex_helpers import *

In [None]:
print(code_path)

We'll also set some paths to make it easy to access the data we cleaned for 
this tutorial. 

To find the number of people affected by one or more wildfire disaster 
by ZCTA by year 2016-2018, we need to get the paths to each of our wildfire 
files that we made in the data setup notebook.

The regular expression below lists all the files in the interim data direcotry, 
and we're selecting the first three files, which are the wildfire disaster 
data files.

In [None]:
# Paths 
base_path = pathlib.Path.cwd().parent
data_dir = base_path / "demo_data"
# wf paths regex
wildfire_paths = sorted(glob.glob(str(data_dir / "02_interim_data" /  "*")))[0:2]

We also need the path to the population raster we're using, and the ZCTA file.
For me, my local path to where I have downloaded the GHSL dataset is below. 
Replace with your local path. 

In [None]:
ghsl_path = 'MY_LOCAL_PATH/GHSL/100 m/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif'

# ZCTA path 
zcta_path = sorted(glob.glob(str(data_dir / "02_interim_data" /  "*")))[3]

We're now set up to run the four functions we're interested in. 

Our first goal was:
1. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018. 

To do this, we can run find_number_of_people_affected with the parameter 
by_unique_hazrad = False. 

Because we're looping over three years, we'll initialize an empty list first, 
and then store the results in this list. We're also adding a year variable to 
the result as we go.

It takes about 1 minute to run this function per year of wildfire data, for a 
total of 3 minutes.


In [None]:
num_affected_list = []

for i in range(0, 2):
    num_affected = find_num_people_affected(
        path_to_hazards=wildfire_paths[i],
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

Because we saved each output with a year variable, we can concatonate these 
dataframes together, and then look at the output. 

In [None]:
# Ok, now we'll join those dataframes together. 
num_affected_df = pd.concat(num_affected_list, axis=0)
num_affected_df.head()

Our output has three columns: ID_hazard, num_people_affected, and year. 
We added year, but the other two are output from find_num_people_affected.

To find the total number of people affected by any wildfire disaster by year, 
we could group this output dataframe by year and sum. 

In our output, one thing has changed: our ID_hazard column is not the same as 
the ID_hazard column that we started with. 

We wanted to count the number of people residing within 10km of *any* 
US wildfire disaster. There are some people who live within 10km of two or
more wildfire disasters. Because we just wanted the total, we did not want to 
double count those people. When computing a total, rather than the number of 
people affected by each unique hazard, find_num_people_affected takes the unary 
union of any buffered hazards that are overlapping, and finds the total of 
everyone residing within that area. In the output, the hazard IDs of any 
overlapping hazards are concatenated. This avoids double counting, while still 
giving the user as much information about how many people lived near each 
hazard or group of hazards.

We can save the output.

In [None]:
num_affected.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

We also wanted to find the total number of people residing within 10 km of 
*EACH* US wildfire disaster in 2016, 2017, and 2018. 

To do this, we also need to use find_num_people_affected, with all the same 
arguments except for by_unique_hazard. In this case, we set by_unique_hazard to 
True. This means that we will count the number of people within 10km of each 
wildfire disaster boundary, regardless of whether two or more exposed areas 
overlap. 

In [None]:
num_affected_list_unique_h = []

for i in range(0, 2):
    num_affected_unique = find_num_people_affected(
        path_to_hazards=wildfire_paths[i],
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list_unique_h.append(num_affected_unique)

num_affected_unique = pd.concat(num_affected_list_unique_h, axis=0)
num_affected_unique.head()

Again, our output has three columns: ID_hazard, num_people_affected, and year. 

This time, the ID_hazard column is the same as the one we passed to this 
function. This time, if people lived within 10 km of one or more fires, they
are counted in the total people affected by that fire. This means people may
be double counted or triple or more. 

We can save the output.

In [None]:
num_affected.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

There were two more ways we wanted to define exposure. These are analogous to 
the two quantities we just computed, but this time, we want to know these
exposures by ZCTA. 

3. Find the total number of people residing within 10km of *any* US wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of *EACH* US wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.

To do this, we need to run find_number_of_people_affected_by_geo. 

First, we'll find the total number of people affected by ZCTA 
(by_unique_hazard = False).

In [None]:
num_affected_list = []

for i in range(0, 2):
    num_affected = find_num_people_affected_by_geo(
        path_to_hazards=wildfire_paths[i],
        path_to_additional_geos=zcta_path,
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

# putting all years into one dataframe
num_affected_df = pd.concat(num_affected_list, axis=0)


As in the case where we were not computing by ZCTA, the results come with 
hazard ids for groups of overlapping hazards.

Since we're interested in the total number of people affected by ZCTA, we'll 
group the output dataframe by ZCTA and year and sum over any hazard IDs.

In [None]:
agg_num_af = num_affected_df.groupby(["ID_spatial_unit", "year"]).agg(
     {"num_people_affected": "sum"}
).reset_index()

# And we can save 
agg_num_af.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

To find the number of people affected by each hazard by each ZCTA, we do the 
same, but with by_unique_hazard = True. 

In [None]:
num_affected_list = []

for i in range(0, 2):
    num_affected = find_num_people_affected_by_geo(
        path_to_hazards=wildfire_paths[i],
        path_to_additional_geos=zcta_path,
        raster_path=ghsl_path,
        by_unique_hazard=True # setting by unique hazard to true 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

# all years in one dataframe
num_affected_df = pd.concat(num_affected_list, axis=0)
# and we can save
agg_num_af.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

In [None]:
# Okay, now we're going to run the function. We're going to loop through the 
# wildfire paths in order to run it on each year 2016-2018. 

# We're going to run find_num_people_affected_by_geo, because we want to find 
# the number of people by ZCTA, and we're going to set the by_unique_hazard 
# argument to False, since we want to get the total number of people affected by 
# any wildflire hazard in each ZCTA, not the number of people affected by each 
# fire in each ZCTA. 

# We'll initalize an empty list first, and store the results of each run of the 
# loop in this list. We're also adding a year variable to the result as we go.

# It takes about 1 minute to run per year of wildfire data, for a 
# total of 3 minutes.

# If you want to run on a lot of data, we recommend parallelizing over time or 
# space.  

num_affected_list = []

for i in range(0, 2):
    num_affected = find_num_people_affected_by_geo(
        path_to_hazards=wildfire_paths[i],
        path_to_additional_geos=zcta_path,
        raster_path=ghsl_path,
        by_unique_hazard=False # setting by unique hazard to false 
    )
    num_affected['year'] = 2016 + i
    num_affected_list.append(num_affected)

In [None]:
# Ok, now we'll join those dataframes together. 
num_affected_df = pd.concat(num_affected_list, axis=0)

# The results also come with hazard ids for groups of overlapping hazards. 
# Since we just want results by ZCTA, we'll group by ZCTA and year and sum 
# over any hazard IDs.
agg_num_af = num_affected_df.groupby(["ID_spatial_unit", "year"]).agg(
     {"num_people_affected": "sum"}
).reset_index()

# And we can save 
agg_num_af.to_csv(data_dir / "03_results" / "num_people_affected_by_wildfire.csv")

In [None]:
# We can look at the data if we want - but that's that!
print(agg_num_af)