#### Introduction 

This tutorial is from a Jupyter notebook available for download on GitHub at https://github.com/heathermcb/popexposure/tree/main/docs/tutorials. If you're reading this on the `popexposure` website, consider downloading and running the tutorials for yourself!

This is a tutorial demonstrating how you may want to use the package `popexposure` to find people living near environmental hazards. 

This notebook explains what `popexposure` is, how it can be used, and then cleans some raw data files for use with `popexposure`. There are two additional notebooks which run `popexposure` methods and explore the results generated by those methods.

We assume that you have a version of Python installed on your computer compatible with the requirements of `popexposure`, you have an IDE, and you’re able to open and run a Jupyter notebook as well as Python scripts, and activate a virtual environment in which to run this notebook and `popexposure`.

#### Outline
1. Purpose of tutorial
2. Activating the pop_exp environment
3. What does popexposure do?
3. Data used in this tutorial
5. Data preparation 

##### Purpose of tutorial

This tutorial (along with notebooks 2 and 3) will explain what `popexposure` does, and teach you how to use the package to find the number of people residing near California wildfire disasters, as well as the number of people residing near California wildfire disasters by ZCTA, across the years 2016-2018. We will discuss the details of how `popexposure` allows you to define exposure to environmental hazards shortly. 

#### Activating the pop_exp environment

We've provided an environment file that contains the requirments of `popexposure` in the same GitHub folder as this tutorial, in which you can run this tutorial. If you're running a script that loads and runs `popexposure` methods from the command line, you can install and activate this environment before you run that script from the command line. If you want to run this tutorial or your own notebook that uses `popexposure`, you can install this environment, make a Jupyter kernel, and run the notebook in it. 

Briefly, if you wanted to run a script using `popexposure` from the command line, you could:

1. Open a terminal window and navigate to this repository using cd.
2. Create the environment by running: conda env create -f pop_exp.yml
3. Activate this environment using: conda activate pop_exp
4. Run your script.

To create a kernel, you need to run:
python -m ipykernel install --user --name pop_exp --display-name "Python (pop_exp)"

### What does popexposure do?

`popexposure` allow the user to estimate either 

(a) the number of people living within a buffer distance of each unique hazard (e.g., the number of people living within 10 km of each individual wildfire disaster burned area in 2018 in California) or 
(b) the number of people living within the buffer distance of any of the cumulative set of hazards (e.g., the number of people living within 10 km of one or more wildfire disaster burned areas in 2018 in California). 

These estimates can be broken down by additional administrative units such as ZCTAs; for example, `popexposure` can find the number of people living within 10 km of any wildfire disaster burned area in 2018 by ZCTA, and calculate administrative unit denominators such as the number of residents in each ZCTA. 

This means there are five distinct computations the package can do:

1. Find the total number of people who reside within a buffer distance (which can vary by hazard or be 0) of one or more hazards for a set of environmental hazards.
2. Find the total number of people who reside within a buffer distance (which can vary by hazard or be 0) for each unique environmental hazard in a set of hazards.
3. Find the total number of people who reside within a buffer distance (which can vary by hazard or be 0) of one or more hazards for a set of environmental hazards, by additional administrative unit (ex. the total number of people who resided within 10km of any wildfire disaster in 2018 by ZCTA).
4. Find the total number of people who reside within a buffer distance (which can vary by hazard or be 0) for each unique environmental hazard in a set of hazards, by additional 
administrative unit. 
5. Find the number of people living within each administrative unit according to a gridded population dataset. 

The fifth option is meant to provide denominators for computations (3) and (4). For example, you may want to find the total number of people who lived within 10km of any wildfire disaster in 2018 by ZCTA, and then calculate the proportion of the ZCTA population that was exposed. To do this, you could use a method in `popexposure` to find the ZCTA population according to the same gridded population raster you used to determine exposure. 

This tutorial will demonstrate all of these options. 

To demo all the options, we will do the following five separate computations:

1. Find the total number of people residing within 10km of one or more California wildfire 
disaster in 2016, 2017, and 2018. 
2. Find the total number of people residing within 10 km of each unique California wildfire
disaster in 2016, 2017, and 2018.
3. Find the total number of people residing within 10km of one or more California wildfire 
disaster in 2016, 2017, and 2018 by 2020 ZCTA. 
4. Find the total number of people residing within 10 km of each unique California wildfire
disaster in 2016, 2017, and 2018 by 2020 ZCTA.
5. Find the population of all 2020 California ZCTAs. 


#### Data used in this tutorial

We'll use a publicly available dataset of US wildfire disaster boundaries for the years 2016-2018 filtered to California as our hazard data.

To create these estimates, `popexposure` requires up to four inputs: (1) a geospatial dataset of environmental hazards, (2) a gridded population dataset, (3) a parameter indicating whether the estimates are hazard-specific or cumulative (i.e. one count for the number of people affected for each unique hazard or one count for the total people living near one or more hazards), and (4) an optional additional geospatial dataset of administrative geographies such as postal codes, census tracts, or counties.

If you want to run this tutorial yourself, you can run the script available in the tutorial directory called 00_download_data.sh to create the necessary directory structure and automatically download these datasets. Or, you can create directories manually and download the files below (and unzip them). Note that the raster dataset may take a few minutes to download, so you may want to start the download before you intend to work through the tutorial. 

You need the following directories:

...tutorials/demo_data
...tutorials/demo_data/01_raw_data
..tutorials/demo_data/02_interim_data
...tutorials/demo_data/03_results

And the following data should be downloaded into 01_raw_data and unzipped.

Wildfire dataset (hazard data):

Description:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R73R85
Download link:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DWILBW#

ZCTAs (additional administrative units):

Description:
https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html
Download link:
https://www2.census.gov/geo/tiger/TIGER2020/ZCTA520/tl_2020_us_zcta520.zip


Global Human Settlement Layer (Gridded population data):
Residential population at 100 m resolution for 2020, California tile. 
(We used the file with Mollweide coordinate reference system, but any would work, since this package can handle any CRS.)

Description:
https://human-settlement.emergency.copernicus.eu/download.php?ds=pop
Download:
https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100/V1-0/tiles/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0_R5_C8.zip


To use `popexposure`, we need to create an instance of the the class PopEstimator (the main class in the `popexposure` package) with the data we want to use to find exposed populations. The PopEstimator class provides the method `est_exposed_pop`, which actually does the computations. We need to instantiate the class with gridded population data, describing the residential population of the area we're interested in. The PopEstimator method `est_exposed_pop` will then take in hazard data, and find the number of people exposed to those hazards.

If we want to break counts down by ZCTA, we'll need also to pass additional ZCTA data to the PopEstimator class. These ZCTA data are optional. When there is no ZCTA data, `est_exposed_pop` will by default return counts without additional administrative units. When administrative unit data are passed, counts will automatically be broken down by administrative unit. 

In this rest of this tutorial, we'll prepare data to give to the PopEstimator class and to `est_exposed_pop`, for finding people exposed to wildfire in California 2016-2018.

#### Data preparation

First we import some libraries. 

In [None]:
import pathlib
import sys
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import os

The gridded population data doesn't need to be prepared for use with `popexposure`.
We'll start by preparing the ZCTA data. The PopEstimator class requires that the data be in a certain format. 

We've chosen to use 2020 ZCTA data, since the time period 2016-2018 is closer to the 2020 census than the 2010 census. 

We need to select and rename the columns to the right format: we need to rename the ZCTA ID to 'ID_admin_unit', and select 'ID_admin_unit' and the geometry column.

In [None]:
# Paths 
base_path = pathlib.Path.cwd()
data_dir = base_path / "demo_data"

In [None]:
# Read in the raw ZCTA data. 
zctas = gpd.read_file(data_dir / "01_raw_data" / "tl_2020_us_zcta520" / "tl_2020_us_zcta520.shp")

# Rename 
zctas.rename(columns={"ZCTA5CE20": "ID_admin_unit"}, inplace=True)
# select ID_admin_unit and geometry
zctas = zctas[["ID_admin_unit", "geometry"]].copy()
zctas.head()

In [None]:
# Filter to zctas in CA
zctas = zctas[pd.to_numeric(zctas['ID_admin_unit']).between(90000, 96100)].copy()


Then, after selecting just ZCTAs in California, we'll save this as a GeoJSON file. The PopEstimator class requires that this is either a GeoJSON or Parquet format. 

In [None]:
# This will take a few seconds. 
zctas_path = data_dir / "02_interim_data" / "zctas_CA_2020.geojson"
zctas.to_file(zctas_path, driver = 'GeoJSON')

Now we can move on to hazard data: the wildfire data. We'll read in the raw wildfire data as downloaded from Harvard dataverse. Note that
the raw data contains wildfire disasters for years 2000-2019, which is a lot, and we're going to filter down to only 2016-2018 for this tutorial. As with the admin data, we need to format this dataset correctly for `popexposure`.

In [None]:
# read in US wildfire dataset
fires = gpd.read_file(data_dir / "01_raw_data"/ "dataverse_files" / "wfbz_disasters_2000-2025.geojson")
# filter to only CA fires - wildfire_states has to contain CA
fires = fires[fires['wildfire_states'].str.contains('CA')]

We'll plot the data to make sure the dataset read in correctly.

In [None]:
fires.head()

In [None]:
fires.plot()

This data is going to the method `est_exposed_pop`. We need to pass a path to `est_exposed_pop` to a dataframe with at least 3 columns:  ID_hazard, at least one column starting with buffer_dist, and a geometry column. We need to decide on one or more buffer distances and create those columns, and rename the other columns to the correct names. 

For this tutorial, we've decided we want to consider people exposed to a wildfire if they live within 10 km of the boundaries of the wildfire disasters that are specified in this dataset. It could be something different if we thought the relevant distance from our hazards was different. We could also assign a buffer of 0 to our hazards, or different buffers to each hazard in the dataset. They don't all have to be the same. We could have even assign buffers based on the hazard area, or another characteristic of each hazard.

The buffer distance is in meters, so we'll specify a 10,000 m buffer distance. 

If we wanted to also see how many people were within 20 km of the wildfire boundaries, we could add another column with a 20,000 m buffer distance. This might be useful for an environmental epidemiology study, for example, if we wanted to run sensitivity analyses on the chosen buffer distance ;).

We'll call our column buffer_dist_10 since it is a 10km buffer distance. 

In [None]:
fires["buffer_dist_10"] = 10000 # buffer distance in in meters 
fires.head() # Checking what columns I have in the data 

We've created a buffer distance column. We need to select and rename the remaining columns we need, but we also need to select the years we're interested in. 

Here, we're interested in years 2016-2018, and we want to determine exposure by year. We want to compute the total number of people affected by any fire in 2016, 2017, and 2018, as well as apply the three other exposure definitions we wrote out above yearly.

There is no option in `popexposure` to indicate which hazards are for which year, or time period. If we want to know the total number of people affected by hazards in 2016 but not 2017, we need to feed `popexposure` the exposure data for 2016 along with a gridded population dataset that represents the population in 2016. If I wanted monthly exposure for 2016, I'd need to split my exposure data up by month and call the `est_exposed_pop` method separately on each month. 

In this tutorial, we'll use the Global Human Settlement Layer data from 2020 for each year 2016-2018, since it's close enough, but we'll split up the hazard data by year because we want yearly counts.

So before selecting just the ID, hazard, and buffer distance columns, we're going to select and split up the years we're interested in. 

In [None]:
# Select fires in 2016, 2017, 2018
fires = fires[fires["wildfire_year"].isin([2016, 2017, 2018])]
# Split this into a list of dataframes by year
fires_by_year = [fires[fires["wildfire_year"] == year] for year in [2016, 2017, 2018]]

Now that we have our exposure datasets, we'll select and rename the columns appropriately: ID_hazard, buffer_dist_10, and geometry. 

In [None]:
# First, select cols
fires_by_year = [fire[["ics_id", "buffer_dist_10", "geometry"]] for fire in fires_by_year]
# Then rename the wildfire ID col
fires_by_year = [fire.rename(columns={"ics_id": "ID_hazard"}) for fire in fires_by_year]

Finally, we can write these out into an interim data folder to call using `est_exposed_pop`, since we can pass either a datframe or a path name to data. We're using GeoJSON files because `est_exposed_pop` requires either a GeoJSON file or Parquet file.

In [None]:
for i, fire in enumerate(fires_by_year):
    fire.to_file(data_dir / "02_interim_data" / f"wildfires_{2016 + i}.geojson", driver="GeoJSON")

Our data is ready! Proceed to 02_demo_example_run.ipynb to continue the tutorial. 