<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-requisites" data-toc-modified-id="Pre-requisites-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-requisites</a></span></li><li><span><a href="#Instructions" data-toc-modified-id="Instructions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Instructions</a></span></li><li><span><a href="#Imports-and-Constants" data-toc-modified-id="Imports-and-Constants-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports and Constants</a></span></li><li><span><a href="#Constants" data-toc-modified-id="Constants-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Constants</a></span></li><li><span><a href="#Export-Images" data-toc-modified-id="Export-Images-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Export Images</a></span></li></ul></div>

## Pre-requisites
Register a Google account at [https://code.earthengine.google.com](https://code.earthengine.google.com). This process may take a couple of days. Without registration, the `ee.Initialize()` command below will throw an error message.

Create the DHS labels file. See `1_create_dhs_labels.R`.

## Instructions

This notebook exports satellite image composites from Google Earth Engine. The images are saved in gzipped TFRecord format (`*.tfrecord.gz`). The exported images take up a significant amount of storage space. Before exporting, make sure you have enough storage space.

- Storage space needed for exported TFRecords: ~97.7 GiB
- Expected export time: ~48h

*Note*: The storage space listed above is for all locations, including locations that were later excluded due to missing imagery.

By default, this notebook exports images to Google Drive. If you instead prefer to export images to Google Cloud Storage (GCS), change the `EXPORT` constant below to `'gcs'` and set `BUCKET` to the desired GCS bucket name. The images are exported to the following locations:

- Google Drive (default): `sustainbench_dhs_tfrecords_raw`
- GCS: `{BUCKET}/sustainbench_dhs_tfrecords_raw/`

Once the images have finished exporting, download the exported TFRecord files from Google Drive to `dhs/dhs_tfrecords_raw/`. After downloading the TFRecord files, this directory should look as follows, where each `YYY` depends on the `CHUNK_SIZE` parameter used:

```
dhs/dhs_tfrecords_raw/
    AL_2008_clust001_toYYY_of450.tfrecord.gz
    ...
    ZW_2015_clustYYY_to400_of400.tfrecord.gz
```

After finishing this notebook, move on to [3_process_tfrecords.ipynb](./3_process_tfrecords.ipynb) for next steps.

## Imports and Constants

In [None]:
%load_ext autoreload
%autoreload 2

# change directory to repo root, and verify
# %cd '../'
!pwd

In [None]:
from __future__ import annotations

import math

import ee
import pandas as pd

import ee_utils

Before using the Earth Engine API, you must perform a one-time authentication that authorizes access to Earth Engine on behalf of your Google account you registered at [https://code.earthengine.google.com](https://code.earthengine.google.com). The authentication process saves a credentials file to `$HOME/.config/earthengine/credentials` for future use.

The command `ee.Authenticate()` runs the authentication process. Once you successfully authenticate, you should not need to authenticate again in the future, unless you delete the credentials file. If you do not authenticate, the subsequent `ee.Initialize()` command will fail.

For more information, see [https://developers.google.com/earth-engine/python_install-conda.html](https://developers.google.com/earth-engine/python_install-conda.html).

In [None]:
try:
    # if already authenticated, can directly intiialize the Earth Engine API
    ee.Initialize()
except:
    # otherwise, authenticate first, then initialize
    ee.Authenticate()
    ee.Initialize()

## Constants

In [None]:
# ========== ADAPT THESE PARAMETERS ==========

# To export to Google Drive, uncomment the next 2 lines
EXPORT = 'drive'
BUCKET = None

# To export to Google Cloud Storage (GCS), uncomment the next 2 lines
# and set the bucket to the desired bucket name
# EXPORT = 'gcs'
# BUCKET = 'mybucket'

# export location parameters
EXPORT_FOLDER = 'sustainbench_dhs_tfrecords_raw'

# CHUNK_SIZE determines how many records (images) are included in each TFRecord file
# per DHS survey. Set CHUNK_SIZE to None to export a single TFRecord file per survey.
# However, sometimes this may fail by exceeding Google Earth Engine memory limits.
# In that case, decrease CHUNK_SIZE by a factor of 10 each time (to as small as 1)
# for Google Earth Engine to stop reporting memory errors.
CHUNK_SIZE = None

In [None]:
# ========== DO NOT MODIFY THESE ==========

# input data paths
CSV_PATH = 'output_labels/merged.csv'

# band names
MS_BANDS = ['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1']

# image parameters
PROJECTION = 'EPSG:3857'  # see https://epsg.io/3857
SCALE = 30                # export resolution: 30m/px
EXPORT_TILE_RADIUS = 127  # image dimension = (2*EXPORT_TILE_RADIUS) + 1 = 255px

## Export Images

In [None]:
def export_images(df: pd.DataFrame,
                  country: str,
                  year: int,
                  export_folder: str,
                  chunk_size: Optional[int] = None
                  ) -> dict[tuple[str, str, int, int], ee.batch.Task]:
    '''
    Args
    - df: pd.DataFrame, contains columns ['lat', 'lon', 'country_code', 'year']
    - country: str, together with `year` determines the survey to export
    - year: int, together with `country` determines the survey to export
    - export_folder: str, name of folder for export
    - chunk_size: int, optionally set a limit to the # of images exported per TFRecord file
        - set to a small number (<= 50) if Google Earth Engine reports memory errors

    Returns: dict, maps task name tuple (export_folder, country, year, chunk) to ee.batch.Task
    '''
    subset_df = df[(df['country_code'] == country) & (df['year'] == year)].reset_index(drop=True)
    if chunk_size is None:
        chunk_size = len(subset_df)
    num_chunks = int(math.ceil(len(subset_df) / chunk_size))
    tasks = {}

    id_max = subset_df['cluster_id'].max()
    digits_in_id = len(str(id_max))
    id_template = '{:0' + str(digits_in_id) + 'd}'
    id_max = id_template.format(id_max)

    for i in range(num_chunks):
        chunk_slice = slice(i * chunk_size, (i+1) * chunk_size - 1)  # df.loc[] is inclusive
        fc = ee_utils.df_to_fc(subset_df.loc[chunk_slice, :])
        start_date, end_date = ee_utils.surveyyear_to_range(year)

        # create 3-year Landsat composite image
        imgcol = ee_utils.LandsatSR(start_date=start_date, end_date=end_date).merged
        imgcol = imgcol.map(ee_utils.mask_qaclear).select(MS_BANDS)
        img = imgcol.median()

        # add nightlights, latitude, and longitude bands
        img = ee_utils.add_latlon(img)
        img = img.addBands(ee_utils.composite_nl(year))

        # create unique filename for export
        stop = min(chunk_slice.stop, len(subset_df) - 1)
        id_start = subset_df.loc[chunk_slice.start, 'cluster_id']
        id_end = subset_df.loc[stop, 'cluster_id']
        assert id_end >= id_start
        id_start = id_template.format(id_start)
        id_end = id_template.format(id_end)
        fname = f'- {country}_{year}_clust{id_start}_to{id_end}_of{id_max}'
        print(fname)

        tasks[(export_folder, country, year, i)] = ee_utils.get_array_patches(
            img=img, scale=SCALE, ksize=EXPORT_TILE_RADIUS,
            points=fc, export=EXPORT,
            prefix=export_folder, fname=fname,
            bucket=BUCKET)
    return tasks

In [None]:
tasks: dict[tuple[str, str, int, int], ee.batch.Task] = {}

In [None]:
dhs_df = pd.read_csv(CSV_PATH, float_precision='high', index_col=False).sort_values('DHSID_EA')

# exclude adm1* because of NaNs
dhs_df = dhs_df[['DHSID_EA', 'country_code', 'cluster_id', 'urban', 'lat', 'lon', 'year']]
display(dhs_df.head())

dhs_surveys = list(dhs_df.groupby(['country_code', 'year']).groups.keys())
for country, year in dhs_surveys:
    print(country, year)
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

Check on the status of each export task at [https://code.earthengine.google.com/](https://code.earthengine.google.com/), or run the following cell which checks every minute. Once all tasks have completed, download the DHS TFRecord files to `dhs/dhs_tfrecords_raw/`.

In [None]:
ee_utils.wait_on_tasks(tasks, poll_interval=10)