<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-requisites" data-toc-modified-id="Pre-requisites-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-requisites</a></span></li><li><span><a href="#Instructions" data-toc-modified-id="Instructions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Instructions</a></span></li><li><span><a href="#Imports-and-Constants" data-toc-modified-id="Imports-and-Constants-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports and Constants</a></span></li><li><span><a href="#Validate-and-Split-Exported-TFRecords" data-toc-modified-id="Validate-and-Split-Exported-TFRecords-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Validate and Split Exported TFRecords</a></span></li><li><span><a href="#Calculate-Mean-and-Std-Dev-for-Each-Band" data-toc-modified-id="Calculate-Mean-and-Std-Dev-for-Each-Band-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Calculate Mean and Std-Dev for Each Band</a></span></li></ul></div>

## Pre-requisites

Go through the [`preprocessing/0_export_tfrecords.ipynb`](./0_export_tfrecords.ipynb) notebook.

Before running this notebook, you should have the following structure under the `data/` directory:

```
data/
    dhs_tfrecords_raw/
        angola_2011_00.tfrecord.gz
        ...
        zimbabwe_2015_XX.tfrecord.gz
    dhsnl_tfrecords_raw/
        angola_2010_00.tfrecord.gz
        ...
        zimbabwe_2016_XX.tfrecord.gz
    lsms_tfrecords_raw/
        ethiopia_2011_00.tfrecord.gz
        ...
        uganda_2013_XX.tfrecord.gz
```

## Instructions

This notebook processes the exported TFRecords as follows:
1. Verifies that the fields in the TFRecords match the original CSV files.
2. Splits each monolithic TFRecord file exported from Google Earth Engine into one file per record.

After running this notebook, you should have three new folders (`dhs_tfrecords`, `dhsnl_tfrecords`, and `lsms_tfrecords`) under `data/`:

```
data/
    dhs_tfrecords/
        angola_2011/
            00000.tfrecord.gz
            ...
            00229.tfrecord.gz
        ...
        zimbabwe_2015/
            00000.tfrecord.gz
            ...
            00399.tfrecord.gz
    dhsnl_tfrecords/
        angola_2010/
            00000.tfrecord.gz
            ...
            07734.tfrecord.gz
        zimbabwe_2016/
            00000.tfrecord.gz
            ...
            03584.tfrecord.gz
    lsms_tfrecords/
        ethiopia_2011/
            00000.tfrecord.gz
            ...
            00326.tfrecord.gz
        uganda_2013/
            00000.tfrecord.gz
            ...
            00164.tfrecord.gz
```

This notebook also calculates the mean and standard deviation of each band across each of the 3 datasets.

## Imports and Constants

In [26]:
%load_ext autoreload
%autoreload 2

# change directory to repo root, and verify
# %cd '../'
!pwd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/media/matthieu/LaCie/2-mpa


In [27]:
from __future__ import annotations

from collections.abc import Iterable
from glob import glob
from pprint import pprint
import os
from typing import Optional

import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm.auto import tqdm

from batchers import batcher, tfrecord_paths_utils
from helper import (
    analyze_tfrecord_batch,
    per_band_mean_std,
    print_analysis_results)

In [28]:
REQUIRED_BANDS = [
    'BLUE', 'GREEN', 'LAT', 'LON', 'NIGHTLIGHTS', 'NIR', 'RED',
    'SWIR1', 'SWIR2', 'TEMP1']

BANDS_ORDER = [
    'BLUE', 'GREEN', 'RED', 'SWIR1', 'SWIR2', 'TEMP1', 'NIR',
    'DMSP', 'VIIRS']


DHSNL_EXPORT_FOLDER = 'data/landsat_7_raw'
DHSNL_PROCESSED_FOLDER = 'data/landsat_7_records'

In [29]:
# def map_filename(row):
#     row.filename=str(row.lat).replace(".","")+"_"+str(row.lon).replace(".","")+".tif"
#     return row

In [None]:
CSV= os.path.join( ".","data", "wealth_index.csv" )
csv = pd.read_csv(CSV)
# csv = csv.apply(map_filename, axis=1)
# csv.to_csv(CSV,index=False)

In [4]:
def tensor_to_string(data, variable):
    filename = (data[variable].numpy())[0]
    return "".join([chr(item) for item in filename]).replace('.','')

In [39]:
def process_dataset(csv_path: str, input_dir: str, processed_dir: str) -> None:
    '''
    Args
    - csv_path: str, path to CSV of DHS or LSMS clusters
    - input_dir: str, path to TFRecords exported from Google Earth Engine
    - processed_dir: str, folder where to save processed TFRecords
    '''
    df = pd.read_csv(csv_path, float_precision='high', index_col=False)
    surveys = list(df.groupby(['country', 'year']).groups.keys())  # (country, year) tuples

    for country, year in surveys:
        country_year = f'{country}_{year}'
        print('Processing:', country_year)

        tfrecord_paths = glob(os.path.join(input_dir, country_year + '*'))
        out_dir = os.path.join(processed_dir, country_year)
        os.makedirs(out_dir, exist_ok=True)
        subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
        validate_and_split_tfrecords(
            tfrecord_paths=tfrecord_paths, out_dir=out_dir, df=subset_df, country=country, year=year)


def validate_and_split_tfrecords(
        tfrecord_paths: Iterable[str],
        out_dir: str,
        df: pd.DataFrame,
        country,
        year
        ) -> None:
    '''Validates and splits a list of exported TFRecord files (for a
    given country-year survey) into individual TFrecords, one per cluster.

    "Validating" a TFRecord comprises of 2 parts
    1) verifying that it contains the required bands
    2) verifying that its other features match the values from the dataset CSV

    Args
    - tfrecord_paths: list of str, paths to exported TFRecords files
    - out_dir: str, path to dir to save processed individual TFRecords
    - df: pd.DataFrame, index is sequential and starts at 0
    '''
    # Create an iterator over the TFRecords file. The iterator yields
    # the binary representations of Example messages as strings.
    options = tf.io.TFRecordOptions(compression_type = 'GZIP')

    # cast float64 => float32 and str => bytes
    for col in df.columns:
        if df[col].dtype == np.float64:
            df[col] = df[col].astype(np.float32)
        elif df[col].dtype == object:  # pandas uses 'object' type for str
            df[col] = df[col].astype(bytes)

   
    progbar = tqdm(total=len(df))

    for tfrecord_path in tfrecord_paths:
        iterator = tf.compat.v1.io.tf_record_iterator(tfrecord_path, options=options)   
        for record_str in iterator:
            # parse into an actual Example message
            ex = tf.train.Example.FromString(record_str)
            feature_map = ex.features.feature
            index = str(country)+"_"+str(year)+"_"+str(feature_map["wealthpooled"].float_list.value[0]).replace('.','')[:5]

            for band in REQUIRED_BANDS:
                assert band in feature_map, f'Band "{band}" not in record {index} of {tfrecord_path}'
            # serialize to string and write to file
            out_path = os.path.join(out_dir, f'{index}.tfrecord.gz')  # all surveys have < 1e6 clusters
            with tf.io.TFRecordWriter(out_path, options=options) as writer:
                writer.write(ex.SerializeToString())

            progbar.update(1)
    progbar.close()

In [40]:
csv = pd.read_csv('data/wealth_index.csv')
csv.drop(["households", "wealthpooled","geometry","filename","bounding_box"], axis=1, inplace=True)
csv.to_csv('data/dhsnl_locs.csv',index=False)

In [41]:
process_dataset(
    csv_path='data/dhsnl_locs.csv',
    input_dir=DHSNL_EXPORT_FOLDER,
    processed_dir=DHSNL_PROCESSED_FOLDER)

Processing: angola_2011



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
100%|██████████| 229/229 [00:23<00:00,  9.83it/s]


Processing: angola_2015



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

KeyboardInterrupt: 

## Calculate Mean and Std-Dev for Each Band

The means and standard deviations calculated here are saved as constants in `batchers/dataset_constants.py` for `_MEANS_DHS`, `_STD_DEVS_DHS`, `_MEANS_LSMS`, and `_STD_DEVS_LSMS`.

In [9]:
def calculate_mean_std(tfrecord_paths):
    '''Calculates and prints the per-band means and std-devs'''
    iter_init, batch_op = batcher.Batcher(
        tfrecord_files=tfrecord_paths,
        label_name=None,
        ls_bands='ms',
        nl_band='merge',
        batch_size=128,
        shuffle=False,
        augment=False,
        clipneg=False,
        normalize=None).get_batch()

    stats = analyze_tfrecord_batch(
        iter_init, batch_op, total_num_images=len(tfrecord_paths),
        nbands=len(BANDS_ORDER), k=10)
    means, stds = per_band_mean_std(stats=stats, band_order=BANDS_ORDER)

    print('Means:')
    pprint(means)
    print()

    print('Std Devs:')
    pprint(stds)

    print('\n========== Additional Per-band Statistics ==========\n')
    print_analysis_results(stats, BANDS_ORDER)

In [20]:
tfrecord_paths_utils.dhsnl()

array([], dtype=float64)

In [19]:
tf.compat.v1.disable_eager_execution()
calculate_mean_std(tfrecord_paths_utils.dhsnl())

Finished. Processed 0 images.
Time per batch - mean: nans, std: nans
Time to process each batch - mean: nans, std: nans
Total time: 0.163s, Num batches: 0
Means:
{'BLUE': nan,
 'DMSP': nan,
 'GREEN': nan,
 'NIR': nan,
 'RED': nan,
 'SWIR1': nan,
 'SWIR2': nan,
 'TEMP1': nan,
 'VIIRS': nan}

Std Devs:
{'BLUE': nan,
 'DMSP': nan,
 'GREEN': nan,
 'NIR': nan,
 'RED': nan,
 'SWIR1': nan,
 'SWIR2': nan,
 'TEMP1': nan,
 'VIIRS': nan}


Statistics including bad pixels
Band BLUE     - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band GREEN    - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band RED      - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band SWIR1    - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band SWIR2    - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band TEMP1    - mean:        nan, std:       nan, min:         inf, max:    0.000000
Band NIR      - mean:    

2023-04-26 14:34:34.955014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22211 MB memory:  -> device: 0, name: Quadro RTX 6000, pci bus id: 0000:04:00.0, compute capability: 7.5
2023-04-26 14:34:34.958775: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)
  means = sums / float(num_total_pixels)
  stds = np.sqrt(sum_sqs/float(num_total_pixels) - means**2)
  means = sums / float(total_pixels_per_band)
  stds = np.sqrt(sum_sqs/float(total_pixels_per_band) - means**2)
  means = sums / nz_pixels
  stds = np.sqrt(sum_sqs/nz_pixels - means**2)
  avg_nz_pixels = nz_pixels.astype(np.float32) / images_count
  mea