<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-requisites" data-toc-modified-id="Pre-requisites-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-requisites</a></span></li><li><span><a href="#Instructions" data-toc-modified-id="Instructions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Instructions</a></span></li><li><span><a href="#Imports-and-Constants" data-toc-modified-id="Imports-and-Constants-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports and Constants</a></span></li><li><span><a href="#Validate-and-Split-Exported-TFRecords" data-toc-modified-id="Validate-and-Split-Exported-TFRecords-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Validate and Split Exported TFRecords</a></span></li><li><span><a href="#Calculate-Mean-and-Std-Dev-for-Each-Band" data-toc-modified-id="Calculate-Mean-and-Std-Dev-for-Each-Band-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Calculate Mean and Std-Dev for Each Band</a></span></li></ul></div>

## Pre-requisites

Go through the [`preprocessing/0_export_tfrecords.ipynb`](./0_export_tfrecords.ipynb) notebook.

Before running this notebook, you should have the following structure under the `data/` directory:

```
data/
    dhs_tfrecords_raw/
        angola_2011_00.tfrecord.gz
        ...
        zimbabwe_2015_XX.tfrecord.gz
    dhsnl_tfrecords_raw/
        angola_2010_00.tfrecord.gz
        ...
        zimbabwe_2016_XX.tfrecord.gz
    lsms_tfrecords_raw/
        ethiopia_2011_00.tfrecord.gz
        ...
        uganda_2013_XX.tfrecord.gz
```

## Instructions

This notebook processes the exported TFRecords as follows:
1. Verifies that the fields in the TFRecords match the original CSV files.
2. Splits each monolithic TFRecord file exported from Google Earth Engine into one file per record.

After running this notebook, you should have three new folders (`dhs_tfrecords`, `dhsnl_tfrecords`, and `lsms_tfrecords`) under `data/`:

```
data/
    dhs_tfrecords/
        angola_2011/
            00000.tfrecord.gz
            ...
            00229.tfrecord.gz
        ...
        zimbabwe_2015/
            00000.tfrecord.gz
            ...
            00399.tfrecord.gz
    dhsnl_tfrecords/
        angola_2010/
            00000.tfrecord.gz
            ...
            07734.tfrecord.gz
        zimbabwe_2016/
            00000.tfrecord.gz
            ...
            03584.tfrecord.gz
    lsms_tfrecords/
        ethiopia_2011/
            00000.tfrecord.gz
            ...
            00326.tfrecord.gz
        uganda_2013/
            00000.tfrecord.gz
            ...
            00164.tfrecord.gz
```

This notebook also calculates the mean and standard deviation of each band across each of the 3 datasets.

## Imports and Constants

In [1]:
%load_ext autoreload
%autoreload 2

# change directory to repo root, and verify
%cd '../'
!pwd

/media/matthieu/LaCie/2-mpa
/media/matthieu/LaCie/2-mpa


In [2]:
from __future__ import annotations

from collections.abc import Iterable
from glob import glob
from pprint import pprint
import os
from typing import Optional

import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm.auto import tqdm

from batchers import batcher, tfrecord_paths_utils
from helper import (
    analyze_tfrecord_batch,
    per_band_mean_std,
    print_analysis_results)

2023-05-22 14:13:49.901874: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-22 14:13:49.963474: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
REQUIRED_BANDS = [
    'BLUE', 'GREEN', 'LAT', 'LON', 'NIGHTLIGHTS', 'NIR', 'RED',
    'SWIR1', 'SWIR2', 'TEMP1']

BANDS_ORDER = [
    'BLUE', 'GREEN', 'RED', 'SWIR1', 'SWIR2', 'TEMP1', 'NIR',
    'DMSP', 'VIIRS']


DHSNL_EXPORT_FOLDER = 'data/landsat_7less_archives'
DHSNL_PROCESSED_FOLDER = 'data/landsat_7less'

In [4]:
# def map_filename(row):
#     row.filename=str(row.lat).replace(".","")+"_"+str(row.lon).replace(".","")+".tif"
#     return row

In [5]:
CSV= os.path.join( ".","data", "dataset.csv" )
csv = pd.read_csv(CSV)

In [6]:
def tensor_to_string(data, variable):
    filename = (data[variable].numpy())[0]
    return "".join([chr(item) for item in filename]).replace('.','')

In [7]:
def process_dataset(csv_path: str, input_dir: str, processed_dir: str) -> None:
    '''
    Args
    - csv_path: str, path to CSV of DHS or LSMS clusters
    - input_dir: str, path to TFRecords exported from Google Earth Engine
    - processed_dir: str, folder where to save processed TFRecords
    '''
    df = pd.read_csv(csv_path, float_precision='high', index_col=False)
    surveys = list(df.groupby(['country', 'year']).groups.keys())  # (country, year) tuples

    for country, year in surveys:
        country_year = f'{country}_{year}'
        print('Processing:', country_year)

        # Checking inside potential subfolders
        tfrecord_paths = glob(os.path.join(input_dir, country_year + '*.tfrecord.gz'))
        tfrecord_paths += glob(os.path.join(input_dir, "*", country_year + '*.tfrecord.gz'))
        tfrecord_paths += glob(os.path.join(input_dir, "*","*", country_year + '*.tfrecord.gz'))

        out_dir = os.path.join(processed_dir, country_year)
        os.makedirs(out_dir, exist_ok=True)
        subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
        validate_and_split_tfrecords(
            tfrecord_paths=tfrecord_paths, out_dir=out_dir, df=subset_df, country=country, year=year)


def validate_and_split_tfrecords(
        tfrecord_paths: Iterable[str],
        out_dir: str,
        df: pd.DataFrame,
        country,
        year
        ) -> None:
    '''Validates and splits a list of exported TFRecord files (for a
    given country-year survey) into individual TFrecords, one per cluster.

    "Validating" a TFRecord comprises of 2 parts
    1) verifying that it contains the required bands
    2) verifying that its other features match the values from the dataset CSV

    Args
    - tfrecord_paths: list of str, paths to exported TFRecords files
    - out_dir: str, path to dir to save processed individual TFRecords
    - df: pd.DataFrame, index is sequential and starts at 0
    '''
    # Create an iterator over the TFRecords file. The iterator yields
    # the binary representations of Example messages as strings.
    options = tf.io.TFRecordOptions(compression_type = 'GZIP')

    # cast float64 => float32 and str => bytes
    for col in df.columns:
        if df[col].dtype == np.float64:
            df[col] = df[col].astype(np.float32)
        elif df[col].dtype == object:  # pandas uses 'object' type for str
            df[col] = df[col].astype(bytes)

   
    progbar = tqdm(total=len(df))

    for tfrecord_path in tfrecord_paths:
        iterator = tf.compat.v1.io.tf_record_iterator(tfrecord_path, options=options)
        for record_str in iterator:
            # parse into an actual Example message
            ex = tf.train.Example.FromString(record_str)
            feature_map = ex.features.feature
            # for k in feature_map: print(k)
            index = str(int(feature_map["cluster"].float_list.value[0]))

            for band in REQUIRED_BANDS:
                assert band in feature_map, f'Band "{band}" not in record {index} of {tfrecord_path}'
            # serialize to string and write to file
            out_path = os.path.join(out_dir, f'{index}.tfrecord.gz')  # all surveys have < 1e6 clusters
            with tf.io.TFRecordWriter(out_path, options=options) as writer:
                writer.write(ex.SerializeToString())

            progbar.update(1)
    progbar.close()

In [8]:
csv = pd.read_csv('data/dataset.csv')
csv.drop(["households", "wealthpooled"], axis=1, inplace=True)
csv.to_csv('data/dhsnl_locs.csv',index=False)

In [9]:
process_dataset(
    csv_path='data/dhsnl_locs.csv',
    input_dir=DHSNL_EXPORT_FOLDER,
    processed_dir=DHSNL_PROCESSED_FOLDER
)

Processing: angola_2011


  0%|          | 0/229 [00:00<?, ?it/s]

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


100%|██████████| 229/229 [00:18<00:00, 12.55it/s]


Processing: angola_2015


  0%|          | 0/624 [00:00<?, ?it/s]


Processing: benin_2012


  0%|          | 0/745 [00:00<?, ?it/s]


Processing: benin_2017


  0%|          | 0/539 [00:00<?, ?it/s]


Processing: burkina_faso_1999


  0%|          | 0/43 [00:00<?, ?it/s]


Processing: burkina_faso_2010


  0%|          | 0/540 [00:00<?, ?it/s]


Processing: burkina_faso_2014


  0%|          | 0/247 [00:00<?, ?it/s]


Processing: burkina_faso_2017


  0%|          | 0/223 [00:00<?, ?it/s]


Processing: cameroon_2004


  0%|          | 0/82 [00:00<?, ?it/s]


Processing: cameroon_2011


  0%|          | 0/575 [00:00<?, ?it/s]


Processing: cameroon_2018


  0%|          | 0/429 [00:00<?, ?it/s]


Processing: cote_d_ivoire_1998


  0%|          | 0/53 [00:00<?, ?it/s]


Processing: cote_d_ivoire_2012


  0%|          | 0/340 [00:00<?, ?it/s]


Processing: democratic_republic_of_congo_2013


  0%|          | 0/491 [00:00<?, ?it/s]


Processing: ethiopia_2005


  0%|          | 0/134 [00:00<?, ?it/s]


Processing: ethiopia_2011


  0%|          | 0/570 [00:00<?, ?it/s]


Processing: ethiopia_2016


  0%|          | 0/621 [00:00<?, ?it/s]


Processing: ethiopia_2019


  0%|          | 0/304 [00:00<?, ?it/s]


Processing: ghana_1998


  0%|          | 0/44 [00:00<?, ?it/s]


Processing: ghana_2014


  0%|          | 0/421 [00:00<?, ?it/s]


Processing: ghana_2016


  0%|          | 0/191 [00:00<?, ?it/s]


Processing: ghana_2019


  0%|          | 0/191 [00:00<?, ?it/s]


Processing: guinea_1999


  0%|          | 0/48 [00:00<?, ?it/s]


Processing: guinea_2012


  0%|          | 0/299 [00:00<?, ?it/s]


Processing: guinea_2018


  0%|          | 0/400 [00:00<?, ?it/s]


Processing: kenya_2003


  0%|          | 0/250 [00:00<?, ?it/s]


Processing: kenya_2014


  0%|          | 0/1584 [00:00<?, ?it/s]


Processing: kenya_2015


  0%|          | 0/244 [00:00<?, ?it/s]


Processing: lesotho_2009


  0%|          | 0/394 [00:00<?, ?it/s]


Processing: lesotho_2014


  0%|          | 0/398 [00:00<?, ?it/s]


Processing: madagascar_1997


  0%|          | 0/30 [00:00<?, ?it/s]


Processing: madagascar_2011


  0%|          | 0/265 [00:00<?, ?it/s]


Processing: madagascar_2013


  0%|          | 0/273 [00:00<?, ?it/s]


Processing: madagascar_2016


  0%|          | 0/357 [00:00<?, ?it/s]


Processing: malawi_2004


  0%|          | 0/164 [00:00<?, ?it/s]


Processing: malawi_2010


  0%|          | 0/826 [00:00<?, ?it/s]


Processing: malawi_2012


  0%|          | 0/139 [00:00<?, ?it/s]


Processing: malawi_2014


  0%|          | 0/139 [00:00<?, ?it/s]


Processing: malawi_2015


  0%|          | 0/849 [00:00<?, ?it/s]


Processing: malawi_2017


  0%|          | 0/147 [00:00<?, ?it/s]


Processing: mali_1996


  0%|          | 0/36 [00:00<?, ?it/s]


Processing: mali_2006


  0%|          | 0/404 [00:00<?, ?it/s]


Processing: mali_2012


  0%|          | 0/413 [00:00<?, ?it/s]


Processing: mali_2015


  0%|          | 0/176 [00:00<?, ?it/s]


Processing: mali_2018


  0%|          | 0/327 [00:00<?, ?it/s]


Processing: mozambique_2009


  0%|          | 0/269 [00:00<?, ?it/s]


Processing: mozambique_2011


  0%|          | 0/608 [00:00<?, ?it/s]


Processing: mozambique_2015


  0%|          | 0/305 [00:00<?, ?it/s]


Processing: mozambique_2018


  0%|          | 0/221 [00:00<?, ?it/s]


Processing: nigeria_2003


  0%|          | 0/117 [00:00<?, ?it/s]


Processing: nigeria_2010


  0%|          | 0/238 [00:00<?, ?it/s]


Processing: nigeria_2013


  0%|          | 0/888 [00:00<?, ?it/s]


Processing: nigeria_2015


  0%|          | 0/321 [00:00<?, ?it/s]


Processing: nigeria_2018


  0%|          | 0/1382 [00:00<?, ?it/s]


Processing: rwanda_2010


  0%|          | 0/491 [00:00<?, ?it/s]


Processing: rwanda_2015


  0%|          | 0/491 [00:00<?, ?it/s]


Processing: rwanda_2019


  0%|          | 0/499 [00:00<?, ?it/s]


Processing: senegal_2010


  0%|          | 0/384 [00:00<?, ?it/s]


Processing: senegal_2012


  0%|          | 0/199 [00:00<?, ?it/s]


Processing: senegal_2014


  0%|          | 0/196 [00:00<?, ?it/s]


Processing: senegal_2015


  0%|          | 0/213 [00:00<?, ?it/s]


Processing: senegal_2016


  0%|          | 0/213 [00:00<?, ?it/s]


Processing: senegal_2017


  0%|          | 0/399 [00:00<?, ?it/s]


Processing: senegal_2018


  0%|          | 0/213 [00:00<?, ?it/s]


Processing: sierra_leone_2013


  0%|          | 0/434 [00:00<?, ?it/s]


Processing: sierra_leone_2016


  0%|          | 0/335 [00:00<?, ?it/s]


Processing: sierra_leone_2019


  0%|          | 0/556 [00:00<?, ?it/s]


Processing: tanzania_2003


  0%|          | 0/107 [00:00<?, ?it/s]


Processing: tanzania_2010


  0%|          | 0/457 [00:00<?, ?it/s]


Processing: tanzania_2012


  0%|          | 0/573 [00:00<?, ?it/s]


Processing: tanzania_2015


  0%|          | 0/607 [00:00<?, ?it/s]


Processing: tanzania_2017


  0%|          | 0/435 [00:00<?, ?it/s]


Processing: togo_2013


  0%|          | 0/329 [00:00<?, ?it/s]


Processing: togo_2017


  0%|          | 0/170 [00:00<?, ?it/s]


Processing: uganda_2006


  0%|          | 0/335 [00:00<?, ?it/s]


Processing: uganda_2009


  0%|          | 0/169 [00:00<?, ?it/s]


Processing: uganda_2011


  0%|          | 0/869 [00:00<?, ?it/s]


Processing: uganda_2014


  0%|          | 0/207 [00:00<?, ?it/s]


Processing: uganda_2016


  0%|          | 0/684 [00:00<?, ?it/s]


Processing: uganda_2018


  0%|          | 0/315 [00:00<?, ?it/s]


Processing: zambia_2013


  0%|          | 0/718 [00:00<?, ?it/s]


Processing: zambia_2018


  0%|          | 0/534 [00:00<?, ?it/s]


Processing: zimbabwe_2005


  0%|          | 0/396 [00:00<?, ?it/s]


Processing: zimbabwe_2010


  0%|          | 0/392 [00:00<?, ?it/s]


Processing: zimbabwe_2015


  0%|          | 0/399 [00:00<?, ?it/s]


In [10]:
tfrecord_archives=glob(os.path.join(DHSNL_PROCESSED_FOLDER,'')+"*/*.gz")
tfrecord_archives[:5]

['data/landsat_7less/angola_2011/211.tfrecord.gz',
 'data/landsat_7less/angola_2011/212.tfrecord.gz',
 'data/landsat_7less/angola_2011/213.tfrecord.gz',
 'data/landsat_7less/angola_2011/214.tfrecord.gz',
 'data/landsat_7less/angola_2011/215.tfrecord.gz']