# Partition Map Creation

This notebook uses the DHS cluster data to partion the clusters into train and validation segments.


## File System Structure

## Input

DHS data is used as the basis for creating partition maps for each country based on the location of clusters. 

<pre style="font-family: monospace;">
./GIS-Image-Stack-Processing
    /DHS
        /County specific folders containing DHS files
</pre>

## Output
<pre style="font-family: monospace;">
./GIS-Image-Stack-Processing
    /AOI/
        Partitions/
            PK/
                <span style="color: blue;">PK_all.json</span> 
                <span style="color: blue;">PK_train.json</span> 
                <span style="color: blue;">PK_valid.json</span> 
            TD/
                <span style="color: blue;">TD_all.json</span> 
                <span style="color: blue;">TD_train.json</span> 
                <span style="color: blue;">TD_valid.json</span> 
</pre>


## Required Configurations

The following configuration is required for each execution of this notebook: the two-letter country code.

<pre style="font-family: monospace;">
<span style="color: blue;">country_code  = 'PK'</span>      # Set the country code
</pre>

In [1]:
#-------------------------------------------------
# REQUIRED CONFIGURATIONS HERE
#-------------------------------------------------
country_code  = 'PK'      # Set the country code

In [2]:
import os
import sys
import json

In [3]:
sys.path.append('./GIS-Image-Stack-Processing')  # Adjust path if `gist_utils` is moved
# Import module that contains several convenience functions (e.g., gdal wrappers)
from gist_utils import *

from gist_utils.aoi_configurations import aoi_configurations

In [4]:
GIS_ROOT = './GIS-Image-Stack-Processing'
PRT_ROOT = './GIS-Image-Stack-Processing/AOI/Partitions'

In [5]:
train_partition = os.path.join(PRT_ROOT, f'{country_code}', f'{country_code}_train.json')
valid_partition = os.path.join(PRT_ROOT, f'{country_code}', f'{country_code}_valid.json')
all_partition   = os.path.join(PRT_ROOT, f'{country_code}', f'{country_code}_all.json')

## DHS Data Configuration

In [6]:
shapefile_path = os.path.join(GIS_ROOT, aoi_configurations[country_code]['shapefile'])
recode_hr_path = os.path.join(GIS_ROOT, aoi_configurations[country_code]['recode_hr'])
recode_kr_path = os.path.join(GIS_ROOT, aoi_configurations[country_code]['recode_kr'])

# DHS Column Headings
dhs_cluster_field  = 'DHSCLUST'
dhs_lat_field      = 'LATNUM'
dhs_lon_field      = 'LONGNUM'

# Map Heading to new names
cluster_id   = 'cluster_id'
cluster_lat  = 'lat'
cluster_lon  = 'lon'

# The following mappings are used to rename DHS column headings to more meaningful names
cluster_column_mapping = {
    dhs_cluster_field: cluster_id,
    dhs_lat_field: cluster_lat,
    dhs_lon_field: cluster_lon
}

# DHS Household recode column name mapping
hr_column_mapping = {
    'HV001': cluster_id,
    'HV201': 'water_access',
    'HV206': 'electricity_access',
    'HV208': 'radio_access',
    'HV209': 'television_access',
    'HV270': 'wealth_index'
}

# DHS Child recode column name mapping
kr_column_mapping = {
    'V001': cluster_id,
    'H7': 'dpt1',
    'H8': 'dpt2',
    'H9': 'dpt3'
}

## Extract DHS Cluster Data

In [7]:
cluster_df, erroneous_cluster_ids = extract_cluster_data(shapefile_path, dhs_cluster_field, dhs_lat_field, dhs_lon_field)

Erroneous clusters detected and removed: [535]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_data[cluster_field] =         cluster_data[cluster_field].astype(float).astype(int)


In [8]:
print(erroneous_cluster_ids)

[535]


In [9]:
# Use the mapping to select and rename columns
cluster_df = cluster_df[list(cluster_column_mapping.keys())].rename(columns=cluster_column_mapping)

print(cluster_df.head())
print(cluster_df.shape[0])

   cluster_id        lat        lon
0           1  36.449918  72.571558
1           2  35.891914  71.726873
2           3  35.169566  71.834458
3           4  35.424729  72.163931
4           5  35.005696  71.776478
560


# Create Partition Maps

This function creates a partition map file that specifies which cluster IDs are to be used for the 
given partiion. An input longitude threshold is currently used to partition data between train and validation.

In [16]:
def generate_partition_maps(cluster_df, cluster_id, cluster_lon, country_code, longitude_threshold, output_train='train.json', output_valid='valid.json', output_all='all.json'):
    
    # Extract the cluster IDs and longitudes from the DataFrame
    cluster_ids = cluster_df[cluster_id].tolist()
    longitudes = cluster_df[cluster_lon].tolist()

    # Initialize dictionaries for training, validation, and all partition maps
    train_partition = []
    valid_partition = []
    all_partition = cluster_ids  # This will include all cluster IDs

    # Assign cluster IDs to the appropriate partition based on the longitude threshold
    for cid, lon in zip(cluster_ids, longitudes):
        if lon > longitude_threshold:
            train_partition.append(cid)
        else:
            valid_partition.append(cid)

    # Prepare dictionary structures for JSON
    train_partition_map = {f"{country_code}": train_partition}
    valid_partition_map = {f"{country_code}": valid_partition}
    all_partition_map = {f"{country_code}": all_partition}

    # Ensure directory exists before saving JSON files
    for output_file in [output_train, output_valid, output_all]:
        output_dir = os.path.dirname(output_file)
        if output_dir and not os.path.exists(output_dir):
            os.makedirs(output_dir, exist_ok=True)

    # Save the train partition map to a JSON file
    with open(output_train, 'w') as f:
        json.dump(train_partition_map, f, indent=4)
    print(f"Train partition map saved to: {output_train}")

    # Save the valid partition map to a JSON file
    with open(output_valid, 'w') as f:
        json.dump(valid_partition_map, f, indent=4)
    print(f"Valid partition map saved to: {output_valid}")

    # Save the all partition map to a JSON file
    with open(output_all, 'w') as f:
        json.dump(all_partition_map, f, indent=4)
    print(f"All partition map saved to:   {output_all}")


In [17]:
# Use the CRS longitude for the AOI as a threshold for partitioning the clusters. This could be improved 
# to more appropriately partition the data, but serves the purpose to protyping the capability.
longitude_threshold = aoi_configurations[country_code]['crs_lon']

generate_partition_maps(cluster_df, 
                        cluster_id, 
                        cluster_lon, 
                        country_code, 
                        longitude_threshold, 
                        train_partition, 
                        valid_partition,
                        all_partition)

Train partition map saved to: ./GIS-Image-Stack-Processing/AOI/Partitions/PK/PK_train.json
Valid partition map saved to: ./GIS-Image-Stack-Processing/AOI/Partitions/PK/PK_valid.json
All partition map saved to:   ./GIS-Image-Stack-Processing/AOI/Partitions/PK/PK_all.json
