# Create AOI Image Tiles 

This notebook is used to create image tiles at DHS survey locations for a given AOI using resampled and spatially aligned GeoTiff raster files that were previously created for the specific AOI using `02_prep_geospatial_data.ipynb`. The following processing steps are performed:

1. **Define configurations**: Required input files, AOI parameters, file naming conventions, etc.
2. **Extract survey data**: Retrieve Cluster IDs and GPS coordinates (lat, lon) from the Pakistan DHS shapefile
3. **Load AOI Raster Data**
4. **Convert GPS locations**: Transform each survey GPS location to the same CRS as the AOI image stack
5. **Identify points within the AOI**: Loop through all GPS locations to find which ones are within the AOI image stack
6. **Crop image tiles**:

    - For each GPS location, find the nearest vertex in the image stack
    - Use that vertex as the center of a bounding box to crop an image tile (224 x 224) for each data type
    - Save each cropped image tile as a data sample with a clearly identifiable file name
    
## File System Structure

The top-level file structure includes five folders and three notebooks:

<pre style="font-family: monospace;">
<span style="color: blue;">./AOI       </span>  <span style="color: gray;"># AOI Image Stacks and Image Tiles</span>  
<span style="color: blue;">./DHS       </span>  <span style="color: gray;"># DHS survey data</span>
<span style="color: blue;">./gist_utils</span>  <span style="color: gray;"># Python package with convenience functions</span>
<span style="color: gray;">./Nightlights</span>
<span style="color: gray;">./Population</span>
<span style="color: gray;">./Rainfall</span>

<span style="color: gray;">./01_prep_rainfall_gpm.ipynb</span>
<span style="color: gray;">./02_prep_geospatial_data.ipynb</span>
<span style="color: blue;">./03_prep_aoi_image_tiles.ipynb (this notebook)</span>
</pre>


## Input

The input GeoTiff files for this notebook are generated by `02_prep_geospatial_data.ipynb` and follow this file structure:

<pre style="font-family: monospace;">
./Nightlights/
    output/PK/
        N_VNL_v22_npp-j01_2022_global_vcmslcfg_median_PK_4_resampled_bilinear.tif

./Population/
    output/PK/
        P_landscan-global-2022_PK_4_resampled_nearest.tif
            
./Rainfall/
    output/PK/
        R_GPM_2001-2022.01.V07B_PK_avg_PK_4_resampled_bilinear.tif      
</pre>

The first code cell in this notebook copies these resampled files into a new file stricture to create an Image Stack for the specified country.


## Output

This notebook produces the following parallel file structure, containing image tiles for each DHS survey location and data type for the specified country. Additionally, a Virtual Reference Table (VRT) file is created for each data type, referencing all the image tiles. These VRT files provide a convenient way to load the raster image tiles in QGIS for visual inspection.

<pre style="font-family: monospace;">
./AOI/
    PK/
        Image_Tiles/
            Nightlights/
                # Cropped image tiles at each DHS cluster location.
                PK_1_C-2_Nightlights_2022_400m.tif
                PK_2_C-3_Nightlights_2022_400m.tif
                    :
                PK_265_C-415_Nightlights_2022_400m.tif
                
            Population/
                PK_1_C-2_Population_2022_400m.tif
                    :
                
            Rainfall/
                PK_1_C-2_Rainfall_2001-2022_400m
                    :
            
            PK_Nightlights_2022_400m.vrt
            PK_Population_2022_400m.vrt
            PK_Rainfall_2001-2022_400m.vrt
</pre>

## File Prep [One Time Copy]

Each of the resampled GeoTiff files generated by `02_prep_geospatial_data.ipynb` for each data type and specified country should be copied to the corresponding Image_Stack folder, as shown below. This file structure constitutes the "Image Stack" for the specified country.

The following code cell automates this copying process. Once the Image Stack in the file structure below has been populated, the remainder of this notebook can be executed to create Image Tiles for the specified country.
<pre style="font-family: monospace;">
./AOI/
    PK/
        Image_Stack/
            # Resampled, spatially aligned image stack.
            N_VNL_v22_npp-j01_2022_global_vcmslcfg_median_PK_4_resampled_bilinear.tif
            P_landscan-global-2022_PK_4_resampled_nearest.tif
            R_GPM_2001-2022.01.V07B_PK_avg_PK_4_resampled_bilinear.tif
            
</pre>

## Required Configurations

Once the desired AOI is specified using its two-letter country code, the notebook can be executed to produce image tiles for each of the three data types.

<pre style="font-family: monospace;">
<span style="color: blue;">country_code= 'PK'</span>  # # Set the country code to one of the available AOIs in the list below

Available AOIs: AM (Armenia)
                JO (Jordan), but not for use with ResNet18 due to lack of DHS metrics
                MA (Morocco)
                MB (Moldova)
                ML (Mali)
                MR (Mauritania)
                NI (Niger)
                PK (Pakistan)
                SN (Senegal)
                TD (Chad)
</pre>


In [1]:
import os
import shutil
import glob as glb
from osgeo import gdal
from dataclasses import dataclass

cache_dir = 'project_utils/__pycache__'
if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)

# Import module that contains several convenience functions (e.g., gdal wrappers)
from project_utils import *

#----------------------------------------------------------------------------------------
# *** IMPORTANT: SYSTEM PATH TO SET ***
#----------------------------------------------------------------------------------------
# The following path is required, as it contains GDAL binaries used for several 
# pre-processing functions. The pathname corresponds to the Conda virtual environment 
# created for this project (e.g., "py39-pt").
#
# Note: GDAL was adopted as a benchmark to compare the original GIS data produced by 
# another team. However, similar functionality could be implemented using the Rasterio 
# Python package. If Rasterio is used, it would eliminate the need for GDAL binaries 
# and this system path specification.
#----------------------------------------------------------------------------------------

os.environ['PATH'] += ':/Users/billk/miniforge3/envs/py39-pt/bin/' 



## 1 Specify the Country Code for the AOI

In [2]:
#-------------------------------------------------
# REQUIRED CONFIGURATIONS HERE
#-------------------------------------------------
country_code = 'PK'   # Set the country code
#-------------------------------------------------

# Set to True to copy re-sampled data to create the Image Stack for the specified country.
#-----------------------------------------------------------------------------------------
make_image_stack = True  # Recommended setting: True
#-----------------------------------------------------------------------------------------

In [3]:
if make_image_stack:

    # Set the country code
    data_types = ['Rainfall', 'Nightlights', 'Population']
    source_base = './'
    destination_base = './AOI/'

    # Function to create directory if it doesn't exist
    def ensure_dir(directory):
        if not os.path.exists(directory):
            os.makedirs(directory)

    # Scan each data type in the source directory
    for data_type in data_types:
        source_path = os.path.join(source_base, data_type, 'output', country_code)

        # Look for TIFF files directly in the source path and its immediate contents
        file_search_path = os.path.join(source_path, '*resampled*.tif')
        
        # Use glob to find files that match the pattern
        for file_path in glb.glob(file_search_path):
            file_name = os.path.basename(file_path)
            dir_name = os.path.basename(os.path.dirname(file_path))

            # Build the destination path
            destination_path = os.path.join(destination_base, country_code, 'Image_Stack')
            # Ensure the destination directory exists
            ensure_dir(destination_path)
            # Copy the file to the destination
            shutil.copy(file_path, destination_path)
            print(f"Copied {file_path} to {destination_path}")

    print("File copying completed.")


Copied ./Rainfall/output/PK/R_GPM_2001-2022.01.V07B_PK_avg_PK_4_resampled_bilinear.tif to ./AOI/PK/Image_Stack
Copied ./Nightlights/output/PK/N_VNL_v22_npp-j01_2022_global_vcmslcfg_median_PK_4_resampled_bilinear.tif to ./AOI/PK/Image_Stack
Copied ./Population/output/PK/P_landscan-global-2022_PK_4_resampled_nearest.tif to ./AOI/PK/Image_Stack
File copying completed.


### Define CRS Based on AOI

In [4]:
shapefile_path = aoi_configurations[country_code]['shapefile']

crs_lat = aoi_configurations[country_code]['crs_lat']
crs_lon = aoi_configurations[country_code]['crs_lon']

#------------------------------------------------------------------------------------------------------------
# A Lambert-Azmuthal Equal Area (LAEA) projectoin CRS is used that requires the definition of a CRS 
# orgign (crs_lat, crs_lon). Each AOI defined in the aoi_configurations.py module contains these coordinates.
#------------------------------------------------------------------------------------------------------------
expected_crs = f'+proj=laea +lat_0={crs_lat} +lon_0={crs_lon} +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs'

case = country_code
print(shapefile_path)

DHS/PK_2017-18_DHS/PKGE71FL/PKGE71FL.shp


### Set Pathnames

In [5]:
# Shape file fields
cluster_field  = 'DHSCLUST'
lat_field      = 'LATNUM'
lon_field      = 'LONGNUM'

expected_pixel_size  = (400, 400)    # This should match the pixel size in the input rasters

# Set the resolution for programmatic file naming below
res = expected_pixel_size[0]

# Build a list of the raster (produced by: 01_prep_geospatial_data.ipynb)
aoi_image_stack_folder = f'./AOI/{country_code}/Image_Stack/'

# List all files in the image stack
aoi_image_stack_paths = sorted([os.path.join(aoi_image_stack_folder, file) for file in os.listdir(aoi_image_stack_folder) if file.endswith('.tif')])

# Image tile outputs (folders where image tiles are stored for each data type)
image_tile_folders = []

image_tile_folders.append(f'./AOI/{country_code}/Image_Tiles/Nightlights/')
image_tile_folders.append(f'./AOI/{country_code}/Image_Tiles/Population/')
image_tile_folders.append(f'./AOI/{country_code}/Image_Tiles/Rainfall/')

image_tile_suffixes = []
image_tile_suffixes.append(f'Nightlights_2022_{res}m')
image_tile_suffixes.append(f'Population_2022_{res}m')
image_tile_suffixes.append(f'Rainfall_2001-2022_{res}m')

# VRT filename suffix
vrt_file_suffixes = []
vrt_file_suffixes.append(f'Nightlights_2022_{res}m')
vrt_file_suffixes.append(f'Population_2022_{res}m')
vrt_file_suffixes.append(f'Rainfall_2001-2022_{res}m')

In [6]:
print(aoi_image_stack_paths[0])
print(aoi_image_stack_paths[1])
print(aoi_image_stack_paths[2])
print('\n')
print(image_tile_folders)

./AOI/PK/Image_Stack/N_VNL_v22_npp-j01_2022_global_vcmslcfg_median_PK_4_resampled_bilinear.tif
./AOI/PK/Image_Stack/P_landscan-global-2022_PK_4_resampled_nearest.tif
./AOI/PK/Image_Stack/R_GPM_2001-2022.01.V07B_PK_avg_PK_4_resampled_bilinear.tif


['./AOI/PK/Image_Tiles/Nightlights/', './AOI/PK/Image_Tiles/Population/', './AOI/PK/Image_Tiles/Rainfall/']


In [7]:
if os.path.exists(shapefile_path):
    print("File exists.")
else:
    print("File does not exist.")

File exists.


## 2 Load AOI Image Stacks

In [8]:
# Initialize a list to store the results
results = []

# Loop through each raster path and call the function
for path in aoi_image_stack_paths:
    
    raster, crs_match, pixel_size_match = load_raster(path, expected_crs, expected_pixel_size)
    result = {
        'path': path,
        'raster': raster, 
        'crs_match': crs_match,
        'pixel_size_match': pixel_size_match
    }
    results.append(result)

### Check Image Stack Metadata 

In [9]:
# Check 
# !gdalinfo {aoi_image_stack_paths[0]}
# !gdalinfo {aoi_image_stack_paths[1]}
# !gdalinfo {aoi_image_stack_paths[2]}

## 3 Extract Cluster Data from Shapefile

In [10]:
cluster_data, erroneous_cluster_ids = extract_cluster_data(shapefile_path, cluster_field, lat_field, lon_field)

Erroneous clusters detected:
Cluster ID: 535, Latitude: 0.0, Longitude: 0.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_data[cluster_field] = cluster_data[cluster_field].astype(float).astype(int)


In [11]:
print(erroneous_cluster_ids)

[535]


### Optional Data Inspection

In [12]:
# Print a few records
# for idx in range(0,9):
#     print(cluster_data[idx])   
    
# Print data for a specific cluster
cluster_id = 1
indices = [cluster_id]
for index in indices:
    # Using .iloc for positional access
    row = cluster_data.iloc[index]
    cluster_id, x, y = row[cluster_field], row[lat_field], row[lon_field]

    print(f"({cluster_id}, {x:.2f}, {y:.2f})")

(2.0, 35.89, 71.73)


In [13]:
# Print the first few elements of cluster_data to understand its structure
for item in cluster_data[:10]:  # Adjust the slicing as necessary for large datasets
    print(item)


DHSCLUST
LATNUM
LONGNUM


In [14]:
# Initialize a flag to check if all CRS matches and a list to collect mismatched CRS details
all_crs_match = True
mismatched_details = []

for result in results:
    if not result['crs_match']:
        all_crs_match = False
        # Collecting detailed information about the mismatch, including the file path and the actual CRS
        mismatched_details.append(f"Path: {result['path']}, Raster CRS: {result['raster'].crs}")

if all_crs_match:
    
    # Assuming cluster_data is a DataFrame returned from extract_cluster_data
    cluster_data_tuples = list(cluster_data.to_records(index=False))

    crs_coordinates = convert_cluster_coordinates(cluster_data_tuples, src_crs='EPSG:4326', dst_crs=expected_crs)
    print(len(crs_coordinates))
        
else:
    print("*** Error: CRS does not match.")
    for detail in mismatched_details:
        print(detail)

560


## 4 Find DHS Points that Fall within the Input Raster Files
The input raster files were previously cropped, projected to the same CRS, and resampled to the same pixel size. The cropping was extended beyond the AOI so that image tiles for DHS locations near the AOI boarder can be created. 

In [15]:
all_pixel_match = all(result['pixel_size_match'] for result in results)

if all_pixel_match:
   
    # Collect number of points in each dataset for comparison
    number_of_points_per_dataset = []

    for result in results:
        raster = result['raster']
        
        points_within_raster = find_points_within_raster(raster, crs_coordinates, expected_crs)
        
        # Store the points within raster into the results dictionary for each raster
        result['points_within_raster'] = points_within_raster
        number_of_points_per_dataset.append(len(points_within_raster))
        print(f"Points w/in bounds: {result['path']}: {len(points_within_raster)}\n")

    # Check if all datasets have the same number of points
    if len(set(number_of_points_per_dataset)) == 1:
        print("All datasets have the same number of points.")
    else:
        print("Warning: Datasets have varying numbers of points. Here are the counts per dataset:", number_of_points_per_dataset)

else:
    print("Pixel size match does not match for one or more rasters.")


Points w/in bounds: ./AOI/PK/Image_Stack/N_VNL_v22_npp-j01_2022_global_vcmslcfg_median_PK_4_resampled_bilinear.tif: 560

Points w/in bounds: ./AOI/PK/Image_Stack/P_landscan-global-2022_PK_4_resampled_nearest.tif: 560

Points w/in bounds: ./AOI/PK/Image_Stack/R_GPM_2001-2022.01.V07B_PK_avg_PK_4_resampled_bilinear.tif: 560

All datasets have the same number of points.


## 5 Crop Image Tiles from AOI Image Stack and Build VRT File
Loop over each data type, stored in memory as a raster, and crop an image tile of the specified size for each of the survey points in the `points_within_raster list`. Additionally, build a VRT file that references the image tiles for each data type. The VRT facilitates loading a large number of image tiles in QGIS for visualization purposes.

In [16]:
def build_vrt(image_tile_folder, vrt_file):
   
    # Get a list of all .tif files in the directory
    tif_files = glb.glob(os.path.join(image_tile_folder, "*.tif"))

    # Create a new VRT dataset
    vrt_options = gdal.BuildVRTOptions(VRTNodata=-999)
    vrt = gdal.BuildVRT(vrt_file, tif_files, options=vrt_options)

    # Check if the VRT dataset was created successfully
    if vrt is None:
        print("Failed to build VRT")
    else:
        vrt.FlushCache()  # Write to disk
        print(f"VRT built successfully at {vrt_file}")

In [17]:
# Loop over each result and its corresponding image tile path
idx = 0
for result, image_tile_folder, image_tile_suffix in zip(results, image_tile_folders, image_tile_suffixes):
    
    raster = result['raster']
    
    aoi_name = f"{country_code}"
    
    crop_raster_rasterio(raster, points_within_raster, aoi_name, image_tile_suffix, image_tile_folder, tile_size=224, debug=False)
    
    # Construct the VRT filename
    vrt_file = f"./AOI/{country_code}/{country_code}_{vrt_file_suffixes[idx]}.vrt"
    
    build_vrt(image_tile_folder, vrt_file)

    idx += 1 

Crops are saved in ./AOI/PK/Image_Tiles/Nightlights/




VRT built successfully at ./AOI/PK/PK_Nightlights_2022_400m.vrt
Crops are saved in ./AOI/PK/Image_Tiles/Population/
VRT built successfully at ./AOI/PK/PK_Population_2022_400m.vrt
Crops are saved in ./AOI/PK/Image_Tiles/Rainfall/
VRT built successfully at ./AOI/PK/PK_Rainfall_2001-2022_400m.vrt
