# Data Preprocessing

The analysis is split between 3 notebooks.:
1) data_collection.ipynb (this notebook): which downloads data from Google Earth Engine (can be slow)
2) data_processing.ipynb: which merges the data into dataframe that is ready to use for regression
3) regression.ipynb: which performs the regression and prediction of the Aboveground Biomass and calculates the carbon stored

In this notebook, we:
* Load the data from part 1
* Combine them into a geodataframe
* Save the dataframe

## Imports

In [9]:
import rasterio
import geopandas as gpd
from shapely.geometry import Point
import numpy as np
import glob

## Data Merging

In [None]:
#Get a list of all the tif files in the data folder
tif_files = glob.glob("./data/*.tif")
# Check that all bands required are present
print(tif_files)

# Extract the band name from the file
band_name = [
    f.split("_", 1)[1].removesuffix(".tif")  # Get all letters after the first _ and .tif
    for f in tif_files
]

print(band_name)

In [12]:
# 1. Open the raster
for tif_path, b in zip(tif_files, band_name):
    with rasterio.open(tif_path) as src:
        band = src.read(1)  # read the first band
        transform = src.transform

        # Mask out nodata
        mask = band != src.nodata if src.nodata is not None else ~np.isnan(band)

        # 2. Get row/col indices of valid pixels
        rows, cols = np.where(mask)

        # 3. Convert pixel indices to coordinates (centers)
        xs, ys = rasterio.transform.xy(transform, rows, cols, offset='center')

        # 4. Create geometries and values
        geometries = [Point(x, y) for x, y in zip(xs, ys)]
        values = band[rows, cols]
    if b == band_name[0]:
        # 5. Create GeoDataFrame
        gdf = gpd.GeoDataFrame({b: values}, geometry=geometries, crs=src.crs)
    else:
        gdf_temp = gpd.GeoDataFrame({b: values}, geometry=geometries, crs=src.crs)
        gdf = gdf.merge(gdf_temp, on='geometry')
    print('added band: {}'.format(b))

# Done!
gdf.head()

added band: agbd
added band: agbd_se
added band: B11
added band: B12
added band: B2
added band: B3
added band: B4
added band: B5
added band: B6
added band: B7
added band: B8
added band: B8A
added band: B9
added band: DEM
added band: Map
added band: slope
added band: VH
added band: VV


Unnamed: 0,agbd,geometry,agbd_se,B11,B12,B2,B3,B4,B5,B6,B7,B8,B8A,B9,DEM,Map,slope,VH,VV
0,0.0,POINT (1149650 6626150),0.0,2049.0,1146.0,541.5,811.5,919.0,1314.5,2826.0,3361.0,3360.0,3505.0,3514.5,327.73822,40.0,0.0,-19.844246,-12.075358
1,0.0,POINT (1149750 6626150),7.693663,2202.5,1446.0,563.869836,901.5,790.5,1383.0,2717.0,3160.0,3183.5,3281.5,3310.5,329.029572,40.0,0.0,-20.196077,-12.001568
2,0.0,POINT (1149850 6626150),0.0,1878.0,1062.0,563.329912,785.5,697.5,1199.0,2759.0,3394.0,3493.0,3532.0,3493.0,330.489929,40.0,0.0,-19.692963,-11.77757
3,23.708748,POINT (1149950 6626150),0.0,1990.0,1164.0,560.608821,805.0,680.0,1263.0,2723.0,3371.0,3466.0,3560.0,3577.0,331.260406,40.0,0.0,-19.450946,-11.32432
4,0.0,POINT (1150050 6626150),0.0,2102.0,1263.0,544.206777,840.0,718.0,1251.0,2766.0,3321.0,3439.0,3512.0,3612.0,331.604492,40.0,0.0,-19.274971,-10.915781


In [13]:
# Quick check that how many GEDI measurements there are
gdf[gdf['agbd']>0]

Unnamed: 0,agbd,geometry,agbd_se,B11,B12,B2,B3,B4,B5,B6,B7,B8,B8A,B9,DEM,Map,slope,VH,VV
3,23.708748,POINT (1149950 6626150),0.0,1990.000000,1164.0,560.608821,805.0,680.0,1263.00,2723.000000,3371.000000,3466.000000,3560.000000,3577.000000,331.260406,40.0,0.0,-19.450946,-11.324320
12,24.602869,POINT (1150850 6626150),0.0,2405.000000,1238.0,427.827603,714.0,507.0,1258.00,3213.000000,3838.000000,4003.000000,4204.000000,4220.000000,331.996826,30.0,0.0,-20.328034,-11.988618
13,27.554802,POINT (1150950 6626150),0.0,2537.000000,1331.0,368.188948,724.0,567.0,1311.00,3196.000000,3756.000000,3961.000000,4095.000000,4298.000000,328.666260,30.0,0.0,-21.366247,-14.345321
14,24.813965,POINT (1151050 6626150),0.0,2577.000000,1417.0,442.013166,746.5,587.5,1356.00,3181.000000,3621.500000,3902.500000,4078.000000,4356.000000,327.028748,30.0,0.0,-20.210041,-13.898526
15,26.184776,POINT (1151150 6626150),0.0,2582.500000,1349.0,452.419690,755.0,595.5,1333.00,3392.000000,3948.500000,4133.000000,4338.000000,4426.500000,324.636414,30.0,0.0,-18.751516,-11.692136
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70410,24.111988,POINT (1172650 6655150),0.0,3042.000000,2623.0,489.771630,920.0,1173.0,1435.00,2201.500000,2378.400000,2575.600000,2673.000000,2715.500000,188.362442,40.0,0.0,-21.165814,-12.660450
70411,24.038567,POINT (1172750 6655150),0.0,2469.666667,1791.5,507.601167,978.5,1018.0,1436.25,2193.500000,2478.500000,2591.500000,2725.333333,2913.000000,188.588898,40.0,0.0,-20.746647,-12.225602
70412,29.834408,POINT (1172850 6655150),0.0,2439.500000,1530.5,588.658185,873.0,810.0,1421.00,2665.333333,3015.500000,3216.750000,3310.000000,3671.666667,190.553406,40.0,0.0,-19.850170,-12.385821
70413,23.845720,POINT (1172950 6655150),0.0,2906.500000,1671.5,427.449142,974.5,962.5,1638.00,3137.750000,3593.666667,3833.000000,3992.500000,4031.000000,190.315063,30.0,0.0,-20.238022,-13.453694


## Saving

In [14]:
# Save data for regression
gdf.to_file('data/full_data.json', driver='GeoJSON')

In [15]:
# Save a smaller version with just the required matching points for training
matched_df = gdf[gdf['agbd']>0]

In [None]:
# Save the training data
matched_df.to_file('data/train_data.json', driver='GeoJSON')

# End Part 2

Alright, this was a very short section. But I like to have my regression analysis in a separate file and have it read in the initial data from a file to avoid Jupyter Notebook pitfalls and keep reproducibility