# Process Sites

In this notebook we preprocess out lidar sites. This includes:
- Reprojection
- Cropping to the site geometry
- (Optional) classifying ground points
- Computing height above ground
- Remove statistical outliers (noise)
- Saving cloud as a cloud optimised point cloud (COPC)

To process the data we will use PDAL pipelines.
We will also use dask to run the processing in parallel.

In [1]:
from pathlib import Path
import json

from jinja2 import Template
import geopandas as gpd
import pdal
import pandas as pd

## PDAL Pipeline Template

Each run of PDAL processing is defined by a pipeline represented by a JSON string. Our processing for each site is mostly the same. Only a few site varaibles will change each run (e.g. name of input file, polygon to clip to). To replace variables we'll use the Jinja template engine. Variables are denoted by `{{ variable_name }}`

In [None]:
# Note, we could have this just a string, but as a dict allows us to add comments
pipeline_template_dict = [
    # Read the input LAS file
    {
        "type": "readers.las",
        "filename": "{{ input_path }}"
    },
    # Reproject to MGA2020 + Aus Height Datum
    {
        "type": "filters.reprojection",
        "out_srs": "EPSG:7855+5711"
    },
    # Crop to our site polygon
    # This is optional, but useful as our source sites are quite large
    # and we don't need the whole thing
    {
        "type": "filters.crop",
        "polygon": "{{ polygon_wkt }}"
    },
    # Note, if you wanted to calculate your own ground classification points
    # do so here. We'll keep the ground classificaiton provided by VirtualTas.
    #
    # Calculate height above ground
    {
        "type": "filters.hag_nn"
    },
    # Remove any noise points
    {
        "type": "filters.outlier",
        "method": "statistical",
        "mean_k": 6,
        "multiplier": 10
    },
    # Save as a COPC file
    {
        "type": "writers.copc",
        "filename": "{{ output_path }}",
        "forward": "scale,offset",
        "extra_dims": "all"
    }
]

pipeline_template = json.dumps(pipeline_template_dict, indent=2)

# Function to replace variables
def replace_pipeline_variables(pipeline_template: str, context: dict):
    template = Template(pipeline_template)
    return template.render(context)

### Site pipelines

We use some data from our previously created `sites.geojson` to create the pdal pipelines.

In [None]:
import geopandas as gpd

sites_gdf = gpd.read_file("../data/outputs/sites/sites.geojson")
sites_gdf.head()

Unnamed: 0,site,id,geometry
0,AGG_O_01,AGG_O_01,"POLYGON ((463303.79 5259716.755, 463123.329 52..."
1,AGG_O_05,AGG_O_05,"POLYGON ((455430.465 5284117.991, 455191.829 5..."
2,AGG_O_07,AGG_O_07,"POLYGON ((464747.381 5299156.759, 464706.037 5..."
3,AGG_Y_02,AGG_Y_02,"POLYGON ((491855.984 5230960.243, 491825.641 5..."
4,AGG_Y_03,AGG_Y_03,"POLYGON ((490748.752 5208804.286, 490664.016 5..."


There are 3 variables in the pipeline above:
- input_path - Where to source the lidar file for that site
- output_path - Where to save the processed lidar file
- polygon_wkt - The polygon for that site in well known text (WKT) format

In [28]:
data_dir = Path("../data")
lidar_source_dir = data_dir / "source" / "cycle-2"  # cycle-2 has best coverage
lidar_output_dir = data_dir / "outputs" / "sites" / "lidar"
lidar_output_dir.mkdir(parents=True, exist_ok=True)

def create_pipeline_from_site(site_row):
    site_id = site_row['id']

    context = {
        "input_path": str(lidar_source_dir / f"{site_id}.laz"),
        "output_path": str(lidar_output_dir / f"{site_id}.copc.laz"),
        "polygon_wkt": site_row.geometry.wkt
    }
    return replace_pipeline_variables(pipeline_template, context)

pipelines = sites_gdf.apply(create_pipeline_from_site, axis=1).to_list()
print(pipelines[0])


[
  {
    "type": "readers.las",
    "filename": "../data/source/cycle-2/AGG_O_01.laz"
  },
  {
    "type": "filters.reprojection",
    "out_srs": "EPSG:7855+5711"
  },
  {
    "type": "filters.crop",
    "polygon": "POLYGON ((463303.78970947 5259716.755331744, 463123.3287796102 5259746.770813609, 463046.07269726234 5259761.7233045045, 462950.04695010127 5259805.900973975, 462980.706530192 5259890.517699114, 463221.4219030455 5259835.555207977, 463327.6009013069 5259803.548345104, 463303.78970947 5259716.755331744))"
  },
  {
    "type": "filters.hag_nn"
  },
  {
    "type": "filters.outlier",
    "method": "statistical",
    "mean_k": 6,
    "multiplier": 10
  },
  {
    "type": "writers.copc",
    "filename": "../data/outputs/sites/lidar/AGG_O_01.copc.laz",
    "forward": "scale,offset",
    "extra_dims": "all"
  }
]


## Processing Pipelines

PDAL is built around processing these pipelines.

In [37]:
def process_pdal_pipeline(pipeline: str, return_data: bool = False):
    """
    Process a PDAL pipeline string.

    Args:
        pipeline (str): The PDAL pipeline JSON string.
        return_data (bool): If True, return the PDAL Pipeline object after execution. Defaults to False. Returning pipeline data
        will contain metadata and all the points processed by the pipeline. This can be a large object so defaults to False.
    """
    pipeline_obj = pdal.Pipeline(pipeline)
    count = pipeline_obj.execute()  # Execute the pipeline
    return (count, pipeline_obj if return_data else None)

Processing a single pipeline can take some time.

In [None]:
%%time

(count, pl) = process_pdal_pipeline(pipelines[0], return_data=True)
print(f"Processed {count} points.")

points = pl.arrays[0]
points_df = pd.DataFrame(pl.arrays[0])
points_df.head()

Processed 3572396 points.
CPU times: user 39 s, sys: 490 ms, total: 39.5 s
Wall time: 34.1 s


Unnamed: 0,X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,Synthetic,...,ScanAngleRank,UserData,PointSourceId,GpsTime,ScanChannel,Red,Green,Blue,Infrared,HeightAboveGround
0,462980.355861,5259889.0,507.670006,30398,2,2,0,0,0,0,...,-0.054,5,1,411879500.0,0,23901,23644,25186,32023,0.041999
1,462980.235861,5259889.0,509.570006,29123,1,2,0,0,0,0,...,-0.054,14,1,411879500.0,0,20046,19275,21074,34147,1.942
2,462980.500862,5259890.0,507.506004,31053,3,3,0,0,2,0,...,-0.054,4,1,411879500.0,0,20560,20817,21588,35447,0.0
3,462980.264861,5259889.0,511.261006,27495,2,3,0,0,0,0,...,-0.054,23,1,411879500.0,0,20817,20817,21845,40629,3.632999
4,462980.632862,5259890.0,507.461003,29995,2,2,0,0,2,0,...,-0.054,4,1,411879500.0,0,15420,15420,17219,33208,0.0


### Parallel processing with Dask

In [36]:
from dask.distributed import Client

client = Client()  # Start a Dask client
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:54370,Workers: 0
Dashboard: http://127.0.0.1:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:54383,Total threads: 2
Dashboard: http://127.0.0.1:54386/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54373,
Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-cgdsy4tp,Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-cgdsy4tp

0,1
Comm: tcp://127.0.0.1:54381,Total threads: 2
Dashboard: http://127.0.0.1:54385/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54375,
Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-6peza655,Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-6peza655

0,1
Comm: tcp://127.0.0.1:54382,Total threads: 2
Dashboard: http://127.0.0.1:54387/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54377,
Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-yfyo3rxg,Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-yfyo3rxg

0,1
Comm: tcp://127.0.0.1:54384,Total threads: 2
Dashboard: http://127.0.0.1:54391/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54379,
Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-l8cnvz1r,Local directory: /var/folders/37/j4yld2bd7pz4_0p7b249nvv40000gn/T/dask-scratch-space/worker-l8cnvz1r


In [40]:
futures = client.map(process_pdal_pipeline, pipelines, key=sites_gdf['id'].tolist())

In [46]:
# TODO - better error manangement and reporting
print(futures[0].result())

(3572396, None)


# Special Cases - Cloud Noise