Add spatial stratification algorithm for splitting datasets into training and testing #433

RaczeQ · 2024-04-12T11:00:07Z

Add an algorithm for splitting the dataset based on spatial location instead of random sampling.

RaczeQ · 2024-04-12T11:09:53Z

Sources:

sabman · 2024-04-14T10:16:23Z

@RaczeQ Thanks for creating this issue. I'd like to see if I can contribute. I am assuming this is in reference to the training loop for the embedding models? If so can you also reference the code module where this might be used. I am guessing its this

srai/srai/embedders/hex2vec/neighbour_dataset.py

Line 154 in 3e7a787

def _get_random_negative_df_loc(self, input_df_loc: int) -> int:

RaczeQ · 2024-04-21T18:30:19Z

Hello @sabman, thank you for showing interest in expanding the library 😊

I've created this issue specifically with end-tasks in mind, and I was planning on leaving the embedding models training (hex2vec, geovex etc) without changes - those will still be fitted on the whole provided dataset.

However, after you've mentioned this, I can see the potential use case in combination with existing embedder just for benchmarking purposes:

Prepare regions / features geodataframes.
Split them into training and validation data.
Train embedder on training data.
Transform validation data (with both encoder and decoder) and calculate the loss between the decoded and original values.

Currently we don't have any specific examples with downstream tasks in the documentation, there is one in our dedicated tutorial repository (https://github.com/kraina-ai/srai-tutorial).
I think about this functionality as a future utility for taking a given geodataframe and assigning a stratification class based on a geometry (or a more sophisticated scenario with class column AND geometry).

My previous comment is the list of materials I've gathered about this topic and if there is a good out of the shelf solution for this use case - we can just add it as a dependency and wrap it within srai API. If you have more ideas, examples or sources about it - I'd be thankful for sharing 🙇🏻.

RaczeQ · 2024-04-21T19:08:30Z

# just pseudo-coding here
def spatial_stratification(
    regions_gdf: GeoDataFrame,
    no_output_classes: int = 2,
    split_values: Optional[list[float]] = None,
    class_column: Optional[str] = None,
) -> pd.Series:
    """
    Generates a Pandas Series with stratification class value and an index from provided GeoDataFrame.

    Args:
        regions_gdf (gpd.GeoDataFrame): The regions that are being stratified.
        no_output_classes (int, optional): How many classes should be in the result series.
            Defaults to 2.
        split_values (Optional[list[float]], optional): The fraction between classes. When not provided,
            rows will be stratified equally. Defaults to None.
        class_column (Optional[str], optional): Name of the column used to additionally take into
            consideration when stratifying geometries. Defaults to None.
    """
    if no_output_classes < 1:
        raise ValueError("Number of output classes should be positive.")

    if not split_values:
        split_values = [1/no_output_classes for _ in range(no_output_classes)]

    normalized_split_values = [
        split_value / sum(split_values) for split_value in split_values
    ] # normalize to 1
    ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spatial stratification algorithm for splitting datasets into training and testing #433

Add spatial stratification algorithm for splitting datasets into training and testing #433

RaczeQ commented Apr 12, 2024

RaczeQ commented Apr 12, 2024

sabman commented Apr 14, 2024

RaczeQ commented Apr 21, 2024

RaczeQ commented Apr 21, 2024 •

edited

Add spatial stratification algorithm for splitting datasets into training and testing #433

Add spatial stratification algorithm for splitting datasets into training and testing #433

Comments

RaczeQ commented Apr 12, 2024

RaczeQ commented Apr 12, 2024

sabman commented Apr 14, 2024

RaczeQ commented Apr 21, 2024

RaczeQ commented Apr 21, 2024 • edited

RaczeQ commented Apr 21, 2024 •

edited