# Memory-Efficient Visualization Strategies for Large Datasets

When working with 6M+ points, memory management becomes crucial. Here are several strategies to optimize Lonboard performance:

## Memory-Efficient Approaches

1. **Smart Sampling**: Use statistical sampling to maintain data representativeness
2. **Spatial Decimation**: Reduce point density based on zoom level
3. **Data Streaming**: Load data progressively based on viewport
4. **Hierarchical Display**: Show different detail levels at different zoom levels

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from typing import Union, Optional

def smart_sample_geodataframe(
    gdf: gpd.GeoDataFrame, 
    target_size: int = 100000,
    method: str = 'stratified',
    random_state: int = 42
) -> gpd.GeoDataFrame:
    """
    Intelligently sample a large GeoDataFrame to reduce memory usage while maintaining representativeness.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        The input geodataframe to sample
    target_size : int
        Target number of points to retain
    method : str
        Sampling method: 'random', 'stratified', 'spatial_grid', or 'adaptive'
    random_state : int
        Random seed for reproducibility
    
    Returns:
    --------
    GeoDataFrame
        Sampled geodataframe
    """
    if len(gdf) <= target_size:
        return gdf.copy()
    
    print(f"Sampling {len(gdf):,} points down to {target_size:,} ({target_size/len(gdf)*100:.1f}%)")
    
    if method == 'random':
        # Simple random sampling
        return gdf.sample(n=target_size, random_state=random_state)
    
    elif method == 'stratified':
        # Stratified sampling by source_collection to maintain proportions
        if 'source_collection' in gdf.columns:
            # Calculate sample sizes for each stratum
            proportions = gdf['source_collection'].value_counts(normalize=True)
            sample_sizes = (proportions * target_size).round().astype(int)
            
            # Ensure we don't exceed available data in any stratum
            sample_sizes = sample_sizes.clip(upper=gdf['source_collection'].value_counts())
            
            # Sample from each stratum
            sampled_parts = []
            for source, size in sample_sizes.items():
                if size > 0:
                    stratum = gdf[gdf['source_collection'] == source]
                    if len(stratum) > 0:
                        sample_size = min(size, len(stratum))
                        sampled_parts.append(stratum.sample(n=sample_size, random_state=random_state))
            
            return pd.concat(sampled_parts, ignore_index=True) if sampled_parts else gdf.sample(n=target_size, random_state=random_state)
        else:
            return gdf.sample(n=target_size, random_state=random_state)
    
    elif method == 'spatial_grid':
        # Grid-based spatial sampling to ensure geographic coverage
        bounds = gdf.total_bounds  # minx, miny, maxx, maxy
        
        # Calculate grid size based on target sample size
        grid_size = int(np.sqrt(target_size))
        x_step = (bounds[2] - bounds[0]) / grid_size
        y_step = (bounds[3] - bounds[1]) / grid_size
        
        sampled_points = []
        for i in range(grid_size):
            for j in range(grid_size):
                # Define grid cell bounds
                minx = bounds[0] + i * x_step
                maxx = bounds[0] + (i + 1) * x_step
                miny = bounds[1] + j * y_step
                maxy = bounds[1] + (j + 1) * y_step
                
                # Find points in this grid cell
                mask = ((gdf.geometry.x >= minx) & (gdf.geometry.x < maxx) & 
                       (gdf.geometry.y >= miny) & (gdf.geometry.y < maxy))
                cell_points = gdf[mask]
                
                # Sample one point from this cell (if any exist)
                if len(cell_points) > 0:
                    sampled_points.append(cell_points.sample(n=1, random_state=random_state + i*grid_size + j))
        
        return pd.concat(sampled_points, ignore_index=True) if sampled_points else gdf.sample(n=target_size, random_state=random_state)
    
    elif method == 'adaptive':
        # Adaptive sampling: more points in dense areas, fewer in sparse areas
        # This is a simplified version - could be much more sophisticated
        bounds = gdf.total_bounds
        grid_size = 50  # Use a finer grid for density calculation
        
        x_step = (bounds[2] - bounds[0]) / grid_size
        y_step = (bounds[3] - bounds[1]) / grid_size
        
        # Calculate density for each grid cell
        densities = np.zeros((grid_size, grid_size))
        for i in range(grid_size):
            for j in range(grid_size):
                minx = bounds[0] + i * x_step
                maxx = bounds[0] + (i + 1) * x_step
                miny = bounds[1] + j * y_step
                maxy = bounds[1] + (j + 1) * y_step
                
                mask = ((gdf.geometry.x >= minx) & (gdf.geometry.x < maxx) & 
                       (gdf.geometry.y >= miny) & (gdf.geometry.y < maxy))
                densities[i, j] = mask.sum()
        
        # Normalize densities and use as sampling probabilities
        total_density = densities.sum()
        if total_density > 0:
            # Create sampling weights based on inverse density (more samples from sparser areas)
            weights = np.zeros(len(gdf))
            for idx, (x, y) in enumerate(zip(gdf.geometry.x, gdf.geometry.y)):
                i = int((x - bounds[0]) / x_step)
                j = int((y - bounds[1]) / y_step)
                i = min(i, grid_size - 1)  # Handle edge case
                j = min(j, grid_size - 1)
                weights[idx] = 1.0 / (densities[i, j] + 1)  # +1 to avoid division by zero
            
            # Normalize weights
            weights = weights / weights.sum()
            
            # Sample based on weights
            indices = np.random.choice(len(gdf), size=min(target_size, len(gdf)), 
                                     replace=False, p=weights)
            return gdf.iloc[indices].copy()
        else:
            return gdf.sample(n=target_size, random_state=random_state)
    
    else:
        raise ValueError(f"Unknown sampling method: {method}")

def get_memory_usage_mb(gdf: gpd.GeoDataFrame) -> float:
    """Calculate approximate memory usage of a GeoDataFrame in MB."""
    return gdf.memory_usage(deep=True).sum() / 1024 / 1024

# Example usage and comparison
print("Available sampling methods:")
methods = ['random', 'stratified', 'spatial_grid', 'adaptive']
for method in methods:
    print(f"  - {method}: {smart_sample_geodataframe.__doc__.split(method)[1].split(',')[0] if method in smart_sample_geodataframe.__doc__ else 'See docstring'}")

Available sampling methods:
  - random: '
  - stratified: '
  - spatial_grid: '
  - adaptive: '
    random_state : int
        Random seed for reproducibility
    
    Returns:
    --------
    GeoDataFrame
        Sampled geodataframe
    


In [2]:
def create_zoom_adaptive_layer(
    gdf: gpd.GeoDataFrame,
    zoom_levels: dict = None,
    color_map: dict = None,
    radius_base: int = 300
) -> dict:
    """
    Create multiple ScatterplotLayers with different levels of detail for zoom-based display.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        The input geodataframe
    zoom_levels : dict
        Dictionary mapping zoom ranges to sample sizes
        e.g., {(0, 3): 1000, (4, 6): 10000, (7, 18): 100000}
    color_map : dict
        Color mapping for different categories
    radius_base : int
        Base radius for points
    
    Returns:
    --------
    dict
        Dictionary with zoom ranges as keys and layers as values
    """
    if zoom_levels is None:
        zoom_levels = {
            (0, 3): 1000,    # World view: 1K points
            (4, 6): 10000,   # Continental view: 10K points  
            (7, 9): 50000,   # Regional view: 50K points
            (10, 18): 200000  # Local view: 200K points
        }
    
    layers = {}
    
    for (min_zoom, max_zoom), target_size in zoom_levels.items():
        print(f"Creating layer for zoom {min_zoom}-{max_zoom} with {target_size:,} points...")
        
        # Sample the data for this zoom level
        if len(gdf) > target_size:
            sampled_gdf = smart_sample_geodataframe(gdf, target_size, method='stratified')
        else:
            sampled_gdf = gdf.copy()
        
        # Create colors if color_map is provided
        if color_map:
            colors = create_color_map(sampled_gdf, color_map)
        else:
            colors = [100, 150, 200, 255]  # Default blue
        
        # Adjust radius based on zoom level (smaller points for higher zoom)
        radius = radius_base * (1 + (max_zoom - min_zoom) * 0.1)
        
        # Create the layer
        from lonboard import ScatterplotLayer
        layer = ScatterplotLayer.from_geopandas(
            sampled_gdf,
            get_fill_color=colors,
            get_radius=radius,
            radius_units='meters',
            pickable=True,
            min_zoom=min_zoom,
            max_zoom=max_zoom
        )
        
        layers[(min_zoom, max_zoom)] = {
            'layer': layer,
            'data': sampled_gdf,
            'point_count': len(sampled_gdf),
            'memory_mb': get_memory_usage_mb(sampled_gdf)
        }
    
    return layers

def monitor_memory_usage():
    """Monitor current memory usage of the Python process."""
    import psutil
    import os
    
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    
    print(f"Current memory usage:")
    print(f"  RSS (Resident Set Size): {memory_info.rss / 1024 / 1024:.1f} MB")
    print(f"  VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024:.1f} MB")
    
    # Get system memory info
    system_memory = psutil.virtual_memory()
    print(f"System memory:")
    print(f"  Total: {system_memory.total / 1024 / 1024 / 1024:.1f} GB")
    print(f"  Available: {system_memory.available / 1024 / 1024 / 1024:.1f} GB")
    print(f"  Used: {system_memory.percent:.1f}%")

def optimize_gdf_memory(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """
    Optimize memory usage of a GeoDataFrame by converting data types.
    """
    gdf_optimized = gdf.copy()
    
    # Convert string columns to categorical where beneficial
    for col in gdf_optimized.select_dtypes(include=['object']).columns:
        if col != 'geometry':  # Don't convert geometry column
            unique_ratio = gdf_optimized[col].nunique() / len(gdf_optimized)
            if unique_ratio < 0.5:  # If less than 50% unique values, convert to categorical
                gdf_optimized[col] = gdf_optimized[col].astype('category')
                print(f"Converted {col} to categorical (unique ratio: {unique_ratio:.3f})")
    
    # Downcast numeric types where possible
    for col in gdf_optimized.select_dtypes(include=['int64']).columns:
        gdf_optimized[col] = pd.to_numeric(gdf_optimized[col], downcast='integer')
    
    for col in gdf_optimized.select_dtypes(include=['float64']).columns:
        gdf_optimized[col] = pd.to_numeric(gdf_optimized[col], downcast='float')
    
    original_memory = get_memory_usage_mb(gdf)
    optimized_memory = get_memory_usage_mb(gdf_optimized)
    
    print(f"Memory optimization:")
    print(f"  Original: {original_memory:.1f} MB")
    print(f"  Optimized: {optimized_memory:.1f} MB")
    print(f"  Savings: {original_memory - optimized_memory:.1f} MB ({(1 - optimized_memory/original_memory)*100:.1f}%)")
    
    return gdf_optimized

# Install psutil if not available
try:
    import psutil
except ImportError:
    print("Installing psutil for memory monitoring...")
    import subprocess
    subprocess.run(['pip', 'install', 'psutil'])

In [3]:
# Practical example: Apply memory-efficient techniques to your dataset

print("=== Memory Usage Analysis ===")
monitor_memory_usage()

print("\n=== Dataset Information ===")
# Check current dataset size (assuming you have gdf_valid from earlier)
if 'gdf_valid' in locals():
    print(f"Current dataset: {len(gdf_valid):,} points")
    print(f"Memory usage: {get_memory_usage_mb(gdf_valid):.1f} MB")
    
    # Optimize memory usage
    print("\n=== Memory Optimization ===")
    gdf_optimized = optimize_gdf_memory(gdf_valid)
    
    # Test different sampling strategies
    print("\n=== Sampling Strategy Comparison ===")
    target_sizes = [10000, 50000, 100000]
    methods = ['random', 'stratified', 'spatial_grid']
    
    sampling_results = {}
    
    for target_size in target_sizes:
        print(f"\nTarget size: {target_size:,} points")
        for method in methods:
            try:
                sampled = smart_sample_geodataframe(gdf_optimized, target_size, method=method)
                memory_mb = get_memory_usage_mb(sampled)
                
                # Check source collection distribution
                if 'source_collection' in sampled.columns:
                    dist = sampled['source_collection'].value_counts(normalize=True)
                    print(f"  {method:12s}: {len(sampled):6,} points, {memory_mb:5.1f} MB, collections: {len(dist)}")
                else:
                    print(f"  {method:12s}: {len(sampled):6,} points, {memory_mb:5.1f} MB")
                
                sampling_results[(target_size, method)] = {
                    'data': sampled,
                    'memory_mb': memory_mb,
                    'point_count': len(sampled)
                }
            except Exception as e:
                print(f"  {method:12s}: Error - {str(e)}")
    
    print("\n=== Recommended Approach ===")
    print("For 6M+ points with Lonboard:")
    print("1. Use stratified sampling to maintain data representativeness")
    print("2. Start with 50K-100K points for interactive exploration")
    print("3. Use zoom-adaptive layers for better performance")
    print("4. Consider spatial decimation for very dense areas")
    
    # Demonstrate zoom-adaptive approach
    print("\n=== Creating Zoom-Adaptive Layers ===")
    # Use a smaller subset for demonstration
    demo_data = smart_sample_geodataframe(gdf_optimized, 25000, method='stratified')
    
    # Create zoom-adaptive layers
    zoom_layers = create_zoom_adaptive_layer(
        demo_data,
        zoom_levels={
            (0, 4): 1000,   # World view
            (5, 8): 5000,   # Regional view  
            (9, 18): 15000  # Local view
        }
    )
    
    print("\nZoom layers created:")
    total_memory = 0
    for (min_zoom, max_zoom), layer_info in zoom_layers.items():
        memory = layer_info['memory_mb']
        total_memory += memory
        print(f"  Zoom {min_zoom:2d}-{max_zoom:2d}: {layer_info['point_count']:5,} points, {memory:5.1f} MB")
    
    print(f"Total memory for all zoom layers: {total_memory:.1f} MB")
    
else:
    print("Please run the data loading cells first to create gdf_valid")

print("\n=== Memory Management Tips ===")
print("1. Monitor memory usage regularly with monitor_memory_usage()")
print("2. Use sampling for initial exploration, full dataset for final analysis")
print("3. Consider using DuckDB/Ibis for aggregations before visualization")
print("4. Clear unused variables with del and gc.collect()")
print("5. For production, consider pre-processing data at different zoom levels")

=== Memory Usage Analysis ===
Current memory usage:
  RSS (Resident Set Size): 142.6 MB
  VMS (Virtual Memory Size): 402881.2 MB
System memory:
  Total: 128.0 GB
  Available: 34.8 GB
  Used: 72.8%

=== Dataset Information ===
Please run the data loading cells first to create gdf_valid

=== Memory Management Tips ===
1. Monitor memory usage regularly with monitor_memory_usage()
2. Use sampling for initial exploration, full dataset for final analysis
3. Consider using DuckDB/Ibis for aggregations before visualization
4. Clear unused variables with del and gc.collect()
5. For production, consider pre-processing data at different zoom levels



# Visualizing iSamples Data with Lonboard



This notebook demonstrates how to use the `lonboard` library to visualize iSamples data. It covers loading the data, cleaning it, and creating an interactive map with various controls.



## Table of Contents

1. [Setup and Imports](#Setup-and-Imports)
2. [Load Data](#Load-Data)
3. [Data Exploration with Ibis](#Data-Exploration-with-Ibis)
4. [Data Cleaning and Preparation](#Data-Cleaning-and-Preparation)
5. [Interactive Map](#Interactive-Map)


In [4]:
from pathlib import Path

import requests

import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import shapely
from palettable.colorbrewer.diverging import BrBG_10
# from sidecar import Sidecar

from lonboard import Map, ScatterplotLayer
from lonboard.colormap import apply_continuous_cmap

import ibis

import ipywidgets as widgets
from IPython.display import display
from ipywidgets import Layout, Button, HBox, VBox, HTML
from ipywidgets import Output, HTMLMath


<a id='Setup-and-Imports'></a>
## 1. Setup and Imports


Two files to analyze:

* [iSamples Complete Export Dataset - April 2025](https://zenodo.org/records/15278211)

* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000) 



In [5]:
# local_path = Path("/Users/raymondyee/Data/iSample/2025_02_20_10_30_49/isamples_export_2025_02_20_10_30_49_geo.parquet")
# local_path = Path("/Users/raymondyee/Data/iSample/OPENCONTEXT.parquet")
# local_path = Path("/Users/raymondyee/Data/iSample/pqg_refining/oc_isamples_pqg.parquet")
# LOCAL_PATH = "isamples_export_2025_04_21_16_23_46_geo.parquet"
LOCAL_PATH = "/Users/raymondyee/Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet"
local_path = Path(LOCAL_PATH)
if not local_path.exists():
    remote_url = "https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet"
    # retrieve the file and store to local_path
    response = requests.get(remote_url)
    with open(local_path, 'wb') as file:
        file.write(response.content)
    
    


<a id='Load-Data'></a>
## 2. Load Data


In [6]:
# write out some info about the local file
# how big is it?
print(f"Local file: {local_path}")
print(f"File size: {local_path.stat().st_size / 1024 / 1024:.2f} MB")

Local file: /Users/raymondyee/Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet
File size: 283.28 MB


In [7]:
# any quick way to read off the columns for a geoparquet file without having to load the whole thing?
import pyarrow.parquet as pq
schema = pq.read_schema(local_path)
print(f"Columns in geoparquet file: {schema.names}")

Columns in geoparquet file: ['sample_identifier', '@id', 'label', 'description', 'source_collection', 'has_sample_object_type', 'has_material_category', 'has_context_category', 'informal_classification', 'keywords', 'produced_by', 'last_modified_time', 'curation', 'registrant', 'related_resource', 'sampling_purpose', 'sample_location_longitude', 'sample_location_latitude', 'geometry']



A GeoParquet file contains standard Parquet data types as well as a special geometry column. Here's a breakdown of what you can expect:

*   **Standard Data Types**: These are the fundamental data types from the Apache Parquet format. They include:
    *   `BOOLEAN`: True or false values.
    *   `INT32`, `INT64`: 32-bit and 64-bit signed integers.
    *   `FLOAT`, `DOUBLE`: 32-bit and 64-bit floating-point numbers.
    *   `BYTE_ARRAY`: Variable-length byte arrays, which can be used to store strings (with UTF-8 encoding), binary data, or complex types.
    *   `FIXED_LEN_BYTE_ARRAY`: Fixed-length byte arrays.

*   **Logical Types**: These are annotations that add semantic meaning to the underlying primitive types. For example, a `BYTE_ARRAY` can be annotated as a `STRING` or `JSON`. Common logical types include:
    *   `STRING`: UTF-8 encoded character strings.
    *   `DECIMAL`: Arbitrary-precision signed decimal numbers.
    *   `DATE`, `TIME`, `TIMESTAMP`: Date and time values with various precisions.
    *   `UUID`: Universally unique identifiers.

*   **Geometry Column**: This is the defining feature of a GeoParquet file. It's a column that stores geographic features in a binary format, typically Well-Known Binary (WKB). This column is what allows geospatial libraries like `geopandas` to interpret the data as points, lines, or polygons. The schema metadata will specify which column is the geometry column.

*   **GeoParquet Metadata**: In addition to the standard Parquet metadata, a GeoParquet file includes specific metadata that describes the geospatial information. This includes:
    *   The name of the geometry column.
    *   The Coordinate Reference System (CRS) of the geometries (e.g., WGS84).
    *   The bounding box of the data.


In [8]:
all_columns = ['sample_identifier',
 'label',
 'description',
 'source_collection',
 'has_sample_object_type',
 'has_material_category',
 'has_context_category',
 'informal_classification',
 'keywords',
 'produced_by',
 'curation',
 'registrant',
 'related_resource',
 'sampling_purpose',
 'sample_location_longitude',
 'sample_location_latitude',
 'geometry']

# read a subset of columns
columns = ['sample_identifier', 'source_collection', 'geometry']
# columns = all_columns




In [9]:
set(schema.names) - set(all_columns)  # what columns are not in the subset?

{'@id', 'last_modified_time'}


The fields `@id` and `last_modified_names` are indeed part of the Parquet file's schema, which is why `schema.names` includes them. Here's a breakdown of what they likely represent:

*   **`@id`**: This is a convention often used in Linked Data formats like JSON-LD. It typically represents a unique identifier for each record, often in the form of a URI (Uniform Resource Identifier). This allows each sample in your dataset to be uniquely referenced in a global context, which is very useful for data integration and interoperability. Given that one of the data sources is Open Context, which heavily uses Linked Open Data principles, this is a very likely explanation.

*   **`last_modified_names`**: This field is less standard, but it's almost certainly related to data provenance and tracking. It likely stores information about when the data for that specific record was last modified. This is crucial for understanding the version and history of the data.

In short, while they may not be "data" columns in the same way as `latitude` or `description`, they are important metadata fields that have been stored as columns within the Parquet file's schema. `pyarrow.read_schema` gives you a raw look at this schema, so it includes everything defined at that level. When you load the data with `geopandas` and specify a subset of columns, these metadata columns are simply ignored, but they are still present in the file.


In [10]:
if local_path.exists():
    gdf = gpd.read_parquet(local_path, columns=columns)
    # Get a sample if the dataset is too large


In [11]:
# confirm that the columns are as expected
assert set(gdf.columns) == set(columns)

In [12]:
# use ibis to read the parquet file and compute some basic stats

table = ibis.read_parquet(local_path)
result = table["source_collection"].value_counts().execute()
print(result)


  source_collection  source_collection_count
0             GEOME                   605554
1       SMITHSONIAN                   322161
2       OPENCONTEXT                  1064831
3             SESAR                  4688386



<a id='Data-Exploration-with-Ibis'></a>
## 3. Data Exploration with Ibis


Ibis uses DuckDB as its default backend for working with Parquet files, which makes it really efficient and convenient for handling large datasets.

In [13]:
# Get all column names
print(table.columns)

# Display table schema/structure with data types
print(table.schema())

# Get number of rows
print(table.count().execute())

# Preview first few rows (similar to pandas head())
print(table.limit(5).execute())

('sample_identifier', '@id', 'label', 'description', 'source_collection', 'has_sample_object_type', 'has_material_category', 'has_context_category', 'informal_classification', 'keywords', 'produced_by', 'last_modified_time', 'curation', 'registrant', 'related_resource', 'sampling_purpose', 'sample_location_longitude', 'sample_location_latitude', 'geometry')
ibis.Schema {
  sample_identifier          string
  @id                        string
  label                      string
  description                string
  source_collection          string
  has_sample_object_type     array<struct<identifier: string>>
  has_material_category      array<struct<identifier: string>>
  has_context_category       array<struct<identifier: string>>
  informal_classification    array<string>
  keywords                   array<struct<keyword: string>>
  produced_by                struct<description: string, has_feature_of_interest: string, identifier: string, label: string, responsibility: array<struct<

In [14]:
# Value counts for categorical columns
print("Source collections:")
print(table["source_collection"].value_counts().execute())

print("Sample object types:")
print(table["has_sample_object_type"].value_counts().limit(10).execute())

print("Material categories:")
print(table["has_material_category"].value_counts().limit(10).execute())

# Check for null values in important columns
null_counts = {col: table[col].isnull().sum().execute() for col in table.columns}
print("Null counts per column:")
for col, count in null_counts.items():
    print(f"{col}: {count}")

Source collections:
  source_collection  source_collection_count
0             SESAR                  4688386
1       SMITHSONIAN                   322161
2             GEOME                   605554
3       OPENCONTEXT                  1064831
Sample object types:
                              has_sample_object_type  \
0  [{'identifier': 'https://w3id.org/isample/voca...   
1  [{'identifier': 'https://w3id.org/isample/voca...   
2  [{'identifier': 'https://w3id.org/isample/open...   
3  [{'identifier': 'https://w3id.org/isample/open...   
4  [{'identifier': 'https://w3id.org/isample/open...   
5  [{'identifier': 'https://w3id.org/isample/open...   
6  [{'identifier': 'https://w3id.org/isample/voca...   
7  [{'identifier': 'https://w3id.org/isample/voca...   
8  [{'identifier': 'https://w3id.org/isample/open...   
9  [{'identifier': 'https://w3id.org/isample/open...   

   has_sample_object_type_count  
0                           645  
1                           230  
2              


### Column Analysis

Here is a breakdown of the columns in the dataset, their data types as interpreted by Ibis, and some notes on their content.

| Column Name | Data Type | Notes |
|---|---|---|
| `sample_identifier` | `string` | Unique identifier for the sample. |
| `label` | `string` | A human-readable label for the sample. |
| `description` | `string` | A description of the sample. |
| `source_collection` | `string` | The collection that the sample belongs to (e.g., SESAR, OPENCONTEXT). |
| `has_sample_object_type` | `string` | The type of object that was sampled (e.g., 'Core', 'Individual Sample'). |
| `has_material_category` | `string` | The category of material that the sample is composed of (e.g., 'Rock', 'Sediment'). |
| `has_context_category` | `string` | The environmental context from which the sample was taken (e.g., 'Marine', 'Terrestrial'). |
| `informal_classification` | `string` | An informal classification of the sample. |
| `keywords` | `string` | Keywords associated with the sample. |
| `produced_by` | `string` | Information about who produced the data. |
| `curation` | `string` | Information about the curation of the sample. |
| `registrant` | `string` | The person or organization that registered the sample. |
| `related_resource` | `string` | Links to related resources. |
| `sampling_purpose` | `string` | The purpose for which the sample was collected. |
| `sample_location_longitude` | `float64` | The longitude of the sample location. |
| `sample_location_latitude` | `float64` | The latitude of the sample location. |
| `geometry` | `geospatial` | The geographic coordinates of the sample, stored in WKB format. This is the primary geometry column. |
| `@id` | `string` | A unique Linked Data identifier (URI) for the record. |
| `last_modified_timestamp` | `string` | Timestamp of when the record was last modified. |


In [15]:

local_path
# pull out the first 100 rows and convert to a geopandas dataframe
k = table.limit(100).to_pandas()
# pandas
k[k['sample_identifier'] == 'ark:/21547/DSz2757']

# how to do this using ibis?

# table[table['sample_identifier'] == 'ark:/21547/DSz275']
table.filter(table['sample_identifier'] == 'ark:/21547/DSz2757').execute()

Unnamed: 0,sample_identifier,@id,label,description,source_collection,has_sample_object_type,has_material_category,has_context_category,informal_classification,keywords,produced_by,last_modified_time,curation,registrant,related_resource,sampling_purpose,sample_location_longitude,sample_location_latitude,geometry
0,ark:/21547/DSz2757,metadata/21547/DSz2757,757,basisOfRecord: PreservedSpecimen,GEOME,[{'identifier': 'https://w3id.org/isample/voca...,[{'identifier': 'https://w3id.org/isample/voca...,[{'identifier': 'https://w3id.org/isample/biol...,"[Taricha, granulosa]","[{'keyword': 'California'}, {'keyword': 'USA'}]",{'description': 'expeditionCode: newts | proje...,1894-01-01 00:00:00+00:00,,,,,-122.57861,38.578888,b'\x01\x01\x00\x00\x00\xde\xc8<\xf2\x07\xa5^\x...


In [16]:
dir(ibis.backends.sql.compilers)

['AthenaCompiler',
 'BigQueryCompiler',
 'ClickHouseCompiler',
 'DataFusionCompiler',
 'DatabricksCompiler',
 'DruidCompiler',
 'DuckDBCompiler',
 'ExasolCompiler',
 'FlinkCompiler',
 'ImpalaCompiler',
 'MSSQLCompiler',
 'MySQLCompiler',
 'OracleCompiler',
 'PostgresCompiler',
 'PySparkCompiler',
 'RisingWaveCompiler',
 'SQLiteCompiler',
 'SnowflakeCompiler',
 'TrinoCompiler',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'annotations',
 'athena',
 'base',
 'bigquery',
 'clickhouse',
 'databricks',
 'datafusion',
 'druid',
 'duckdb',
 'exasol',
 'flink',
 'impala',
 'mssql',
 'mysql',
 'oracle',
 'postgres',
 'pyspark',
 'risingwave',
 'snowflake',
 'sqlite',
 'trino']

### Querying Nested Data: Pandas vs. Ibis

When working with columns that contain nested, JSON-like data (such as dictionaries or structs), both pandas and Ibis provide powerful tools for querying. However, their approaches and the underlying performance can differ significantly.

#### The Pandas Approach

Let's assume we have a pandas DataFrame `k` with a column `produced_by` that contains dictionaries.

**1. The `apply` method (less idiomatic):**

A common but often inefficient approach is to use `apply` with a `lambda` function. This works but can be slow on large datasets as it's not a vectorized operation.

```python
k[k['produced_by'].apply(lambda x: x['identifier'] == 'ark:/21547/DSz2757' if x is not None else False)]
```

**2. The `.str.get()` accessor (idiomatic and fast):**

A much more "pandasonic" and performant way is to use the `.str.get()` accessor, which is vectorized and gracefully handles missing values.

```python
k[k['produced_by'].str.get('identifier') == 'ark:/21547/DSz2757']
```

#### The Ibis Approach

Now, let's consider an Ibis table `table` connected to a database like DuckDB. Ibis translates your Python code into efficient SQL.

**1. Direct Filtering:**

Ibis allows you to access fields in a struct-like column directly. This is clean and highly readable.

```python
e = table.filter(table.produced_by['identifier'] == 'ark:/21547/DSz2757')
```

**2. Filtering with `lambda` (more idiomatic):**

For even cleaner code that avoids repeating the table variable, you can pass a `lambda` function to `filter`. This is considered a best practice in the Ibis community as it makes complex data pipelines easier to read and maintain.

```python
e = table.filter(lambda t: t.produced_by['identifier'] == 'ark:/21547/DSz2757')
```

Both Ibis expressions produce the same efficient SQL query. For DuckDB, the generated SQL would look something like this, using dot notation to access the nested field:

```sql
SELECT *
FROM my_table t0
WHERE "t0"."produced_by"."identifier" = 'ark:/21547/DSz2757'
```

This demonstrates how Ibis lets you write high-level, Pythonic code while leveraging the full power of the underlying database engine for scalable, high-performance queries on complex data.

In [17]:
rows = table.limit(1).execute()
(rows.loc[0].to_dict())

{'sample_identifier': 'ark:/21547/DSz2757',
 '@id': 'metadata/21547/DSz2757',
 'label': '757',
 'description': 'basisOfRecord: PreservedSpecimen',
 'source_collection': 'GEOME',
 'has_sample_object_type': [{'identifier': 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/wholeorganism'}],
 'has_material_category': [{'identifier': 'https://w3id.org/isample/vocabulary/material/1.0/organicmaterial'}],
 'has_context_category': [{'identifier': 'https://w3id.org/isample/biology/biosampledfeature/1.0/Animalia'}],
 'informal_classification': ['Taricha', 'granulosa'],
 'keywords': [{'keyword': 'California'}, {'keyword': 'USA'}],
 'produced_by': {'description': 'expeditionCode: newts | projectId: 244',
  'has_feature_of_interest': '',
  'identifier': 'ark:/21547/DSz2757',
  'label': 'a22d568d303a95c622a9409871e562d7 newts',
  'responsibility': [{'name': 'Vance Vredenburg', 'role': 'collector '},
   {'name': ' Vance Vredenburg', 'role': 'principalInvestigator'}],
  'result_time': '

In [18]:
# compute the value counts for has_material_category
result = table["has_material_category"].value_counts().execute()
print(result)

                                has_material_category  \
0   [{'identifier': 'https://w3id.org/isample/voca...   
1   [{'identifier': 'https://w3id.org/isample/voca...   
2   [{'identifier': 'https://w3id.org/isample/voca...   
3                                                None   
4   [{'identifier': 'https://w3id.org/isample/voca...   
5   [{'identifier': 'https://w3id.org/isample/voca...   
6   [{'identifier': 'https://w3id.org/isample/voca...   
7   [{'identifier': 'https://w3id.org/isample/voca...   
8   [{'identifier': 'https://w3id.org/isample/voca...   
9   [{'identifier': 'https://w3id.org/isample/voca...   
10  [{'identifier': 'https://w3id.org/isample/voca...   
11  [{'identifier': 'https://w3id.org/isample/voca...   
12  [{'identifier': 'https://w3id.org/isample/voca...   
13  [{'identifier': 'https://w3id.org/isample/voca...   
14  [{'identifier': 'https://w3id.org/isample/voca...   
15  [{'identifier': 'https://w3id.org/isample/voca...   
16  [{'identifier': 'https://w3

In [19]:
# Summary statistics for numeric columns
print("Latitude statistics:")
lat_stats = table.aggregate([
    table["sample_location_latitude"].count().name('count'),
    table["sample_location_latitude"].min().name('min'),
    table["sample_location_latitude"].max().name('max'),
    table["sample_location_latitude"].mean().name('mean'),
    table["sample_location_latitude"].std().name('std'),
]).execute()
print(lat_stats)

print("Longitude statistics:")
lon_stats = table.aggregate([
    table["sample_location_longitude"].count().name('count'),
    table["sample_location_longitude"].min().name('min'),
    table["sample_location_longitude"].max().name('max'),
    table["sample_location_longitude"].mean().name('mean'),
    table["sample_location_longitude"].std().name('std'),
]).execute()
print(lon_stats)

# For percentiles, you can use quantile:
print("Latitude percentiles:")
lat_percentiles = table.aggregate([
    table["sample_location_latitude"].quantile(0.25).name('25%'),
    table["sample_location_latitude"].quantile(0.50).name('50%'),
    table["sample_location_latitude"].quantile(0.75).name('75%')
]).execute()
print(lat_percentiles)

Latitude statistics:
     count     min     max       mean        std
0  5980282 -89.983  89.981  16.281101  33.070944
Longitude statistics:
     count    min    max      mean        std
0  5980282 -180.0  180.0 -8.264868  92.460269
Latitude percentiles:
      25%        50%      75%
0 -0.6798  29.970606  38.9346


In [20]:
# Group by source collection and count records
collection_summary = (
    table.group_by("source_collection")
    .aggregate(count=table.count())
    .order_by(ibis.desc("count"))
    .execute()
)
print("Records per source collection:")
print(collection_summary)

# Find records with geographic information
geography_stats = (
    table.group_by("source_collection")
    .aggregate(
        total=table.count(),
        with_coords=((~table["geometry"].isnull()).sum()),
        coord_percentage=(100 * (~table["geometry"].isnull()).mean())
    )
    .execute()
)
print("Geographic data availability by collection:")
print(geography_stats)

Records per source collection:
  source_collection    count
0             SESAR  4688386
1       OPENCONTEXT  1064831
2             GEOME   605554
3       SMITHSONIAN   322161
Geographic data availability by collection:
  source_collection    total  with_coords  coord_percentage
0       OPENCONTEXT  1064831      1064831             100.0
1             GEOME   605554       605554             100.0
2       SMITHSONIAN   322161       322161             100.0
3             SESAR  4688386      4688386             100.0



<a id='Data-Cleaning-and-Preparation'></a>
## 4. Data Cleaning and Preparation



### Ibis vs. Pandas: Selecting the First Row

Your intuition is correct! The equivalent of `df.loc[0]` in pandas for an Ibis table is `table.limit(1)`.

The key difference lies in their execution models:

*   **Pandas (Eager Execution)**: When you have a pandas DataFrame (`df`), the data is already loaded into memory. `df.loc[0]` or `df.head(1)` directly accesses this in-memory data to retrieve the first row.

*   **Ibis (Lazy Execution)**: Ibis works differently. When you create an Ibis table, you are creating a *pointer* to the data, not loading it. The code you write builds a query plan. 
    *   `table.limit(1)`: This adds a "limit" operation to the query plan. No data has been read yet.
    *   `.execute()`: This is the command that sends the completed query plan to the backend (in this case, DuckDB reading the Parquet file) to actually retrieve the data.

This lazy approach is what makes Ibis so powerful for large datasets, as you only pull the data you explicitly ask for into memory.

Here is an example:


In [21]:
gdf

Unnamed: 0,sample_identifier,source_collection,geometry
0,ark:/21547/DSz2757,GEOME,POINT (-122.57861 38.57889)
1,ark:/21547/DSz2779,GEOME,POINT (-122.37306 37.38528)
2,ark:/21547/DSz2806,GEOME,POINT (-122.11705 37.36549)
3,ark:/21547/DSz2807,GEOME,POINT (-122.11705 37.36549)
4,ark:/21547/DSz2759,GEOME,POINT (-122.57861 38.57889)
...,...,...,...
6680927,ark:/65665/3fffcea63-19cd-478d-84fe-9914c6f55157,SMITHSONIAN,POINT EMPTY
6680928,ark:/65665/3fffe3e56-ec61-4892-9237-497340ad56ae,SMITHSONIAN,POINT EMPTY
6680929,ark:/65665/3fffe639f-69f4-451d-8aad-af6c9a0265d8,SMITHSONIAN,POINT (-95.4615 30.3353)
6680930,ark:/65665/3fffebe64-0849-4803-9cbc-a4129a927bf8,SMITHSONIAN,POINT EMPTY


In [22]:
list(gdf.columns)

['sample_identifier', 'source_collection', 'geometry']

In [23]:
# print out the first few rows of table
ibis.options.interactive = False
k = table.head().execute()
type(k)


pandas.core.frame.DataFrame

In [24]:
# Filter out null and empty geometries
gdf_valid = gdf[~gdf.geometry.isna() & ~gdf.geometry.is_empty]

print(f"Original dataframe: {len(gdf):,} records")
print(f"After removing empty geometries: {len(gdf_valid):,} records")
print(f"Removed: {len(gdf) - len(gdf_valid):,} records ({(len(gdf) - len(gdf_valid))/len(gdf)*100:.2f}%)")


Original dataframe: 6,680,932 records
After removing empty geometries: 5,980,282 records
Removed: 700,650 records (10.49%)


In [25]:
# reduce the size of gdf to make it easier to plot

# Europe
# gdf = gdf.cx[-11.83:25.5, 34.9:59]
# USA
# gdf = gdf.cx[-125:-66, 24:50]
# WORLD
# gdf = gdf.cx[-180:180, -90:90]

In [26]:
gdf.columns

Index(['sample_identifier', 'source_collection', 'geometry'], dtype='object')

In [27]:
default_color = [128, 128, 128, 255]  # Gray for unknown sources
# Define color map 
color_map = {
    "SESAR": [51, 102, 204, 255],       # Vibrant blue (#3366CC)
    "OPENCONTEXT": [220, 57, 18, 255],  # Crimson red (#DC3912)
    "GEOME": [16, 150, 24, 255],        # Forest green (#109618)
    "SMITHSONIAN": [255, 153, 0, 255]   # Deep orange (#FF9900)
}

# Get selected collections
selected_collections = ['SESAR', 'OPENCONTEXT', 'GEOME', 'SMITHSONIAN']

def create_color_map_0(gdf, color_map, selected_collections=None, default_color=[128, 128, 128, 255]):
    # Pre-compute colors for each point
    colors = np.zeros((len(gdf), 4), dtype=np.uint8)
    for i, source in enumerate(gdf['source_collection']):
        if (selected_collections is None or source in selected_collections) and source in color_map:
            colors[i] = color_map[source]
        else:
            colors[i] = default_color
    return colors


# function to create a color map with selected collections (which has default of all collections)
# use faster vectorized approach
def create_color_map(gdf, color_map, selected_collections=None, default_color=[128, 128, 128, 255]):
    # Pre-compute colors for each point
    colors = np.zeros((len(gdf), 4), dtype=np.uint8)
    
    # Create a mapping dictionary once
    color_lookup = {cat: np.array(color_map.get(cat, default_color)) for cat in gdf['source_collection'].cat.categories}
    
    # Apply the mapping using categorical codes
    for cat_code, cat in enumerate(gdf['source_collection'].cat.categories):
        mask = gdf['source_collection'].cat.codes == cat_code
        # Only apply color if the category is in selected_collections (if provided)
        if selected_collections is None or cat in selected_collections:
            colors[mask] = color_lookup.get(cat, default_color)
        else:
            colors[mask] = default_color
    
    return colors


In [28]:


# write this comparision as a test
# pass arguments to the function

def test_color_map():
    # Test with full dataset (no selections)
    colors0 = create_color_map_0(gdf_sample, color_map)
    colors1 = create_color_map(gdf_sample, color_map)
    assert np.array_equal(colors0, colors1), "Full dataset color mapping failed"

    # Test with selected collections
    selected_collections = ['SESAR', 'OPENCONTEXT']
    colors0_selected = create_color_map_0(gdf_sample, color_map, selected_collections)
    colors1_selected = create_color_map(gdf_sample, color_map, selected_collections)
    assert np.array_equal(colors0_selected, colors1_selected), "Selected collections color mapping failed"

In [30]:
from lonboard import ScatterplotLayer, Map, BitmapTileLayer
import numpy as np

# First, ensure source_collection is categorical
gdf['source_collection'] = gdf['source_collection'].astype('category')

# Filter out null and empty geometries
gdf_valid = gdf[~gdf.geometry.isna() & ~gdf.geometry.is_empty]

# Get a sample if the dataset is too large
gdf_sample = gdf_valid.sample(frac=0.1, random_state=42)  # Adjust number as needed

# Create a color map for the sample
colors = create_color_map(gdf_sample, color_map, selected_collections)


# Create a base tile layer with OpenStreetMap
base_layer = BitmapTileLayer(
        data="https://tile.openstreetmap.org/{z}/{x}/{y}.png",
        tile_size=256,
        max_requests=-1,
        min_zoom=0,
        max_zoom=19,
    )

satellite_layer = BitmapTileLayer(
    data="https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}",
    tile_size=256,
    min_zoom=0,
    max_zoom=19
)

# Create a ScatterplotLayer with the pre-computed colors
layer = ScatterplotLayer.from_geopandas(
    gdf_sample,
    get_fill_color=colors,  # Pass the numpy array of colors
    get_radius=300,
    radius_units='meters',  # Use pixels instead of meters
    pickable=True
)

# Create and display the map
m = Map([base_layer, layer], _height=800)
# m = Map([satellite_layer, layer], _height=800)
display(m)

# example code to manipulate the map
# layer.get_fill_color = [0, 50, 200, 200]

Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_…

In [None]:
# let's play with the map and layer to learn how to use it

# layer.get_fill_color = [0, 50, 200, 200]
# set layer fill color to the color map

layer.get_fill_color = colors

# Just update zoom
# Correct way to update the view state
new_view_state = {
    "longitude": m.view_state.longitude,
    "latitude": m.view_state.latitude,
    "zoom": 6,  # Your new zoom level
    "pitch": m.view_state.pitch,
    "bearing": m.view_state.bearing
}

m.view_state = new_view_state


# view_state has the following attributes: longitude, latitude, zoom, pitch, bearing
# m.view_state = {"zoom": 10} 

# dynamically change layers in the map
m.layers = [base_layer, layer]
# m.layers = [satellite_layer, layer]

In [None]:
# Correct the output widget code in cell with ID "5d3f6ec5"
gdf_sample['source_collection']

# construct checkboxes for each source collection
source_collections = gdf_sample['source_collection'].unique()
checkboxes = {source: widgets.Checkbox(value=False, description=source) for source in source_collections}

# Create output widget
output = widgets.Output()

# Respond to checkbox changes - FIX HERE
def on_checkbox_change(change):
    with output:
        output.clear_output()
        selected_collections = [source for source, checkbox in checkboxes.items() if checkbox.value]
        # Print to the output widget instead of trying to set its value
        print(f"Selected collections: {', '.join(selected_collections)}")
        print(f"Number of rows in selection: {gdf_sample['source_collection'].isin(selected_collections).sum()}")

        # now update the layer and the map
        # Create a color map fo

# Register the callback with all checkboxes
for checkbox in checkboxes.values():
    checkbox.observe(on_checkbox_change, names='value')

# Display the checkboxes and output
display(widgets.VBox(list(checkboxes.values())), output)


In [None]:
gdf_sample['source_collection'].isin(selected_collections).sum()

In [None]:
gdf_sample['source_collection'].value_counts()

## Managing Environment with `pip-tools`

`pip-tools` is used to manage Python package dependencies for reproducible environments. The typical workflow involves two main commands: `pip-compile` and `pip-sync`.

1.  **Define Direct Dependencies (`requirements.in`)**:
    *   List your project's top-level dependencies in a `requirements.in` file. You can specify version constraints if needed.
    *   Example `requirements.in`:
        ```
        pandas>=1.0
        geopandas
        lonboard
        # For local editable installs:
        # -e /path/to/local/package
        ```

2.  **Compile Dependencies (`pip-compile`)**:
    *   Run `pip-compile requirements.in` (or specify input and output files: `pip-compile requirements.in --output-file requirements.txt`).
    *   This generates a `requirements.txt` file, which pins the versions of your direct dependencies and all their sub-dependencies. This file ensures that your environment is reproducible.

3.  **Synchronize Environment (`pip-sync`)**:
    *   Run `pip-sync requirements.txt` (or just `pip-sync` if `requirements.txt` is in the current directory).
    *   This command modifies your current virtual environment to exactly match the packages and versions specified in `requirements.txt`. It will:
        *   Install any missing packages.
        *   Upgrade or downgrade existing packages to their pinned versions.
        *   Uninstall any packages in the environment that are not listed in `requirements.txt`.

**How to "Install" Packages with `pip-tools`**:

`pip-tools` doesn't have a direct `install` subcommand like `pip install <package>`. To add or update packages:
1.  Add or modify the package entry in your `requirements.in` file.
2.  Run `pip-compile requirements.in` to update `requirements.txt`.
3.  Run `pip-sync` to apply the changes to your virtual environment.

This process ensures that your `requirements.txt` always reflects the complete, pinned set of dependencies for your project, leading to more stable and predictable environments.

In [None]:
# Create functions to make the map configurable and update based on user selections

def update_layer_colors(gdf_data, selected_collections=None, radius=300, radius_units='meters'):
    """
    Update the ScatterplotLayer with filtered data and colors based on selected collections
    
    Parameters:
    -----------
    gdf_data : GeoDataFrame
        The geodataframe containing the data to plot
    selected_collections : list, optional
        List of collection names to highlight. If None, all collections are shown
    radius : float, optional
        Radius of the points
    radius_units : str, optional
        Units for the radius ('meters' or 'pixels')
        
    Returns:
    --------
    layer : ScatterplotLayer
        Updated ScatterplotLayer with filtered data and colors
    colors : numpy.ndarray
        Array of colors for the points
    """
    # If selected_collections is empty or None, use all collections
    if not selected_collections:
        selected_collections = gdf_data['source_collection'].unique()
    
    # Filter the data if needed
    if len(selected_collections) < len(gdf_data['source_collection'].unique()):
        filtered_data = gdf_data[gdf_data['source_collection'].isin(selected_collections)]
    else:
        filtered_data = gdf_data
    
    # Create colors based on the selected collections
    colors = create_color_map(filtered_data, color_map, selected_collections)
    
    # Create the layer
    layer = ScatterplotLayer.from_geopandas(
        filtered_data,
        get_fill_color=colors,
        get_radius=radius,
        radius_units=radius_units,
        pickable=True
    )
    
    return layer, colors, filtered_data

def create_map(base_layer_type="osm", layer=None, height=800):
    """
    Create and return a map with the specified base layer and data layer
    
    Parameters:
    -----------
    base_layer_type : str, optional
        Type of base layer to use ('osm' or 'satellite')
    layer : ScatterplotLayer, optional
        Data layer to add to the map
    height : int, optional
        Height of the map in pixels
        
    Returns:
    --------
    m : Map
        Map object with the specified layers
    """
    # Define base layers
    osm_layer = BitmapTileLayer(
        data="https://tile.openstreetmap.org/{z}/{x}/{y}.png",
        tile_size=256,
        max_requests=-1,
        min_zoom=0,
        max_zoom=19,
    )
    
    satellite_layer = BitmapTileLayer(
        data="https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}",
        tile_size=256,
        min_zoom=0,
        max_zoom=19
    )
    
    # Select the base layer
    if base_layer_type.lower() == "satellite":
        base = satellite_layer
    else:
        base = osm_layer
    
    # Create the map with appropriate layers
    layers = [base]
    if layer is not None:
        layers.append(layer)
    
    m = Map(layers, _height=height)
    return m


<a id='Interactive-Map'></a>
## 5. Interactive Map


In [None]:
# Create interactive widgets for map configuration
from ipywidgets import widgets, interactive, Layout, HBox, VBox, Output

# Create widgets for collection selection
collection_checkboxes = {
    collection: widgets.Checkbox(
        value=True, 
        description=collection,
        layout=Layout(width='auto')
    ) for collection in gdf_sample['source_collection'].unique()
}

# Create a widget for base map selection
base_map_dropdown = widgets.Dropdown(
    options=['OpenStreetMap', 'Satellite'],
    value='OpenStreetMap',
    description='Base Map:',
    layout=Layout(width='200px')
)

# Create a widget for point size
point_size_slider = widgets.IntSlider(
    value=300,
    min=100,
    max=1000,
    step=50,
    description='Point Size:',
    layout=Layout(width='300px')
)

# Create a widget for the units
radius_units_dropdown = widgets.Dropdown(
    options=['meters', 'pixels'],
    value='meters',
    description='Units:',
    layout=Layout(width='200px')
)

# Create a button to update the map
update_button = widgets.Button(
    description='Update Map',
    button_style='primary',
    layout=Layout(width='150px')
)

# Create an output widget for the map and statistics
map_output = widgets.Output()
stats_output = widgets.Output()

# Function to update the map based on widget values
def update_map(b):
    with map_output:
        map_output.clear_output(wait=True)
        
        # Get selected collections
        selected_collections = [
            collection for collection, checkbox in collection_checkboxes.items() 
            if checkbox.value
        ]
        
        # Get base map type
        base_layer_type = 'osm' if base_map_dropdown.value == 'OpenStreetMap' else 'satellite'
        
        # Update layer with selected collections and point size
        layer, colors, filtered_data = update_layer_colors(
            gdf_sample, 
            selected_collections, 
            radius=point_size_slider.value,
            radius_units=radius_units_dropdown.value
        )
        
        # Create and display the map
        m = create_map(base_layer_type=base_layer_type, layer=layer)
        display(m)
        
        # Update statistics
        with stats_output:
            stats_output.clear_output(wait=True)
            print(f"Selected collections: {', '.join(selected_collections)}")
            print(f"Points displayed: {len(filtered_data):,} of {len(gdf_sample):,} ({len(filtered_data)/len(gdf_sample)*100:.1f}%)")
            print(f"Points by collection:")
            for collection in selected_collections:
                count = sum(filtered_data['source_collection'] == collection)
                print(f"  {collection}: {count:,} points")

# Connect the update function to the button
update_button.on_click(update_map)

# Create the layout for the widgets
collection_box = VBox([widgets.HTML("<b>Data Collections:</b>")] + list(collection_checkboxes.values()))
config_box = VBox([
    widgets.HTML("<b>Map Configuration:</b>"),
    base_map_dropdown,
    point_size_slider,
    radius_units_dropdown,
    update_button
])

# Arrange the widgets in a horizontal layout
control_panel = HBox([collection_box, config_box], layout=Layout(width='100%'))

# Display the widgets and outputs
display(control_panel)
display(stats_output)
display(map_output)

# Initialize the map
update_map(None)

In [None]:
# Add a function to zoom to specific regions
zoom_regions = {
    'World': {'longitude': 0, 'latitude': 0, 'zoom': 1},
    'North America': {'longitude': -100, 'latitude': 40, 'zoom': 3},
    'Europe': {'longitude': 10, 'latitude': 50, 'zoom': 4},
    'Asia': {'longitude': 100, 'latitude': 30, 'zoom': 3},
    'Africa': {'longitude': 20, 'latitude': 0, 'zoom': 3},
    'South America': {'longitude': -60, 'latitude': -20, 'zoom': 3},
    'Australia': {'longitude': 135, 'latitude': -25, 'zoom': 4},
}

# Create a dropdown for region selection
region_dropdown = widgets.Dropdown(
    options=list(zoom_regions.keys()),
    value='World',
    description='Zoom to:',
    layout=Layout(width='200px')
)

# Function to zoom the map to a region
def zoom_to_region(change):
    if not hasattr(zoom_to_region, 'current_map'):
        return
    
    region = change['new']
    view_state = zoom_regions[region].copy()
    # Add missing view state properties
    if 'pitch' not in view_state:
        view_state['pitch'] = 0
    if 'bearing' not in view_state:
        view_state['bearing'] = 0
    
    zoom_to_region.current_map.view_state = view_state

# Function to update the map based on widget values (updated version)
def update_map(b):
    with map_output:
        map_output.clear_output(wait=True)
        
        # Get selected collections
        selected_collections = [
            collection for collection, checkbox in collection_checkboxes.items() 
            if checkbox.value
        ]
        
        # Get base map type
        base_layer_type = 'osm' if base_map_dropdown.value == 'OpenStreetMap' else 'satellite'
        
        # Update layer with selected collections and point size
        layer, colors, filtered_data = update_layer_colors(
            gdf_sample, 
            selected_collections, 
            radius=point_size_slider.value,
            radius_units=radius_units_dropdown.value
        )
        
        # Create and display the map
        m = create_map(base_layer_type=base_layer_type, layer=layer)
        display(m)
        
        # Store the map for later zoom operations
        zoom_to_region.current_map = m
        
        # Update statistics
        with stats_output:
            stats_output.clear_output(wait=True)
            print(f"Selected collections: {', '.join(selected_collections)}")
            print(f"Points displayed: {len(filtered_data):,} of {len(gdf_sample):,} ({len(filtered_data)/len(gdf_sample)*100:.1f}%)")
            print(f"Points by collection:")
            for collection in selected_collections:
                count = sum(filtered_data['source_collection'] == collection)
                print(f"  {collection}: {count:,} points ({count/len(filtered_data)*100:.1f}%)")

# Connect the region dropdown to the zoom function
region_dropdown.observe(zoom_to_region, names='value')

# Update the control panel to include the region dropdown
config_box = VBox([
    widgets.HTML("<b>Map Configuration:</b>"),
    base_map_dropdown,
    point_size_slider,
    radius_units_dropdown,
    region_dropdown,
    update_button
])

# Recreate the control panel
control_panel = HBox([collection_box, config_box], layout=Layout(width='100%'))

# Display the updated widgets and outputs
display(control_panel)
display(stats_output)
display(map_output)

# Initialize the map
update_map(None)

## Interactive iSamples Map

This interactive map allows you to explore the iSamples dataset with the following features:

1. **Collection Selection**: Choose which data collections to display
2. **Base Map**: Switch between OpenStreetMap and satellite imagery
3. **Point Size**: Adjust the size of the points on the map
4. **Units**: Choose between meters and pixels for point sizing
5. **Region Selection**: Quickly zoom to different regions of the world
6. **Statistics**: View counts and percentages of displayed points

The map is rendered using the Lonboard library, which provides fast visualization of large geospatial datasets directly in the notebook.


Yes, it is absolutely possible to embed human-readable descriptions of what columns mean directly within a Parquet or GeoParquet file. This is a key feature for making datasets self-documenting and easier to use.

### Parquet and GeoParquet Metadata

Both Parquet and GeoParquet formats allow for storing metadata at both the file level and the column level. This metadata is stored as key-value pairs. You can add a `description` key to the metadata of each column to hold a human-readable description.

When you are creating or modifying a Parquet file, you can add this metadata. For example, using `pyarrow`, you can specify the schema with descriptions for each field.

### Example with `pyarrow`

Here is a conceptual example of how you might do this in Python with the `pyarrow` library:

```python
import pyarrow as pa
import pandas as pd

# Sample data
data = {'col1': [1, 2], 'col2': [3.4, 5.6]}
df = pd.DataFrame(data)

# Create a schema with descriptions
schema = pa.schema([
    pa.field('col1', pa.int64(), metadata={"description": "This is the first column."}),
    pa.field('col2', pa.float64(), metadata={"description": "This is the second column."})
])

# Create a PyArrow Table
table = pa.Table.from_pandas(df, schema=schema)

# Write to a Parquet file
# pa.parquet.write_table(table, 'data_with_descriptions.parquet')
```

When another user or tool reads this Parquet file, they can inspect the schema and retrieve these descriptions to understand the meaning of each column without needing separate documentation. This is a powerful feature for data sharing and collaboration.


In [None]:
# Efficiently generate a histogram for 'last_modified_time'
# The challenge is the large number of rows and unique timestamps.
# We can use Ibis to offload the heavy lifting to the backend.

# 1. The column is already a timestamp, so we can use it directly.
timestamp_col = table['last_modified_time']

# 2. Extract the year from the timestamp to use as bins for our histogram.
year_col = timestamp_col.year().name('year')

# 3. Group by year and count the number of records in each year.
# This is memory-efficient as only the aggregated result is pulled into pandas.
histogram_data = table.group_by(year_col).agg(count=table.count()).order_by('year').execute()

# 4. Plot the histogram using matplotlib.
plt.figure(figsize=(15, 7))
plt.bar(histogram_data['year'], histogram_data['count'], color='skyblue')
plt.title('Histogram of Records by Last Modified Year')
plt.xlabel('Year')
plt.ylabel('Number of Records')
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Ensure x-axis labels are integers
plt.xticks(histogram_data['year'].unique().astype(int))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Display the aggregated data
histogram_data

### Interactive Filtering by Date

Below is an example of using `ipywidgets` to create a slider for filtering the data based on the `last_modified_time`. This is much more efficient than loading the entire dataset into pandas, as it uses Ibis to perform the filtering and counting on the backend.

In [None]:
# Get min and max years for the slider
min_year = table['last_modified_time'].year().min().execute()
max_year = table['last_modified_time'].year().max().execute()

# Create a range slider for the years
year_slider = widgets.IntRangeSlider(
    value=[min_year, max_year],
    min=min_year,
    max=max_year,
    step=1,
    description='Filter by Year:',
    disabled=False,
    continuous_update=False,  # Only trigger update on release
    orientation='horizontal',
    readout=True,
    readout_format='d',
    layout=Layout(width='500px')
)

# Create an output widget to display the count
count_output = widgets.Output()

# Function to handle slider changes
def on_slider_change(change):
    with count_output:
        count_output.clear_output(wait=True)
        min_val, max_val = change['new']
        
        # Filter the table based on the selected year range
        filtered_table = table.filter(
            (table['last_modified_time'].year() >= min_val) &
            (table['last_modified_time'].year() <= max_val)
        )
        
        # Get the count of rows in the filtered table
        row_count = filtered_table.count().execute()
        
        print(f"Number of records from {min_val} to {max_val}: {row_count:,}")

# Observe the slider for changes
year_slider.observe(on_slider_change, names='value')

# Display the widgets
display(year_slider, count_output)

# Trigger the initial display
on_slider_change({'new': (min_year, max_year)})

In [None]:

import pyarrow.parquet as pq
from IPython.display import Markdown, display

# Read the schema from the Parquet file
schema = pq.read_schema(local_path)

# Create the markdown table header
md = "| Column Name | Data Type | Description |\n"
md += "|---|---|---|\n"

# Populate the table with schema information
for field in schema:
    # Extract description from metadata, if it exists
    description = field.metadata.get(b'description', b'').decode('utf-8') if field.metadata else ""
    md += f"| {field.name} | {field.type} | {description} |\n"

display(Markdown(md))


In [None]:
import geopandas as gpd
import pyarrow.parquet as pq
from pathlib import Path
from IPython.display import Markdown, display

# Define the path to the GeoParquet file
LOCAL_PATH = "/Users/raymondyee/Data/iSample/2025_04_21_16_23_46/isamples_export_2025_04_21_16_23_46_geo.parquet"
local_path = Path(LOCAL_PATH)

# 1. Define your column descriptions here
# The keys are the column names from your file. 
# Fill in the string values with the description for each column.

column_descriptions = {
    '@id': 'Unique identifier for the sample, often a URI.',
    'sample_identifier': 'A unique identifier for the sample within its source collection.',
    'label': 'The primary name or label assigned to the sample.',
    'description': 'A free-text description of the sample.',
    'source_collection': 'The collection or dataset from which the sample originates (e.g., SESAR, OPENCONTEXT).',
    'has_sample_object_type': 'The type of object that was sampled (e.g., rock, water, artifact).',
    'has_material_category': 'The category of material the sample is composed of (e.g., organic, inorganic).',
    'has_context_category': 'The environmental or cultural context from which the sample was taken (e.g., marine, archaeological).',
    'informal_classification': 'An informal or local classification of the sample.',
    'keywords': 'A list of keywords associated with the sample.',
    'produced_by': 'Information about the agent or process that produced the sample.',
    'curation': 'Information about the curation and stewardship of the sample.',
    'registrant': 'The person or organization that registered the sample.',
    'related_resource': 'Links to related resources or publications.',
    'sampling_purpose': 'The reason or purpose for which the sample was collected.',
    'sample_location_longitude': 'The longitude of the sample location (WGS84).',
    'sample_location_latitude': 'The latitude of the sample location (WGS84).',
    'last_modified_names': 'The date the record was last modified.',
    'geometry': 'The geographic coordinates of the sample location in WKB format.'
}

# 2. Code to generate a markdown documentation file

# Read the schema from the Parquet file
schema = pq.read_schema(local_path)

# Start the markdown string with a header
md_string = f"# Schema for {local_path.name}\n\n"
md_string += "| Column Name | Data Type | Description |\n"
md_string += "|---|---|---|\n"

# Populate the table with schema information
for field in schema:
    col_name = field.name
    col_type = str(field.type)
    # Get description from our dictionary, or a placeholder if not present
    col_desc = column_descriptions.get(col_name, "*No description provided.*").replace('|', '\|') # Escape pipe characters
    md_string += f"| `{col_name}` | `{col_type}` | {col_desc} |\n"

# Define the output path for the markdown file
output_path = Path(str(local_path).replace('_geo.parquet', '_geo_schema.md'))

# Write the markdown string to the file
with open(output_path, 'w', encoding='utf-8') as f:
    f.write(md_string)

print(f"Successfully created schema documentation at: {output_path}")

# Display the generated markdown in the notebook for review
print("\n--- Schema Documentation ---")
display(Markdown(md_string))

# TO DO: long term to do: possibly figure out how to embed human friendly descriptions into the geoparquet file.

In [None]:
table


That is an excellent question. While Ibis does not have a `query` method that works exactly like the pandas version, you can make your code more concise by passing a `lambda` function to the `filter` method. This avoids the need to repeat the table name.



### The EDA System: A Conceptual Roadmap

The central idea is to create a "control panel" of widgets that are dynamically generated based on the schema of your Ibis table. This panel allows a user to build up a complex filter expression interactively, and then Ibis executes the final, filtered query.

Here’s a step-by-step approach to implementing your vision:

---

#### **Step 1: Schema Inspection and Widget Mapping**

The foundation of the system is a function that inspects the Ibis table's schema and decides which widget is appropriate for each column.

```python
import ibis
import ipywidgets as widgets
from ipywidgets import VBox, HBox, Dropdown, Text, IntSlider, Output

def generate_control_panel(table):
    # Get the schema from the Ibis table
    schema = table.schema()
    
    widget_map = {}
    
    for col_name, col_type in schema.items():
        # A factory function decides which widget to create
        widget = create_widget_for_column(table, col_name, col_type)
        if widget:
            widget_map[col_name] = widget
            
    # Arrange widgets in a layout
    return VBox([HBox([widgets.Label(name), w]) for name, w in widget_map.items()]), widget_map
```

---

#### **Step 2: The Widget Factory (The "Brains")**

This is the core logic you described. A function needs to intelligently create the right widget based on the column's type and cardinality.

```python
def create_widget_for_column(table, col_name, col_type):
    # For nested/JSON-like columns
    if isinstance(col_type, ibis.expr.datatypes.Struct):
        # Create a text box to query a specific key within the struct
        # This is a simple starting point; could be more advanced
        return Text(description="Filter by key (e.g., key:value)")

    # For numeric columns
    if col_type.is_numeric():
        min_val = table[col_name].min().execute()
        max_val = table[col_name].max().execute()
        return IntSlider(min=min_val, max=max_val, value=min_val, description=f"Range")

    # For string/categorical columns
    if col_type.is_string():
        # Check the number of unique values (cardinality)
        cardinality = table[col_name].nunique().execute()
        
        if 1 < cardinality < 25: # Low cardinality -> Dropdown
            options = table[col_name].value_counts().execute().index.tolist()
            return Dropdown(options=[''] + options) # Add empty option for "no filter"
        else: # High cardinality -> Text search
            return Text(description="Contains text...")
            
    return None # No widget for this type
```

---

#### **Step 3: Linking Widgets to Ibis (`observe`)**

Once the widgets are displayed, you need to link their changes to an Ibis query. The `.observe()` method of `ipywidgets` is perfect for this. You'll build a list of filter expressions and re-run the query whenever a widget's value changes.

```python
# Assume `table` is your Ibis table
controls_vbox, widget_map = generate_control_panel(table)
output_area = Output() # An area to display the results

def apply_filters(change):
    # Start with the base table
    filtered_table = table
    
    # Collect all active filter conditions
    for col_name, widget in widget_map.items():
        if widget.value: # Apply filter if widget has a value
            # This logic would need to be more robust based on widget type
            if isinstance(widget, Dropdown):
                filtered_table = filtered_table.filter(lambda t: t[col_name] == widget.value)
            elif isinstance(widget, Text):
                 filtered_table = filtered_table.filter(lambda t: t[col_name].contains(widget.value))
            # ... add logic for sliders, etc.

    # Execute the query and display the results
    with output_area:
        output_area.clear_output()
        # Display the filtered data (e.g., as a pandas DataFrame)
        display(filtered_table.limit(100).execute())
        # Also display the generated SQL to see what Ibis is doing!
        print("Generated SQL:")
        print(ibis.to_sql(filtered_table.limit(100)))

# Attach the observer to each widget
for widget in widget_map.values():
    widget.observe(apply_filters, names='value')

# Display the UI
display(controls_vbox, output_area)
# Trigger the initial display
apply_filters(None)
```

### Summary of Your Big Idea

*   **You are right on track.** This is exactly how modern data apps are built. You're essentially creating a semantic layer (the widgets) on top of your data that translates user intent into high-performance queries.
*   **Scalability is Key:** The beauty of this approach is that the expensive work (`value_counts`, `min`, `max`, and the final filtering) is all pushed down to the database via Ibis. Your Jupyter kernel remains light and responsive.
*   **Interactivity:** For the high-cardinality search/autosuggest, you could make the `Text` widget's observer trigger a `... .like('%value%').value_counts()` query to populate a *separate* dropdown, creating a dynamic search experience.

This is a substantial but very achievable project. Starting with the three steps outlined above will give you a solid foundation for a very powerful and reusable EDA tool.


In [None]:
# let's concentrate on last_modified_time

table['last_modified_time'].value_counts().execute()