# ISMN loader

This template follows the tool-specs for reproducible research found at: [https://vforwater.github.io/tool-specs/](https://vforwater.github.io/tool-specs/).

This notebook creates the dataset needed for the other tools to run. To reproduce the dataset, you will need a running MetaCatalog 
instance, with a dump from the ISMN database covering the area of interest. The docker compose service `db` connects the 
instance and uses MetaCatalog-API to load the data into a local DuckDB database.

All subsequent notebooks will use the data from this notebook to run the other tools, so that a connection to the MetaCatalog
instance is not needed.

## handle parameters directly in the notebook

This is an **alternative** approach to the globally defined parameters. All tool-spec compliant tools using a supported language include a package called `json2args` in their dependencies. This package can be used to gain larger control over the passed parameters and data.

In [None]:
import numpy as np
import pandas as pd
import polars as pl
import duckdb

In [None]:
database_name = 'database.duckdb'
force_rebuild_duckdb=True

## Load the data

In [None]:
from metacatalog_api import core

Seach the MetaCatalog database to find any Metadata record running with 'ISMN' in the title and count them.

In [None]:
metadata = core.entries(filter={'entries.title': '%ISMN%'}, limit=None)

len(metadata)

In [None]:
# get all varibales
np.unique([m.variable.name for m in metadata])

In [None]:
# How many soil moisture sensors?
len([m for m in metadata if m.variable.name == 'volumetric water content'])

Define a function to read the data into a duckdb database.

In [None]:
from pathlib import Path

# handle recreation
db_path = Path('/out') / database_name
if db_path.exists() and force_rebuild_duckdb:
    print(f"The database {db_path} already exists, but is forced to be dropped...")
    db_path.unlink()

In [None]:
# define a function to load the data
def load_data(meta, db_name):
    # get the metadata needed
    table_name = meta.datasource.path
    var_name = meta.variable.name.replace(' ', '_')
    oid = meta.id

    # connect
    with core.connect() as cur:
            df = pl.read_database(f"SELECT * FROM {table_name};", cur)

    # add the id column
    df = df.with_columns(pl.lit(meta.id, dtype=pl.Int16).alias('meta_id'))

    # add to duckdb
    with duckdb.connect(db_name, read_only=False) as db:
        try:
            db.sql(f"CREATE TABLE {var_name} AS SELECT * FROM df;")
        except duckdb.CatalogException:
            db.sql(f"INSERT INTO {var_name} SELECT * FROM df;")


**The next cell will actually run the data-loading**

In [None]:
from tqdm import tqdm

for meta in tqdm(metadata):
    load_data(meta,  str(db_path))

### Load the metadata

We dump every single metadata entry as JSON to a temporary folder 

In [None]:
# install the spatial extension
with duckdb.connect(str(db_path), read_only=True) as db:
    db.install_extension('spatial')

In [None]:
import tempfile

sql = """
CREATE table raw_metadata AS SELECT * FROM read_json('%s/export_*.json', columns={
    id: 'INTEGER',
    title: 'TEXT',
    abstract: 'TEXT',
    location: 'TEXT',
    details: 'JSON',
    variable: 'JSON'
});
"""

with tempfile.TemporaryDirectory() as tmp:
    path = Path(tmp)
    print(f"Populating temporary folder: {path}")
    
    # go for all entries
    for meta in tqdm(metadata):
        with open(path / f"export_{meta.id}.json", "w") as f:
            f.write(meta.model_dump_json())

    # add to duckdb
    with duckdb.connect(str(db_path), read_only=False) as db:
        db.load_extension('spatial')
        db.sql(sql % tmp)

# after that, create a nicer overview table
sql = """
CREATE TABLE metadata AS
SELECT 
    id, 
    ST_X(location::Geometry) as lon, 
    ST_Y(location::Geometry) as lat,
    trim(variable.name, '"') as variable,
    list_transform(
        list_filter(details::JSON[], d -> trim(json_value(d, '$.key'), '"') = 'depth_from'),
        d -> trim(json_value(d, '$.value'), '"')::FLOAT
    )[1] AS depth_from,
    list_transform(
        list_filter(details::JSON[], d -> trim(json_value(d, '$.key'), '"') = 'depth_to'),
        d -> trim(json_value(d, '$.value'), '"')::FLOAT
    )[1] AS depth_to,
    list_transform(
        list_filter(details::JSON[], d -> trim(json_value(d, '$.key'), '"') = 'network'),
        d -> trim(json_value(d, '$.value'), '"')::TEXT
    )[1] AS network
FROM raw_metadata;
"""

with duckdb.connect(str(db_path), read_only=False) as db:
    db.load_extension('spatial')
    db.sql(sql)

## Some overviews

In [None]:
import duckdb
from pathlib import Path
db_path = Path('/out') / database_name

# at first install the spatial extension into it
with duckdb.connect(str(db_path), read_only=True) as db:
    db.install_extension('spatial')

In [None]:
# load them
with duckdb.connect(str(db_path), read_only=True) as db:
    db.load_extension('spatial')
    df = db.sql('FROM metadata').pl()

df

In [None]:
import folium

template = """
<h3>{v}</h3>
<p>Network: <strong>{net}</strong></p>
<p>Depth: <strong>{f:.2f}</strong> - <strong>{t:.2f}</strong></p>
"""

colors =  {"soil temperature": "brown", "air temperature": "red", "rainfall intensity": "purple", "volumetric water content": "blue"}
osm = folium.Map(location=[df['lat'].mean(), df['lon'].mean()], zoom_start=5)

for row in df.iter_rows():
    folium.CircleMarker(
        location=[row[2], row[1]], radius=10, fill_color=colors.get(row[3], 'black'), 
        popup=template.format(v=row[3], net=row[6], f=row[4], t=row[5])
    ).add_to(osm)
osm

In [None]:
osm.save('/out/ismn_overview.html')