# Test dataset generation
In this notebook we are going to generate a test dataset that is going to serve as ground-truth. The data in this dataset are GLD (Groundwater Level Dossiers) that we have migrated from DINO -> BRO and we are certain that the process was successful.

In [None]:
import sys
import pandas as pd

sys.path.append('../../src')
from utils_dino import init_connection_to_dino, get_DINO_data_by_piezometer
from utils_bro import get_bro_data

In [2]:
engine = init_connection_to_dino()

## 1. Get migration id info

In [3]:
sql_identifiers_migration_GLD_additions = f"""
SELECT
  e.BRO_ID,
  MIN(w.NITG_NR)          AS NITG_NR,
  MIN(w.WELL_DBK)         AS WELL_DBK,
  MIN(p.PIEZOMETER_DBK)   AS PIEZOMETER_DBK
FROM DINO_DBA.GWS_WELL w
INNER JOIN DINO_DBA.GWS_PIEZOMETER p
  ON w.WELL_DBK = p.WELL_DBK
INNER JOIN DINO_DBA.BRO_MIGRATION_EVENT e
  ON p.PIEZOMETER_DBK = e.EVENT_RECORD_DBK
WHERE e.RO_TYPE_CD = 'GLD'
  AND e.EVENT_TYPE_CD = 'ADDITION'
  AND e.TABLE_NM_DBK = (
        SELECT TABLE_NM_DBK
        FROM DINO_DBA.REF_BRO_MIGRATION_TABLE_NM
        WHERE TABLE_NM = 'GWS_PIEZOMETER'
      )
GROUP BY e.BRO_ID
"""

In [4]:
identifiers_migration_GLD_additions = pd.read_sql(sql_identifiers_migration_GLD_additions, engine)

## 2. Get the data and store it in dicts
Here we are skipping data that 
1. are zombies (`raise ValueError`)
2. time-series of different lenghts. 
    > _Explanation:_ These are the 1-to-many DINO -> BRO cases, something changed in the history of the well/tube making it two separate GMW and thus the time series are stored in different GLD.

In [5]:
samples = 2_000
meta_obs, data_obs = {}, {}
# Iterate over all unique BRO_IDs in the identifiers_migration_GLD_additions dataframe (drop_duplicates is redundant here given the sql query but kept for safety)
glds_piezo_dbks = identifiers_migration_GLD_additions[['bro_id', 'piezometer_dbk']].drop_duplicates()
for i, (gld, piezo_dbk) in enumerate(glds_piezo_dbks.values):
    try:
        dino_df = get_DINO_data_by_piezometer(piezo_dbk, engine)
        bro_df = get_bro_data(gld)
    except ValueError as e:
        print(e)
        continue
    if dino_df.shape[0] != bro_df.shape[0]: # data length mismatch
        continue
    if dino_df.shape[0] < 2 or bro_df.shape[0] < 2: # too few data points
        continue
    # TODO: decide here what if BRO record has more than 2 columns (multiple observations): 
    # 1. only take onbekend which is probably the same as DINO and the last in the dataframe (chosen atm, last column)
    # 2. store as separate time-series
    data_obs[gld] = {'dino': dino_df[['monitor_date', 'value']].values, 'bro': bro_df.iloc[:, [0, -1]].values}
    meta_obs[gld] = {'x': dino_df['x'].values[0], 'y': dino_df['y'].values[0], 'NITG_NR': dino_df['nitg_nr'].values[0]}
    if i > samples: break

Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

Er is geen data beschikbaar voor dit dossier GLD000000075026. The response body is empty.
Er is geen data beschikbaar voor dit dossier GLD000000078651. The response body is empty.
Er is geen data beschikbaar voor dit dossier GLD000000073806. The response body is empty.


## 3. Export the data

In [None]:
import pickle
with open(f"../../data/{len(meta_obs)}_sample_migrated_GLD_dino+bro.pkl", 'wb') as f:
    pickle.dump({'meta': meta_obs, 'data': data_obs}, f)

## 4. Close connection

In [7]:
engine.dispose()