<p align="center">
  <img src="Geo-INQUIRE_logo_2_crop.jpg" alt="Geo-INQUIRE Logo" width="320"/>
</p>

# GEO-INQUIRE Audio Processing Tool (EMSO & EIDA compliant)
**Author:** Silvana Neves <img src="logo-sin-leyenda-color.png" alt="PLOCAN Logo" width="160" style="vertical-align: middle; margin-left: 4px;"/> <img src="OIP.jpeg" alt="EMSO ERIC Logo" width="80" style="vertical-align: middle; margin-left: 8px;"/>

## What this tool does
- Converts WAV to FLAC (lossless) and embeds EMSO metadata.
- Generates MiniSEED and StationXML (EIDA compliant).
- Provides a GUI to load EMSO and StationXML metadata from TXT or JSON.
- Computes file start/end times from the filename and applies UTC offset.

## Inputs
- WAV files (single or folder).
- EMSO metadata: upload TXT/JSON **or** enter manually in the GUI.
- StationXML metadata: upload TXT/JSON **or** enter manually in the GUI.
- UTC offset (e.g., UTC+0).

## Outputs
- FLAC with embedded EMSO metadata.
- MiniSEED file for seismic workflows.
- StationXML metadata file for EIDA.

## Quick start
1. Run the installation cell (cell 2) once to install required packages.
2. Run the main code cell (cell 3) to launch the GUI.
3. Select WAV files or a folder.
4. Load EMSO metadata (TXT/JSON) and StationXML metadata (TXT/JSON).
5. Click **Start Processing**.

## Notes
- Required EMSO fields are marked with * and validated in the GUI.
- Start/end times are computed automatically from filenames.
- Output sample rate is controlled by `FINAL_SAMPLING_RATE` in the code.
- Sample metadata files are included for EMSO and StationXML.
- This tool assumes no digitizer decimation in the field; decimation is applied in post-processing by this script. If your digitizer already decimates, set `datalogger_input_sample_rate` to the pre-decimation rate, set `FINAL_SAMPLING_RATE` to the recorded output rate, and update the StationXML response decimation (factor/offset/delay/correction) to match the digitizer stage.
- If your WAV files are FLOAT/DOUBLE, set `wav_float_units` and `datalogger_nbits`/`datalogger_ref_voltage` so the tool can reconstruct true counts for MiniSEED and StationXML.
- StationXML is written by ObsPy, then post-processed to enforce strict schema order and required element placement (for example, `Source` before `Sender` and ordered decimation fields). This avoids validation failures after edits.


In [1]:
# Run this cell once to install required packages
# After installation, you can skip this cell in future sessions

!pip install numpy scipy soundfile mutagen plotly python-dateutil lxml
!pip install obspy==1.4.2 --only-binary=:all:

Collecting mutagen
  Using cached mutagen-1.47.0-py3-none-any.whl.metadata (1.7 kB)
Collecting obspy
  Using cached obspy-1.4.2.tar.gz (17.0 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting plotly
  Downloading plotly-6.5.2-py3-none-any.whl.metadata (8.5 kB)
Collecting sqlalchemy<2 (from obspy)
  Using cached sqlalchemy-1.4.54-cp314-cp314-win_amd64.whl
Collecting narwhals>=1.15.1 (from plotly)
  Using cached narwhals-2.15.0-py3-none-any.whl.metadata (13 kB)
Using cached mutagen-1.47.0-py3-none-any.whl (194 kB)
Downloading plotly-6.5.2-py3-none-any.whl (9.9 MB)
   ---------------------------------------- 0.0/9.9 MB ? eta -:--:--
   ------- -------------------------------- 1.8/9.9 MB 17.5 MB/s eta

  error: subprocess-exited-with-error
  
  × Building wheel for obspy (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [6042 lines of output]
      !!
      
              ********************************************************************************
              Please consider removing the following classifiers in favor of a SPDX license expression:
      
              License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
      
              See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
              ********************************************************************************
      
      !!
        self._finalize_license_expression()
      running bdist_wheel
      running build
      running build_py
      creating build\lib.win-amd64-cpython-314\obspy
      copying obspy\conftest.py -> build\lib.win-amd64-cpython-314\obspy
      copying obspy\__init__.py -> build\lib.win-amd64-cpython-314

In [3]:
import os
import tkinter as tk
from tkinter import filedialog, messagebox, ttk, Toplevel, Label
import webbrowser
import json
from datetime import datetime, timedelta
import time
import re
import tempfile
import numpy as np
import soundfile as sf
from scipy.signal import firwin, lfilter, resample
from mutagen.flac import FLAC
import obspy
from obspy.core import UTCDateTime
from obspy.core.inventory import Inventory, Network, Station, Channel, Site
from obspy.core.inventory.response import (
    Response, PolesZerosResponseStage,
    InstrumentSensitivity)
import plotly.graph_objects as go
from dateutil import parser  # Date parsing
from lxml import etree

# Target sample rate (Hz) used for downsampling and XML metadata
# Adjust this value if a different output sample rate is required.
FINAL_SAMPLING_RATE = 300

def extract_datetime_from_filename(filename):
    """
    Automatically extracts a datetime from the filename.
    
    1. First, tries to match a full datetime with explicit separators, e.g.:
       "2024-05-17_09-25-33" -> returns a datetime with date and time.
    2. If that fails, it tries to match a compact format like "20180726_141241".
    3. Otherwise, it falls back to fuzzy parsing of the entire filename.
    
    The returned datetime is naive (tzinfo removed).
    """
    # Remove file extension
    name = re.sub(r'\.\w+$', '', filename)
    
    # 1. Try full datetime with explicit separators: e.g. "2024-05-17_09-25-33"
    pattern_full = r'(\d{4}[-]\d{2}[-]\d{2})[ _](\d{2}[-]\d{2}[-]\d{2})'
    match = re.search(pattern_full, name)
    if match:
        try:
            date_part = match.group(1)
            time_part = match.group(2).replace('-', ':')
            dt_str = f"{date_part} {time_part}"
            dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M:%S")
            return dt
        except Exception:
            pass

    # 2. Try compact datetime: e.g. "20180726_141241"
    pattern_compact = r'(\d{8})[ _](\d{6})'
    match = re.search(pattern_compact, name)
    if match:
        try:
            date_part = match.group(1)
            time_part = match.group(2)
            dt_str = f"{date_part} {time_part}"
            dt = datetime.strptime(dt_str, "%Y%m%d %H%M%S")
            return dt
        except Exception:
            pass

    # 3. Fallback: fuzzy parse entire filename
    try:
        dt = parser.parse(name, fuzzy=True)
        return dt.replace(tzinfo=None)
    except Exception:
        return datetime.utcnow()

def generate_start_end_time(wav_file_name, duration_seconds):
    """
    Extracts the start time from the WAV filename using filename parsing.
    If no valid timestamp is found, uses the current UTC time.
    Computes the end time as start time plus the file's duration.
    Returns start and end times in ISO format.
    """
    dt = extract_datetime_from_filename(wav_file_name)
    start_time = UTCDateTime(dt.isoformat())
    end_time = start_time + duration_seconds
    return start_time.isoformat(), end_time.isoformat()

# ------------------ Utility Functions ------------------

def plot_signals(original_signal, filtered_signal, downsampled_signal, original_rate, target_rate):
    original_samples_to_plot = original_rate
    downsampled_samples_to_plot = target_rate

    original_signal_norm = (original_signal / np.max(np.abs(original_signal))
                              if np.max(np.abs(original_signal)) != 0 else original_signal)
    filtered_signal_norm = (filtered_signal / np.max(np.abs(filtered_signal))
                            if np.max(np.abs(filtered_signal)) != 0 else filtered_signal)
    downsampled_signal_norm = (downsampled_signal / np.max(np.abs(downsampled_signal))
                               if np.max(np.abs(downsampled_signal)) != 0 else downsampled_signal)

    t_original = np.linspace(0, 1, original_samples_to_plot, endpoint=False)
    t_filtered = np.linspace(0, 1, original_samples_to_plot, endpoint=False)
    t_downsampled = np.linspace(0, 1, downsampled_samples_to_plot, endpoint=False)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=t_original, y=original_signal_norm[:original_samples_to_plot],
                             mode='lines', name='Original Signal'))
    fig.add_trace(go.Scatter(x=t_filtered, y=filtered_signal_norm[:original_samples_to_plot],
                             mode='lines', name='Filtered Signal'))
    fig.add_trace(go.Scatter(x=t_downsampled, y=downsampled_signal_norm[:downsampled_samples_to_plot],
                             mode='lines+markers', name=f'Downsampled Signal ({FINAL_SAMPLING_RATE} Hz)',
                             marker=dict(color='gold')))
    for i in range(downsampled_samples_to_plot):
        filtered_index = int(i * (original_rate / target_rate))
        fig.add_trace(go.Scatter(x=[t_downsampled[i], t_filtered[filtered_index]],
                                 y=[downsampled_signal_norm[i], filtered_signal_norm[filtered_index]],
                                 mode='lines', line=dict(color='green', dash='dash'),
                                 showlegend=False))
    fig.update_layout(
        title='First Second of the first file: Signal Downsampling and Interpolation',
        xaxis_title='Time [s]',
        yaxis_title='Amplitude',
        legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01),
        template="plotly_white"
    )
    fig.show()

def get_pcm_subtype(nbits):
    if nbits <= 16:
        return "PCM_16"
    if nbits <= 24:
        return "PCM_24"
    return "PCM_32"


def full_scale_counts(nbits):
    return float(2 ** (nbits - 1) - 1)


def read_wav_as_counts(file_path, stationxml_data=None):
    """
    Read WAV data as integer counts.

    - PCM WAV: returns integer counts directly.
    - FLOAT WAV: requires wav_float_units to interpret scaling.
      * normalized: float in [-1, 1] mapped to full-scale counts
      * volts: float in volts mapped using ADC counts/volt
      * counts: float already in counts
    """
    info = sf.info(file_path)
    subtype = (info.subtype or "").upper()
    samplerate = info.samplerate

    if subtype.startswith("PCM"):
        data, _ = sf.read(file_path, dtype="int32")
        return data.astype(np.int32), samplerate, info, "pcm"

    if subtype in ("FLOAT", "DOUBLE"):
        data, _ = sf.read(file_path, dtype="float64")
        cfg = stationxml_data or {}
        wav_float_units = str(cfg.get("wav_float_units", "normalized")).strip().lower()
        nbits = int(cfg.get("datalogger_nbits") or 24)
        ref_v = float(cfg.get("datalogger_ref_voltage") or 2.5)

        if wav_float_units == "volts":
            counts_per_v = (2 ** (nbits - 1)) / ref_v
            counts = np.round(data * counts_per_v).astype(np.int32)
            return counts, samplerate, info, "float_volts"
        if wav_float_units == "counts":
            counts = np.round(data).astype(np.int32)
            return counts, samplerate, info, "float_counts"

        # Default: normalized float in [-1, 1]
        counts = np.round(data * full_scale_counts(nbits)).astype(np.int32)
        return counts, samplerate, info, "float_normalized"

    # Fallback: treat as float counts
    data, _ = sf.read(file_path, dtype="float64")
    return np.round(data).astype(np.int32), samplerate, info, "unknown"


def write_flac_from_counts(flac_file_path, data_counts, sample_rate, nbits=24):
    """
    Write integer counts to FLAC using appropriate PCM subtype.
    """
    subtype = get_pcm_subtype(nbits)
    sf.write(
        flac_file_path,
        data_counts.astype(np.int32),
        sample_rate,
        format="FLAC",
        subtype=subtype
    )

def add_metadata_to_flac(flac_file_path, metadata):
    """
    Adds metadata to the FLAC file.
    Now includes per-file timestamps (time_coverage_start, time_coverage_end)
    and the initial sampling rate.
    The final sampling rate is set via the global FINAL_SAMPLING_RATE variable.
    """
    audio = FLAC(flac_file_path)
    metadata["date_created"] = datetime.utcnow().isoformat()  # Always in UTC
    for key, value in metadata.items():
        audio[key] = str(value)
    audio.save()

def get_wav_info(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")
    info = sf.info(file_path)
    rate = info.samplerate
    if not (8000 <= rate <= 400000):
        raise ValueError(f"Unrealistic sample rate {rate} detected in file {file_path}.")
    return rate, info.frames, info

def downsample_wav(data_counts, original_rate, target_rate=FINAL_SAMPLING_RATE, nbits=24):
    if original_rate == target_rate:
        return data_counts.astype(np.int32), data_counts.astype(np.float64)

    data_float = data_counts.astype(np.float64)
    cutoff = min(150, 0.5 * target_rate)
    numtaps = 101
    fir_filter = firwin(numtaps, cutoff / (0.5 * original_rate))
    filtered_data = lfilter(fir_filter, 1.0, data_float)
    chunk_size = 1000000
    chunks = []
    for i in range(0, len(filtered_data), chunk_size):
        chunk = filtered_data[i:i + chunk_size]
        chunk_downsampled = resample(chunk, int(len(chunk) * (target_rate / original_rate)))
        chunks.append(chunk_downsampled)
    downsampled = np.concatenate(chunks)

    full_scale = full_scale_counts(nbits)
    downsampled = np.clip(np.round(downsampled), -full_scale, full_scale).astype(np.int32)
    return downsampled, filtered_data

def convert_data_format(data, nbits=24):
    full_scale = full_scale_counts(nbits)
    if data.dtype != np.int32:
        data = np.clip(np.round(data), -full_scale, full_scale).astype(np.int32)
    return data

def extract_times_from_wav(file_path):
    """
    Uses filename parsing to get the start time from the filename.
    Computes the end time as start time plus the file's duration.
    Returns (start_time, end_time) as UTCDateTime objects.
    """
    base_name = os.path.basename(file_path)
    rate, frames, _info = get_wav_info(file_path)
    duration = frames / rate
    if duration > 86400:
        raise ValueError(f"File {file_path} has an implausibly long duration ({duration} seconds).")
    start_iso, end_iso = generate_start_end_time(base_name, duration)
    start_time = UTCDateTime(start_iso)
    end_time = UTCDateTime(end_iso)
    return start_time, end_time

def process_wav_file(file_path, metadata, stationxml_data, plot_first=True, tz_offset=0):
    data_counts, rate, info, source_type = read_wav_as_counts(file_path, stationxml_data)
    duration_seconds = len(data_counts) / rate
    if duration_seconds > 86400:
        raise ValueError(f"File {file_path} has an implausibly long duration ({duration_seconds} seconds).")
    
    # Extract file-specific start/end times and initial sampling rate
    file_start_time, file_end_time = extract_times_from_wav(file_path)
    # Adjust times by the UTC offset so FLAC metadata reflects UTC time
    adjusted_start_time = file_start_time - timedelta(hours=tz_offset)
    adjusted_end_time = file_end_time - timedelta(hours=tz_offset)
    file_metadata = metadata.copy()
    file_metadata["time_coverage_start"] = adjusted_start_time.isoformat()
    file_metadata["time_coverage_end"] = adjusted_end_time.isoformat()
    file_metadata["initial_sampling_rate"] = rate
    file_metadata["input_audio_subtype"] = str(info.subtype)
    file_metadata["input_audio_source_type"] = source_type

    nbits = int(stationxml_data.get("datalogger_nbits") or 24) if stationxml_data else 24

    if source_type.startswith("float"):
        wav_units = str(stationxml_data.get("wav_float_units", "normalized")) if stationxml_data else "normalized"
        print(f"Input WAV is FLOAT; interpreting as '{wav_units}'.")
        if wav_units not in ("normalized", "volts", "counts"):
            print("Warning: Unknown wav_float_units. Expected: normalized | volts | counts")

    if rate > FINAL_SAMPLING_RATE:
        downsampled_data, filtered_data = downsample_wav(data_counts, rate, FINAL_SAMPLING_RATE, nbits)
        if plot_first:
            plot_signals(data_counts, filtered_data, downsampled_data, rate, FINAL_SAMPLING_RATE)
    else:
        downsampled_data = convert_data_format(data_counts, nbits)

    flac_output_path = file_path.replace('.wav', '.flac').replace('.WAV', '.flac')
    write_flac_from_counts(flac_output_path, downsampled_data, FINAL_SAMPLING_RATE, nbits)
    add_metadata_to_flac(flac_output_path, file_metadata)
    return flac_output_path

def flac_to_miniseed(flac_file_path, output_path, stationxml_data=None, start_time=None, end_time=None):
    """
    Convert FLAC to MiniSEED with proper network, station, location, and channel codes.
    
    Args:
        flac_file_path: Path to input FLAC file
        output_path: Path to output MiniSEED file
        stationxml_data: Dictionary with network_code, station_code, location_code, channel_code
        start_time: UTCDateTime for start of data (must match inventory)
        end_time: UTCDateTime for end of data (must match inventory)
    """
    try:
        flac_meta = FLAC(flac_file_path)
        if start_time is None:
            if 'time_coverage_start' in flac_meta:
                start_time = UTCDateTime(flac_meta['time_coverage_start'][0])
            elif 'date_created' in flac_meta:
                start_time = UTCDateTime(flac_meta['date_created'][0])
            else:
                start_time = UTCDateTime()
    except Exception:
        if start_time is None:
            start_time = UTCDateTime()
    
    # Read FLAC as integer counts
    samples, sample_rate = sf.read(flac_file_path, dtype="int32")

    # Create ObsPy stream and trace
    stream = obspy.Stream()
    trace = obspy.Trace(data=samples.astype(np.int32))
    trace.stats.sampling_rate = sample_rate
    trace.stats.starttime = start_time
    
    # Add network, station, location, and channel codes if provided
    if stationxml_data:
        trace.stats.network = stationxml_data.get("network_code", "")
        trace.stats.station = stationxml_data.get("station_code", "")
        # Location code defaults to "00" if empty
        loc_code = stationxml_data.get("location_code", "") or ""
        trace.stats.location = loc_code if loc_code else "00"
        trace.stats.channel = stationxml_data.get("channel_code", "")
    
    stream.append(trace)
    stream.write(output_path, format='MSEED')


def generate_stationxml_obspy(wav_file_name, stationxml_data, duration_seconds, tz_offset):
    
    # function to convert to float
    def safe_float(value, default=0.0):
        if not value or str(value).strip() == "":
            return default
        try:
            return float(value)
        except (ValueError, TypeError):
            return default

    # 1) compute start/end with offset
    start_iso, end_iso = generate_start_end_time(wav_file_name, duration_seconds)
    start_time = UTCDateTime(start_iso) - timedelta(hours=tz_offset)
    end_time   = UTCDateTime(end_iso)   - timedelta(hours=tz_offset)

    # 2) gather GUI fields
    sender = stationxml_data.get("sender", "")
    source = stationxml_data.get("source", "")
    net_id = stationxml_data.get("network_identifier", "")
    
    # Ensure network_identifier is a valid URI
    if not net_id or not net_id.strip():
        network_identifier = ""
    elif net_id.startswith(("http://", "https://", "urn:", "doi:")):
        network_identifier = net_id.strip()
    elif net_id.startswith("10.") and "/" in net_id:  # DOI format
        network_identifier = f"https://doi.org/{net_id}"
    else:
        clean_id = net_id.strip().replace(" ", "_")
        network_identifier = f"urn:network:{clean_id}"

    # 3) Get sensor and datalogger info
    sensor_desc = stationxml_data.get("sensor_description", "")
    sensor_model = stationxml_data.get("sensor_model", "")
    sensor_serial = stationxml_data.get("sensor_serial_number", "")
    sensor_type = stationxml_data.get("sensor_type", "")
    sensor_manufacturer = stationxml_data.get("sensor_manufacturer", "")
    sensor_vendor = stationxml_data.get("sensor_vendor", "")
    
    datalogger_model = stationxml_data.get("datalogger_model", "")
    datalogger_serial = stationxml_data.get("datalogger_serial_number", "")
    datalogger_type = stationxml_data.get("datalogger_type", "")
    datalogger_manufacturer = stationxml_data.get("datalogger_manufacturer", "")
    datalogger_vendor = stationxml_data.get("datalogger_vendor", "")
    datalogger_desc = stationxml_data.get("datalogger_description", "")
    
    input_sample_rate = safe_float(stationxml_data.get("datalogger_input_sample_rate", ""), FINAL_SAMPLING_RATE)
    # Full system sensitivity (for InstrumentSensitivity - Pa -> count)
    full_system_sensitivity = safe_float(stationxml_data.get("sensitivity_value", ""), 0.0)
    
    # Hydrophone-only sensitivity (for Stage 1 - Pa -> V)
    hydrophone_sensitivity_vpa = safe_float(stationxml_data.get("hydrophone_sensitivity_vpa", ""), 0.0)
    
    # Stage 2 gain: V -> normalized_unit conversion (for normalized data)
    # This represents how many normalized units per Volt from the digitizer
    stage2_gain = safe_float(stationxml_data.get("datalogger_stage2_gain", ""), 1.0)
    
    sensitivity_freq = safe_float(stationxml_data.get("sensitivity_frequency", ""), 0.0)
    input_units = stationxml_data.get("input_units_name", "Pa")
    output_units = stationxml_data.get("output_units_name", "count")
    
    
    
    # Calculate decimation (factor = 1 if no decimation)
    if input_sample_rate > FINAL_SAMPLING_RATE:
        decimation_factor = round(input_sample_rate / FINAL_SAMPLING_RATE)
        adjusted_input_sr = decimation_factor * FINAL_SAMPLING_RATE
    else:
        decimation_factor = 1
        adjusted_input_sr = FINAL_SAMPLING_RATE

    # 4) Calculate Depth
    channel_elevation = safe_float(stationxml_data.get("channel_elevation", ""), 0.0)
    water_depth = stationxml_data.get("water_depth", "")
    channel_depth_value = None
    
    if water_depth and str(water_depth).strip() and str(water_depth).strip().lower() != "none":
        try:
            water_depth_float = float(water_depth)
            channel_depth_value = -water_depth_float - channel_elevation
        except (ValueError, TypeError):
            channel_depth_str = stationxml_data.get("channel_depth", "")
            if channel_depth_str and str(channel_depth_str).strip() and str(channel_depth_str).strip().lower() != "none":
                try:
                    channel_depth_value = float(channel_depth_str)
                except (ValueError, TypeError):
                    channel_depth_value = None
    else:
        channel_depth_str = stationxml_data.get("channel_depth", "")
        if channel_depth_str and str(channel_depth_str).strip() and str(channel_depth_str).strip().lower() != "none":
            try:
                channel_depth_value = float(channel_depth_str)
            except (ValueError, TypeError):
                channel_depth_value = None

    # 5) Build Channel with ObsPy
    channel_kwargs = {
        "code": stationxml_data.get("channel_code",""),
        "location_code": stationxml_data.get("location_code","") or "00",
        "latitude": safe_float(stationxml_data.get("channel_latitude", ""), 0.0),
        "longitude": safe_float(stationxml_data.get("channel_longitude", ""), 0.0),
        "elevation": channel_elevation,  
        "azimuth": safe_float(stationxml_data.get("azimuth", ""), 0.0),
        "dip": safe_float(stationxml_data.get("dip", ""), 0.0),
        "sample_rate": FINAL_SAMPLING_RATE,
        "start_date": start_time,
        "end_date": end_time
    }
    # Always include depth (default to 0.0 if not provided)
    channel_kwargs["depth"] = channel_depth_value if channel_depth_value is not None else 0.0
    
    channel = Channel(**channel_kwargs)
    
    if sensor_desc and sensor_desc.strip():
        channel.description = sensor_desc

   # Single stage: Full system (Pa -> output units)
    stage1 = PolesZerosResponseStage(
        stage_sequence_number=1,
        stage_gain=full_system_sensitivity,  # Full system sensitivity (counts/Pa)
        stage_gain_frequency=sensitivity_freq,
        input_units=input_units,  # "Pa"
        output_units=output_units,  # Output units should match MiniSEED data
        pz_transfer_function_type="LAPLACE (RADIANS/SECOND)",
        normalization_factor=full_system_sensitivity,  # Match stage_gain
        normalization_frequency=sensitivity_freq,
        zeros=[],  # No zeros for simple gain
        poles=[0.0 + 0.0j]  # Single pole at zero = represents gain
    )

    response = Response(
        instrument_sensitivity=InstrumentSensitivity(
            value=full_system_sensitivity,  # Full system (counts/Pa) from metadata
            frequency=sensitivity_freq if sensitivity_freq > 0 else 100.0,  # Use actual sensitivity frequency
            input_units=input_units,  # String, e.g. "Pa"
            output_units=output_units  # String
        ),
        response_stages=[stage1]  # Single stage only
    )

    # Attach Response to channel
    channel.response = response

    # 7) Build Station and Network
    site = Site(name=stationxml_data.get("site_name",""))
    station = Station(
        code=stationxml_data.get("station_code",""),
        latitude=safe_float(stationxml_data.get("latitude", ""), 0.0),
        longitude=safe_float(stationxml_data.get("longitude", ""), 0.0),
        elevation=safe_float(stationxml_data.get("elevation", ""), 0.0),
        start_date=start_time,
        end_date=end_time,
        site=site,
        channels=[channel]
    )
    
    station_desc = stationxml_data.get("station_description", "")
    if station_desc and station_desc.strip():
        station.description = station_desc

    net_desc = stationxml_data.get("network_description","")
    if network_identifier:
        net_desc += f" | Identifier: {network_identifier}"
    network = Network(
        code=stationxml_data.get("network_code",""),
        description=net_desc,
        start_date=start_time,
        end_date=end_time,
        stations=[station]
    )

    # Create Inventory and write with ObsPy
    inventory = Inventory(networks=[network], source=sender)
    xml_filename = f"{os.path.splitext(wav_file_name)[0]}.station.xml"
    
    # Write without validation; validate after post-processing
    # ObsPy handles schema ordering; apply post-processing fixes
    inventory.write(xml_filename, format="STATIONXML", validate=False)

    # lxml post-processing for structural fixes
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(xml_filename, parser)
    root = tree.getroot()
    ns_pref = f"{{{root.nsmap[None]}}}" if None in root.nsmap else ""

    # Order <Source> before <Sender>
    sender_el = root.find(f"{ns_pref}Sender")
    src_el = root.find(f"{ns_pref}Source")
    
    if src_el is not None:
        root.remove(src_el)
    if sender_el is not None:
        root.remove(sender_el)
    
    if source:
        src_el = etree.Element(f"{ns_pref}Source")
        src_el.text = source
        root.insert(0, src_el)
    
    sender_el = etree.Element(f"{ns_pref}Sender")
    sender_el.text = sender
    if source:
        root.insert(1, sender_el)
    else:
        root.insert(0, sender_el)

    # b) Remove <EndDate> children 
    for ed in root.iterdescendants():
        if ed.tag == f"{ns_pref}EndDate":
            parent = ed.getparent()
            if parent is not None:
                parent.remove(ed)

    # Order Network elements: Description, Identifier, Station
    for net in root.findall(f"{ns_pref}Network"):
        desc = net.find(f"{ns_pref}Description")
        ident = net.find(f"{ns_pref}Identifier")
        st = net.find(f"{ns_pref}Station")
        
        if ident is not None:
            net.remove(ident)
        else:
            ident = etree.Element(f"{ns_pref}Identifier")
        
        if network_identifier:
            ident.text = network_identifier
        else:
            ident.text = ""
        
        if st is not None:
            net.remove(st)
        
        if desc is not None:
            desc_index = list(net).index(desc)
            net.insert(desc_index + 1, ident)
        else:
            net.insert(0, ident)
        
        if st is not None:
            net.append(st)

    # Ensure Decimation has required elements
    # Required elements: InputSampleRate, Factor, Offset, Delay, Correction
    for decimation in root.findall(f".//{ns_pref}Decimation"):
        # Get existing elements
        input_sr = decimation.find(f"{ns_pref}InputSampleRate")
        factor_el = decimation.find(f"{ns_pref}Factor")
        offset_el = decimation.find(f"{ns_pref}Offset")
        delay_el = decimation.find(f"{ns_pref}Delay")
        correction_el = decimation.find(f"{ns_pref}Correction")
        
        # Remove all to reorder properly
        elements_to_reorder = []
        if input_sr is not None:
            decimation.remove(input_sr)
            elements_to_reorder.append(("InputSampleRate", input_sr))
        if factor_el is not None:
            decimation.remove(factor_el)
            elements_to_reorder.append(("Factor", factor_el))
        if offset_el is not None:
            decimation.remove(offset_el)
            elements_to_reorder.append(("Offset", offset_el))
        if delay_el is not None:
            decimation.remove(delay_el)
            elements_to_reorder.append(("Delay", delay_el))
        if correction_el is not None:
            decimation.remove(correction_el)
            elements_to_reorder.append(("Correction", correction_el))
        
        # Create missing elements with default values
        if not any(e[0] == "Offset" for e in elements_to_reorder):
            offset_el = etree.Element(f"{ns_pref}Offset")
            offset_el.text = "0"  # Default: no sample offset
            elements_to_reorder.append(("Offset", offset_el))
        
        if not any(e[0] == "Delay" for e in elements_to_reorder):
            delay_el = etree.Element(f"{ns_pref}Delay")
            delay_el.text = "0.0"  # Default: no delay in seconds
            elements_to_reorder.append(("Delay", delay_el))
        
        if not any(e[0] == "Correction" for e in elements_to_reorder):
            correction_el = etree.Element(f"{ns_pref}Correction")
            correction_el.text = "0.0"  # Default: no correction in seconds
            elements_to_reorder.append(("Correction", correction_el))
        
        # Re-add in correct order: InputSampleRate, Factor, Offset, Delay, Correction
        order = ["InputSampleRate", "Factor", "Offset", "Delay", "Correction"]
        for element_name in order:
            element = next((e[1] for e in elements_to_reorder if e[0] == element_name), None)
            if element is not None:
                decimation.append(element)

    
    # Description is optional, so if it's None/empty/"None", remove it entirely
    for desc in root.findall(f".//{ns_pref}Description"):
        desc_text = desc.text
        if desc_text is None:
            should_remove = True
        else:
            desc_text_clean = desc_text.strip()
            should_remove = (desc_text_clean == "" or 
                           desc_text_clean.lower() == "none" or 
                           desc_text_clean.lower() == "null")
        
        if should_remove:
            parent = desc.getparent()
            if parent is not None:
                parent.remove(desc)

    # Convert Equipment to Sensor/DataLogger if present
    # Helper function to check if a value is valid
    def is_valid_value(val):
        if val is None:
            return False
        val_str = str(val).strip()
        return val_str != "" and val_str.lower() != "none"
    
    for ch in root.findall(f".//{ns_pref}Channel"):
        # Remove Equipment wrapper if present
        for eq in ch.findall(f"{ns_pref}Equipment"):
            ch.remove(eq)
        
        # Insert Sensor/DataLogger before Response
        response_el = ch.find(f"{ns_pref}Response")
        response_index = list(ch).index(response_el) if response_el is not None else len(list(ch))
        
        # Build Sensor element if info provided
        sensor = None
        if any([is_valid_value(sensor_type), is_valid_value(sensor_desc), is_valid_value(sensor_manufacturer),
                is_valid_value(sensor_vendor), is_valid_value(sensor_model), is_valid_value(sensor_serial)]):
            sensor = etree.Element(f"{ns_pref}Sensor")
            if is_valid_value(sensor_type):
                sensor_type_el = etree.Element(f"{ns_pref}Type")
                sensor_type_el.text = str(sensor_type).strip()
                sensor.append(sensor_type_el)
            if is_valid_value(sensor_desc):
                sensor_desc_el = etree.Element(f"{ns_pref}Description")
                sensor_desc_el.text = str(sensor_desc).strip()
                sensor.append(sensor_desc_el)
            if is_valid_value(sensor_manufacturer):
                sensor_manufacturer_el = etree.Element(f"{ns_pref}Manufacturer")
                sensor_manufacturer_el.text = str(sensor_manufacturer).strip()
                sensor.append(sensor_manufacturer_el)
            if is_valid_value(sensor_vendor):
                sensor_vendor_el = etree.Element(f"{ns_pref}Vendor")
                sensor_vendor_el.text = str(sensor_vendor).strip()
                sensor.append(sensor_vendor_el)
            if is_valid_value(sensor_model):
                sensor_model_el = etree.Element(f"{ns_pref}Model")
                sensor_model_el.text = str(sensor_model).strip()
                sensor.append(sensor_model_el)
            if is_valid_value(sensor_serial):
                sensor_serial_el = etree.Element(f"{ns_pref}SerialNumber")
                sensor_serial_el.text = str(sensor_serial).strip()
                sensor.append(sensor_serial_el)
        
        # Build DataLogger element if info provided
        dl = None
        if any([is_valid_value(datalogger_type), is_valid_value(datalogger_desc), is_valid_value(datalogger_manufacturer),
                is_valid_value(datalogger_vendor), is_valid_value(datalogger_model), is_valid_value(datalogger_serial)]):
            dl = etree.Element(f"{ns_pref}DataLogger")
            if is_valid_value(datalogger_type):
                dl_type_el = etree.Element(f"{ns_pref}Type")
                dl_type_el.text = str(datalogger_type).strip()
                dl.append(dl_type_el)
            if is_valid_value(datalogger_desc):
                dl_desc_el = etree.Element(f"{ns_pref}Description")
                dl_desc_el.text = str(datalogger_desc).strip()
                dl.append(dl_desc_el)
            if is_valid_value(datalogger_manufacturer):
                dl_manufacturer_el = etree.Element(f"{ns_pref}Manufacturer")
                dl_manufacturer_el.text = str(datalogger_manufacturer).strip()
                dl.append(dl_manufacturer_el)
            if is_valid_value(datalogger_vendor):
                dl_vendor_el = etree.Element(f"{ns_pref}Vendor")
                dl_vendor_el.text = str(datalogger_vendor).strip()
                dl.append(dl_vendor_el)
            if is_valid_value(datalogger_model):
                dl_model_el = etree.Element(f"{ns_pref}Model")
                dl_model_el.text = str(datalogger_model).strip()
                dl.append(dl_model_el)
            if is_valid_value(datalogger_serial):
                dl_serial_el = etree.Element(f"{ns_pref}SerialNumber")
                dl_serial_el.text = str(datalogger_serial).strip()
                dl.append(dl_serial_el)
        
        # Insert Sensor and DataLogger before Response
        # Order: Sensor, then DataLogger, then Response
        if sensor is not None:
            ch.insert(response_index, sensor)
            response_index += 1  # Update index after inserting
        if dl is not None:
            ch.insert(response_index, dl)
            # Response index remains after inserts

    # Write final XML
    tree.write(xml_filename, pretty_print=True, xml_declaration=True, encoding="UTF-8")
    
    # Now validate the final XML (after all fixes)
    try:
        from obspy import read_inventory
        read_inventory(xml_filename, format="STATIONXML")
    except Exception as e:
        # If validation fails, raise an error with details
        raise ValueError(f"The created file fails to validate after post-processing: {str(e)}")
    
    return xml_filename

stationxml_tooltips = {
    "sender": "Person/tool generating this metadata. Example: 'Silvana Neves, Plocan, Spain'.",
    "source": "Optional name of software, data tool, or institutional source.",
    "network_code": "Two-character FDSN network code. Must be registered with FDSN. See https://www.fdsn.org/networks/ for official codes. Example: '7J'.",
    "network_description": "Full description of the network/project. Can include project scope, partners, time period, etc.",
    "network_identifier": "Globally unique network ID, preferably a DOI. Used for validation and traceability. Must be a valid URI (http://, https://, urn:, or doi:).",
    "station_code": "Up to 5-character station code (alphanumeric). Should be unique within the network and follow naming conventions.",
    "station_description": "Description of station location and deployment.",
    "latitude": "Latitude of station (decimal degrees, WGS84).",
    "longitude": "Longitude of station (decimal degrees, WGS84).",
    "elevation": "Elevation in meters relative to sea level (positive upward). If seafloor-mounted, this is negative. E.g., -17 if 1 m above 18 m depth.",
    "site_name": "Short descriptive name of the deployment site/platform. E.g., 'Plocan OBS Site'.",
    "channel_code": "Three-character SEED channel code (e.g., CDH). Auto-suggested from sampling rate. First = band (e.g., C, D, F), second = instrument ('D' for hydrophone), third = orientation ('H' for hydrophone).",
    "location_code": "Two-character location identifier. Often '00'. Leave blank if not applicable.",
    "channel_latitude": "Latitude of sensor. Often same as station latitude unless offset.",
    "channel_longitude": "Longitude of sensor. Often same as station longitude unless offset.",
    "channel_elevation": "Elevation in meters above sea level. Use same rule as station elevation.",
    "channel_depth": "Sensor depth below surface, in meters (positive down). E.g., if hydrophone is 1 m above 18 m seafloor, depth = 17 m.",
    "water_depth": "Water depth in meters (positive value). Used to calculate channel_depth if channel_depth is not provided directly.",
    "azimuth": "Sensor azimuth in degrees from North (clockwise). Use 0 for vertical or omnidirectional hydrophones.",
    "dip": "Dip angle in degrees. -90 = downward, +90 = upward, 0 = horizontal. Use -90 for hydrophones if pressure increase results in signal increase.",
    "sensor_description": "Free text: model, serial number, mounting orientation, depth offset, sensitivity notes. Mention if response is not flat and frequency-dependent.",
    "sensor_model": "Model name of the sensor (e.g., 'RESON TC4032', 'HiTech HTI-96-MIN').",
    "sensor_serial_number": "Serial number of the sensor unit.",
    "sensor_type": "Type of sensor (e.g., 'Hydrophone', 'Seismometer', 'Pressure Sensor').",
    "sensor_manufacturer": "Manufacturer name of the sensor (e.g., 'RESON', 'HiTech').",
    "sensor_vendor": "Vendor name (often same as manufacturer).",
    "datalogger_model": "Model name of the datalogger/recorder (e.g., 'Raspberry Pi 4', 'Seatronics Logger').",
    "datalogger_serial_number": "Serial number of the datalogger unit.",
    "datalogger_type": "Type of datalogger (e.g., 'A/D converter + digital filter', 'Digital recorder').",
    "datalogger_manufacturer": "Manufacturer name of the datalogger.",
    "datalogger_vendor": "Vendor name (often same as manufacturer).",
    "datalogger_description": "Description of the datalogger (e.g., 'A/D converter + FIR digital filter [config: 240sps]').",
    "datalogger_input_sample_rate": "Input sample rate of the datalogger in Hz. This is the rate before any decimation. Default: 300 Hz (final output rate).",
    "datalogger_nbits": "ADC resolution (bits). Example: 24 for a 24-bit digitizer.",
    "datalogger_ref_voltage": "ADC full-scale input voltage (V). Example: 2.5 or 5.0 V.",
    "wav_float_units": "If WAV subtype is FLOAT/DOUBLE, specify units: 'normalized' (-1..1), 'volts', or 'counts'.",
    "sensitivity_value": "Full system sensitivity in counts/Pa. Use linear value only (not dB).",
    "sensitivity_frequency": "Frequency (Hz) where the sensitivity was measured. Commonly 20000 (20 kHz) for hydrophones like RESON TC4032.",
    "input_units_name": "Unit of the physical quantity measured. For hydrophones, use 'Pa' (Pascal). Do not use uPa - convert to SI base units.",
    "output_units_name": "Output unit of the data. Use 'count' when MiniSEED stores counts."
}

def process_files(file_paths, metadata, plot_preference, stationxml_data, tz_offset):
    total_files = len(file_paths)
    successful_files = 0
    start_processing_time = time.time()

    for index, file_path in enumerate(file_paths):
        try:
            print(f"Analyzing file {index + 1} of {total_files} - {os.path.basename(file_path)}")
            flac_output_path = process_wav_file(
                file_path,
                metadata,
                stationxml_data,
                plot_first=(index == 0 and plot_preference),
                tz_offset=tz_offset
            )

            # Get file times for MiniSEED (must match XML inventory dates)
            rate, frames, _info = get_wav_info(file_path)
            duration_seconds = frames / rate
            file_start_time, file_end_time = extract_times_from_wav(file_path)
            # Adjust times by UTC offset to match XML
            adjusted_start_time = file_start_time - timedelta(hours=tz_offset)
            adjusted_end_time = file_end_time - timedelta(hours=tz_offset)
            
            # Generate XML first to get proper dates
            xml_output_path = generate_stationxml_obspy(
                wav_file_name=os.path.basename(file_path),
                stationxml_data=stationxml_data,
                duration_seconds=duration_seconds,
                tz_offset=tz_offset
            )
            
            # Convert FLAC to MiniSEED with proper codes and times (must match XML inventory)
            miniseed_output_path = file_path.replace('.wav', '.mseed').replace('.WAV', '.mseed')
            flac_to_miniseed(
                flac_output_path, 
                miniseed_output_path,
                stationxml_data=stationxml_data,
                start_time=adjusted_start_time,
                end_time=adjusted_end_time
            )

            print(
                f"Created files for {file_path}:\n"
                f"   FLAC: {flac_output_path}\n"
                f"   MiniSEED: {miniseed_output_path}\n"
                f"   StationXML (.station.xml): {os.path.abspath(xml_output_path)}"
            )
            successful_files += 1

        except Exception as e:
            print(f"Error processing {file_path}: {e}")

    total_elapsed = time.time() - start_processing_time
    elapsed_str = time.strftime('%H:%M:%S', time.gmtime(total_elapsed))
    print(f"\nProcessed {successful_files} out of {total_files} files successfully.")
    print(f"Total processing time: {elapsed_str}")


# ------------------ EMSO Validation ------------------

PLACEHOLDER_VALUES = {"na", "n/a", "none", "null", "void"}
REQUIRED_EMSO_FIELDS = [
    "Conventions",
    "institution_edmo_code",
    "institution_edmo_uri",
    "geospatial_lat_min",
    "geospatial_lat_max",
    "geospatial_lon_min",
    "geospatial_lon_max",
    "geospatial_vertical_min",
    "geospatial_vertical_max",
    "update_interval",
    "site_code",
    "network",
    "title",
    "summary",
    "principal_investigator",
    "principal_investigator_email",
    "license",
    "license_uri",
    "name",
    "long_name",
    "standard_name",
    "units",
    "sdn_parameter_name",
    "sdn_parameter_urn",
    "sdn_uom_name",
    "sdn_uom_urn",
    "sensor_model",
    "sensor_SeaVoX_L22_code",
    "sensor_reference",
    "sensor_manufacturer",
    "sensor_manufacturer_uri",
    "sensor_manufacturer_urn",
    "sensor_serial_number",
    "sensor_mount",
    "sensor_orientation",
    "hydrophone_sensitivity",
    "nbits"
]


def validate_emso_metadata(metadata):
    missing = []
    issues = []

    for field in REQUIRED_EMSO_FIELDS:
        raw_value = metadata.get(field, "")
        value = str(raw_value).strip() if raw_value is not None else ""
        if not value or value.lower() in PLACEHOLDER_VALUES:
            missing.append(field)

    invalid_keys = [key for key in metadata.keys() if str(key).startswith("$")]
    if invalid_keys:
        issues.append(
            "Invalid EMSO keys starting with '$': " + ", ".join(sorted(invalid_keys))
        )

    fill_value = str(metadata.get("_FillValue", "")).strip()
    if fill_value.lower().endswith("f"):
        issues.append("_FillValue has a non-standard 'f' suffix.")

    return missing, issues


# ------------------ Tkinter GUI Application ------------------

import tkinter as tk
from tkinter import ttk
from PIL import Image, ImageTk
 
# Load the image at the module level
class Application(tk.Tk):
    def __init__(self):
        super().__init__()  
        logo_frame = ttk.Frame(self)
        logo_frame.pack(pady=4)
        try:
            # Load Geo-INQUIRE logo 
            geo_image = Image.open("Geo-INQUIRE_logo_2_crop.jpg")
            geo_resized = geo_image.resize((240, 90), Image.LANCZOS)
            self.geo_logo = ImageTk.PhotoImage(geo_resized)
            geo_logo_label = ttk.Label(logo_frame, image=self.geo_logo)
            geo_logo_label.image = self.geo_logo
            geo_logo_label.pack(pady=2)
            
            # Author attribution frame with PLOCAN and EMSO logos
            author_frame = ttk.Frame(logo_frame)
            author_frame.pack(pady=3)
            
            # "Author:" label
            author_label = ttk.Label(author_frame, text="Author:", font=("Arial", 10))
            author_label.pack(side='left', padx=4)
            
            # Load PLOCAN logo - smaller size for attribution, maintain aspect ratio
            plocan_image = Image.open("logo-sin-leyenda-color.png")
            # Convert to RGBA to handle transparency if needed
            if plocan_image.mode != 'RGBA':
                plocan_image = plocan_image.convert('RGBA')
            
            # Calculate size maintaining aspect ratio (smaller for attribution)
            original_width, original_height = plocan_image.size
            aspect_ratio = original_width / original_height
            target_height = 35  # Smaller size for attribution
            target_width = int(target_height * aspect_ratio)
            
            plocan_resized = plocan_image.resize((target_width, target_height), Image.LANCZOS)
            # Create PhotoImage preserving transparency
            self.plocan_logo = ImageTk.PhotoImage(plocan_resized)
            plocan_logo_label = ttk.Label(author_frame, image=self.plocan_logo)
            plocan_logo_label.image = self.plocan_logo
            plocan_logo_label.pack(side='left', padx=4)
            
            # Load EMSO ERIC logo (next to PLOCAN)
            emso_image = Image.open("OIP.jpeg")
            emso_original_w, emso_original_h = emso_image.size
            emso_aspect = emso_original_w / emso_original_h
            emso_target_h = 55
            emso_target_w = int(emso_target_h * emso_aspect)
            emso_resized = emso_image.resize((emso_target_w, emso_target_h), Image.LANCZOS)
            self.emso_logo = ImageTk.PhotoImage(emso_resized)
            emso_logo_label = ttk.Label(author_frame, image=self.emso_logo)
            emso_logo_label.image = self.emso_logo
            emso_logo_label.pack(side='left', padx=8)
        except Exception as e:
            print("Error loading image:", e)
        
        # Auto-fit window to screen
        self.title("Audio Processing Tool")
        self.attributes('-topmost', True)
        
        # Get screen dimensions
        screen_width = self.winfo_screenwidth()
        screen_height = self.winfo_screenheight()
        
        # Calculate window size (65% of screen, with minimum size)
        window_width = int(screen_width * 0.65)
        window_height = int(screen_height * 0.75)
        min_width = 800
        min_height = 550
        window_width = max(window_width, min_width)
        window_height = max(window_height, min_height)
        # Also ensure it doesn't exceed screen size
        window_width = min(window_width, screen_width - 50)
        window_height = min(window_height, screen_height - 50)
        # Center window on screen
        x = (screen_width - window_width) // 2
        y = (screen_height - window_height) // 2
        
        # Set geometry (width x height + x_offset + y_offset)
        self.geometry(f"{window_width}x{window_height}+{x}+{y}")
        
        # Set maximum size to prevent window from being larger than screen
        self.maxsize(screen_width, screen_height)
        main_frame = ttk.Frame(self)
        main_frame.pack(fill='both', expand=True)
        main_canvas = tk.Canvas(main_frame)
        main_canvas.pack(side="left", fill="both", expand=True)
        main_scrollbar_y = ttk.Scrollbar(main_frame, orient="vertical", command=main_canvas.yview)
        main_scrollbar_y.pack(side="right", fill="y")
        main_scrollbar_x = ttk.Scrollbar(main_frame, orient="horizontal", command=main_canvas.xview)
        main_scrollbar_x.pack(side="bottom", fill="x")
        main_canvas.configure(yscrollcommand=main_scrollbar_y.set, xscrollcommand=main_scrollbar_x.set)
        main_canvas.bind('<Configure>', lambda e: main_canvas.configure(scrollregion=main_canvas.bbox("all")))
        def _on_mousewheel(event):
            main_canvas.yview_scroll(-1 * int(event.delta/120), "units")
        main_canvas.bind("<Enter>", lambda e: main_canvas.bind_all("<MouseWheel>", _on_mousewheel))
        main_canvas.bind("<Leave>", lambda e: main_canvas.unbind_all("<MouseWheel>"))
        
        self.scroll_frame = ttk.Frame(main_canvas)
        main_canvas.create_window((0, 0), window=self.scroll_frame, anchor="nw")
        
        #Tool Description 
        self.description_label = tk.Label(
        self.scroll_frame,
        text=(
            "This tool converts WAV audio files to FLAC and MiniSEED (300 Hz), extracts start/end times, "
            "and creates EIDA-compliant XML.\n"
            "EMSO/EIDA metadata editors are built in (TXT or JSON). Channel codes are auto-suggested. "
            "Input validation ensures compliance.\n"
                ),
            font=("Arial", 10),
            fg="dark green",
            wraplength=2200,
            justify="left"
        )
        self.description_label.pack(pady=6)

        file_select_frame = ttk.Frame(self.scroll_frame)
        file_select_frame.pack(pady=5)

        self.selection_label = ttk.Label(file_select_frame, text="Select files or a folder:")
        self.selection_label.pack(side='left', padx=6)

        self.file_button = ttk.Button(file_select_frame, text="Browse audio files", command=self.select_files)
        self.file_button.pack(side='left', padx=6)

        self.folder_button = ttk.Button(file_select_frame, text="Select Folder", command=self.select_folder)
        self.folder_button.pack(side='left', padx=6)

        self.path_label = ttk.Label(self.scroll_frame, text="", wraplength=1100)
        self.path_label.pack(pady=5)

        self.tz_frame = ttk.Frame(self.scroll_frame)
        self.tz_frame.pack(pady=5, fill='x')

        tz_label = ttk.Label(self.tz_frame, text="Time Zone Offset of the Files:")
        tz_label.pack(side='left', padx=5)

        self.tz_offset_entry = ttk.Entry(self.tz_frame, width=20)
        self.tz_offset_entry.insert(0, "UTC+0")
        self.tz_offset_entry.pack(side='left', padx=5)

        tz_instruction = ttk.Label(
            self.tz_frame,
            text="Enter the time zone offset in hours (e.g., UTC+8, UTC-05, UTC+10). Must be between -24 and +24."
        )
        tz_instruction.pack(side='left', padx=5)
        
        #EMSO Metadata Frame
        self.metadata_frame = ttk.Frame(self.scroll_frame)
        self.metadata_frame.pack(pady=10, padx=(0, 0), anchor="e")
        emso_bg = "#eef5f9"
        eida_bg = "#f5f0f7"
        ROWS_VISIBLE = 8
        ROW_HEIGHT = 26
        ROW_PADY = 2

        # --- EMSO Metadata Section ---
        self.emso_label_frame = tk.LabelFrame(
            self.metadata_frame,
            text="EMSO FLAC Metadata",
            bg=emso_bg,
            padx=4,
            pady=4,
            width=520,
            height=420
        )
        self.emso_label_frame.grid(row=0, column=0, padx=4, sticky="n")
        self.emso_label_frame.grid_propagate(False)

        self.emso_manual_mode = False  # Track if manual entry mode is active
        
        # Browse/manual buttons (same row)
        emso_buttons_frame = ttk.Frame(self.emso_label_frame)
        emso_buttons_frame.pack(pady=5)

        self.emso_browse_button = ttk.Button(
            emso_buttons_frame, text="Browse metadata files", command=self.select_metadata_file
        )
        self.emso_browse_button.pack(side='left', padx=6)
        
        self.emso_manual_button = ttk.Button(
            emso_buttons_frame, text="Or enter manually", command=self.toggle_emso_manual_input
        )
        self.emso_manual_button.pack(side='left', padx=6)

        emso_info_label = ttk.Label(
            self.emso_label_frame,
            text=("Metadata guidance available. Fields marked with * are mandatory.\n"
                  "EMSO standard: click here to open."),
            foreground="blue",
            cursor="hand2",
            font=("Arial", 9, "italic"),
            wraplength=680,
            justify="left"
        )
        emso_info_label.pack(pady=3, anchor='w')
        emso_info_label.bind("<Button-1>", lambda e: webbrowser.open("https://github.com/emso-eric/emso-metadata-specifications/blob/develop/EMSO_metadata.md"))

        emso_canvas = tk.Canvas(
            self.emso_label_frame,
            width=480,
            height=ROWS_VISIBLE * ROW_HEIGHT,
            bg=emso_bg,
            highlightthickness=0
        )
        emso_canvas.pack(side="left", fill="both", expand=True)

        emso_scrollbar = ttk.Scrollbar(self.emso_label_frame, orient="vertical", command=emso_canvas.yview)
        emso_scrollbar.pack(side="right", fill="y")

        emso_canvas.configure(yscrollcommand=emso_scrollbar.set)
        emso_canvas.bind('<Configure>', lambda e: emso_canvas.configure(scrollregion=emso_canvas.bbox("all")))

        self.emso_scroll_frame = tk.Frame(emso_canvas, bg=emso_bg)
        emso_canvas.create_window((0, 0), window=self.emso_scroll_frame, anchor="nw")

        emso_fields = [
            "Conventions", "institution_edmo_code*", "institution_edmo_uri*",
            "geospatial_lat_min*", "geospatial_lat_max*", "geospatial_lon_min*", "geospatial_lon_max*",
            "geospatial_vertical_min*", "geospatial_vertical_max*",
            "update_interval*", "site_code*", "emso_facility", "source", "platform_code", "wmo_platform_code",
            "data_type", "format_version", "network*", "data_mode", "title*", "summary*", "keywords",
            "keywords_vocabulary", "project", "principal_investigator*", "principal_investigator_email*", "doi",
            "license*", "license_uri*", "name*", "long_name*", "standard_name*", "units*", "comment",
            "coordinates", "ancillary_variables", "_FillValue", "sdn_parameter_name*",
            "sdn_parameter_urn*", "sdn_parameter_uri", "sdn_uom_name*", "sdn_uom_urn*", "sdn_uom_uri",
            "sensor_model*", "sensor_SeaVoX_L22_code*", "sensor_reference*", "sensor_manufacturer*",
            "sensor_manufacturer_uri*", "sensor_manufacturer_urn*", "sensor_serial_number*", "sensor_mount*",
            "sensor_orientation*", "hydrophone_sensitivity*", "nbits*",
            "initial_sampling_rate", "final_sampling_rate", "cdm_data_type", "creator_name", "creator_url",
            "sourceUrl", "geospatial_bounds", "processing_level", "project_url", "instrument",
            "instrument_vocabulary"
        ]

        self.metadata_entries = {}
        for label in emso_fields:
            frame = ttk.Frame(self.emso_scroll_frame)
            frame.pack(fill='x', pady=ROW_PADY)
            label_clean = label.replace('*', '')
            display_label = f"{label_clean} *" if '*' in label else label_clean
            ttk.Label(frame, text=display_label, width=40, anchor='w').pack(side='left')
            entry = ttk.Entry(frame, width=40)
            entry.pack(side='left', pady=1)
            entry.config(state=tk.DISABLED)  # Start disabled until file loaded or manual mode activated
            self.metadata_entries[label_clean] = entry

        auto_time_label_emso = ttk.Label(
            self.emso_scroll_frame,
            text="Note: Start/End Times and Initial Sampling Rate are extracted automatically.",
            foreground="blue",
            font=("Arial", 9, "italic")
        )
        auto_time_label_emso.pack(pady=4, anchor='w')

        
       # --- EIDA Metadata Section ---
        self.eida_label_frame = tk.LabelFrame(
            self.metadata_frame,
            text="XML EIDA Metadata Requirements",
            bg=eida_bg,
            padx=4,
            pady=4,
            width=520,
            height=420
        )
        self.eida_label_frame.grid(row=0, column=1, padx=4, sticky="n")
        self.eida_label_frame.grid_propagate(False)

        # View buttons (same row)
        view_buttons_frame = ttk.Frame(self.eida_label_frame)
        view_buttons_frame.pack(pady=4)

        rules_button = ttk.Button(
            view_buttons_frame,
            text="StationXML Rules",
            command=lambda: webbrowser.open("https://docs.fdsn.org/projects/stationxml/en/latest/reference.html")
        )
        rules_button.pack(side='left', padx=4)

        self.source_ids_button = ttk.Button(
            view_buttons_frame,
            text="FDSN Source IDs",
            command=lambda: webbrowser.open("https://docs.fdsn.org/projects/source-identifiers/en/v1.0/index.html")
        )
        self.source_ids_button.pack(side='left', padx=4)

        # Metadata input method
        self.xml_manual_mode = False  # Track if manual entry mode is active
        
        # Browse/manual buttons (next row)
        input_buttons_frame = ttk.Frame(self.eida_label_frame)
        input_buttons_frame.pack(pady=5)

        self.xml_browse_button = ttk.Button(
            input_buttons_frame,
            text="Browse metadata files",
            command=self.select_stationxml_file
        )
        self.xml_browse_button.pack(side='left', padx=6)
        
        self.xml_manual_button = ttk.Button(
            input_buttons_frame,
            text="Or enter manually",
            command=self.toggle_xml_manual_input
        )
        self.xml_manual_button.pack(side='left', padx=6)

        # Canvas and scrollable frame for entry fields
        eida_canvas = tk.Canvas(
            self.eida_label_frame,
            width=480,
            height=ROWS_VISIBLE * ROW_HEIGHT,
            bg=eida_bg,
            highlightthickness=0
        )
        eida_canvas.pack(side="left", fill="both", expand=True)

        eida_scrollbar = ttk.Scrollbar(self.eida_label_frame, orient="vertical", command=eida_canvas.yview)
        eida_scrollbar.pack(side="right", fill="y")

        eida_canvas.configure(yscrollcommand=eida_scrollbar.set)
        eida_canvas.bind('<Configure>', lambda e: eida_canvas.configure(scrollregion=eida_canvas.bbox("all")))

        self.eida_scroll_frame = tk.Frame(eida_canvas, bg=eida_bg)
        eida_canvas.create_window((0, 0), window=self.eida_scroll_frame, anchor="nw")

        # Tooltip utility
        def create_tooltip(widget, text):
            def on_enter(event):
                top = Toplevel(widget)
                top.wm_overrideredirect(True)
                top.geometry(f"+{event.x_root + 10}+{event.y_root + 10}")
                label = Label(top, text=text, background="light yellow", relief="solid", borderwidth=1,
                              wraplength=400, justify="left")
                label.pack(ipadx=1)
                widget.tooltip = top

            def on_leave(event):
                if hasattr(widget, 'tooltip'):
                    widget.tooltip.destroy()
                    del widget.tooltip

            widget.bind("<Enter>", on_enter)
            widget.bind("<Leave>", on_leave)

        # Field definitions
        eida_fields = {
            "sender": "Sender*",
            "source": "Source",
            "network_code": "Network Code*",
            "network_description": "Network Description",
            "network_identifier": "Network Identifier*",
            "station_code": "Station Code*",
            "station_description": "Station Description",
            "latitude": "Latitude (Degrees)*",
            "longitude": "Longitude (Degrees)*",
            "elevation": "Elevation (Meters)*",
            "site_name": "Site Name*",
            "channel_code": "Channel Code*",
            "location_code": "Location Code",
            "channel_latitude": "Channel Latitude (Degrees)*",
            "channel_longitude": "Channel Longitude (Degrees)*",
            "channel_elevation": "Channel Elevation (Meters)*",
            "channel_depth": "Channel Depth (Meters)*",
            "water_depth": "Water Depth (Meters)",
            "azimuth": "Azimuth*",
            "dip": "Dip*",
            "sensor_description": "Sensor Description",
            "sensor_model": "Sensor Model",
            "sensor_serial_number": "Sensor Serial Number",
            "sensor_type": "Sensor Type",
            "sensor_manufacturer": "Sensor Manufacturer",
            "sensor_vendor": "Sensor Vendor",
            "datalogger_model": "DataLogger Model",
            "datalogger_serial_number": "DataLogger Serial Number",
            "datalogger_type": "DataLogger Type",
            "datalogger_manufacturer": "DataLogger Manufacturer",
            "datalogger_vendor": "DataLogger Vendor",
            "datalogger_description": "DataLogger Description",
            "datalogger_input_sample_rate": "DataLogger Input Sample Rate (Hz)",
            "datalogger_nbits": "DataLogger ADC Bits",
            "datalogger_ref_voltage": "DataLogger Ref Voltage (V)",
            "wav_float_units": "WAV Float Units (normalized/volts/counts)",
            "sensitivity_value": "Instrument Sensitivity Value*",
            "sensitivity_frequency": "Instrument Sensitivity Frequency*",
            "input_units_name": "Input Units Name*",
            "output_units_name": "Output Units Name*",
            "hydrophone_sensitivity_vpa": "Hydrophone Sensitivity (V/Pa)*"
        }

        self.stationxml_entries = {}
        for key, label in eida_fields.items():
            frame = ttk.Frame(self.eida_scroll_frame)
            frame.pack(fill='x', pady=ROW_PADY)

            ttk.Label(frame, text=label, width=40, anchor='w').pack(side='left')

            entry = ttk.Entry(frame, width=45)
            entry.pack(side='left', pady=2)
            entry.config(state=tk.DISABLED)  # Start disabled until file loaded or manual mode activated
            self.stationxml_entries[key] = entry

            if key in stationxml_tooltips:
                create_tooltip(entry, stationxml_tooltips[key])

        # Extra info label
        auto_time_label_xml = ttk.Label(
            self.eida_scroll_frame,
            text="Note: Start/End Dates are auto-calculated. Final Sampling Rate = {} Hz.".format(FINAL_SAMPLING_RATE),
            foreground="blue",
            font=("Arial", 9, "italic")
        )
        auto_time_label_xml.pack(pady=4, anchor='w')

        # Add plotting checkbox
        self.plot_var = tk.BooleanVar()
        self.plot_check = ttk.Checkbutton(
            self.scroll_frame,
            text="Plot original vs filtered vs downsampled signal (first file only; see notebook output)",
            variable=self.plot_var
        )
        self.plot_check.pack(pady=5)

        # Start processing button
        self.start_button = ttk.Button(
            self.scroll_frame,
            text="Start Processing",
            command=self.start_processing
        )
        self.start_button.pack(pady=20)

        # Resize window to show full frames
        self.update_idletasks()
        content_width = self.scroll_frame.winfo_reqwidth()
        content_height = self.scroll_frame.winfo_reqheight()
        window_width = min(screen_width - 50, max(content_width + 60, 1100))
        window_height = min(screen_height - 80, max(content_height + 30, 550))
        x = (screen_width - window_width) // 2
        y = (screen_height - window_height) // 2
        self.geometry(f"{window_width}x{window_height}+{x}+{y}")
        main_canvas.configure(width=window_width - 30)


    def select_files(self):
        self.file_paths = filedialog.askopenfilenames(
            title="Select WAV files",
            filetypes=[("WAV files", "*.wav"), ("WAV files", "*.WAV")]
        )
        if self.file_paths:
            self.path_label.config(text=f"Selected files: {len(self.file_paths)} file(s) selected")
        

    def select_folder(self):
        folder_path = filedialog.askdirectory(title="Select Folder of WAV files")
        if folder_path:
            try:
                self.file_paths = [
                    os.path.join(folder_path, f)
                    for f in os.listdir(folder_path)
                    if f.lower().endswith('.wav')
                ]
                self.path_label.config(
                    text=f"Selected folder: {folder_path} ({len(self.file_paths)} WAV files)"
                )
            except Exception as e:
                messagebox.showerror("Error", f"Could not read folder: {e}")
                self.file_paths = []
        else:
            # User cancelled, don't change anything
            pass
       

    def select_metadata_file(self):
        metadata_file = filedialog.askopenfilename(
            title="Select Metadata File",
            filetypes=[
                ("Metadata files", "*.txt *.json"),
                ("Text files", "*.txt"),
                ("JSON files", "*.json")
            ]
        )
        if metadata_file:
            if metadata_file.lower().endswith(".json"):
                with open(metadata_file, 'r', encoding='utf-8') as f:
                    metadata = json.load(f)
                if not isinstance(metadata, dict):
                    raise ValueError("EMSO metadata JSON must be an object with key/value pairs.")
            else:
                with open(metadata_file, 'r', encoding='utf-8') as f:
                    metadata = {
                        line.strip().split('=', 1)[0].strip(): line.strip().split('=', 1)[1].strip()
                        for line in f if '=' in line
                    }
            metadata = {str(k): "" if v is None else str(v) for k, v in metadata.items()}
            if "$name" in metadata and "name" not in metadata:
                metadata["name"] = metadata.pop("$name")
            # Enable all fields FIRST (before populating - required for Tkinter)
            for entry in self.metadata_entries.values():
                entry.config(state=tk.NORMAL)
            # Then populate with data from file
            self.populate_metadata_fields(metadata)
            for entry in self.metadata_entries.values():
                entry.config(state=tk.NORMAL)

    def select_stationxml_file(self):
        metadata_file = filedialog.askopenfilename(
            title="Select XML Metadata File",
            filetypes=[
                ("Metadata files", "*.txt *.json *.JSON"),
                ("Text files", "*.txt"),
                ("JSON files", "*.json *.JSON")
            ]
        )
        if metadata_file:
            if metadata_file.lower().endswith(".json"):
                with open(metadata_file, 'r', encoding='utf-8') as f:
                    metadata = json.load(f)
                if not isinstance(metadata, dict):
                    raise ValueError("StationXML metadata JSON must be an object with key/value pairs.")
            else:
                with open(metadata_file, 'r', encoding='utf-8') as f:
                    metadata = {
                        line.strip().split('=', 1)[0].strip(): line.strip().split('=', 1)[1].strip()
                        for line in f if '=' in line
                    }
            metadata = {str(k): "" if v is None else str(v) for k, v in metadata.items()}
            # Enable all fields FIRST (before populating - required for Tkinter)
            for entry in self.stationxml_entries.values():
                entry.config(state=tk.NORMAL)
            # Then populate with data from file
            self.populate_stationxml_fields(metadata)
            for entry in self.stationxml_entries.values():
                entry.config(state=tk.NORMAL)
            
    def populate_metadata_fields(self, metadata):
        for key, entry in self.metadata_entries.items():
            if key in metadata:
                entry.delete(0, tk.END)
                entry.insert(0, metadata[key])

    def populate_stationxml_fields(self, metadata):
        def norm_key(value):
            return str(value).strip().lower()

        normalized = {norm_key(k): v for k, v in metadata.items()}

        for key, entry in self.stationxml_entries.items():
            value = None
            if key in metadata:
                value = metadata[key]
            else:
                value = normalized.get(norm_key(key))
            if value is not None:
                entry.delete(0, tk.END)
                entry.insert(0, str(value))


    def toggle_emso_manual_input(self):
        self.emso_manual_mode = not self.emso_manual_mode
        state = tk.NORMAL if self.emso_manual_mode else tk.DISABLED
        for entry in self.metadata_entries.values():
            entry.config(state=state)
    
    def toggle_xml_manual_input(self):
        self.xml_manual_mode = not self.xml_manual_mode
        state = tk.NORMAL if self.xml_manual_mode else tk.DISABLED
        for entry in self.stationxml_entries.values():
            entry.config(state=state)

    def get_metadata(self):
        metadata = {}
        for key, entry in self.metadata_entries.items():
            value = entry.get().strip()
            if value:
                metadata[key] = value
        return metadata

    def get_stationxml_data(self):
        stationxml_data = {}
        for key, entry in self.stationxml_entries.items():
            stationxml_data[key] = entry.get().strip()
        return stationxml_data

    def start_processing(self):
        if not hasattr(self, 'file_paths') or not self.file_paths:
            messagebox.showerror("Error", "No files or folder selected.")
            return


        try:
            tz_str = self.tz_offset_entry.get().strip().upper()
            pattern = r'^UTC([+-])(\d{1,2})$'
            m = re.match(pattern, tz_str)
            if not m:
                raise ValueError("Time zone offset must be in the format: UTC±X or UTC±XX (e.g., UTC+8, UTC-05, UTC+10).")
            sign, digits = m.groups()
            tz_offset = int(digits) if sign == '+' else -int(digits)
            if abs(tz_offset) >= 24:
                raise ValueError("Time zone offset (in hours) must be between -24 and +24.")
                
            local_start_time, local_end_time = extract_times_from_wav(self.file_paths[0])
            utc_start_time = local_start_time - timedelta(hours=tz_offset)
            utc_end_time = local_end_time - timedelta(hours=tz_offset)
            utc_start_str = utc_start_time.isoformat()
            utc_end_str = utc_end_time.isoformat()

            metadata = self.get_metadata()
            stationxml_data = self.get_stationxml_data()

            missing_emso, emso_issues = validate_emso_metadata(metadata)
            if missing_emso:
                messagebox.showerror(
                    "EMSO Metadata Error",
                    "Missing or invalid EMSO fields:\n" + ", ".join(missing_emso)
                )
                return
            if emso_issues:
                messagebox.showwarning(
                    "EMSO Metadata Warnings",
                    "Please review EMSO metadata:\n" + "\n".join(emso_issues)
                )

            metadata['time_coverage_start'] = utc_start_str
            metadata['time_coverage_end'] = utc_end_str
            stationxml_data['time_coverage_start'] = utc_start_str
            stationxml_data['time_coverage_end'] = utc_end_str

            self.start_button.config(state=tk.DISABLED)
            # Close the GUI and run processing in the notebook output.
            self.destroy()
            process_files(self.file_paths, metadata, self.plot_var.get(), stationxml_data, tz_offset)
        except Exception as e:
            messagebox.showerror("Processing Error", str(e))
   
if __name__ == "__main__":
    app = Application()
    app.mainloop()


Analyzing file 1 of 2 - channelA_2020-06-27_01-40-14.wav


Created files for C:/Users/silvana.neves/Plataforma Oceánica de Canarias/BIOACOUSTICS - General/ANALISIS DATOS/GEO-INQUIRE\channelA_2020-06-27_01-40-14.wav:
   FLAC: C:/Users/silvana.neves/Plataforma Oceánica de Canarias/BIOACOUSTICS - General/ANALISIS DATOS/GEO-INQUIRE\channelA_2020-06-27_01-40-14.flac
   MiniSEED: C:/Users/silvana.neves/Plataforma Oceánica de Canarias/BIOACOUSTICS - General/ANALISIS DATOS/GEO-INQUIRE\channelA_2020-06-27_01-40-14.mseed
   StationXML (.station.xml): c:\Users\silvana.neves\Plataforma Oceánica de Canarias\BIOACOUSTICS - General\ANALISIS DATOS\GEO-INQUIRE\channelA_2020-06-27_01-40-14.station.xml
Analyzing file 2 of 2 - channelA_2020-06-30_01-06-28.wav
Created files for C:/Users/silvana.neves/Plataforma Oceánica de Canarias/BIOACOUSTICS - General/ANALISIS DATOS/GEO-INQUIRE\channelA_2020-06-30_01-06-28.wav:
   FLAC: C:/Users/silvana.neves/Plataforma Oceánica de Canarias/BIOACOUSTICS - General/ANALISIS DATOS/GEO-INQUIRE\channelA_2020-06-30_01-06-28.flac
   M