# Using onset/duration format

Many datasets store interval annotations as onset + duration rather than explicit interval boundaries. This notebook shows how to use `DimensionInterval` with onset/duration coordinates directly, without needing to construct `pd.IntervalIndex` objects.

## Why onset/duration?

Common data formats like TextGrid, Praat, and many annotation tools export intervals as:
- `onset`: when the interval starts
- `duration`: how long the interval lasts

The `linked_indices` library provides helper functions to convert annotation DataFrames directly to xarray coordinates with proper naming conventions.

In [None]:
import xarray as xr
from linked_indices import DimensionInterval, example_data
from linked_indices.example_data import (
    intervals_from_dataframe,
    intervals_from_long_dataframe,
)

## Loading annotation data

Annotation data typically comes as a pandas DataFrame with onset, duration, and label columns. Let's load some example speech annotation data:

In [None]:
# Load example speech annotations (simulating data from Praat, TextGrid, etc.)
annotations = example_data.speech_annotations()
annotations

Notice that the annotations have **gaps** between them - this is common in real speech data where there are pauses between words. For example, "hello" ends at 1.7s but "world" doesn't start until 2.1s.

## Converting DataFrame to xarray coordinates

The `intervals_from_dataframe` function converts annotation DataFrames to xarray Datasets with properly named coordinates (`{dim}_onset`, `{dim}_duration`):

In [None]:
# Convert annotations DataFrame to xarray coordinates
word_coords = intervals_from_dataframe(annotations, dim_name="word", label_col="word")
word_coords

The helper automatically creates:
- `word` as the dimension coordinate (from `label_col`)
- `word_onset` and `word_duration` as coordinates (named `{dim}_onset`, `{dim}_duration`)

## Adding audio data

Now we can add our continuous audio signal and merge with the annotation coordinates:

In [None]:
# Generate a simulated audio signal
times, audio_signal = example_data.generate_audio_signal(duration=10.0)

# Create dataset by merging annotation coordinates with audio data
ds = word_coords.copy()
ds["audio"] = (("time",), audio_signal)
ds = ds.assign_coords(time=times)
ds

## Applying the DimensionInterval index

To link the time and word dimensions, apply `DimensionInterval` with the `onset_duration_coords` option mapping dimension names to `(onset_coord, duration_coord)` tuples:

In [None]:
ds = ds.drop_indexes(["time", "word"]).set_xindex(
    ["time", "word_onset", "word_duration", "word"],
    DimensionInterval,
    onset_duration_coords={"word": ("word_onset", "word_duration")},
)
ds

In [None]:
ds.xindexes["word"]

In [None]:
ds.coord_viz()

In [None]:
ds.coord_inspector["word"]

Notice that:
- The `word_onset` and `word_duration` coordinates remain visible
- All coordinates are linked under a single `DimensionInterval` index
- No manual coordinate creation was needed - the helper handled naming conventions

## Selecting data

Selection works exactly the same as with the IntervalIndex format. When you select on any dimension, all other dimensions are automatically constrained.

In [None]:
# Select by word label - time is automatically constrained
ds.sel(word="hello")

In [None]:
# Select by time range - words are automatically constrained
ds.sel(time=slice(2, 5))

In [None]:
# Select by onset value
ds.sel(word_onset=4.5)

## Handling gaps

Our word annotations have gaps between them (silence between words). Let's see what happens when we select time in a gap:

In [None]:
# Time 1.8 to 2.0 is in the gap between "hello" (ends at 1.7) and "world" (starts at 2.1)
ds.sel(time=slice(1.75, 2.0))

When selecting multiple words with gaps between them using `isel`, the time dimension spans the **union** of their intervals (including the gap). Here we select "hello" [0.5, 1.7) and "world" [2.1, 3.9):

In [None]:
# Select first two words - time spans from 0.5 to 3.9, including the gap
ds.isel(word=slice(0, 2))

## Multiple onset/duration dimensions

You can have multiple interval dimensions, each with their own onset/duration coordinates. This is common for hierarchical annotations like words and phonemes. The helper function makes it easy to convert each level:

In [None]:
# Load multi-level annotations (words and phonemes)
word_annotations, phoneme_annotations = example_data.multi_level_annotations()

display(word_annotations)
display(phoneme_annotations)

# Convert each DataFrame to xarray coordinates using helpers
word_ds = intervals_from_dataframe(word_annotations, dim_name="word", label_col="word")
phoneme_ds = intervals_from_dataframe(
    phoneme_annotations, dim_name="phoneme", label_col="phoneme"
)

In [None]:
# Merge annotation coordinates and add audio data
times, audio = example_data.generate_audio_signal(duration=10.0)

ds_multi = xr.merge([word_ds, phoneme_ds])
ds_multi["audio"] = (("time",), audio)
ds_multi = ds_multi.assign_coords(time=times)

# Apply index with both onset/duration mappings
ds_multi = ds_multi.drop_indexes(["time", "word", "phoneme"]).set_xindex(
    [
        "time",
        "word_onset",
        "word_duration",
        "word",
        "part_of_speech",
        "phoneme_onset",
        "phoneme_duration",
        "phoneme",
    ],
    DimensionInterval,
    onset_duration_coords={
        "word": ("word_onset", "word_duration"),
        "phoneme": ("phoneme_onset", "phoneme_duration"),
    },
)
ds_multi

In [None]:
# Select word "hello" - both time AND phonemes are constrained
ds_multi.sel(word="hello")

In [None]:
# Select by part of speech - finds all nouns
ds_multi.sel(part_of_speech="noun")

## Controlling interval closedness

By default, intervals are left-closed `[onset, onset+duration)`. You can change this with the `interval_closed` option:

In [None]:
# Reload data for fresh example
annotations = example_data.speech_annotations()
times, audio = example_data.generate_audio_signal()

# Create with right-closed intervals (onset, onset+duration]
ds_right = xr.Dataset(
    {"audio": (("time",), audio)},
    coords={
        "time": times,
        "word_onset": ("word", annotations["onset"].values),
        "word_duration": ("word", annotations["duration"].values),
        "word": ("word", annotations["word"].values),
    },
)

ds_right = ds_right.drop_indexes(["time", "word"]).set_xindex(
    ["time", "word_onset", "word_duration", "word"],
    DimensionInterval,
    onset_duration_coords={"word": ("word_onset", "word_duration")},
    interval_closed="right",  # Options: "left", "right", "both", "neither"
)
print("Created dataset with right-closed intervals (onset, onset+duration]")

## Summary

The onset/duration format provides a convenient way to work with interval data without manually constructing `pd.IntervalIndex` objects:

1. **Load annotations** as a pandas DataFrame (from TextGrid, Praat, CSV, etc.)
2. **Convert to coordinates** using `intervals_from_dataframe()` or `intervals_from_long_dataframe()`
3. **Merge and add data** - combine annotation coordinates with your continuous data
4. **Apply the index** with `onset_duration_coords` mapping
5. **Select data** - all selection operations work identically to IntervalIndex format

### Helper functions

| Function | Use case |
|----------|----------|
| `intervals_from_dataframe()` | Convert a single-event-type DataFrame |
| `intervals_from_long_dataframe()` | Convert a multi-event-type DataFrame with category column |

### Key features

- **Natural representation**: Use onset + duration directly from your data files
- **Library helpers**: Handle coordinate naming conventions automatically
- **Visible coordinates**: onset and duration remain as regular coordinates  
- **Full functionality**: All selection operations work identically
- **Multiple dimensions**: Support for multiple onset/duration pairs
- **Gap support**: Non-contiguous intervals work correctly
- **Mixed events**: Handle DataFrames with multiple event types

## Handling multiple event types in one DataFrame

Sometimes annotation data comes as a single "long format" DataFrame with multiple event types (words, phonemes, stimuli, etc.) distinguished by a category column. The `intervals_from_long_dataframe` function handles this case:

In [None]:
# Load example mixed-event annotations
mixed_df = example_data.mixed_event_annotations()
mixed_df

In [None]:
# Convert all event types at once
intervals_from_long_dataframe(mixed_df)

In [None]:
# Add time/audio and apply DimensionInterval
times, audio = example_data.generate_audio_signal(duration=10.0)
interval_ds = intervals_from_long_dataframe(mixed_df)

ds_mixed = interval_ds.copy()
ds_mixed["audio"] = (("time",), audio)
ds_mixed = ds_mixed.assign_coords(time=times)

# Apply the index with all three event types
ds_mixed = ds_mixed.drop_indexes(["time", "word", "phoneme", "stimulus"]).set_xindex(
    [
        "time",
        "word_onset",
        "word_duration",
        "word",
        "phoneme_onset",
        "phoneme_duration",
        "phoneme",
        "stimulus_onset",
        "stimulus_duration",
        "stimulus",
    ],
    DimensionInterval,
    onset_duration_coords={
        "word": ("word_onset", "word_duration"),
        "phoneme": ("phoneme_onset", "phoneme_duration"),
        "stimulus": ("stimulus_onset", "stimulus_duration"),
    },
)
ds_mixed

In [None]:
# Selecting a stimulus constrains words and phonemes too
ds_mixed.sel(stimulus="image_A")

### Manual iteration for selective event types

If you only want some event types, you can filter and apply `intervals_from_dataframe` iteratively:

In [None]:
# Only include words and phonemes (exclude stimuli)
datasets = []
for event_type in ["word", "phoneme"]:
    subset = mixed_df[mixed_df["event_type"] == event_type].drop(columns=["event_type"])
    ds_subset = intervals_from_dataframe(subset, dim_name=event_type, label_col="label")
    datasets.append(ds_subset)

xr.merge(datasets)