## Event Detection

This notebooks shows an example of analysis on an MD LJ system using `dupin`.
We will use Minkowski structure metrics and the Voronoi polytope volumes for features.
The notebook will go through the entire data generation pipeline. In addition, we will
showcase the logging infrastructure which looks into the data generation to recover
the region of the simulation where nucleation occurs.

### Import All the Things

In [None]:
import os

import freud  # analysis toolkit
import gsd.fl  # trajectory reader
import numpy as np
import pandas as pd
import ruptures as rpt  # change point detection library
import scipy as sp
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
from tqdm.notebook import tqdm

import dupin as du

FILENAME = "lj-sim.gsd"
FRAMES = range(3, 80)

## Generate the Data

### Data Definition

`dupin` data generation follows a pipeline approach which is most clearly
seen with decorators. Below we create a `steinhardt_generator` which will compute the
Minkowski structure metrics (MSM) -- the Steinhardt class with `weighted=True` and
using a Voronoi neighbor list computes MSM. We use the function `data_generator` to
add the Voronoi polytope volume to the feature dictionary of `steinhardt_generator`.

<div class="alert alert-warning">
    <p><strong>Note:</strong></p>
    <p>
        Since we will use the <code>freud.locality.Voronoi</code> object to provide the neighbor list,
        the volumes will be up to date in `data_generator`.
    </p>
</div>

We then use `DataMap` and `DataReducer` subclasses to get the tenth and first highest/lowest
spatially averaged value for each feature.

In the next cell we set up some base objects, and then show two methods for creating our final data
pipeline.

In [None]:
ls = [2, 4, 6, 8, 10]
steinhardt_compute = freud.order.Steinhardt(l=ls, weighted=True)
voronoi_nlist_generator = freud.locality.Voronoi()

steinhardt_generator = du.data.freud.FreudDescriptor(
    compute=steinhardt_compute, attrs={"particle_order": [f"$Q_{{{i}}}$" for i in ls]}
)

### Decorator Method

The decorator method uses the `wraps()` method to create something that works
appropriately as a decorator for all builtin classes for data generator and
manipulation in `dupin`. This approach goes from bottom to top as
decorators in Python always do. That means the top-most decorator is the last
in the pipeline. This method can be useful when using a custom function to
generate data.

In [None]:
@du.data.reduce.NthGreatest.wraps([1, 10, -1, -10])
@du.data.map.Tee.wraps(
    [
        du.data.spatial.NeighborAveraging.wraps("neighbors", False),
        du.data.map.Identity.wraps(),
    ]
)
def data_generator(
    system: "tuple[freud.box.Box, np.ndarray[float]]",
    neighbors: "freud.locality.NeighborList",
):
    """Combine the MSM and Voronoi polytope volumes."""
    data = steinhardt_generator(system, neighbors=neighbors)
    data["$V_{vor}$"] = voronoi_nlist_generator.volumes
    return data

### Pipeline Method

Another way to build the final data pipeline is through a pipeline
syntax. This approach still uses `wraps` which is necessary to create
the pipeline all at once, but the approach is to go left to right
in the data manipulation process which can appear more natural to many
people.

In [None]:
@du.data.base.CustomGenerator
def data_generator(
    system: "tuple[freud.box.Box, np.ndarray[float]]",
    neighbors: "freud.locality.NeighborList",
):
    """Combine the MSM and Voronoi polytope volumes."""
    data = steinhardt_generator(system, neighbors=neighbors)
    data["$V_{vor}$"] = voronoi_nlist_generator.volumes
    return data


data_generator = data_generator.pipe(
    du.data.map.Tee.wraps(
        [
            du.data.spatial.NeighborAveraging.wraps("neighbors", False),
            du.data.map.Identity.wraps(),
        ]
    )
).pipe(du.data.reduce.NthGreatest.wraps([1, 100, -1, -100]))

### Data Computation

Wrap the generator into a `SignalAggregator` object
to enable collection of the feature set across the trajectory.
We also specify, a logger to record information throughout the pipeline.
Given the many layers to the pipeline, simply providing properties for the
different objects would be more cumbersome, so we just listen into the pipeline
using the logging system.

In [None]:
signal_aggregator = du.data.SignalAggregator(data_generator, du.data.logging.Logger())

Let's actually compute the features now, and get a
`pandas.DataFrame` object to work with and view.

In [None]:
with gsd.fl.open(FILENAME, "rb") as traj:
    for frame in FRAMES:
        system = (
            traj.read_chunk(frame, "configuration/box"),
            traj.read_chunk(frame, "particles/position"),
        )
        voronoi_nlist_generator.compute(system)
        signal_aggregator.accumulate(
            system, neighbors=voronoi_nlist_generator.nlist
        )
df = signal_aggregator.to_dataframe()
df.to_hdf("./lj-data.h5", "data")
df.head()

In [None]:
log_df = signal_aggregator.logger.to_dataframe()
log_df.to_hdf("./lj-log.h5", "data")
log_df.head()

In [None]:
def position_generator(filename, frames):
    with gsd.fl.open(filename, "rb") as fh:
        for frame in frames:
            yield fh.read_chunk(frame, "particles/position")


f_positions = du.postprocessing.retrieve_positions(
    log_df, position_generator(FILENAME, FRAMES)
)
f_positions.head()

### Find Rough Nucleation Region

Using the recovered positions we will find the area of the simulation where nucleation
occurs. This uses the knowlegde that nucleation occurs arround the 14th frame.

In [None]:
nucl_positions = f_positions["$Q_{6}$"]["NthGreatest"]["100th_greatest"].iloc[14:18]
mean_pos = nucl_positions.mean().to_numpy()
std_pos = nucl_positions.std().to_numpy()
print(f"Nucleation likely around {mean_pos} with standard deviation {std_pos}")

In [None]:
nucl_positions = f_positions["$Q_{6}$"]["NthGreatest"]["100th_greatest"].iloc[14]
print(f"Nucleation location {nucl_positions.to_numpy()}")

## Computing array features for later analysis

`SignalAggregator` can also be used to aggregate per-particle features 
and convert the data to an `xarray.DataArray` object. Notice how we
do not reduce the array data in the pipeline below.

In [None]:
@du.data.base.CustomGenerator
def data_generator(
    system: "tuple[freud.box.Box, np.ndarray[float]]",
    neighbors: "freud.locality.NeighborList",
):
    """Combine the MSM and Voronoi polytope volumes."""
    data = steinhardt_generator(system, neighbors=neighbors)
    data["$V_{vor}$"] = voronoi_nlist_generator.volumes
    return data


data_generator = data_generator.pipe(
    du.data.map.Tee.wraps(
        [
            du.data.spatial.NeighborAveraging.wraps("neighbors", False),
            du.data.map.Identity.wraps(),
        ]
    )
)

In [None]:
signal_aggregator = du.data.SignalAggregator(data_generator)

In [None]:
with gsd.fl.open(FILENAME, "rb") as traj:
    for frame in tqdm(FRAMES):
        system = (
            traj.read_chunk(frame, "configuration/box"),
            traj.read_chunk(frame, "particles/position"),
        )
        voronoi_nlist_generator.compute(system)
        signal_aggregator.accumulate(
            system, neighbors=voronoi_nlist_generator.nlist
        )
xarr = signal_aggregator.to_xarray(third_dim_name="particle")
xarr