# Reading & writing data

In ERLabPy, data are represented as {class}`xarray.DataArray`, {class}`xarray.Dataset`, and {class}`xarray.DataTree` objects.

- {class}`xarray.DataArray` are similar to waves in Igor Pro, but are much more flexible. Opposed to the maximum of 4 dimensions in Igor, {class}`xarray.DataArray` can have as many dimensions as you want (up to 64). Another advantage is that the coordinates of the dimensions do not have to be evenly spaced. In fact, they are not limited to numbers but can be any type of data, such as date and time representations.

- {class}`xarray.Dataset` is a collection of {class}`xarray.DataArray` objects. It is used to store multiple data arrays that are related to each other, such as a set of measurements.

- {class}`xarray.DataTree` is a hierarchical data structure that can store multiple {class}`xarray.Dataset` objects, just like an Igor experiment file with multiple waves within nested folders.

See [Data Structures](https://docs.xarray.dev/en/latest/user-guide/data-structures.html) in the xarray documentation for a general introduction to xarray data structures.

This guide will introduce you to reading and writing data from and to various file formats, and how to implement a custom plugin for a experimental setup.

:::{note}

If you are not familiar with {mod}`xarray`, it is recommended to read the [xarray tutorial](https://tutorial.xarray.dev/) and the [xarray user guide](https://docs.xarray.dev/en/stable/user-guide/index.html) first.

:::

Skip to the [corresponding section](loading-arpes-data) for guides on loading ARPES data.

## Reading data with `xarray`

{mod}`xarray` provides basic support for reading and writing NetCDF and HDF5 files into {mod}`xarray` objects. See the {mod}`xarray` documentation on [I/O operations ](https://docs.xarray.dev/en/stable/user-guide/io.html) for more information.

Here, we will focus on working with data exported from Igor Pro and other commonly used file formats.

### From Igor Pro

Installing ERLabPy automatically registers a backend for xarray that allows reading `.pxt`, `.pxp`, and `.ibw` files. This means that you can load these files directly into xarray using {func}`xarray.open_dataset` or {func}`xarray.open_dataarray` as if they were NetCDF files.

In most cases, xarray will automatically detect the file format. For example, to load an `.ibw` file into a {class}`xarray.DataArray`, use the following code:

```python
import xarray as xr

data = xr.open_dataarray("path/to/wave.ibw")
```

Loading an experiment file to a {class}`xarray.DataTree` is also possible:

```python
data = xr.open_datatree("path/to/experiment.pxp")
```

Along with the Igor Pro file formats, the backend also supports loading HDF5 files exported from Igor Pro. For such files, the engine must be specified explicitly with `engine="erlab-igor"`.

:::{warning}

Loading waves from complex ``.pxp`` files may fail or produce unexpected results. It is recommended to export the waves to a ``.ibw`` file to load them in ERLabPy. If you encounter any problems, please let us know by opening an issue.

:::

### From arbitrary formats

There are many python libraries that can read and write data in various formats. Here,
some common file formats and how to read them are listed:

* Spreadsheet data can be read using {func}`pandas.read_csv` and {func}`pandas.read_excel`.
  
  The resulting DataFrame can be converted to an xarray object using {meth}`pandas.DataFrame.to_xarray` or {meth}`xarray.Dataset.from_dataframe`.

* When reading HDF5 files with arbitrary groups and metadata, you must first explore the  group structure using [h5netcdf](https://h5netcdf.org/). More conveniently, you can use {func}`xarray.open_groups` to inspect the group structure.

* FITS files can be read with [astropy](https://docs.astropy.org/en/stable/io/fits/index.html).

  In the near future, ERLabPy will provide a loader for FITS files.

* For working with NeXus files, see {mod}`erlab.io.nexusutils`.


## Writing ``xarray`` objects to a file

Since the state and variables of a Python interpreter are not saved, it is important to save your data in a format that can be easily read and written.

While it is possible to save and load entire Python interpreter sessions using [pickle](https://docs.python.org/3/library/pickle.html) or the more versatile [dill](https://github.com/uqfoundation/dill), it is out of the scope of this guide. Instead, we recommend saving your data in a format that is easy to read, write, and share, such as HDF5 or NetCDF. To save and load xarray objects to such formats, see the xarray documentation on [I/O operations](https://docs.xarray.dev/en/stable/user-guide/io.html).

### To Igor Pro

As an experimental feature, {func}`save_as_hdf5 <erlab.io.save_as_hdf5>` can save certain DataArrays in a format that is compatible with the Igor Pro HDF5 loader. An [accompanying Igor procedure ](https://github.com/kmnhan/erlabpy/blob/main/PythonInterface.ipf) is available in the repository. If loading in Igor Pro fails, try saving again with all attributes removed.

Alternatively, [igorwriter](https://github.com/t-onoz/igorwriter) can be used to write numpy arrays to ``.ibw`` and ``.itx`` files directly.

(loading-arpes-data)=
## ARPES data

ARPES data from synchrotron endstations and laboratory setups worldwide are saved in diverse formats. ERLabPy’s data loading framework strives to offer a unified interface for loading ARPES data from various sources.

To ensure seamless integration with common analysis procedures like momentum conversion and Fermi edge fitting, the data loaded into xarray objects must adhere to specific conventions.


(data-conventions)=

### Conventions

:::{note}

These conventions are not strictly enforced, but adhering to them will simplify the use of the provided analysis tools.

Generally, any type of xarray object will be compatible with analysis routines that aren’t specific to ARPES, such as plotting, masking, transformations, curve fitting, interpolation, and so on.

:::

These are some rules that loaded ARPES data must follow to ensure compatibility with analysis procedures such as momentum conversion and fermi edge fitting:

- Information about the experimental geometry is stored in the `'configuration'` attribute as an integer from 1 to 4. See [Nomenclature](nomenclature) and {class}`AxesConfiguration <erlab.constants.AxesConfiguration>` for more information.

- Angles are stored in coordinates that are named according to the conventions in [Nomenclature](nomenclature).

- The energy (binding or kinetic) is stored in a coordinate named `'eV'`. The sign of binding energies should be negative for occupied states.

- The photon energy must be stored in a coordinate named `'hv'`.

- The sample temperature, if available, is stored in an attribute or coordinate named `'sample_temp'`.

- The work function of the system, if available, is stored in an attribute named `'sample_workfunction'`.

- The angular resolution of the experiment, if available, is stored in an attribute named `'angle_resolution'`. This is only used to estimate momentum grid sizes when converting to momentum space.



In addition, the following units are used:

| Quantity         | Unit            |
|:----------------:|:---------------:|
| Energy           | eV              |
| Angle            | deg             |
| Temperature      | K               |


### Loading

ERLabPy's data loading framework consists of various plugins, or *loaders*, each
designed to load data from a different beamline or laboratory. Each *loader* is a class
instance that has a `load` method which takes a file path or sequence number and returns
data.

Let's see the list of available loaders:

In [None]:
import erlab

erlab.io.loaders

In [None]:
%config InlineBackend.figure_formats = ["svg", "pdf"]
import matplotlib.pyplot as plt
import xarray as xr

plt.rcParams["figure.dpi"] = 96
plt.rcParams["image.cmap"] = "viridis"

xr.set_options(display_expand_data=False)

You can access each loader using its name as an attribute or an item. For example, to
access the loader for the ALS beamline 4.0.3 (MERLIN), you can use any of the following
methods:

In [None]:
erlab.io.loaders["merlin"]
erlab.io.loaders.merlin

Data loading is done by calling the {meth}`load <erlab.io.dataloader.LoaderBase.load>` method of the loader. It requires an `identifier` parameter, which can be a path to a file or a sequence number. It also accepts a `data_dir` parameter, which specifies the directory where the data is stored.

- If `identifier` is a sequence number, `data_dir` must be provided.

- If `identifier` is a string and `data_dir` is provided, the path is constructed by
  joining `data_dir` and `identifier`.

- If `identifier` is a string and `data_dir` is not provided, `identifier` should be a
  valid path to a file.

Suppose we have data from the ALS beamline 4.0.3 stored as `/path/to/data/f_001.pxt`, `/path/to/data/f_002.pxt`, etc. To load `f_001.pxt`, all three of the following are valid:

```python
loader = erlab.io.loaders["merlin"]

loader.load("/path/to/data/f_001.pxt")
loader.load("f_001.pxt", data_dir="/path/to/data")
loader.load(1, data_dir="/path/to/data")
```

### Setting the default loader and data directory

In practice, a loader and a single directory will be used repeatedly in a session to load different data from the same experiment.

Instead of explicitly specifying the loader and directory each time, a default loader and data directory can be set with {func}`erlab.io.set_loader` and {func}`erlab.io.set_data_dir`. All subsequent calls to the shortcut function {func}`erlab.io.load` will use the specified loader and data directory.

```python
erlab.io.set_loader("merlin")
erlab.io.set_data_dir("/path/to/data")
data_1 = erlab.io.load(1)
data_2 = erlab.io.load(2)
```

The loader and data directory can also be controlled with a context manager:

```python
with erlab.io.loader_context("merlin", data_dir="/path/to/data"):
    data_1 = erlab.io.load(1)
```

### Data across multiple files

For setups like the ALS beamline 4.0.3, some scans are stored over multiple files like
`f_003_S001.pxt`, `f_003_S002.pxt`, and so on. In this case, the loader will
automatically concatenate all files in the same scan. For example, *all of the
following* will return the same concatenated data:

```python
erlab.io.load(3)
erlab.io.load("f_003_S001.pxt")
erlab.io.load("f_003_S002.pxt")
```

If you want to cherry-pick a single file, you can pass ``single=True`` to {meth}`load
<erlab.io.dataloader.LoaderBase.load>`:

```python
erlab.io.load("f_003_S001.pxt", single=True)
```

If you don't want automatic concatenation to happen, you can suppress it with `combine=False`. The following code will return a list of DataArrays:
```python
erlab.io.load(3, combine=False)
```

### Handling multiple data directories

If you call {func}`erlab.io.set_loader` or {func}`erlab.io.set_data_dir` multiple times, the last call will override the previous ones. While this is useful for changing the loader or data directory, it makes data loading *dependent on execution order*. This may lead to unexpected behavior in notebooks.

If you plan to use multiple loaders or data directories in the same session, it is recommended to use the context manager {func}`erlab.io.loader_context`:

```python
with erlab.io.loader_context("merlin", data_dir="/path/to/data"):
    data = erlab.io.load(identifier)
```

It may also be convenient to define functions that set the loader and data directory and
call {func}`erlab.io.load` with the appropriate arguments.

### Summarizing data

Some supported loaders can generate a {class}`pandas.DataFrame` containing an overview of the data in a given directory. The generated summary can be viewed as a table with the {meth}`summarize <erlab.io.dataloader.LoaderBase.summarize>` method.

If `ipywidgets` is installed, an interactive widget is also displayed. This is useful for quickly skimming through the data.

Just like {meth}`load <erlab.io.dataloader.LoaderBase.load>`, {meth}`summarize <erlab.io.dataloader.LoaderBase.summarize>` can also be accessed with the shortcut function {func}`erlab.io.summarize`. For example, to display a summary of the data available in the directory `/path/to/data` using the `'merlin'` loader:

```python
erlab.io.set_loader("merlin")
erlab.io.summarize("/path/to/data")
```
If the path is not specified, the current data directory is used.

To see what the generated summary looks like, see the [example below](summary example).

:::{note}

If the [ImageTool manager](imagetool-manager-guide) is running, the a button to open the data in ImageTool is shown in the interactive summary.

:::

:::{note}

Alternatively, a Qt-based GUI for browsing and loading data is also available. See {mod}`erlab.interactive.explorer` for more information.

:::


(implementing-plugins)=
## Implementing a data loader plugin 

Implementing a new loader plugin to support an ARPES setup can be done by subclassing {class}`LoaderBase <erlab.io.dataloader.LoaderBase>` and inheriting or overriding some of its methods and attributes. Any subclass of {class}`LoaderBase <erlab.io.dataloader.LoaderBase>` is automatically registered as a loader.

At the bare minimum, a loader must override the {attr}`name <erlab.io.dataloader.LoaderBase.name>` attribute and the {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>` method. Other additional attributes and methods can be implemented to provide more functionality.

Before we dive into the details, let's first understand the data loading flow.


### Data loading flow

The core method of a loader is the {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>` method, which is given a path to a single file and must return the data as an xarray object. In most cases, this will be a {class}`xarray.DataArray`. In cases where the data is more complex, e.g., multiple region scans with different axes, returning a {class}`xarray.Dataset` or {class}`xarray.DataTree` is also possible. In {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>`, post-processing steps such as renaming and reordering dimensions should not be included, as this can be handled automatically by setting some class attributes that we will discuss later.

ARPES data files from a single experiment usually follow a fixed naming scheme, e.g., `file_0001.h5`, `file_0002.h5`, and so on. If the naming scheme is well-defined, it is possible to infer the file path from a sequence number so that the user can use the sequence number directly to load the data. This can be accomplished by implementing the {meth}`identify <erlab.io.dataloader.LoaderBase.identify>` method which should infer the full path to a data file given an integer sequence number(`identifier`) and the path to a folder(`data_dir`).

The following flowchart shows the process of loading data from a single scan, given the path to the directory (`data_dir`) and the sequence number or file name (`identifier`):

```{image} ../images/flowchart_single.pdf
:align: center
:alt: Flowchart for loading data from a single file
```

If only all data formats were as simple as this! Unfortunately, there are some setups where data that belongs to a single scan is saved over multiple files. In this case, the files will look like `file_0001_0001.h5`, `file_0001_0002.h5`, etc., and we can no longer uniquely identify a single file with a sequence number. For these kinds of setups, an additional method {meth}`infer_index <erlab.io.dataloader.LoaderBase.infer_index>` must be implemented. The following flowchart shows the process of loading data from multiple files:

```{image} ../images/flowchart_multiple.pdf
:align: center
:alt: Flowchart for loading data from multiple files
```

In this case, the method {meth}`identify <erlab.io.dataloader.LoaderBase.identify>` should resolve *all* files that belong to the given sequence number, and return a *list* of file paths along with a dictionary of coordinates that are varied across the files. For example, if there are three files for a scan taken at three different `beta` angles, the method should return a list of three file paths and a dictionary with `'beta'` as the sole key and an array of length 3 containing the angle as the value. An empty dictionary should be returned if there are no varying coordinates.

The method {meth}`infer_index <erlab.io.dataloader.LoaderBase.infer_index>` must infer the sequence number from a bare file name (without the extension and directory name). For example, given `file_0003_0123`, the method should infer `3`.

### A minimal example

Consider a setup that saves data into a `.csv` file named `data_0001.csv`, `data_0002.csv`, and so on. A simple implementation of a loader for the setup will look something like this:

In [None]:
import os

import pandas as pd

from erlab.io.dataloader import LoaderBase


class MyLoader(LoaderBase):
    name = "my_loader"
    description = "Barebones loader for CSV files"
    extensions = {".csv"}
    skip_validate = False
    always_single = True

    def identify(self, num, data_dir):
        file = os.path.join(data_dir, f"data_{str(num).zfill(4)}.csv")
        return [file], {}

    def load_single(self, file_path, without_values=False):
        return pd.read_csv(file_path).to_xarray()

Some class attributes and methods have been implemented. For a detailed explanation of each attribute and method, see the {class}`LoaderBase <erlab.io.dataloader.LoaderBase>` documentation.

We can see that the loader has been properly registered:

In [None]:
erlab.io.loaders

In [None]:
erlab.io.loaders["my_loader"]

The loader can be used just like the built-in loaders:

```python
data = erlab.io.loaders.my_loader.load(1, data_dir="/path/to/data)
```

### Handling metadata

Unlike the previous example, real ARPES data is more than just a simple array of numbers. It contains metadata such as the experimental geometry, sample temperature, and so on. It is important to store this metadata in the xarray object in a consistent manner as defined [here](data-conventions).

To obtain a consistent representation of the data, data loaded by {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>` must be post-processed to adhere to the conventions. Typically, this involves manipulating coordinate and attribute names, which is automatically performed based on the following class attributes:

- {attr}`name_map <erlab.io.dataloader.LoaderBase.name_map>`

- {attr}`coordinate_attrs <erlab.io.dataloader.LoaderBase.coordinate_attrs>`

- {attr}`average_attrs <erlab.io.dataloader.LoaderBase.average_attrs>`

- {attr}`additional_attrs <erlab.io.dataloader.LoaderBase.additional_attrs>`

- {attr}`overridden_attrs <erlab.io.dataloader.LoaderBase.overridden_attrs>`

- {attr}`additional_coords <erlab.io.dataloader.LoaderBase.additional_coords>`

- {attr}`overridden_coords <erlab.io.dataloader.LoaderBase.overridden_coords>`

Any post-processing steps that reach beyond renaming and reordering dimensions can be implemented in the {meth}`post_process <erlab.io.dataloader.LoaderBase.post_process>` method:

```python
def post_process(self, data: xr.DataArray) -> xr.DataArray:
    data = super().post_process(data)
    # Perform additional post-processing steps here
    return data
```

The loaders perform a basic check for some of the [conventions](data-conventions) using {meth}`validate <erlab.io.dataloader.LoaderBase.validate>` for every data file loaded. A warning is issued if some are missing. This behavior can be controlled with loader class attributes {attr}`skip_validate <erlab.io.dataloader.LoaderBase.skip_validate>` and {attr}`strict_validation <erlab.io.dataloader.LoaderBase.strict_validation>`.

### Data spanning multiple files

Next, let's try to write a more realistic loader for a hypothetical setup that saves data as HDF5 files with the following naming scheme: `data_001.h5`, `data_002.h5`, and so on, with multiple scans named like `data_001_S001.h5`, `data_001_S002.h5`, etc. with the scan axis information stored in a separate file named `data_001_axis.csv`.

Let us first generate a data directory and place some synthetic data in it. Before saving, we rename and set some attributes that resemble real ARPES data.

In [None]:
import csv
import datetime
import tempfile

import numpy as np

import erlab
from erlab.io.exampledata import generate_data_angles


def make_data(beta=5.0, temp=20.0, hv=50.0, bandshift=0.0):
    data = generate_data_angles(
        shape=(250, 1, 300),
        angrange={"alpha": (-15, 15), "beta": (beta, beta)},
        hv=hv,
        configuration=1,
        temp=temp,
        bandshift=bandshift,
        assign_attributes=False,
        seed=1,
    ).T

    # Rename coordinates. The loader must rename them back to the original names.
    data = data.rename(
        {
            "alpha": "ThetaX",
            "beta": "Polar",
            "eV": "BindingEnergy",
            "hv": "PhotonEnergy",
            "xi": "Tilt",
            "delta": "Azimuth",
        }
    )
    dt = datetime.datetime.now()

    # Assign some attributes that real data would have
    data = data.assign_attrs(
        {
            "LensMode": "Angular30",  # Lens mode of the analyzer
            "SpectrumType": "Fixed",  # Acquisition mode of the analyzer
            "PassEnergy": 10,  # Pass energy of the analyzer
            "UndPol": 0,  # Undulator polarization
            "Date": dt.strftime(r"%d/%m/%Y"),  # Date of the measurement
            "Time": dt.strftime("%I:%M:%S %p"),  # Time of the measurement
            "TB": temp,
            "X": 0.0,
            "Y": 0.0,
            "Z": 0.0,
        }
    )
    return data


# Create a temporary directory
tmp_dir = tempfile.TemporaryDirectory()

# Define coordinates for the scan
beta_coords = np.linspace(2, 7, 10)

# Generate and save cuts with different beta values
for i, beta in enumerate(beta_coords):
    data = make_data(beta=beta, temp=20.0, hv=50.0)
    filename = f"{tmp_dir.name}/data_001_S{str(i + 1).zfill(3)}.h5"
    data.to_netcdf(filename, engine="h5netcdf")

# Write scan coordinates to a csv file
with open(f"{tmp_dir.name}/data_001_axis.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Index", "Polar"])

    for i, beta in enumerate(beta_coords):
        writer.writerow([i + 1, beta])

# Generate some cuts with different band shifts
for i in range(4):
    data = make_data(beta=5.0, temp=20.0, hv=50.0, bandshift=-i * 0.05)
    filename = f"{tmp_dir.name}/data_{str(i + 2).zfill(3)}.h5"
    data.to_netcdf(filename, engine="h5netcdf")

Now, we have generated a folder that resembles typical data from an ARPES experiment. Let's list the contents of the folder:

In [None]:
sorted(os.listdir(tmp_dir.name))

Each HDF5 file represents a single ARPES cut. `data_001_S001.h5` to `data_001_S010.h5`
represents an ARPES map with 10 cuts, with the scan axis recorded in
`data_001_axis.csv`. Let's check what the raw data looks like.

In [None]:
xr.load_dataarray(f"{tmp_dir.name}/data_002.h5")

The data has been properly loaded, but the coordinates and attributes have names that
are specific to the beamline.

Our loader should do three things: rename the coordinates and attributes to standard
names, add metadata to the dataset, and combine related cuts into a single DataArray
that contains the ARPES mapping.

:::{note}

Here, we easily loaded the data into an xarray object directly, but that is not the case for most experimental setups. Properly loading raw data into an xarray object is a complex process that requires knowledge of the data format and the experimental setup, and this is what must be implemented in the {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>`.

ERLabPy provides convenient functions to ease this process. See [implementations of existing data loaders](https://github.com/kmnhan/erlabpy/tree/main/src/erlab/io/plugins) for examples.

:::

Now that we have the data, let's implement the loader. The biggest difference from the previous example is that we need to handle multiple files for a single scan in {meth}`identify <erlab.io.dataloader.LoaderBase.identify>`. Also, we have to implement {meth}`infer_index <erlab.io.dataloader.LoaderBase.infer_index>` to extract the scan number from the file name.

In [None]:
import pathlib
import re

import erlab


class ExampleLoader(erlab.io.dataloader.LoaderBase):
    name = "example"
    description = "Example loader for multiple files"
    extensions = {".h5"}

    name_map = {
        "eV": "BindingEnergy",
        "alpha": "ThetaX",
        "beta": ["Polar", "Polar Compens"],
        # Can have multiple names assigned to the same name
        # If both are present in the data, a ValueError will be raised
        "delta": "Azimuth",
        "xi": "Tilt",
        "hv": "PhotonEnergy",
        "polarization": "UndPol",
        "sample_temp": "TB",
    }
    # Map the names of the coordinates or attributes in the resulting data to the names
    # present in the data returned by `load_single`. Note that the order of
    # non-dimension coordinates in the output data will follow the order of the keys in
    # this dictionary.

    coordinate_attrs: tuple[str, ...] = (
        "beta",
        "delta",
        "xi",
        "hv",
        "X",
        "Y",
        "Z",
        "polarization",
        "photon_flux",
        "sample_temp",
    )
    # Attributes to be used as coordinates. Place all attributes that we don't want to
    # lose when merging multiple file scans here.

    additional_attrs = {
        "configuration": 1,  # Experimental geometry. Required for momentum conversion
        "sample_workfunction": 4.3,
    }
    # Any additional metadata you want to add to the data. Note that attributes defined
    # here will not be transformed into coordinates. If you wish to promote some fixed
    # attributes to coordinates, add them to additional_coords.

    additional_coords = {}
    # Additional non-dimension coordinates to be added to the data, for instance the
    # photon energy for lab-based ARPES.

    always_single = False

    def identify(self, num, data_dir):
        data_dir = pathlib.Path(data_dir)

        coord_dict = {}

        # Look for scans with data_###_S###.h5, and sort them
        files = sorted(data_dir.glob(f"data_{str(num).zfill(3)}_S*.h5"))

        if len(files) == 0:
            # If no files found, look for data_###.h5
            files = sorted(data_dir.glob(f"data_{str(num).zfill(3)}.h5"))
            if len(files) > 1:
                # More than one file found with the same scan number, show warning
                erlab.utils.misc.emit_user_level_warning(
                    f"Multiple files found for scan {num}, using {files[0]}"
                )
                files = files[:1]
        else:
            # If files found, extract coordinate values from the filenames
            axis_file = data_dir / f"data_{str(num).zfill(3)}_axis.csv"
            with axis_file.open("r") as f:
                header = f.readline().strip().split(",")

            # Load the coordinates from the csv file
            coord_arr = np.loadtxt(axis_file, delimiter=",", skiprows=1)

            # Each header entry will contain a dimension name
            for i, hdr in enumerate(header[1:]):
                coord_dict[hdr] = coord_arr[: len(files), i + 1].astype(np.float64)

        if len(files) == 0:
            # If no files found up to this point, return None
            return None

        return files, coord_dict

    def load_single(self, file_path, without_values=False):
        return xr.open_dataarray(file_path, engine="h5netcdf")

    def infer_index(self, name):
        # Get the scan number from file name
        try:
            scan_num: str = re.match(r".*?(\d{3})(?:_S\d{3})?", name).group(1)
        except (AttributeError, IndexError):
            return None, None

        if scan_num.isdigit():
            # The second return value, a dictionary, is reserved for more complex
            # setups. See tips below for a brief explanation.
            return int(scan_num), {}
        return None, None

In [None]:
erlab.io.loaders

We can see that the `example` loader has been registered. Let's test the loader by
loading and plotting some data.

In [None]:
erlab.io.set_loader("example")
erlab.io.set_data_dir(tmp_dir.name)
erlab.io.load(1)

In [None]:
erlab.io.load(5).qplot()

Brilliant! We now have a working loader for our hypothetical setup. 

:::{note}

- There are more class attributes and methods that can be inherited or overridden to customize the loader's behavior.

- For single-file loaders which save data in well-known formats such as outputs from Scienta Omicron DA30 analyzers, SES, or NeXus, the implementation can be much more straightforward. See the implementations of existing data loaders for examples.

:::

However, in order to use {func}`erlab.io.summarize` with our loader, a few more methods and attributes need to be implemented. These are discussed in the next section.

### Summary generation

To enable summary generation, we need to implement two attributes and one method:

- {attr}`formatters <erlab.io.dataloader.LoaderBase.formatters>`: A dictionary that maps attribute or coordinate names in the data to functions that convert the coordinate or attribute value into a human-readable form.

- {attr}`summary_attrs <erlab.io.dataloader.LoaderBase.summary_attrs>`: A dictionary that maps summary column names to attribute or coordinate names in the data. A callable can also be used to generate entries for attributes that are not directly present in the data. 

- {meth}`files_for_summary <erlab.io.dataloader.LoaderBase.files_for_summary>`: A method that takes a path to a directory and returns a list of file paths in the directory that are associated with the loader. 

You can also choose to implement the following attribute to further customize the
summary:

- {attr}`summary_sort <erlab.io.dataloader.LoaderBase.summary_sort>`: A string that determines the column name to sort the summary table with.

  If not provided, the table will respect the order of the files returned by {meth}`files_for_summary <erlab.io.dataloader.LoaderBase.files_for_summary>`.

To improve the performance of summary generation, you can optionally implement {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>` to utilize the `without_values` argument. If it is True, it means that the values in the returned data of {meth}`load_single <erlab.io.dataloader.LoaderBase.load_single>` will not be accessed, so you can return the data with its values set to arbitrary numbers. This is useful when only the metadata is needed for the summary. An example of this will be shown below.


In [None]:
def _format_polarization(val) -> str:
    val = round(float(val))
    return {0: "LH", 2: "LV", -1: "RC", 1: "LC"}.get(val, str(val))


def _parse_time(darr: xr.DataArray) -> datetime.datetime:
    return datetime.datetime.strptime(
        f"{darr.attrs['Date']} {darr.attrs['Time']}", "%d/%m/%Y %I:%M:%S %p"
    )


def _determine_kind(darr: xr.DataArray) -> str:
    data_type = "xps"
    if "alpha" in darr.dims:
        data_type = "cut"
    if "beta" in darr.dims:
        data_type = "map"
    if "hv" in darr.dims:
        data_type = "hvdep"
    return data_type


class ExampleLoaderComplete(ExampleLoader):
    name = "example_complete"
    description = "Example loader that supports summary generation"

    formatters = {
        "polarization": _format_polarization,
        "LensMode": lambda x: x.replace("Angular", "A"),
    }

    summary_attrs = {
        "Time": _parse_time,
        "Type": _determine_kind,
        "Lens Mode": "LensMode",
        "Scan Type": "SpectrumType",
        "T(K)": "sample_temp",
        "Pass E": "PassEnergy",
        "Polarization": "polarization",
        "hv": "hv",
        "x": "X",
        "y": "Y",
        "z": "Z",
        "polar": "beta",
        "tilt": "xi",
        "azi": "delta",
    }

    summary_sort = "Time"

    def load_single(self, file_path, without_values=False):
        darr = xr.open_dataarray(file_path, engine="h5netcdf")

        if without_values:
            # Prevent loading values into memory
            return xr.DataArray(
                np.zeros(darr.shape, darr.dtype),
                coords=darr.coords,
                dims=darr.dims,
                attrs=darr.attrs,
                name=darr.name,
            )

        return darr

    def files_for_summary(self, data_dir):
        return erlab.io.utils.get_files(data_dir, extensions=[".h5"])


erlab.io.loaders

(summary example)=

Let's see how the resulting summary looks like.

:::{note}

- If [ipywidgets](https://github.com/jupyter-widgets/ipywidgets) is not installed, only the DataFrame will be displayed.
- If you are viewing this documentation online, the summary will not be interactive. Run the code locally to try it out.

:::

In [None]:
erlab.io.set_loader("example_complete")
erlab.io.summarize()

Each cell in the summary table is formatted with {meth}`formatter <erlab.io.dataloader.LoaderBase.formatter>` after applying the {attr}`formatters <erlab.io.dataloader.LoaderBase.formatters>`.

### Tips

- The data loading framework is designed to be simple and flexible, but it may not cover all possible setups. If you encounter a setup that cannot be loaded with the existing api, please let us know by opening an issue!

- Before implementing a loader, see {mod}`erlab.io.dataloader` for descriptions about each attribute, and the values and types of the expected outputs. The implementation of existing loaders in the {mod}`erlab.io.plugins` module is a good starting point; see the [source code on github](https://github.com/kmnhan/erlabpy/tree/main/src/erlab/io/plugins).

- If you wish to add general post-processing steps such as fixing the sign of the binding energy coordinates, you can reimplement {meth}`post_process <erlab.io.dataloader.LoaderBase.post_process>` which by default handles coordinate and attribute renaming.

- For complex data structures, constructing a full path from just the sequence number  and the data directory can be difficult. In this case, the {meth}`identify <erlab.io. dataloader.LoaderBase.identify>` can be implemented to take additional keyword arguments. All additional keyword arguments passed to {meth}`load <erlab.io.dataloader.LoaderBase.load>` are passed to {meth}`identify <erlab.io.dataloader.LoaderBase.identify>`.

  For instance, consider data with different prefixes like `A_001.h5`, `A_002.h5`, `B_001.h5`, etc. stored in the same directory. In this case, we can't uniquely infer the file path from the sequence number alone. In this case, {meth}`identify <erlab.io.dataloader.LoaderBase.identify>` can be implemented to take an additional `prefix` argument to eliminate the ambiguity, after which `A_001.h5` can be loaded with `erlab.io.load(1, prefix="A")`.

  If there are multiple file scans in this setup like `A_001_S001.h5`, `A_001_S002.h5`, etc., we would want to pass the `prefix` parameter to {meth}`load <erlab.io.dataloader.LoaderBase.load>` from an identifier given as a file name. This is where the second return value of {meth}`infer_index <erlab.io.dataloader.LoaderBase.infer_index>` comes in handy, where you can return a dictionary which is passed to {meth}`load <erlab.io.dataloader.LoaderBase.load>`.

  For an example of this, see the implementation of {class}`erlab.io.plugins.erpes.ERPESLoader`.

- If you have implemented a new loader or have improved an existing one, consider contributing it to the ERLabPy project by opening a pull request. We are always looking for new loaders to support more experimental setups! See more about contributing [here](../contributing).


Don't forget to cleanup the temporary directory!

In [None]:
tmp_dir.cleanup()