Reading & writing data
======================

Reading data
------------

```python
import xarray as xr

data = xr.open_dataarray("path/to/wave.ibw")
```

```python
data = xr.open_datatree("path/to/experiment.pxpt")
```

Along with the Igor Pro file formats, the backend also supports loading HDF5 files
exported from Igor Pro. For such files, the engine must be specified explicitly with
`engine="erlab-igor"`.

Writing data
------------
Since the state and variables of a Python interpreter are not saved, it is important to
save your data in a format that can be easily read and written.

While it is possible to save and load entire Python interpreter sessions using
[pickle](https://docs.python.org/3/library/pickle.html) or the more versatile
[dill](https://github.com/uqfoundation/dill), it is out of the scope of this guide.
Instead, we recommend saving your data in a format that is easy to read and write, such
as HDF5 or NetCDF. To save and load xarray objects to such formats, see the xarray
documentation on [I/O operations](https://docs.xarray.dev/en/stable/user-guide/io.html).

Loading ARPES data
------------------

ERLabPy's data loading framework consists of various plugins, or *loaders*, each
designed to load data from a different beamline or laboratory. Each *loader* is a class
that has a `load` method which takes a file path or sequence number and returns data.

Let's see the list of loaders available by default:

In [None]:
import erlab.io

erlab.io.loaders

In [None]:
%config InlineBackend.figure_formats = ["svg", "pdf"]
import matplotlib.pyplot as plt
import xarray as xr

plt.rcParams["figure.dpi"] = 96
plt.rcParams["image.cmap"] = "viridis"

xr.set_options(display_expand_data=False)
nb_execution_mode = "cache"

In [None]:
erlab.io.loaders["merlin"]
erlab.io.loaders["bl403"]
erlab.io.loaders.merlin
erlab.io.loaders.bl403

- If `identifier` is a sequence number, `data_dir` must be provided.

- If `identifier` is a string and `data_dir` is provided, the path is constructed by
  joining `data_dir` and `identifier`.

- If `identifier` is a string and `data_dir` is not provided, `identifier` should be a
  valid path to a file.

Suppose we have data from the ALS beamline 4.0.3 stored as `/path/to/data/f_001.pxt`,
`/path/to/data/f_002.pxt`, etc. To load `f_001.pxt`, all three of the following are
valid:

```python
loader = erlab.io.loaders["merlin"]

loader.load("/path/to/data/f_001.pxt")
loader.load("f_001.pxt", data_dir="/path/to/data")
loader.load(1, data_dir="/path/to/data")
```

```python
erlab.io.set_loader("merlin")
erlab.io.set_data_dir("/path/to/data")
data_1 = erlab.io.load(1)
data_2 = erlab.io.load(2)
```

The loader and data directory can also be controlled with a context manager:

```python
with erlab.io.loader_context("merlin", data_dir="/path/to/data"):
    data_1 = erlab.io.load(1)
```

For setups like the ALS beamline 4.0.3, some scans are stored over multiple files like
`f_003_S001.pxt`, `f_003_S002.pxt`, and so on. In this case, the loader will
automatically concatenate all files in the same scan. For example, *all of the
following* will return the same concatenated data:

```python
erlab.io.load(3)
erlab.io.load("f_003_S001.pxt")
erlab.io.load("f_003_S002.pxt")
```

```python
erlab.io.load("f_003_S001.pxt", single=True)
```

If you don't want automatic concatenation to happen, you can suppress it with `combine=False`. The following code will return a list of DataArrays:
```python
erlab.io.load(3, combine=False)
```

```python
with erlab.io.loader_context("merlin", data_dir="/path/to/data"):
    data = erlab.io.load(identifier)
```

```python
erlab.io.set_loader("merlin")
erlab.io.summarize("/path/to/data")
```
If the path is not specified, the current data directory is used.

Implementing a data loader plugin 
---------------------------------

In [None]:
import os

import pandas as pd

from erlab.io.dataloader import LoaderBase


class MyLoader(LoaderBase):
    name = "my_loader"
    aliases = None
    name_map = {}
    coordinate_attrs = {}
    additional_attrs = {"information": "any metadata you want to load with the data"}
    skip_validate = False
    always_single = True

    def identify(self, num, data_dir):
        file = os.path.join(data_dir, f"data_{str(num).zfill(4)}.csv")
        return [file], {}

    def load_single(self, file_path, without_values=False):
        return pd.read_csv(file_path).to_xarray()

Here, the `without_values` argument to `load_single` is unused; it will be explained later.

In [None]:
erlab.io.loaders

In [None]:
erlab.io.loaders["my_loader"]

In [None]:
import csv
import datetime
import tempfile

import numpy as np

import erlab.io
from erlab.io.exampledata import generate_data_angles


def make_data(beta=5.0, temp=20.0, hv=50.0, bandshift=0.0):
    data = generate_data_angles(
        shape=(250, 1, 300),
        angrange={"alpha": (-15, 15), "beta": (beta, beta)},
        hv=hv,
        configuration=1,
        temp=temp,
        bandshift=bandshift,
        assign_attributes=False,
        seed=1,
    ).T

    # Rename coordinates. The loader must rename them back to the original names.
    data = data.rename(
        {
            "alpha": "ThetaX",
            "beta": "Polar",
            "eV": "BindingEnergy",
            "hv": "PhotonEnergy",
            "xi": "Tilt",
            "delta": "Azimuth",
        }
    )
    dt = datetime.datetime.now()

    # Assign some attributes that real data would have
    data = data.assign_attrs(
        {
            "LensMode": "Angular30",  # Lens mode of the analyzer
            "SpectrumType": "Fixed",  # Acquisition mode of the analyzer
            "PassEnergy": 10,  # Pass energy of the analyzer
            "UndPol": 0,  # Undulator polarization
            "Date": dt.strftime(r"%d/%m/%Y"),  # Date of the measurement
            "Time": dt.strftime("%I:%M:%S %p"),  # Time of the measurement
            "TB": temp,
            "X": 0.0,
            "Y": 0.0,
            "Z": 0.0,
        }
    )
    return data


# Create a temporary directory
tmp_dir = tempfile.TemporaryDirectory()

# Define coordinates for the scan
beta_coords = np.linspace(2, 7, 10)

# Generate and save cuts with different beta values
for i, beta in enumerate(beta_coords):
    data = make_data(beta=beta, temp=20.0, hv=50.0)
    filename = f"{tmp_dir.name}/data_001_S{str(i + 1).zfill(3)}.h5"
    data.to_netcdf(filename, engine="h5netcdf")

# Write scan coordinates to a csv file
with open(f"{tmp_dir.name}/data_001_axis.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Index", "Polar"])

    for i, beta in enumerate(beta_coords):
        writer.writerow([i + 1, beta])

# Generate some cuts with different band shifts
for i in range(4):
    data = make_data(beta=5.0, temp=20.0, hv=50.0, bandshift=-i * 0.05)
    filename = f"{tmp_dir.name}/data_{str(i + 2).zfill(3)}.h5"
    data.to_netcdf(filename, engine="h5netcdf")

Now, we have generated a folder that resembles typical data from an ARPES experiment. Let's list the contents of the folder:

In [None]:
sorted(os.listdir(tmp_dir.name))

Each HDF5 file represents a single ARPES cut. `data_001_S001.h5` to `data_001_S010.h5`
represents an ARPES map with 10 cuts, with the scan axis recorded in
`data_001_axis.csv`. Let's check what the raw data looks like.

In [None]:
xr.load_dataarray(f"{tmp_dir.name}/data_002.h5")

The data has been properly loaded, but the coordinates and attributes have names that
are specific to the beamline.

Our loader should do three things: rename the coordinates and attributes to standard
names, add metadata to the dataset, and combine related cuts into a single DataArray
that contains the ARPES mapping.

In [None]:
import pathlib
import re

from erlab.io.dataloader import LoaderBase
from erlab.utils.misc import emit_user_level_warning


class ExampleLoader(LoaderBase):
    name = "example"

    aliases = ["Ex"]

    name_map = {
        "eV": "BindingEnergy",
        "alpha": "ThetaX",
        "beta": [
            "Polar",
            "Polar Compens",
        ],  # Can have multiple names assigned to the same name
        # If both are present in the data, a ValueError will be raised
        "delta": "Azimuth",
        "xi": "Tilt",
        "x": "X",
        "y": "Y",
        "z": "Z",
        "hv": "PhotonEnergy",
        "polarization": "UndPol",
        "sample_temp": "TB",
    }
    # Map the names of the coordinates or attributes in the resulting data to the names
    # present in the data returned by `load_single`. Note that the order of
    # non-dimension coordinates in the output data will follow the order of the keys in
    # this dictionary.

    coordinate_attrs: tuple[str, ...] = (
        "beta",
        "delta",
        "xi",
        "hv",
        "x",
        "y",
        "z",
        "polarization",
        "photon_flux",
        "sample_temp",
    )
    # Attributes to be used as coordinates. Place all attributes that we don't want to
    # lose when merging multiple file scans here.

    additional_attrs = {
        "configuration": 1,  # Experimental geometry. Required for momentum conversion
        "sample_workfunction": 4.3,
    }
    # Any additional metadata you want to add to the data. Note that attributes defined
    # here will not be transformed into coordinates. If you wish to promote some fixed
    # attributes to coordinates, add them to additional_coords.

    additional_coords = {}
    # Additional non-dimension coordinates to be added to the data, for instance the
    # photon energy for lab-based ARPES.

    always_single = False

    def identify(self, num, data_dir):
        data_dir = pathlib.Path(data_dir)

        coord_dict = {}

        # Look for scans with data_###_S###.h5, and sort them
        files = sorted(data_dir.glob(f"data_{str(num).zfill(3)}_S*.h5"))

        if len(files) == 0:
            # If no files found, look for data_###.h5
            files = sorted(data_dir.glob(f"data_{str(num).zfill(3)}.h5"))
            if len(files) > 1:
                # More than one file found with the same scan number, show warning
                emit_user_level_warning(
                    f"Multiple files found for scan {num}, using {files[0]}"
                )
                files = files[:1]
        else:
            # If files found, extract coordinate values from the filenames
            axis_file = data_dir / f"data_{str(num).zfill(3)}_axis.csv"
            with axis_file.open("r") as f:
                header = f.readline().strip().split(",")

            # Load the coordinates from the csv file
            coord_arr = np.loadtxt(axis_file, delimiter=",", skiprows=1)

            # Each header entry will contain a dimension name
            for i, hdr in enumerate(header[1:]):
                coord_dict[hdr] = coord_arr[: len(files), i + 1].astype(np.float64)

        if len(files) == 0:
            # If no files found up to this point, return None
            return None

        return files, coord_dict

    def load_single(self, file_path, without_values=False):
        return xr.open_dataarray(file_path, engine="h5netcdf")

    def infer_index(self, name):
        # Get the scan number from file name
        try:
            scan_num: str = re.match(r".*?(\d{3})(?:_S\d{3})?", name).group(1)
        except (AttributeError, IndexError):
            return None, None

        if scan_num.isdigit():
            # The second return value, a dictionary, is reserved for more complex
            # setups. See tips below for a brief explanation.
            return int(scan_num), {}
        return None, None

Note that there are more class attributes and methods that can be inherited or
overridden to customize the loader's behavior.

In [None]:
erlab.io.loaders

We can see that the `example` loader has been registered. Let's test the loader by
loading and plotting some data.

In [None]:
erlab.io.set_loader("example")
erlab.io.set_data_dir(tmp_dir.name)
erlab.io.load(1)

In [None]:
erlab.io.load(5).qplot()

In [None]:
def _format_polarization(val) -> str:
    val = round(float(val))
    return {0: "LH", 2: "LV", -1: "RC", 1: "LC"}.get(val, str(val))


def _parse_time(darr: xr.DataArray) -> datetime.datetime:
    return datetime.datetime.strptime(
        f"{darr.attrs['Date']} {darr.attrs['Time']}",
        "%d/%m/%Y %I:%M:%S %p",
    )


def _determine_kind(darr: xr.DataArray) -> str:
    if "scan_type" in darr.attrs and darr.attrs["scan_type"] == "live":
        return "LP" if "beta" in darr.dims else "LXY"

    data_type = "xps"
    if "alpha" in darr.dims:
        data_type = "cut"
    if "beta" in darr.dims:
        data_type = "map"
    if "hv" in darr.dims:
        data_type = "hvdep"
    return data_type


class ExampleLoaderComplete(ExampleLoader):
    name = "example_complete"
    aliases = ["ExC"]

    formatters = {
        "polarization": _format_polarization,
        "LensMode": lambda x: x.replace("Angular", "A"),
    }

    summary_attrs = {
        "Time": _parse_time,
        "Type": _determine_kind,
        "Lens Mode": "LensMode",
        "Scan Type": "SpectrumType",
        "T(K)": "sample_temp",
        "Pass E": "PassEnergy",
        "Polarization": "polarization",
        "hv": "hv",
        "x": "x",
        "y": "y",
        "z": "z",
        "polar": "beta",
        "tilt": "xi",
        "azi": "delta",
    }

    summary_sort = "Time"

    def load_single(self, file_path, without_values=False):
        darr = xr.open_dataarray(file_path, engine="h5netcdf")

        if without_values:
            # Prevent loading values into memory
            return xr.DataArray(
                np.zeros(darr.shape, darr.dtype),
                coords=darr.coords,
                dims=darr.dims,
                attrs=darr.attrs,
            )

        return darr

    def files_for_summary(self, data_dir):
        return erlab.io.utils.get_files(data_dir, extensions=[".h5"])


erlab.io.loaders

In [None]:
erlab.io.set_loader("example_complete")
erlab.io.summarize()