Skip to content

paulscherrerinstitute/sf_datafiles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SwissFEL Data Files

This module provides an easy way of dealing with SwissFEL data files.

Introduction

The SwissFEL DAQ writes hdf5 data files into a folder structure defined by instrument (Alvra, Bernina, Cristallina, Diavolezza, Maloja, Furka) and experiment pgroup (of the form p12345):

/sf/{instrument}/data/{pgroup}/raw/...

In this folder, user-defined subfolders are created, and files are named by a run number and distinguished by a data type (camera images, beam-synchronous data, detector images, epics PV data):

.../{user_defined}/run_{number}.{data_type}.h5

For scans, an additional json file is written, which describes the scan steps:

.../scan_info/{user_defined}.json

This module simplifies dealing with the different data types, scan data, missing pulses, memory limitations for very large datasets, timing offsets, data format conversion, etc.

Usage example

The following example serves as illustration of the general work flow:

from matplotlib import pyplot as plt
from sfdata import SFDataFiles

with SFDataFiles("/sf/instrument/data/p12345/raw/something/run_000041.*.h5") as data:
    subset = data["SIGNAL_CHANNEL", "BACKGROUND_CHANNEL"]
    subset.drop_missing()
    pids = subset["SIGNAL_CHANNEL"].pids
    sig = subset["SIGNAL_CHANNEL"].data
    bkg = subset["BACKGROUND_CHANNEL"].data

norm = sig - bkg

plt.plot(pids, norm)
plt.show()

Open (and close) files

The main entry point is the SFDataFiles class:

from sfdata import SFDataFiles

which itself is an SFData subclass that handles HDF5 files.

Files can be opened as contexts:

with SFDataFiles("run_000041.BSREAD.h5") as data:
    do_something_with(data)

in which case, they don't need to be closed manually. Or assigned to a variable

data = SFDataFiles("run_000041.BSREAD.h5")
do_something_with(data)
data.close()

where they should be closed at the end.

Besides a single filename as shown above, SFDataFiles also accepts several filenames and/or filenames with wildcards (*, ?) in which case the channels of all given files will be merged into a single SFData object:

SFDataFiles("run_000041.*.h5")
SFDataFiles("run_000041.BSREAD.h5", "run_000041.CAMERA.h5")

Note that if channels occur in several files, only the last instance will be available in the SFData object. Channels will not be appended along the pulse ID axis. Thus, currently, it only makes sense to open files from one run at the same time.

SFDataFiles is a convenience wrapper which internally creates one SFDataFile (note the missing s) object for each given filename. SFDataFile works identical to SFDataFiles but accepts only a single filename as argument.

Channels

A list of available channels can be viewed via

data.names

The channels within the HDF5 files are represented by the SFChannel class and can be retrieved from the SFData object like from a dictionary:

ch = data["SLAAR11-LTIM01-EVR0:DUMMY_PV1_NBS"]

Note that, here, the channel name can be tab completed in ipython or jupyter.

Regular access

The pulse IDs, data contents and timestamps can be accessed via

ch.pids
ch.data
ch.timestamps

which reads the full arrays at once from the HDF5 file (it should be noted that this is currently not cached!). In most cases, this will be the preferred way of reading data.

Mimicking numpy arrays, the following attributes are available:

ch.shape
ch.dtype # data type of the individual elements
ch.ndim  # number of dimensions: len(shape)
ch.size  # number of elements: prod(shape)

Furthermore, slices of valid data can be accessed via numpy array-slicing syntax on the channel object:

ch[:100, 200:300, 400:500] # read only a 100x100 ROI of the first 100 images

Note that this is not identical to

ch.data[:100, 200:300, 400:500] # read all data, then apply the slice

Access in batches

If the full data array is too large to be held in memory at once (which is mainly a concern for camera images), it can be read in batches of valid entries instead:

for indices, batch in ch.in_batches():
    for image in batch:
        do_something_with(image)

For adjusting the memory consumption, the batching method accepts the batch size size as argument:

SFChannel.in_batches(size=100)

Similarly, the number of batches n can be adjusted, e.g., for faster debugging of further processing steps:

SFChannel.in_batches(n=3)

Batching yields indices, the current index slice within the whole valid data, and batch, a numpy array containing the current batch of valid data.

In most cases a reducing operation is supposed to be applied to the data and the result is to be stored in an array with the first axis corresponding to the valid pulse IDs. For this, indices can be put to use. A simple example would be to sum over each image individually in order to get an intensity information per pulse:

inten = np.empty(ch.nvalid)
for indices, batch in ch.in_batches():
    inten[indices] = batch.sum(axis=(1, 2))

To simplify this workflow, the convenience method apply_in_batches is available:

def proc(batch):
    return batch.sum(axis=(1, 2))

inten = ch.apply_in_batches(proc)

It should be noted that the processor function does not need to return a 1D array. If there are nvalid entries in the channel and a single processed entry is of the shape single_shape, the result will be of the shape (nvalid, *single_shape).

Finally, if the pulse IDs for each batch are needed, the following pattern can be used:

all_pids = ch.pids
for indices, batch in ch.in_batches():
    current_pids = all_pids[indices]

Access via datasets

In case the underlying HDF5 datasets need to be accessed, e.g., for reading only specific parts of the data, channels have a datasets namespace attached:

ch.datasets.pids
ch.datasets.data
ch.datasets.timestamps

In order to actually read the data from the file in this case, standard HDF5 syntax applies:

ch.datasets.pids[:]
ch.datasets.data[100:200]

Special channels

Jungfrau data

If the jungfrau_utils module is installed, its File class will be used automatically to open Jungfrau data files (*.JF*.h5). Instead of SFChannel, these channels are represented by the SFChannelJF class. The difference should be fully transparent for the user as both channels act identically when subsetting, slicing, etc.

The valid entries for Jungfrau data are initialized from the dataset is_good_frame in the respective data file. This dataset is also taken into account when SFChannelJF.reset_valid() is called.

Jungfrau data files do not contain timestamps. Thus, both ch.timestamps and ch.datasets.timestamps are None for SFChannelJF objects.

File system meta information

All classes directly tied to a single file (SFDataFile, SFChannel* and SFScanInfo) have a member .fs that gives direct access to meta information from the file system:

  • access, change, modification time as datetime object: .fs.atime, .fs.ctime, .fs.mtime
  • filename as string: .fs.name
  • size in bytes: .fs.size
  • group and owner as readable name (will be the pgroup and root for data files): .fs.group, .fs.owner
  • the file as pathlib.Path object: .fs.path

Subsets

Subsets of the data can be accessed by giving several channel names

subset = data["SLAAR11-LTIM01-EVR0:DUMMY_PV1_NBS", "SLAAR11-LTIM01-EVR0:DUMMY_PV2_NBS"]

which returns an SFData object that contains only the specified channels. SFData works identical to SFDataFile(s). Specifically, further subsets can be created from a subset. All subsets are real subsets of the original data. This means in particular that the data has to be read also for subsets within the file context (created by the with statement) or before closing the files.

Statistics

For an overview of which channel has how many missing shots, SFData has a print_stats method:

from sfdata import SFDataFile

channels = (
    "SAR-CVME-TIFALL5:EvtSet",
    "SARES11-SPEC125-M1.roi_background_x_profile",
    "SLAAR11-LTIM01-EVR0:DUMMY_PV1_NBS",
    "SLAAR11-LTIM01-EVR0:DUMMY_PV2_NBS"
)

with SFDataFile("run_000041.BSREAD.h5") as data:
    subset = data[channels]
    subset.print_stats(show_complete=True)

This results in output like this:

print_stats

The statistics overview can also be directly accessed via the included command line tool sfdstats.

Drop missing pulses

For correlating channels, pulse IDs that are not available in all channels need to be removed. This can be achieved via

subset.drop_missing()

resulting in data where data points that belong to the same pulse are matched. Internally, this is handled by updating each channels .valid attribute, which can be a boolean index or a list of coordinates (/indices) within the datasets (note: due to a limitation of h5py, this is only true for 1D datasets, for 2D or more only the latter works!).

The valid marker is per channel, and subsets are real subsets of the original data. Thus, valid markers set on a subset are also set for the larger parent data.

In case all .drop_missing() operations need to be reverted, both SFChannel and SFData have a .reset_valid() method (where the latter loops over the former). These reset the valid marker(s) to all pulse IDs that are in the respective underlying dataset. Note that each .drop_missing() calls .reset_valid() before calculating the new valid marker.

Channels with timing offsets

In case one of the channels has a timing offsets (i.e., along the pids axis), the .offset attribute can be used to correct for it:

with SFDataFiles("/sf/instrument/data/p12345/raw/something/run_000041.*.h5") as data:
    subset = data["SIGNAL_CHANNEL", "BACKGROUND_CHANNEL"]
    ch_sig = subset["SIGNAL_CHANNEL"]
    ch_bkg = subset["BACKGROUND_CHANNEL"]

    ch_bkg.offset = 1 # channel is delayed by one pid
    subset.drop_missing() # takes offset into account

    sig = ch_sig.data
    bkg = ch_bkg.data
    norm = sig - bkg

Convert to other data formats

For more complex treatment of missing pulse IDs, e.g., imputation, SFData can be converted to pandas DataFrames or xarray Dataset.

Convert to pandas DataFrame

df = subset.to_dataframe()

This way, missing entries will be marked as NaNs, and can be dealt with via, e.g., dropna() or fillna().

Note: NaN is only defined for floats. In order to use NaNs as missing marker for all data types (specifically, also for integers or booleans), the created dataframe has the dtype object. After dealing with the NaNs, the dtype can be corrected using the infer_objects method.

As an alternative to object dataframes, pandas provides a few (currently mostly experimental) extension types to deal with missing data for non-floats: specifically, integers and booleans. Note that these use pandas.NA for missing values instead of numpy.nan, which are equivalent for most purposes. In order to get dataframes using these dtypes, to_dataframe has as_nullable as keyword argument (defaulting to False).

Furthermore, dataframe columns can only hold 1D data natively. However, with object dtype, numpy arrays of larger dimensionality can be stored. A few caveats apply (for instance, df1.equals(df2) will not work due to arr1 == arr2 returning an array of booleans) and it might be easier to deal with regular lists instead of arrays in theses cases. Thus, there is a switch as_lists (which defaults to False) to enable a conversion to nested regular lists before insertion. Depending on the use case, these lists might need to be converted back to arrays when taken out of the dataframe.

Convert to xarray Dataset

ds = subset.to_xarray()

This way, missing entries will be marked as NaNs, and can be dealt with via, e.g., dropna() or fillna(). For dropna() the dim parameter is most likely the "pids" axis:

ds.dropna("pids", ...)

Scans

For conveniently working with data from scans, which consist of several steps each a set of data files, SFScanInfo can be used:

from sfdata import SFScanInfo

scan = SFScanInfo("/sf/instrument/data/p12345/raw/scan_info/a_scan.json")
xs = scan.readbacks
ys = []
for step in scan:
    # step is a SFDataFiles object
    subset = step["SIGNAL_CHANNEL", "BACKGROUND_CHANNEL"]
    ...
    norm = sig - bkg
    ys.append(norm)

plt.plot(xs, ys)
plt.show()

Since step is a SFDataFiles object, the usage example can be followed where ... is given here.

SFScanInfo gives access to the other contents of the json file via attributes. Specifically, values and readbacks are worth mentioning here as they come already converted to numpy arrays.

Finally, it should be noted that the iteration will simply skip over steps that do not contain files that can be opened (it will still print warnings, which can be silenced in the usual way). This is to simplify plotting preliminary data from scans that are still running or finished scans where files are broken. Therefore, the following pattern is probably more versatile than the previous example:

from sfdata import SFScanInfo

scan = SFScanInfo("/sf/instrument/data/p12345/raw/scan_info/a_scan.json")
xs = scan.readbacks
ys = np.zeros_like(xs) # or np.full_like(xs, np.nan)
for i, step in enumerate(scan):
    # step is a SFDataFiles object
    subset = step["SIGNAL_CHANNEL", "BACKGROUND_CHANNEL"]
    ...
    norm = sig - bkg
    ys[i] = norm

plt.plot(xs, ys)
plt.show()

About

Module for easier handling of SwissFEL hdf5 data files

Resources

Stars

Watchers

Forks

Packages

No packages published