# Data formats

Good agreement about data formats (and storage):

* Reduces time and errors
* Use of common tools
* Easy Reproducibility
* Collaboration (use of global formats)
    
Overall, just smoother research.

Sharing data?

### Types of data:

* Main signal
    * Images
    * Spectrocopic signals 1D
    * Spectroscopic images 2D x 1D
    * 4D-STEM 2D x 2D
* Auxilliary data
    * Metadata (typically as dictionary)
    * Tabular data

### Filetypes:

* Arrays (of float):
    * tif
    * npy
    * hdf5
    * zarr
    * netcdf
    * dm3/dm4
* Dictionary or tabular
    * json
    * csv

### Libraries:

* NumPy
    * npy
    * backbone of the proceeding libraries
* scikit-image (tif, jpeg)
    * tif, jpeg, png etc.
    * tifffile library necessary for reading embedded metadata
* Nionswift (npy + json, tiff + json, hdf5)
    * npy, tiff, hdf
    * json for metadata, except for hdf5
* Hyperspy
    * npy, tiff, hdf5 (hspy), zarr (zspy)
    * json for metadata, except for hdf5 and zarr
* pandas
    * csv 
* Xarray (zarr, Netcdf)
    * zarr, netcdf

### Example: scan maps

In [1]:
import os
from skimage.io import imread
import dask
import dask.array as da
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter

In [2]:
dirname = "data/map_2022_12_21_15_47/"
fnames = os.listdir(dirname)

In [3]:
fnames = [fname for fname in fnames if fname[:4].isdigit()]
paths = [os.path.join(dirname, fname) for fname in fnames]

In [4]:
lazy_images = [dask.delayed(imread)(path) for path in paths]

lazy_stack = [da.from_delayed(image, shape=(2048,2048), dtype=np.float32) for image in lazy_images]

lazy_stack = da.stack(lazy_stack, axis=0)

In [5]:
images = lazy_stack[:]

filtered_images = images.map_blocks(gaussian_filter, sigma=10)

In [6]:
images.to_zarr(dirname + ".zarr", overwrite=True)

PermissionError: [WinError 32] Processen kan ikke få adgang til filen, da den bruges af en anden proces: 'D:\\Dropbox\\pnm_group_retreat\\data\\map_2022_12_21_15_47\\.zarr\\382.0.0.10d1286e865248498c4289163d4c9c2e.partial'

In [None]:
plt.imshow(filtered_images[50])