# Intro to the Advanced Scientific Data Format (ASDF)


## Outline

- Why a new data format?
- ASDF Standard features
- Who uses it?
- Working with existing ASDF files
  - Read a file
  - Show the contents of an ASDF file
  - Search for an attribute in an ASDF file
  - Accessing metadata and data
  - Modifying and saving files
  - Exercise
  - *Adding History items*
  - *Compression*
- *Command line interface*


### The Need For a New Data Format

What's wrong with FITS?

  FITS served the astronomical community very well for many years. However, with the advanace of new instrumentation, development of algorithms and increased size of data and data volume it appears to be more problematic than helpful. The issues with FITS have been documented in a paper by B. Thomas, et al. (Learning from FITS: Limitations in use in modern astronomical research, Astron. Comput. (2015), 10.1016/j.ascom.2015.01.009, arXiv:1502.00996v2).
  
  The specific motivation for developing the standard was that FITS WCS conventions proved basically unusable for raw HST data that included complex distortion models and that required high accuracy. The experience with HST showed that those conventions will not work with the much more complex JWST WCS transforms.
  

### Main Features of ASDF


- It has a hierarchical metadata structure, made up of basic dynamic data types such as strings, numbers, lists and mappings.

- Attribute names and values are not constrained by size as is the case for FITS header cards.

- It has human-readable metadata that can be edited directly in place in the file.

- The structure of the files can be automatically validated using associated schema files.

- It’s designed for extensibility: new conventions may be used without breaking backward compatibility with tools that do not understand those conventions. Versioning systems are used to prevent conflicting with alternative conventions.

- The binary array data (when compression is not used) is a raw memory dump, and techniques such as memory mapping can be used to efficiently access it.

- It is possible to read and write the file in as a stream, without requiring random access.

- It’s built on top of industry standards, such as YAML and JSON Schema to take advantage of a larger community
working on the core problems of data representation. This also makes it easier to support ASDF in new programming languages and environments by building on top of existing libraries.

- Since every ASDF file has the version of the specification to which it is written, it will be possible, through careful planning, to evolve the ASDF format over time, allowing for files that use new features while retaining backward compatibility with older tools.

### Implementation Status

The current version of the **standard is 1.5.0.**

There is a Python reference implementation that supports the standard, the library is called **asdf**.

Future plans include a C/C++ (a partial implementation exists) and possibly an IDL implementation.


### Who Uses ASDF?

- The JWST calibration pipeline uses data models based on ASDF to abstract out the serialization format. The WCS describing the unresampled data is serialized using ASDF.

- ASDF will be the data format for the Nancy Grace Roman Space Telescope. 

- Daniel K Inoue Solar Telescope (DKIST) is using ASDF for serializing the World Coordinate System.

- Used by the Vera Rubin Telescope as a WCS exchange format.

- There are other non-institutional projects using it in astronomy and other fields. 

### Anatomy of an ASDF file

ASDF is a hybrid text and binary format. The general layout of the file is
- header
- tree (optional)
  The tree is a dictionary. Most Python types can be serialized directly, using YAML, as {key: value} pairs in the tree. 
- binary blocks (optional)
- binary block index (optional)

The header, tree and block index are text, while the blocks are raw binary.



### Reading an ASDF file

The Python ASDF library is a standalone package distributed through PyPi and conda-forge.


In [1]:
import asdf

To open a file use the **open** function. It is useful to look up the keyword arguments it accepts, there are many options specifying how a file should be opened or validated during opening. For this example we will use the default behavior and look at the object.

In [2]:
af = asdf.open('example.asdf')

In [3]:
print(af)

<asdf.asdf.AsdfFile object at 0x7f8ae82a93d0>


In [4]:
af.tree

{'asdf_library': {'author': 'The ASDF Developers',
  'homepage': 'http://github.com/asdf-format/asdf',
  'name': 'asdf',
  'version': '2.11.2.dev13+gf9aeb247'},
 'history': {'extensions': [{'extension_class': 'asdf.extension.BuiltinExtension',
    'software': {'name': 'asdf', 'version': '2.11.2.dev13+gf9aeb247'}},
   {'extension_class': 'asdf.extension._manifest.ManifestExtension',
    'extension_uri': 'asdf://asdf-format.org/transform/extensions/transform-1.5.0',
    'software': {'name': 'asdf-astropy', 'version': '0.2.1'}}]},
 'data': <array (unloaded) shape: [5, 6] dtype: float64>,
 'meta': {'date': '2022-06-01T16:34:22.059',
  'instrument': {'detector': 'NRCA', 'filter': 'FW100W', 'name': 'NIRCAM'},
  'model1': <Chebyshev2D(1, 1, c0_0=0.1, c1_0=0.2, c0_1=0.3, c1_1=0.4)>,
  'model2': <Chebyshev2D(1, 1, c0_0=0.1, c1_0=0.2, c0_1=0.3, c1_1=0.4)>,
  'telescope': 'JWST'}}

### Getting information about a file

There are two functions that allow introspecting a file, **info** and **search**. They are available as methods on the **AsdFile** object or on the command line. Both are configurable through multiple parameters.


In [5]:
af.info?

In [6]:
af.search?

In [7]:
af.info()

[1mroot[0m (AsdfObject)
[2m├─[0m[1masdf_library[0m (Software)
[2m│ ├─[0m[1mauthor[0m (str): The ASDF Developers
[2m│ ├─[0m[1mhomepage[0m (str): http://github.com/asdf-format/asdf
[2m│ ├─[0m[1mname[0m (str): asdf
[2m│ └─[0m[1mversion[0m (str): 2.11.2.dev13+gf9aeb247
[2m├─[0m[1mhistory[0m (dict)
[2m│ └─[0m[1mextensions[0m (list)
[2m│   ├─[0m[[1m0[0m] (ExtensionMetadata)
[2m│   │ ├─[0m[1mextension_class[0m (str): asdf.extension.BuiltinExtension
[2m│   │ └─[0m[1msoftware[0m (Software)[3m ...[0m
[2m│   └─[0m[[1m1[0m] (ExtensionMetadata)[3m ...[0m
[2m├─[0m[1mdata[0m (NDArrayType): shape=(5, 6), dtype=float64
[2m└─[0m[1mmeta[0m (dict)
[2m  ├─[0m[1mdate[0m (str): 2022-06-01T16:34:22.059
[2m  ├─[0m[1minstrument[0m (dict)
[2m  │ ├─[0m[1mdetector[0m (str): NRCA
[2m  │ ├─[0m[1mfilter[0m (str): FW100W
[2m  │ └─[0m[1mname[0m (str): NIRCAM
[2m  ├─[0m[1mmodel1[0m (Chebyshev2D)
[2m  ├─[0m[1mmodel2[0m (Chebyshev2D)
[2

ASDF is a human readable format, so let's look at the file on disk. There are several things worth pointing out.


- An ASDF file has a header which records the version of the ASDF Standard used to write out the file.

- The information about the instrument configuration is stored in one self-contained section.

- The data array is listed as "unloaded" (shown above in the Python tree). By default asdf uses lazy loading when opening files. Arrays are loaded into memory only when accessed. This behaviour can be changed through a parameter.

- The description of the data array is within the tree but the binary block and the binary block index are at the end of the file.

- The "date" attribute is serialized in isot format. When the file is read in with the Python library, an astropy Time object is directly created.

- (Un)serializing astropy models works in the same way - a model is created in memory when the file is read in.

- When the same object is serialized to disk, it is not copied in the file. Rather a reference to it is created using YAML anchors. In this example **&id002** is the definition of the chebyshev2D model. ***id002** is serialzed as attribute *model2* and is a reference to the original definition of the model.

In [None]:
#!less example.asdf
'''
#ASDF 1.0.0                              
#ASDF_STANDARD 1.5.0                    
%YAML 1.1                         
%TAG ! tag:stsci.edu:asdf/                 
--- !core/asdf-1.1.0                       
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.11.2.dev13+gf9aeb247}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.11.2.dev13+gf9aeb247}
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/transform/extensions/transform-1.5.0
    software: !core/software-1.0.0 {name: asdf-astropy, version: 0.2.1}
data: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [5, 6]
meta:
  date: '2022-05-31T13:29:12.748'
  instrument: {detector: NRCA, filter: FW100W, name: NIRCAM}
  model1: &id002 !transform/ortho_polynomial-1.0.0
    coefficients: !core/ndarray-1.0.0
      source: 1
      datatype: float64
      byteorder: little
      shape: [2, 2]
    inputs: [x, y]
    outputs: [z]
    polynomial_type: chebyshev
    window:
    - &id001 [-1, 1]
    - *id001
  model2: *id002
  telescope: JWST
...
<D3>BLK^@0^@^@^@^@^@^@^@^@^@^@^@^@^@^@^H^@^@^@^@^@^@^@^H^@^@^@^@^@^@^@^H^@<BD>Gb
<E1>.<EB>?<FA><B8><C7>i/<C7><E1><BF>
<99><99><99><99><99><C9>?<9A><99><99><99><99><99><D9>?#ASDF BLOCK INDEX
%YAML 1.1
---
- 1215
- 3317
...

'''

The **asdf** library has search capabilities. A file can be search for an attribute by name, type or value.

In [9]:
from astropy.modeling.core import Model

af.search(type=Model)

[1mroot[0m (AsdfObject)
[2m└─[0m[1mmeta[0m (dict)
[2m  ├─[0m[1mmodel1[0m (Chebyshev2D)
[2m  └─[0m[1mmodel2[0m (Chebyshev2D)

In [10]:
af.search('model')

[1mroot[0m (AsdfObject)
[2m└─[0m[1mmeta[0m (dict)
[2m  ├─[0m[1mmodel1[0m (Chebyshev2D)
[2m  └─[0m[1mmodel2[0m (Chebyshev2D)

#### Accessing and Modifying a file


In [11]:
print(af['meta']['date'])

2022-06-01T16:34:22.059


Reading an custom serialized object creates the object in memory.

For example, the model saved in the file can be directly evaluated.

In [12]:
chebyshev = af['meta']['model1']

In [13]:
type(chebyshev)

<class 'astropy.modeling.polynomial.Chebyshev2D'>
Name: Chebyshev2D (OrthoPolynomialBase)
N_inputs: 2
N_outputs: 1
Fittable parameters: <property object at 0x7f8b1a667360>

In [14]:
chebyshev(1.2, 2.1)

1.9780000000000002

In [15]:
print(af['data'])

[[ 1.00000000e+02  6.82245734e-01 -2.98389550e-01  5.14005486e-01
  -1.10249540e+00 -4.35345418e-01]
 [-3.59395950e-02  5.68040474e-01  8.69830342e-02  1.71662611e+00
  -8.94540622e-02  1.79363333e-01]
 [-1.02320839e-02  8.06388959e-01  6.04275392e-01  3.91118220e-02
   8.00726809e-01 -8.91474365e-01]
 [-4.09661617e-01  4.51147659e-01  1.09377565e+00  5.77306930e-01
   1.00588173e+00  2.44444703e-01]
 [ 2.67706405e-01 -2.15643764e+00  2.76001987e-01 -1.14983575e+00
   2.93941246e-01 -1.19135198e+00]]


In [16]:
af['data'][0,0] = 100

ValueError: assignment destination is read-only

By default a file is opened in **r** mode. Once it's opened in **rw** mode, it can be modified.

In [17]:
af.close()

af_rw = asdf.open('example.asdf', mode='rw')

af_rw['data'][0,0] = 100

print(af_rw['data'])
af_rw.write_to('example_rw.asdf')

[[ 1.00000000e+02  6.82245734e-01 -2.98389550e-01  5.14005486e-01
  -1.10249540e+00 -4.35345418e-01]
 [-3.59395950e-02  5.68040474e-01  8.69830342e-02  1.71662611e+00
  -8.94540622e-02  1.79363333e-01]
 [-1.02320839e-02  8.06388959e-01  6.04275392e-01  3.91118220e-02
   8.00726809e-01 -8.91474365e-01]
 [-4.09661617e-01  4.51147659e-01  1.09377565e+00  5.77306930e-01
   1.00588173e+00  2.44444703e-01]
 [ 2.67706405e-01 -2.15643764e+00  2.76001987e-01 -1.14983575e+00
   2.93941246e-01 -1.19135198e+00]]


### Exercise 1

Using the asdf library open the file provided with the tutorial (**r0000000.asdf**)
The file is a simulated image from the Nancy Grace Roman WFI instrument (courtesy of the Roman Instrument Team at STScI).

- Use the **info** and **search** methods to look at the contents of the file.
- Find the **wcs** attribute.
- Evaluate the WCS object for some pixel coordinates wihtin the image to calculate the RA, DEC.

  Hint: 
  - The WCS object is represented using the Generalized World Coordinate System (GWCS) library. It can be evaluated by calling it as a function.
  - GWCS supports the *Astropy Common WCS API*. Use *wcs.pixel_to_world()* to evaluate the same coordinates. The result is an *astropy.coordinates.SkyCoord* object. Use it to transform the result to Galactic cooridnates.
  


### Compression

ASDF supports array compression. The currently supported compression types are **zlib**, **bzp2**, **lz4** .

To specify which compression algorithm to use, pass the code to the *set_array_compression* method.

In [18]:
import numpy as np

ar_zeros = np.zeros((4000, 4000))
af_rw.set_array_compression(ar_zeros, 'bzp2')

In [19]:
af_rw.write_to('compressed.asdf')

### Adding *History* items

In [21]:
af_rw.add_history_entry("This file was generated during AAS 240")

### Conclusion

#### Summary

The python library includes optionally extensions which are able to serialize certain astropy objects - models, coordinate frames, units and quantities, tables. It can serialize GWCS objects.

ASDF can serialized custom types, called *tags*. A little work is needed to write *Converters*, which handle the serialization to and deserialization from ASDF files.

The source code for the current extensions is on Github: https://github.com/asdf-format and in the GWCS library at https://github.com/spacetelescope/gwcs.

#### What we didn't cover 

ASDF is extensible. It is relatively easy to write an extension which serializes any other Python object. Tutorials are available at https://github.com/asdf-format/tutorials .

ASDF supports compression and there is a mechanism to add custom compression algorithms.

ASDF uses JSON schema to validate the contents of the files. If used this is a powerful way to make sure files are correct. 

There's a command line tool **asdftool** which does many of the operations we've shown outside the Python interpreter. Check the options using **asdftool --help**.

ASDF supports the so called **exploded form**. ASDF files can be split into one for the YAML content and one for each of the binary blocks contained within the file, facilitating easier editor access to the YAML, and independent program access to the binary data.

#### Documentation

The ASDF standard is documented at  https://asdf-standard.readthedocs.io/en/1.0.2/

Documentation of the Python library is at https://asdf.readthedocs.io/en/stable/

Additional ASDF documentation on various extensions and converters can be found at

https://asdf-astropy.readthedocs.io/en/latest/


#### Future work

- Add support for chunking arrays using **zarr**

- Add support for efficient access of large files in the cloud

- Visualization suport

- A C/C++ library, an IDL library?

- Add more compression options


### The code used to create *example.asdf* from scratch

In [22]:
# Create a Time object in isot format
from astropy import time as atime
t = atime.Time.now().isot

# Create the data array
import numpy as np
data = np.random.randn(30).reshape((5, 6))

# Create a Chebyshev 2D polynomial
from astropy.modeling.models import *
p = Chebyshev2D(1, 1)
p.parameters = .1, .2, .3, .4


In [23]:
"""
- Create an empty AsdfFile object
- Assign the attributes, choosing to assign metadata under a top level *meta* attribute
- Write the file to disk
"""
jw = asdf.AsdfFile()
jw['meta']={}
jw['meta']['telescope'] = 'JWST'
jw['meta']['instrument'] = {}
jw['meta']['instrument']['name'] = 'NIRCAM'
jw['meta']['instrument']['detector'] = 'NRCA'
jw['meta']['instrument']['filter'] = 'FW100W'
jw['meta']['date'] = t
jw['data'] = data
jw['meta']['model1'] = p
jw['meta']['model2'] = p
jw.write_to('example.asdf')
