# Development

[edit this notebook in colab](https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb)

The proposed plan for radio astronomy data reduction software consists out of three layers:
- Application Software
- ngCASA functions
- CNGI functions

The application software layer is what ultimately most astronomers will use. This document describes the building blocks that may be assembled to create the application software. Consequently, ngCASA and CNGI is aimed at the following users:

- CASA Developers
- Pipeline Developers
- Advance users and algorithm developers

## ngCASA Prototype

The following ngCASA functions have working prototypes:
- [make_imaging_weight](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_imaging_weight.html#ngcasa.imaging.make_imaging_weight)
- [make_psf](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_psf.html#ngcasa.imaging.make_psf)
- [make_image](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_image.html#ngcasa.imaging.make_image)

Example notebooks can be found [here](https://ngcasa.readthedocs.io/en/latest/prototypes.html). In the [continuum imaging](https://ngcasa.readthedocs.io/en/latest/prototypes/continuum_imaging_example.html) and [cube imaging](https://ngcasa.readthedocs.io/en/latest/prototypes/cube_imaging_example.html)  example notebooks the images generated by the ngCASA prototype and CASA are compared. A benchmarking comparison between the ngCASA prototype and CASA can be found [here](https://ngcasa.readthedocs.io/en/latest/benchmark.html).

### Chunking

In the zarr and dask model, data is broken up into chunks. The chunk size can be specified in the ```xarray.open_zarr``` call using the ```chunks``` parameter. The dask and zarr chunking do not have to be the same. The [zarr chunking](https://zarr.readthedocs.io/en/stable/tutorial.html) is what is used on disk and the [dask chunking ](https://docs.dask.org/en/latest/array-chunks.html) is used during parallel computation. However, it is more efficient for the dask chunk size to be equal to or a multiple of the zarr chunk size (to stop multiple reads of the same data). This hierarchy of chunking allows for flexible and efficient algorithm development. For example cube imaging is more memory efficient if chunking is along the channel axis (the [benchmarking](https://ngcasa.readthedocs.io/en/latest/benchmark.html) example demonstrates this). Note, chunking can be done in any combination of dimensions.

 When using different chunking than the chunking on disk the ```overwrite_encoded_chunks``` parameter in the  ```xarray.open_zarr``` call should be set to True. The ```storage_parms['chunks_on_disk']``` and ```storage_parms['chunks_return']``` functionality is still experimental, please report any bugs. There should also be no dask chunking on the polarization dimension.


### ngCASA Function Template
```python
def ngcasa_func(dataset_1, ..., dataset_n, parms_1, ..., parms_m, storage_parms):
    """
    Description of function.
    Parameters
    ----------
    dataset_1 : xarray.core.dataset.Dataset
    ...
    dataset_n : xarray.core.dataset.Dataset
    parms_1 : dict
    ...
    parms_m : dict
    storage_parms : dictionary
    storage_parms['to_disk'] : bool, default = False
    storage_parms['append'] : bool, default = False
    storage_parms['outfile'] : str
    storage_parms['chunks_on_disk'] : dict of int, default ={}
    storage_parms['chunks_return'] : dict of int, default = {}
    storage_parms['graph_name'] : str
    storage_parms['compressor'] : numcodecs.blosc.Blosc,default=Blosc(cname='zstd', clevel=2, shuffle=0)
    Returns
    -------
    output_dataset_1 : xarray.core.dataset.Dataset
    ...
    output_dataset_k : xarray.core.dataset.Dataset
    """
```
ngCASA functions will receive and return xarray datasets. By default calling an ngCASA function will not do computation, but rather build a graph of the computation ([example of a graph](https://ngcasa.readthedocs.io/en/latest/prototypes/cube_imaging_example.html)). Computation can be triggered by using ```dask.compute(dataset)```. Each ngCASA function will also have the ```storage_parms``` input, by default ```storage_parms['to_disk']``` is set to False and the other dictionary elements have no effect. However, if ```storage_parms['to_disk']``` is True a compute is triggered and data is written to disk, [example of using storage_parms](https://ngcasa.readthedocs.io/en/latest/prototypes/imaging_weights_example.html). The ```storage_parms['to_disk']``` elements have the following functionality:

- ```storage_parms['to_disk']``` If True the dask graph is executed and saved to disk in the zarr format.
- ```storage_parms['append']```  If ```storage_parms[‘to_disk’]``` is True only the dask graph associated with the function is executed and the resulting data variables are saved to an existing zarr file on disk. Note that graphs of unrelated data to this function will not be executed or saved. The append function also only works for data variables whose dimensions are already part of the dataset on disk.
- ```storage_parms['outfile']``` The zarr file to create or append to.
- ```storage_parms['chunks_on_disk']``` The chunk size to use when writing to disk. This is ignored if ```storage_parms['append']``` is True.
- ```storage_parms['chunks_return']``` The chunk size of the dataset that is returned. 
- ```storage_parms['graph_name']``` The time to compute and save the data is stored in the attribute section of the dataset and ```storage_parms['graph_name']``` is used in the label.
- ```storage_parms['compressor']``` The compression algorithm to use. Available compression algorithms can be found [here](https://numcodecs.readthedocs.io/en/stable/blosc.html).

The data variables in the dataset that the function uses to build the graph can be specified by the user. There will, however, be defaults for example the [make_image](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_image.html#ngcasa.imaging.make_image) function has the ```user_grid_parms['data_name']``` parameter. This parameter sets the name of the data variable that contains the visibilities to be gridded and the default is 'DATA'. In the [imaging weight example notebook](https://ngcasa.readthedocs.io/en/latest/prototypes/imaging_weights_example.html) this functionality is demonstrated. 


## Future Design Decisions

### Data Structures 
The initial exercise of defining an ngCASA API and writing example usage scripts raised the following questions related to default data structures that CNGI and ngCASA will adopt.

### Visibilities zarr/xarray
A default list of array names is as defined in CNGI. But, to get the ability to have versions of arrays, any application may add arrays as needed. Examples are corrected_data_1, corrected_data_2, flag_original, flag_auto, etc. All additional arrays must conform to the same meta-data and coordinates. But now, all methods (even CNGI) need the ability to specify an array name to use, with the default being from the core definition. 

### Image zarr/xarray
What is the default list of array names (i.e. image products) to store? The current specification in CNGI is tied closely to the casa6 tclean's list of outputs. However, not every applications/users will need this full list. Therefore, we perhaps should require that an image dataset contain a minimum of 1 array with a predefined default name, but then allow the appending of more. All image arrays within a set must share the same shape and coordinate metadata. Here too, all methods (even CNGI) that operate on images must have the ability to specify an array name to use. 

### A Caltable
Need to define the default structure of a zarr/xarray dataset meant to store calibration solutions. Same requirements from above apply here too.

### Primary Beam models Convolution Function Cache
We need a persistent data structure to store and use gridding convolution functions. The imaging prototype is developing one, related to the casa6 CFCache. A mechanism to convert between a Primary-Beam model database and a convolution function cache must also be designed. 

### Flux Component List
We need a definition of a component list as an alternative to a raster image. This is required for compatibility with catalogues and translation between sky models that are not fitted/evaluated on a pixellated grid. 


## List of Future Prototypes
The following are suggestions to add to the list of CNGI and ngCASA prototypes that demonstrate that algorithms and science use cases can be implemented with minimal complexity and that they scale as expected. 

Note : This list is preliminary in content, and has no fixed timelines associated with it. 

### Join Operation

Demonstrate a join operation for zarr/xds datasets for visibilities, images, caltables, CF-caches, etc. 

- Visibilities : A common use case is to combine data from multiple datasets that may not share meta-data or be consistent in shape. Some algorithms must be done on each dataset (or subset) separately, whereas others are to be done on an entire set.   
    - Option1 : Always work with small homogeneous subsets of data and have the application layer manage lists and merging of products (via loops). 
    - Option2 : Implement a join and then use data selection as needed. All applications then strictly can take single datasets as inputs. Algorithms such as imaging can view it as a single large dataset. However, for some algorithms (such as calibration solutions that must pay attention to different meta-data across subsets) the internal diversity will still have to be managed through loops inside the methods.
 
  Current CASA6 has a mix of both the following and this is a significant source of inconsistency. Can the new framework simplify any book-keeping for this use-case without sacrificing other aspects such as performance ? 

- Images : Applications may need to work with multiple image sets that may not share the same shape or meta-data. 
  - Should there be a join operation for images, should there only be custom combination methods, or should the application layer manage lists of image sets ?
  - Example use cases are linear mosaics where a list of small images is combined onto a larger image grid (as an explicit weighted average of images, where the primary beam is used as the weight) and multi-field imaging where main and outlier fields have vastly differing meta-data.

 
### Re-bin and Expand
The reverse operation of time/chan average or rebin/regrid operations. Visibility processing methods should (where mathematically possible) support transformations in both directions, with clear conventions implemented.  

  - E.g. vis.timeaverage () followed by 'edit flags' and then write back to original dataset ? FLAG value is to be copied during expansion. 
  - E.g. Caltable solutions need an interpolation to get back to the original data resolution. Interpolation+Extrapolation.


### Imaging

- Step 1 : A [fully functioning imaging prototype](https://ngcasa.readthedocs.io/en/latest/prototypes.html) exists and has demonstrated imaging weight calculations, basic gridding and image formation using a python-numba implementation. Equivalence with casa6's standard gridder has been established along with a performance comparison and demonstrations on a local workstation (with pre-specified resource constraints) as well as AWS. 

- Step 2 : Over the next several months, this prototype will be extended to include mosaic gridding/imaging and the design of a fully-featured convolution function cache (based on the casa6 awproject gridder cfcache). Demonstrations will likely include a large ALMA cube mosaic imaging example.

- Step 3 : Time permitting, this work will be further extended to include support for heterogenous arrays (dish shapes and pointing offsets) and corresponding demonstrations via ALMA or ngVLA simulations may be written. 

### Flagging 

- Step 1 : Implement manual flag commands using native python-based xarray selection syntax and demonstrate that this scales with realistic flag command counts (several 1000 independent data selections).

### Calibration 
- Step 1 : Basic apply w/ single caltable: establish/exercise basic Jones algebra constructs and topology of the fundamental calibration infrastructure (CNGI vs. ngCASA).
- Step 2 : General apply of many terms: introduce/establish/exercise multi-term topology, including OTF aggregation of calibration.
- Step 3 : Basic solve (no apply): Establish/exercise essential solving mechanics (including pre-averging).
- Step 4 : Solve w/ pre-applies: Exercise pre-apply aggregation and resource management and staging to the solve.

### Simulation
- Step 1 : Generate a simulated dataset to mimic a real observed dataset and show that meta-data are accurate. 
- Step 2 : If feasible, demonstrate an ngVLA simulation by integrating the simulator meta-data generation with the imager prototype (preferably after full heterogenous array support is ready).


### Visualization 

- Step 1 : Use off-the-shelf python libraries to demonstrate interactive plotting of large volumes of data. 

- Step 2 : Demonstrate the use of python visualization tools from within an ngCASA application or user script. The purpose is to support flexible interactive visualization as part of algorithms and applications. Features to be included for evaluation for this use case are as follows : 

  - Array 'versioning' implemented as separately named arrays allows a user to visualize multiple versions (of flags, or corrected_data, or model_Data). The visualization tool should allow the user to specify what (non-default) array names to use/plot.

  - On-the-fly visualization must be possible using the currently-available xarray dataset (which may or may not have a zarr counterpart).  
   - Flagging : An example use case is [autoflagging](https://ngcasa.readthedocs.io/en/latest/ngcasa_flagging.html#Autoflag-with-extension-and-pre-existing-flags) where it is often useful to inspect flags, discard them, try new autoflag parameters, re-inspect, and save flags only when satisfied. 
   - Imaging : [Interactive mask drawing and iteration control](https://ngcasa.readthedocs.io/en/latest/ngcasa_imaging.html#Interactive-Clean) editing is another use case.


  - Visualization must apply to zarr/xarray datasets such that they are usable (interchangeably, within reason) for visibility datasets, caltables and images. This merges the roles of what has traditionally been separate in casa6 as plotms and the viewer and will get us (at least) scatter and raster displays for all kinds of datasets that we support. 



### Pipeline Usage Modes

- Step 1 : Demonstrate a pipeline use case of operating on a diverse dataset containing multiple meta-data sub-structures (EBs, pointings, etc), with some algorithmic steps being run separately and some on the joint data. Demonstrate the role of the xarray/dask framework in simplifying the management of data selections/splits, parallelization, and combination for imaging. Demonstrate the use of array/flag versions, etc.

- Step 2 : String together all the above prototypes into a full pipeline demonstration. 





## More Information

Development environment, process and rules are inherited from CNGI and may be found here:

https://cngi-prototype.readthedocs.io/en/latest/development.html