<a href="https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Development

[edit this notebook in colab](https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb)

Radio interferometry data analysis applications and algorithms may be assembled from CNGI and ngCASA building blocks. A user may choose to implement their own analysis scripts, use a pre-packaged task similar to those in current CASA or embed ngCASA and CNGI methods in a production pipeline DAG.



### Data Structures 
The initial exercise of defining an ngCASA API and writing example usage scripts raised the following questions related to default data structures that CNGI and ngCASA will adopt.

#### Visibilities zarr/xarray
A default list of array names is as defined in CNGI. But, to get the ability to have versions of arrays, any application may add arrays as needed. Examples are corrected_data_1, corrected_data_2, flag_original, flag_auto, etc. All additional arrays must conform to the same meta-data and coordinates. But now, all methods (even CNGI) need the ability to specify an array name to use, with the default being from the core definition. 

#### Image zarr/xarray
 What is the default list of array names (i.e. image products) to store here ? The current specification in CNGI is tied closely to the casa6 tclean's list of outputs. However, not app applications/users will need this full list. Therefore, we perhaps should require that an image dataset contain a minimum of 1 array with a predefined default name, but then allow the appending of more. All image arrays within a set must share the same shape and coordinate metadata. Here too, all methods (even CNGI) that operate on images must have the ability to specify an array name to use. 

#### A Caltable
Need to define the default structure of a zarr/xarray dataset meant to store calibration solutions. Same requirements from above apply here too.

#### Primary Beam models Convolution Function Cache
We need a persistent data structure to store and use gridding convolution functions. The imaging prototype is developing one, related to the casa6 CFCache. A mechanism to convert between a Primary-Beam model database and a convolution function cache must also be designed. 

#### Flux Component List
We need a definition of a component list as an alternative to a raster image. This is required for compatibility with catalogues and translation between sky models that are not fitted/evaluated on a pixellated grid. 

  



      


### List of future prototypes
The following are suggestions to add to the list of CNGI and ngCASA prototypes that demonstrate that algorithms and science use cases can be implemented with minimal complexity and that they scale as expected. 

Note : This list is preliminary in content, and has no fixed timelines associated with it. 

#### Infrastructure : Join Operation

Demonstrate a join operation for zarr/xds datasets for visibilities, images, caltables, CF-caches, etc. 

- Visibilities : A common use case is to combine data from multiple datasets that may not share meta-data or be consistent in shape. Some algorithms must be done on each dataset (or subset) separately, whereas others are to be done on an entire set.   
    - Option1 : Always work with small homogeneous subsets of data and have the application layer manage lists and merging of products (via loops). 
    - Option2 : Implement a join and then use data selection as needed. All applications then strictly can take single datasets as inputs. Algorithms such as imaging can view it as a single large dataset. However, for some algorithms (such as calibration solutions that must pay attention to different meta-data across subsets) the internal diversity will still have to be managed through loops inside the methods.
 
  Current CASA6 has a mix of both the following and this is a significant source of inconsistency. Can the new framework simplify any book-keeping for this use-case without sacrificing other aspects such as performance ? 

- Images : Applications may need to work with multiple image sets that may not share the same shape or meta-data. 
  - Should there be a join operation for images, should there only be custom combination methods, or should the application layer manage lists of image sets ?
  - Example use cases are linear mosaics where a list of small images is combined onto a larger image grid (as an explicit weighted average of images, where the primary beam is used as the weight) and multi-field imaging where main and outlier fields have vastly differing meta-data.

 
#### Infrastructure : Re-bin and Expand
The reverse operation of time/chan average or rebin/regrid operations. Visibility processing methods should (where mathematically possible) support transformations in both directions, with clear conventions implemented.  

  - E.g. vis.timeaverage () followed by 'edit flags' and then write back to original dataset ? FLAG value is to be copied during expansion. 
  - E.g. Caltable solutions need an interpolation to get back to the original data resolution. Interpolation+Extrapolation.


#### Applications/Algorithms : Imaging

- Step 1 : A [fully functioning imaging prototype](https://ngcasa.readthedocs.io/en/latest/prototypes.html) exists and has demonstrated imaging weight calculations, basic gridding and image formation using a python-numba implementation. Equivalence with casa6's standard gridder has been established along with a performance comparison and demonstrations on a local workstation (with pre-specified resource constraints) as well as AWS. 

- Step 2 : Over the next several months, this prototype will be extended to include mosaic gridding/imaging and the design of a fully-featured convolution function cache (based on the casa6 awproject gridder cfcache). Demonstrations will likely include a large ALMA cube mosaic imaging example.

- Step 3 : Time permitting, this work will be further extended to include support for heterogenous arrays (dish shapes and pointing offsets) and corresponding demonstrations via ALMA or ngVLA simulations may be written. 

#### Applications/Algorithms : Flagging 

- Step 1 : Implement manual flag commands using native python-based xarray selection syntax and demonstrate that this scales with realistic flag command counts (several 1000 independent data selections) 

#### Applications/Algorithms : Calibration (TBD)

- Step 1 : Implement a gain solve and apply.  

- Step 2 :  Demonstrate a complicated sequence of data preprocessing and reshaping steps in between a series of cal solve/apply steps. 

#### Applications/Algorithms : Simulation
- Step 1 : Generate a simulated dataset to mimic a real observed dataset and show that meta-data are accurate. 
- Step 2 : If feasible, demonstrate an ngVLA simulation by integrating the simulator meta-data generation with the imager prototype (preferably after full heterogenous array support is ready).


#### Applications : Visualization 

- Step 1 : Use off-the-shelf python libraries to demonstrate interactive plotting of large volumes of data. 

- Step 2 : Demonstrate the use of python visualization tools from within an ngCASA application or user script. The purpose is to support flexible interactive visualization as part of algorithms and applications. Features to be included for evaluation for this use case are as follows : 

  - Array 'versioning' implemented as separately named arrays allows a user to visualize multiple versions (of flags, or corrected_data, or model_Data). The visualization tool should allow the user to specify what (non-default) array names to use/plot.

  - On-the-fly visualization must be possible using the currently-available xarray dataset (which may or may not have a zarr counterpart).  
   - Flagging : An example use case is [autoflagging](https://ngcasa.readthedocs.io/en/latest/ngcasa_flagging.html#Autoflag-with-extension-and-pre-existing-flags) where it is often useful to inspect flags, discard them, try new autoflag parameters, re-inspect, and save flags only when satisfied. 
   - Imaging : [Interactive mask drawing and iteration control](https://ngcasa.readthedocs.io/en/latest/ngcasa_imaging.html#Interactive-Clean) editing is another use case.


  - Visualization must apply to zarr/xarray datasets such that they are usable (interchangeably, within reason) for visibility datasets, caltables and images. This merges the roles of what has traditionally been separate in casa6 as plotms and the viewer and will get us (at least) scatter and raster displays for all kinds of datasets that we support. 



#### Pipeline Usage Modes

- Step 1 : Demonstrate a pipeline use case of operating on a diverse dataset containing multiple meta-data sub-structures (EBs, pointings, etc), with some algorithmic steps being run separately and some on the joint data. Demonstrate the role of the xarray/dask framework in simplifying the management of data selections/splits, parallelization, and combination for imaging. Demonstrate the use of array/flag versions, etc.

- Step 2 : String together all the above prototypes into a full pipeline demonstration. 





## Layers

- Application
- Functional Blocks
- Lower Level

This document will describe the functional block level. This level will be used by:

- CASA Developers
- Pipeline Developers
- Advance users and algorithm developers

Functional block smallest reasonable unit of data reduction work. 



## ngCASA Function Desgin




## Chunking

Zarr Chunking and Dask chunking


## ngCASA Pitfalls

## Loading Data from AWS S3

## Creating Graphs

## Dask Bokeh Dashboard

## More Information

Development environment, process and rules are inherited from CNGI and may be found here:

https://cngi-prototype.readthedocs.io/en/latest/development.html