<a href="https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Development

[edit this notebook in colab](https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb)

Radio interferometry data analysis applications and algorithms may be assembled from CNGI and ngCASA building blocks. A user may choose to implement their own analysis scripts, use a pre-packaged task similar to those in current CASA or embed ngCASA and CNGI methods in a production pipeline DAG.

Note : The following examples represent preliminary design ideas that illustrate how ngCASA science applications may be assembled, whether the new infrastructure adequately addresses algorithmic needs, and how the new dask-based infrastructure may be best leveraged for scaleable high performance computing. Questions raised via this initial exercise will guide the design of prototypes, continued evaluation the chosen infrastructure, and the final function hierarchy and APIs. 

### Notes/Questions

The initial exercise of defining the ngCASA API and constructing usage examples for Imaging, Calibration, Flagging and Simulation raised the following questions related to code structure, functional hierarchy, data formats, and usage of CNGI. 

#### Data Structures

  - Visibilities zarr/xarray : A default list of array names is as defined in CNGI. But, to get the ability to have versions of arrays, any application may add arrays as needed. Examples are corrected_data_1, corrected_data_2, flag_original, flag_auto, etc. All additional arrays must conform to the same meta-data and coordinates. But now, all methods (even CNGI) need the ability to specify an array name to use, with the default being from the core definition. 

- Image zarr/xarray : What is the default list of array names (i.e. image products) to store here ? The current specification in CNGI is tied closely to the casa6 tclean's list of outputs. However, not app applications/users will need this full list. Therefore, we perhaps should require that an image dataset contain a minimum of 1 array with a predefined default name, but then allow the appending of more. All image arrays within a set must share the same shape and coordinate metadata. Here too, all methods (even CNGI) that operate on images must have the ability to specify an array name to use. 

- A Caltable : Need to define the default structure of a zarr/xarray dataset meant to store calibration solutions. Same requirements from above apply here too.

- Primary Beam models Convolution Function Cache : We need a persistent data structure to store and use gridding convolution functions. The imaging prototype is developing one, related to the casa6 CFCache. A mechanism to convert between a Primary-Beam model database and a convolution function cache must also be designed. 

- Flux Component List : We need a definition of a component list as an alternative to a raster image. This is required for compatibility with catalogues and translation between sky models that are not fitted/evaluated on a pixellated grid. 

#### Visualization requirements
To support flexible interactive visualization as part of algorithms and applications, the following must be evaluated/considered when the visualization module is designed/prototyped.

  - Array 'versioning' implemented as separately named arrays allows a user to visualize multiple versions (of flags, or corrected_data, or model_Data). The visualization tool should allow the specification of what arrays to use/plot.

  - On-the-fly visualization must be possible by inserting a call to a plotter in between any sequence of steps in a user application script, using the currently-available xarray dataset (which may or may not have a zarr counterpart).  An example use case is autoflagging where it is often useful to inspect flags, discard them, try new autoflag parameters, re-inspect, and save flags only when satisfied. 
  
  - Visualization must apply to zarr/xarray datasets such that they are usable (interchangeably, within reason) for visibility datasets, caltables and images. This merges the roles of what has traditionally been separate in casa6 as plotms and the viewer and will get us (at least) scatter and raster displays for all kinds of datasets that we support. 

  - Any more ? 
  



      


### List of future prototypes
The following are suggestions to add to the list of CNGI and ngCASA prototypes that demonstrate that algorithms and science use cases can be implemented with minimal complexity and that they scale as expected. 

#### Infrastructure

- Join operation for zarr/xds datasets for visibilities, images, caltables, CF-caches, etc. A common use case is to combine data from multiple datasets that may not share meta-data or be consistent in shape. Some algorithms must be done on each dataset (or subset) separately, whereas others are to be done on an entire set. A related situation applies to image datasets.  We need a demonstration of how to achieve this. Also, what bookkeeping complexity can the framework actually eliminate ? Should the rules be the same for visibilities and images or not ? A prototype should result in a recommendation, a demonstration that it simplifies some book-keeping, and that it does not overly bloat the dataset (with NaNs).

 - Visibilities : Current CASA6 has a mix of both the following and this is a significant source of inconsistency. 
    - Option1 : Always work with small homogeneous subsets of data and have the application layer manage lists and merging of products (via loops). 
    - Option2 : Implement a join and then use data selection as needed. All applications then strictly can take single datasets as inputs. Algorithms such as imaging can view it as a single large dataset. However, for some algorithms (such as calibration solutions that must pay attention to different meta-data across subsets) the internal diversity will still have to be managed through loops inside the methods.
 
 - Images : What is a joint operation for images, and how is it different from an image regrid.  
   - An example use case is the linear mosaic where a list of small images is combined onto a larger image grid (as an explicit weighted average of images, where the primary beam is used as the weight).  This is perhaps just an ngCASA or CNGI method that implements the math, along with a 'regrid' for coordinate re-mapping. 
   - A multi-field imaging run may consist of image sets of different shapes and sizes, all working together in the same imaging run. Should the application layer explicitly maintain lists, or should there be a 'join' that does it internally ? 
 
 
- Re-bin and Expand : The reverse operation of time/chan average or rebin/regrid operations. Visibility processing methods should (where mathematically possible) support transformations in both directions, with clear conventions implemented.  

  - E.g. vis.timeaverage () followed by 'edit flags' and then write back to original dataset ? FLAG value is to be copied during expansion. 
  - E.g. Caltable solutions need an interpolation to get back to the original data resolution. Interpolation+Extrapolation.



#### Algorithms 

- Imaging : Weighting, Gridding, Imaging (with mosaic?). 
 ( For imaging, this is enough.)

- Flagging : Implement manual flags using native python selections and demonstrate that this scales with realistic flag command sizes. (Autoflags are simpler, structurally). 

- Calibration : Implement a gain solve and apply.  Demonstrate a complicated sequence of data preprocessing and reshaping steps in between a series of cal solve/apply steps. 

- Visualization : Use OTS python libs to demonstrate interactive plotting of large volumes of data. 

- Simulation :  Generate a simulated dataset to mimic a real observed dataset and show that meta-data are accurate. 

- Pipeline Usage Modes : Generation of outputs, editing of input datasets, array/flag names/versions, pipeline as a DAG.  String together the above prototypes for a full pipeline demo ! 




## Layers

- Application
- Functional Blocks
- Lower Level

This document will describe the functional block level. This level will be used by:

- CASA Developers
- Pipeline Developers
- Advance users and algorithm developers

Functional block smallest reasonable unit of data reduction work. 



## ngCASA Function Desgin




## Chunking

Zarr Chunking and Dask chunking


## ngCASA Pitfalls

## Loading Data from AWS S3

## Creating Graphs

## Dask Bokeh Dashboard

## More Information

Development environment, process and rules are inherited from CNGI and may be found here:

https://cngi-prototype.readthedocs.io/en/latest/development.html