<a href="https://colab.research.google.com/github/casangi/ngcasa/blob/master/docs/ngcasa_development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Development

The ngCASA development page is a living document that will continually change as the ***prototype*** is developed. The same functional approach as the [CNGI prototype](https://cngi-prototype.readthedocs.io/en/latest/development.html) is used. The API consists of stateless Python functions only. 

The proposed plan for radio astronomy data reduction software consists out of three layers:

- Application Software
- ngCASA functions
- CNGI functions

The application software layer is what ultimately most astronomers will use. This document describes the building blocks, that may be assembled to create the application software. Consequently, ngCASA and CNGI is aimed at the following users:

- CASA developers
- Pipeline developers
- Advance users and algorithm developers

A subset of the algorithms in CASA will be implemented for the prototype. This is done so that the efficacy of the framework can be compared against CASA. 

## Framework 

In the figure below a diagram is given of the framework running on a single machine. Data is stored on disk in the Zarr format. The Dask scheduler manages N workers with M threads each. Each thread applies functions to a set of data chunks. Any code that is wrapped with Python can be parallelized with Dask. Therefore, the option to use C++, Numba or other custom HPC (high performance computing) code exists. The size of the Zarr data chunks (on disk) and Dask data chunks do not have to be the same, this is further explained in the chunking section. Data chunks can either be read from disk as they are needed for computation or the data can be persisted into memory. Xarray labels the Dask arrays and wraps Dask functions. 

The three core packages that make up the framework are designed to be compatible:

- Zarr
- Dask/Dask Distributed
- Xarray

![title1](https://raw.githubusercontent.com/casangi/ngcasa/master/docs/images/ngCASA_design.png)

 <font color='red'>To Do: Add cluster diagram.</font>


### Zarr
[Zarr](https://zarr.readthedocs.io/en/stable/spec/v2.html) provides an implementation of chunked, compressed, N-dimensional arrays that can be stored on disk. Zarr is the chosen data storage package for ngCASA. Compressed arrays can be organized into a group. For ngCASA the Zarr group hierarchy is structured so that it is compatible with the [Xarray](http://xarray.pydata.org/en/stable/io.html#zarr) dataset convention (a Zarr group contains all the data for an Xarray dataset):

```
Zarr Group 
    |-- .zattrs
    |-- .zgroup
    |-- .zmetadata

    |-- Array_1
    |    |-- .zattrs
    |    |-- .zarray
    |    |-- 0.0.0. ... 0
    |    |-- ... 
    |    |-- C_1.C_2.C_3. ... C_D 
    |-- ... 
    |-- Array_N_c
    |    |-- ...     
```

The group folder consists out of three hidden metadata files (```.zattrs```, ```.zgroup``` and ```.zmetadata```
) and a collection of folders. Each folder contains the data for a single array along with two hidden metadata files (```.zattrs```, ```.zarray```). The metadata files are used to create the lazily loaded representation of the dataset (only metadata is loaded). The data of each array is chunked and stored in the format ```x_1,x_2,x_3, ..., x_D``` where D is the number of dimensions and x_i is the chunk index for the ith dimension. For example a three dimensional array with two chunks in the first dimension and three chunks in the second dimension would consist of the following files ```0.0.0, 1.0.0, 0.1.0, 1.1.0, 0.2.0, 1.2.0```.

Group folder metadata files (encoded using JSON):

- ```.zgroup``` contains an integer defining the version of the storage [specification](https://zarr.readthedocs.io/en/stable/spec/v2.html). For example:
```
{
   "zarr_format": 2
}
```
- ```.zattrs``` describes data attributes that can not be stored in an array (this file can be empty). For example:
```
{
    "append_zarr_time": 1.8485729694366455,
    "auto_correlations": 0,
    "ddi": 0,
    "freq_group": 0,
    "ref_frequency": 372520022603.63745,
    "total_bandwidth": 234366781.0546875
}
```      
- ```.zmetadata``` contains all the metadata from all other metadata files (both in the group directory and array subdirectories). This file does not have to exist, however it can decrease the time it takes to create the lazy loaded representation of the dataset, since each metadata file does not have to be opened and read separately. If any of the files are changed or files are added the ```.zmetadata``` file must be updated with ```zarr.consolidate_metadata(group_folder_name)```.

Array folder metadata files (encoded using JSON):

- ```.zarray``` describes the data: how it is chunked, the compression used and array properties. For example the ```.zarray``` file for a DATA array (contains the visibility data) would contain:
```
{
    "chunks": [270,210,12,1],
    "compressor": {"blocksize": 0,"clevel": 2,"cname": "zstd","id": "blosc","shuffle": 0},
    "dtype": "<c16",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [270,210,384,1],
    "zarr_format": 2
}
```
Zarr supports all the compression algorithms implemented in [numcodecs](https://numcodecs.readthedocs.io/en/stable/blosc.html) (‘zstd’, ‘blosclz’, ‘lz4’, ‘lz4hc’, ‘zlib’, ‘snappy’).
- ```.zattrs``` is used to label the arrays so that an Xarray dataset can be created. The labeling creates three types of arrays:
    - Dimention coordinates are one dimensional arrays that are used for label based indexing and alignment of data variable arrays. The array name is the same as its sole dimension. For example the ```.zattrs``` file in the "chan" array would contain:
    ```
    {
    "_ARRAY_DIMENSIONS": ["chan"]
    }
    ```
    - Coordinates can have any number of dimensions and are a function of dimension coordinates. For example the "declination" coordinate is a function of the d0 and d1 dimension coordinates and its ```.zattrs``` file contains:
    ```
    {
    "_ARRAY_DIMENSIONS": ["d0","d1"]
    }
    ```
    - Data variables contain the data that dimension coordinates and coordinates label. For example the DATA data variable's ```.zattrs``` file contains:
    ```
    {
    "_ARRAY_DIMENSIONS": ["time","baseline","chan","pol"],
    "coordinates": "interval scan field state processor observation"
    }
    ```

Zarr Advantages:

- Wide variety of compression algorithms supported, see [numcodecs](https://numcodecs.readthedocs.io/en/stable/blosc.html). Each data variable can be compressed using a different compression algorithm. 
- Data can be chunked on any dimension and is out of the box compatible with Dask (the parallelism framework).
- Has a defined cloud interface.

<font color='red'>To Do: Add to list of advantages.</font>

<font color='red'>To Do: Add more detail how zarr and xarray can be used with cloud storage.</font>


### Dask and Dask Distributed

[Dask](https://dask.org/) is a flexible library for parallel computing in Python and [Dask Distributed](https://distributed.dask.org/en/latest/)  provides a centrally managed, distributed, dynamic task scheduler. A Dask task graph describes how tasks will be executed in parallel. The nodes in a task graph are made of Dask collections. Different Dask collections can be used in the same task graph. For the CNGI and ngCASA projects Dask array and Dask delayed collections will predominantly be used. Explanations from the Dask website about Dask arrays and Dask delayed:

- "[Dask array](https://docs.dask.org/en/latest/array.html) implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs."

- "Sometimes problems don’t fit into one of the collections like dask.array or dask.dataframe. In these cases, users can parallelize custom algorithms using the simpler [dask.delayed](https://docs.dask.org/en/latest/delayed.html) interface. This allows one to create graphs directly with a light annotation of normal python code."

![title1](https://docs.dask.org/en/latest/_images/dask-overview.svg)

Image from [Dask](https://docs.dask.org/en/latest/) website.

Dask/Dask Distributed Advantages:

- Parallelism can be achieved over any dimension, since it is determined by the data chunking. 
- Data can either be persisted into memory or read from disk as needed. As processing is finished chunks can be saved. This enables the processing of data that is larger than memory.  
- Graphs can easily be combined with ```dask.compute()``` to run multiple functions in parallel. For example a cube and continuum image can be created in parallel.

<font color='red'>To Do: Add more detail about Dask.</font>

<font color='red'>To Do: Add to list of advantages.</font>

### Xarray
[Xarray](http://xarray.pydata.org/en/stable/) provides N-Dimensional labeled arrays and datasets in Python. The Xarray dataset is used to organize and label related arrays. The Xarray website gives the following definition of a dataset:

- "[xarray.Dataset](http://xarray.pydata.org/en/stable/data-structures.html#dataset) is Xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions."

The Zarr disk format and Xarray dataset specification are compatible if the Xarray ```to_zarr``` and ```open_zarr``` functions are used. The compatibility is achieved by requiring the Zarr group to have labeling information in the metadata files and a depth of one (see the Zarr section above for further explanation).

When ```xarray.open_zarr(zarr_group_file)``` is used the array data is not loaded to memory (there is a parameter to force this), rather the metadata is loaded and a lazy loaded Xarray dataset is created. For example a dataset that consists of three dimension coordinates (dim_), two coordinates (coord_) and three data variables (Data_) would have the following structure (this is what is displayed if a print command is used on a lazy loaded dataset):
```
<xarray.Dataset>
Dimensions:        (dim_1: d1, dim_2: d2, dim_3: d3)
Coordinates:
    coord_1      (dim_1) data_type_coord_1 dask.array<chunksize=(chunk_d1,), meta=np.ndarray>
    coord_2      (dim_1, dim_2)  data_type_coord_2 dask.array<chunksize=(chunk_d1,chunk_d2), meta=np.ndarray>
  * dim_1        (dim_1) data_type_dim_1 np.array([x_1, x_2, ..., x_d1])
  * dim_2        (dim_2) data_type_dim_2 np.array([y_1, y_2, ..., y_d2])
  * dim_3        (dim_3) data_type_dim_3 np.array([z_1, z_2, ..., z_d3])
  
Data variables:
    DATA_1       (dim2) data_type_DATA_1 dask.array<chunksize=(chunk_d2,), meta=np.ndarray>
    DATA_2       (dim1) data_type_DATA_2 dask.array<chunksize=(chunk_d1,), meta=np.ndarray>
    DATA_3       (dim1,dim2,dim3) data_type_DATA_3 dask.array<chunksize=(chunk_d1,chunk_d2,chunk_d3), meta=np.ndarray>
    
Attributes:
    attr_1:        a1
    attr_2:        a2
```

- d1, d2, d3 are integers.
- a1, a2 can be any data type that can be stored in a JSON file.
- data_type_dim_, data_type_coord_ , data_type_DATA_ can be any acceptable NumPy array data type.

Explanations of dimension coordinates, coordinates and data variables from the [Xarray website](http://xarray.pydata.org/en/stable/data-structures.html#coordinates):

- dim_ "Dimension coordinates are one dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array). They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these dimension coordinates use a pandas. Index internally to store their values."

- coord_ "Coordinates (non-dimension) are variables that contain coordinate data, but are not a dimension coordinate. They can be multidimensional (see Working with Multidimensional Coordinates), and there is no relationship between the name of a non-dimension coordinate and the name(s) of its dimension(s). Non-dimension coordinates can be useful for indexing or plotting; otherwise, xarray does not make any direct use of the values associated with them. They are not used for alignment or automatic indexing, nor are they required to match when doing arithmetic."

- DATA_ The array that dimension coordinates and coordinates label.


<font color='red'>To Do: Add list of advantages.</font>

### Numba

[Numba](http://numba.pydata.org/) is an open source JIT (Just In Time) compiler that translates a subset of Python and NumPy code into fast machine code. Numba is used in ngCASA for functions that have long nested for loops (for example the gridder code). Numba can be used by adding the @jit decorator above a function:

```python
@jit(nopython=True, cache=True, nogil=True)
def my_func(input_parms):
    does something ...
```
    
Explanation of jit arguments from the [Numba](https://numba.pydata.org/numba-doc/latest/user/jit.html):

- nopython "The behaviour of the nopython compilation mode is to essentially compile the decorated function so that it will run entirely without the involvement of the Python interpreter. This is the recommended and best-practice way to use the Numba jit decorator as it leads to the best performance."
- cache "To avoid compilation times each time you invoke a Python program, you can instruct Numba to write the result of function compilation into a file-based cache.
- nogil "Whenever Numba optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL)"
    
A 5 minute guide to starting with Numba can be found [here](http://numba.pydata.org/numba-doc/latest/user/5minguide.html). 

Numba also has functionality to run code on GPUs  (for [CUDA CPUs](http://numba.pydata.org/numba-doc/latest/cuda/index.html), for [AMD ROC CPUs](http://numba.pydata.org/numba-doc/latest/roc/index.html)), this will be explored in the future.

### Chunking

In the Zarr and Dask framework, data is broken up into chunks. The Dask chunk size can be specified in the ```xarray.open_zarr``` call using the ```chunks``` parameter. The Dask and Zarr chunking do not have to be the same. The [Zarr chunking](https://zarr.readthedocs.io/en/stable/tutorial.html) is what is used on disk and the [Dask chunking](https://docs.dask.org/en/latest/array-chunks.html) is used during parallel computation. However, it is more efficient for the Dask chunk size to be equal to or a multiple of the Zarr chunk size (to stop multiple reads of the same data). This hierarchy of chunking allows for flexible and efficient algorithm development. For example cube imaging is more memory efficient if chunking is along the channel axis (the [benchmarking](https://ngcasa.readthedocs.io/en/latest/benchmark.html) example demonstrates this). Note, chunking can be done in any combination of dimensions.

## Zarr Data Format

Data is stored in Zarr files with Xarray formatting (which is specified in the metadata). The data is stored as N-dimensional arrays that can be chunked on any axis. The ngCASA/CNGI data formats are:

- vis.zarr Visibility data (measurement set).
- img.zarr Images (also used for convolution function cache).
- cal.zarr Calibration tables.
- tel.zar  Telescope layout (used for simulations)
- Other formats will be added as needed. 

CNGI will provide functions to convert between the new Zarr formats and legacy formats (such as the measurement set, FITS files,  ASDM, etc.). The current implementations of the Zarr formats are not fully developed and are only sufficient for prototype development. 

Zarr data formats rules:

- Data variable names are always uppercase and dimension coordinates are lowercase.
- Data variable names are not fixed but defaults exist, for example the data variable that contains the uvw data default name is UVW. This flexibility allows for the easy inclusion of more advanced algorithms (for example multi-term deconvolution that produces Taylor term images).
- Dimension coordinates and coordinates have fixed names.  
- Dimension coordinates are not chunked.
- Data variables and coordinates are chunked and should have consistent chunking with each other.
- Any number of data variables can be in a dataset but must share a common coordinate and chunking system. 

<font color='red'>To Decide: Should the cf cache have its own format or just use img.zarr.</font>

<font color='red'>To Decide: Should the number of dimensions be fixed.</font>

<font color='red'>To Do: Formalizing formats and writing formal definition documentation.</font>

### vis.zarr

The vis.zarr format is a replacement of the measurement set (ms). The ms uses tables while vis.zarr uses N-dimensional arrays. For example the Data data variable (contains the visibilities) is a four dimensional array with axes time, baseline, chan, pol. The vis.zarr is different from the other Zarr formats in that it is a collection of datasets (not just one). Multiple datasets arise since the dimensions of different polarization and spectral window setups differ. The data description identifier (ddi) provides an index that combines the spectral window identifier and polarization identifier. The idea of ddi comes from the [ms v2](https://casacore.github.io/casacore-notes/229.pdf) definition ([ms v3](https://casacore.github.io/casacore-notes/264.pdf) deprecates this). 

Each folder within a vis.zarr file is a dataset:
```
vis.zarr
    |-- 0
    |-- ...
    |-- n_ddi
    |-- global
```

The global dataset contains all the information that is common between all ddis. The CNGI and ngCASA functions can only operate on a single ddi. In the application layer multiple ddis can be processed. This is possible because CNGI/ngCASA functions produce graphs that can be combined. 

Example of a ddi dataset in a vis.zarr file:
```
<xarray.Dataset>
Dimensions:        (baseline: 210, chan: 7, pair: 2, pol: 1, receptor: 2, spw: 1, time: 270, uvw_index: 3)
Coordinates:
    antennas       (baseline, pair) int32 dask.array<chunksize=(210, 2), meta=np.ndarray>
  * baseline       (baseline) int64 0 1 2 3 4 5 6 ... 204 205 206 207 208 209
  * chan           (chan) float64 3.725e+11 3.725e+11 ... 3.725e+11 3.725e+11
    chan_width     (chan) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    corr_product   (receptor, pol) int32 dask.array<chunksize=(2, 1), meta=np.ndarray>
    effective_bw   (chan) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    field          (time) <U6 dask.array<chunksize=(270,), meta=np.ndarray>
    interval       (time) float64 dask.array<chunksize=(270,), meta=np.ndarray>
    observation    (time) <U22 dask.array<chunksize=(270,), meta=np.ndarray>
  * pol            (pol) int32 9
    processor      (time) <U14 dask.array<chunksize=(270,), meta=np.ndarray>
    resolution     (chan) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    scan           (time) int32 dask.array<chunksize=(270,), meta=np.ndarray>
  * spw            (spw) int32 0
    state          (time) <U82 dask.array<chunksize=(270,), meta=np.ndarray>
  * time           (time) datetime64[ns] 2012-11-19T07:56:26.544000626 ... 2012-11-19T09:07:28.607999802
  * uvw_index      (uvw_index) <U2 'uu' 'vv' 'ww'
Dimensions without coordinates: pair, receptor
Data variables:
    ARRAY_ID       (time, baseline) int32 dask.array<chunksize=(270, 210), meta=np.ndarray>
    DATA           (time, baseline, chan, pol) complex128 dask.array<chunksize=(270, 210, 1, 1), meta=np.ndarray>
    EXPOSURE       (time, baseline) float64 dask.array<chunksize=(270, 210), meta=np.ndarray>
    FEED1          (time, baseline) int32 dask.array<chunksize=(270, 210), meta=np.ndarray>
    FEED2          (time, baseline) int32 dask.array<chunksize=(270, 210), meta=np.ndarray>
    FLAG           (time, baseline, chan, pol) bool dask.array<chunksize=(270, 210, 1, 1), meta=np.ndarray>
    FLAG_ROW       (time, baseline) bool dask.array<chunksize=(270, 210), meta=np.ndarray>
    SIGMA          (time, baseline, pol) float64 dask.array<chunksize=(270, 210, 1), meta=np.ndarray>
    TIME_CENTROID  (time, baseline) float64 dask.array<chunksize=(270, 210), meta=np.ndarray>
    UVW            (time, baseline, uvw_index) float64 dask.array<chunksize=(270, 210, 3), meta=np.ndarray>
    WEIGHT         (time, baseline, pol) float64 dask.array<chunksize=(270, 210, 1), meta=np.ndarray>
Attributes:
    assoc_spw_id:       []
    auto_correlations:  0
    bbc_no:             2
    ddi:                0
    freq_group:         0
    freq_group_name:    
    if_conv_chain:      0
    meas_freq_ref:      1
    name:               ALMA_RB_07#BB_2#SW-01#FULL_RES
    net_sideband:       2
    num_chan:           7
    ref_frequency:      372520022603.63745
    total_bandwidth:    4272311.112915039
```

Example of a global dataset in a vis.zarr file:
```
<xarray.Dataset>
Dimensions:                 (antenna: 26, d1: 1, d2: 2, d3: 3, feed: 1, field: 1, observation: 1, processor: 3, receptors: 3, source: 1, spw: 1, state: 20, time_fcmd: 5067, time_hist: 1134)
Coordinates:
  * antenna                 (antenna) int64 0 1 2 3 4 5 6 ... 20 21 22 23 24 25
  * feed                    (feed) int64 0
  * field                   (field) <U6 'TW Hya'
  * observation             (observation) <U22 'uid://A002/X327408/X6f'
  * processor               (processor) <U14 'CORRELATOR (0)' ... 'CORRELATOR (2)'
  * receptors               (receptors) int64 0 1 2
  * source                  (source) int32 0
  * spw                     (spw) int64 0
  * state                   (state) <U82 'CALIBRATE_BANDPASS#ON_SOURCE,CALIBRATE_PHASE#ON_SOURCE,CALIBRATE_WVR#ON_SOURCE (0)' ... 'OBSERVE_TARGET#ON_SOURCE (19)'
  * time_fcmd               (time_fcmd) datetime64[ns] 2012-11-19T07:28:06.096000671 ... 2012-11-19T09:11:31.978034973
  * time_hist               (time_hist) datetime64[ns] 2012-11-28T06:33:06.728000641 ... 2020-04-02T18:32:00.925000193
Dimensions without coordinates: d1, d2, d3
Data variables:
    ANT_DISH_DIAMETER       (antenna) float64 dask.array<chunksize=(26,), meta=np.ndarray>
    ANT_FLAG_ROW            (antenna) bool dask.array<chunksize=(26,), meta=np.ndarray>
    ANT_MOUNT               (antenna) <U16 dask.array<chunksize=(26,), meta=np.ndarray>
    ANT_NAME                (antenna) <U16 dask.array<chunksize=(26,), meta=np.ndarray>
    ANT_OFFSET              (antenna, d3) float64 dask.array<chunksize=(26, 3), meta=np.ndarray>
    ANT_POSITION            (antenna, d3) float64 dask.array<chunksize=(26, 3), meta=np.ndarray>
    ANT_STATION             (antenna) <U16 dask.array<chunksize=(26,), meta=np.ndarray>
    ANT_TYPE                (antenna) <U16 dask.array<chunksize=(26,), meta=np.ndarray>
    FCMD_APPLICATION        (time_hist) <U16 dask.array<chunksize=(1134,), meta=np.ndarray>
    FCMD_APPLIED            (time_fcmd) bool dask.array<chunksize=(5067,), meta=np.ndarray>
    FCMD_COMMAND            (time_fcmd) <U75 dask.array<chunksize=(1267,), meta=np.ndarray>
    FCMD_INTERVAL           (time_fcmd) float64 dask.array<chunksize=(5067,), meta=np.ndarray>
    FCMD_LEVEL              (time_fcmd) int32 dask.array<chunksize=(5067,), meta=np.ndarray>
    FCMD_MESSAGE            (time_hist) <U188 dask.array<chunksize=(284,), meta=np.ndarray>
    FCMD_OBJECT_ID          (time_hist) int32 dask.array<chunksize=(1134,), meta=np.ndarray>
    FCMD_OBSERVATION_ID     (time_hist) int32 dask.array<chunksize=(1134,), meta=np.ndarray>
    FCMD_ORIGIN             (time_hist) <U24 dask.array<chunksize=(1134,), meta=np.ndarray>
    FCMD_PRIORITY           (time_hist) <U16 dask.array<chunksize=(1134,), meta=np.ndarray>
    FCMD_REASON             (time_fcmd) <U47 dask.array<chunksize=(1267,), meta=np.ndarray>
    FCMD_SEVERITY           (time_fcmd) int32 dask.array<chunksize=(5067,), meta=np.ndarray>
    FCMD_TYPE               (time_fcmd) <U16 dask.array<chunksize=(2534,), meta=np.ndarray>
    FEED_BEAM_ID            (spw, antenna, feed) int32 dask.array<chunksize=(1, 26, 1), meta=np.ndarray>
    FEED_BEAM_OFFSET        (spw, antenna, feed, d2, receptors) float64 dask.array<chunksize=(1, 26, 1, 2, 3), meta=np.ndarray>
    FEED_INTERVAL           (spw, antenna, feed) float64 dask.array<chunksize=(1, 26, 1), meta=np.ndarray>
    FEED_NUM_RECEPTORS      (spw, antenna, feed) int32 dask.array<chunksize=(1, 26, 1), meta=np.ndarray>
    FEED_POLARIZATION_TYPE  (spw, antenna, feed, receptors) <U16 dask.array<chunksize=(1, 26, 1, 3), meta=np.ndarray>
    FEED_POL_RESPONSE       (spw, antenna, feed, receptors, receptors) complex128 dask.array<chunksize=(1, 26, 1, 3, 3), meta=np.ndarray>
    FEED_POSITION           (spw, antenna, feed, d3) float64 dask.array<chunksize=(1, 26, 1, 3), meta=np.ndarray>
    FEED_RECEPTOR_ANGLE     (spw, antenna, feed, receptors) float64 dask.array<chunksize=(1, 26, 1, 3), meta=np.ndarray>
    FEED_TIME               (spw, antenna, feed) datetime64[ns] dask.array<chunksize=(1, 26, 1), meta=np.ndarray>
    FIELD_CODE              (field) <U16 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_DELAY_DIR         (field, d2, d1) float64 dask.array<chunksize=(1, 2, 1), meta=np.ndarray>
    FIELD_DelayDir_Ref      (field) int32 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_FLAG_ROW          (field) bool dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_NUM_POLY          (field) int32 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_PHASE_DIR         (field, d2, d1) float64 dask.array<chunksize=(1, 2, 1), meta=np.ndarray>
    FIELD_PhaseDir_Ref      (field) int32 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_REFERENCE_DIR     (field, d2, d1) float64 dask.array<chunksize=(1, 2, 1), meta=np.ndarray>
    FIELD_RefDir_Ref        (field) int32 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_SOURCE_ID         (field) int32 dask.array<chunksize=(1,), meta=np.ndarray>
    FIELD_TIME              (field) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_FLAG_ROW            (observation) bool dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_OBSERVER            (observation) <U16 dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_RELEASE_DATE        (observation) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_SCHEDULE_TYPE       (observation) <U16 dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_TELESCOPE_NAME      (observation) <U16 dask.array<chunksize=(1,), meta=np.ndarray>
    OBS_TIME_RANGE          (observation, d2) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    PROC_FLAG_ROW           (processor) bool dask.array<chunksize=(3,), meta=np.ndarray>
    PROC_MODE_ID            (processor) int32 dask.array<chunksize=(3,), meta=np.ndarray>
    PROC_SUB_TYPE           (processor) <U21 dask.array<chunksize=(3,), meta=np.ndarray>
    PROC_TYPE_ID            (processor) int32 dask.array<chunksize=(3,), meta=np.ndarray>
    SRC_CALIBRATION_GROUP   (spw, source) int32 dask.array<chunksize=(1, 1), meta=np.ndarray>
    SRC_CODE                (spw, source) <U16 dask.array<chunksize=(1, 1), meta=np.ndarray>
    SRC_DIRECTION           (spw, source, d2) float64 dask.array<chunksize=(1, 1, 2), meta=np.ndarray>
    SRC_INTERVAL            (spw, source) float64 dask.array<chunksize=(1, 1), meta=np.ndarray>
    SRC_NAME                (spw, source) <U16 dask.array<chunksize=(1, 1), meta=np.ndarray>
    SRC_NUM_LINES           (spw, source) int32 dask.array<chunksize=(1, 1), meta=np.ndarray>
    SRC_PROPER_MOTION       (spw, source, d2) float64 dask.array<chunksize=(1, 1, 2), meta=np.ndarray>
    SRC_TIME                (spw, source) datetime64[ns] dask.array<chunksize=(1, 1), meta=np.ndarray>
    STATE_CAL               (state) float64 dask.array<chunksize=(20,), meta=np.ndarray>
    STATE_FLAG_ROW          (state) bool dask.array<chunksize=(20,), meta=np.ndarray>
    STATE_LOAD              (state) float64 dask.array<chunksize=(20,), meta=np.ndarray>
    STATE_REF               (state) bool dask.array<chunksize=(20,), meta=np.ndarray>
    STATE_SIG               (state) bool dask.array<chunksize=(20,), meta=np.ndarray>
    STATE_SUB_SCAN          (state) int32 dask.array<chunksize=(20,), meta=np.ndarray>
Attributes:
    asdm_antennaid:               ['Antenna_0', 'Antenna_1', 'Antenna_2', 'An...
    asdm_antennamake:             ['AEM_12', 'AEM_12', 'AEM_12', 'AEM_12', 'A...
    asdm_antennaname:             ['DA41', 'DA42', 'DA44', 'DA45', 'DA46', 'D...
    asdm_antennatype:             ['GROUND_BASED', 'GROUND_BASED', 'GROUND_BA...
    asdm_assocantennaid:          ['', '', '', '', '', '', '', '', '', '', ''...
    asdm_caldataid:               ['CalData_0', 'CalData_0', 'CalData_0', 'Ca...
    asdm_calreductionid:          ['CalReduction_0', 'CalReduction_0', 'CalRe...
    asdm_chanfreq:                [[[184550000000.0], [186550000000.0], [1888...
    asdm_chanwidth:               [[[1500000000.0], [2500000000.0], [20000000...
    asdm_dishdiameter:            [12.0, 12.0, 12.0, 12.0, 12.0, 12.0, 12.0, ...
    asdm_drypath:                 [[[1.5075339078903198]], [[1.50631928443908...
    asdm_endvalidtime:            [4860030637.2595, 4860030637.2595, 48600306...
    asdm_freqlo:                  [[[97963275000.0]], [[97963275000.0], [9968...
    asdm_frequencyband:           ['UNSPECIFIED', 'ALMA_RB_03', 'ALMA_RB_03',...
    asdm_inputantennanames:       [[['DA41']], [['DA42']], [['DA44']], [['DA4...
    asdm_name:                    ['DA41', 'DA42', 'DA44', 'DA45', 'DA46', 'D...
    asdm_numchan:                 [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
    asdm_numinputantennas:        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    asdm_numlo:                   [1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, ...
    asdm_numpoly:                 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    asdm_offset:                  [[[0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0...
    asdm_pathcoeff:               [[[[[5.696597872884013e-05]], [[2.828374090...
    asdm_polyfreqlimits:          [[[67000000000.0], [90000000000.0]], [[6700...
    asdm_position:                [[[9.9e-05], [-0.000115], [7.500033]], [[3....
    asdm_receiverid:              [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
    asdm_receiversideband:        ['NOSB', 'TSB', 'TSB', 'TSB', 'TSB', 'TSB',...
    asdm_reftemp:                 [[[[119.45744705200195], [48.73897772568922...
    asdm_sidebandlo:              [[['DSB']], [['LSB'], ['LSB'], ['LSB']], [[...
    asdm_spectralwindowid:        ['SpectralWindow_0', 'SpectralWindow_27', '...
    asdm_startvalidtime:          [4860026976.564501, 4860026976.564501, 4860...
    asdm_stationid:               ['Station_0', 'Station_1', 'Station_2', 'St...
    asdm_time:                    [4860026863.941999, 4860026863.941999, 4860...
    asdm_timeinterval:            [[[4860026863.941999], [4363345172.912776]]...
    asdm_type:                    ['ANTENNA_PAD', 'ANTENNA_PAD', 'ANTENNA_PAD...
    asdm_water:                   [0.00043721461364834303, 0.0004537176442467...
    asdm_wetpath:                 [[[0.003352516097947955]], [[0.003476256737...
    asdm_wvrmethod:               ['ATM_MODEL', 'ATM_MODEL', 'ATM_MODEL', 'AT...
    cald_antenna_id:              [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
    cald_cal_load_names:          [[['AMBIENT_LOAD'], ['HOT_LOAD']], [['AMBIE...
    cald_feed_id:                 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
    cald_interval:                [4363345170.433152, 4363345169.85944, 43633...
    cald_num_cal_load:            [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
    cald_num_receptor:            [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
    cald_spectral_window_id:      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
    cald_temperature_load:        [[[16.850000381469727], [83.25000762939453]...
    cald_time:                    [7041699451.638201, 7041699451.925056, 7041...
    weat_antenna_id:              [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...
    weat_dew_point:               [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
    weat_dew_point_flag:          [False, False, False, False, False, False, ...
    weat_interval:                [0.048, 0.048, 0.048, 0.048, 0.048, 0.048, ...
    weat_ns_wx_station_id:        [26, 27, 26, 27, 26, 27, 26, 27, 26, 27, 26...
    weat_ns_wx_station_position:  [[2225262.12, -5440307.3, -2480962.57], [22...
    weat_pressure:                [555.3649291992188, 555.3472290039062, 555....
    weat_pressure_flag:           [False, False, False, False, False, False, ...
    weat_rel_humidity:            [7.26800012588501, 7.462800025939941, 7.296...
    weat_rel_humidity_flag:       [False, False, False, False, False, False, ...
    weat_temperature:             [269.92529296875, 269.8935852050781, 269.96...
    weat_temperature_flag:        [False, False, False, False, False, False, ...
    weat_time:                    [4860026989.023999, 4860026989.023999, 4860...
    weat_wind_direction:          [2.96705961227417, 1.1519173383712769, 2.44...
    weat_wind_direction_flag:     [False, False, False, False, False, False, ...
    weat_wind_speed:              [1.5, 1.899999976158142, 2.700000047683716,...
    weat_wind_speed_flag:         [False, False, False, False, False, False, ...
```

<font color='red'>To Do: Clean up and formalize the vis.zarr format (taking into account the changes made in ms v3).
</font>

<font color='red'>To Decide: Special consideration should be given to phased array feeds and single dish data.</font>

<font color='red'>To Decide: Should DATA be the default name for the data variable that contains the visibilities.</font>

<font color='red'>To Decide: Remove WEIGHT and WEIGHT_SPECTRUM and have only one array WEIGHT (has full dimensionality time, baseline, chan, pol).</font>

<font color='red'>To Decide: Remove FLAG_ROW.</font>

<font color='red'>To Decide: The top directory of the vis.zarr file is not a Zarr group. Should we make it a group (this would add metadata). We would then create an xr.open_zarr alternative that can open a set of datasets. </font>

<font color='red'>To Decide: Should we remove the idea of an ddi (as was done in ms v3). The top level folder format would change from ddi_num to pol_num.spectral_window_num </font>

<font color='red'>To Decide: Maybe we should build a dataset of datasets interface.</font>
Possible interface:
```
<ngCASA.Dataset>
Datasets :
    0.0       xarray.dataset
    0.1       xarray.dataset
    global    xarray.dataset
Attributes:
    Telescope :    ALMA
```

<font color='red'>To Decide: Split up global.</font>

<font color='red'>To Decide: A default list of array names is as defined in CNGI. But, to get the ability to have versions of arrays, any application may add arrays as needed. Examples are corrected_data_1, corrected_data_2, flag_original, flag_auto, etc. All additional arrays must conform to the same meta-data and coordinates. But now, all methods (even CNGI) need the ability to specify an array name to use, with the default being from the core definition.</font> 


### img.zarr

Example of a img.zarr file
```
<xarray.Dataset>
Dimensions:          (chan: 7, d0: 200, d1: 400, pol: 1)
Coordinates:
  * chan             (chan) float64 3.725e+11 3.725e+11 ... 3.725e+11 3.725e+11
    declination      (d0, d1) float64 dask.array<chunksize=(200, 400), meta=np.ndarray>
  * pol              (pol) float64 9.0
    right_ascension  (d0, d1) float64 dask.array<chunksize=(200, 400), meta=np.ndarray>
Dimensions without coordinates: d0, d1
Data variables:
    IMAGE            (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR      (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    MASK             (d0, d1, chan, pol) bool dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    MODEL            (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    PB               (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    PSF              (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    RESIDUAL         (d0, d1, chan, pol) float64 dask.array<chunksize=(200, 400, 1, 1), meta=np.ndarray>
    SUMWT            (chan, pol) float64 dask.array<chunksize=(1, 1), meta=np.ndarray>
Attributes:
    axisunits:            ['rad', 'rad', '', 'Hz']
    commonbeam:           [0.6639984846115112, 0.5052574276924133, -65.900550...
    commonbeam_units:     ['arcsec', 'arcsec', 'deg']
    date_observation:     2012/11/19/07
    direction_reference:  j2000
    imagetype:            Intensity
    incr:                 [-3.878509448876288e-07, 3.878509448876288e-07, 1.0...
    object_name:          tw hya
    observer:             cqi
    pointing_center:      11
    rest_frequency:       3.72522e+11 hz
    restoringbeam:        [[[0.6639984846115112, 0.5052574276924133, -65.9005...
    spectral__reference:  lsrk
    telescope:            alma
    telescope_position:   [2.22514e+06m, -5.44031e+06m, -2.48103e+06m] (itrf)
    unit:                 Jy/beam
    velocity__type:       radio
```

<font color='red'>To Do: Add to img.zarr explanation.</font>


<font color='red'>To Decide: What is the default list of array names (i.e. image products) to store? The current specification in CNGI is tied closely to the casa6 tclean's list of outputs. However, not every applications/users will need this full list. Therefore, we perhaps should require that an image dataset contain a minimum of 1 array with a predefined default name, but then allow the appending of more. All image arrays within a set must share the same shape and coordinate metadata. Here too, all methods (even CNGI) that operate on images must have the ability to specify an array name to use.</font>

<font color='red'>To Decide: A mechanism to convert between a Primary-Beam model database and a convolution function cache must also be designed.</font>

<font color='red'>To Decide: We need a definition of a component list as an alternative to a raster image. This is required for compatibility with catalogues and translation between sky models that are not fitted/evaluated on a pixellated grid.</font>


### cal.zarr

<font color='red'>To Decide: Create initial design.</font>

### tel.zarr

The .cfg to tel.zarr conversion script and repository of tel.zarr files can be found [here](https://github.com/casangi/cngi_reference/tree/master/telescope_layout). The respository for .cfg files can be found [here](https://open-bitbucket.nrao.edu/projects/CASA/repos/casa-data/browse/alma/simmos).
    
Example of a tel.zarr file
```
<xarray.Dataset>
Dimensions:        (ant: 193, pos_coord: 3)
Coordinates:
  * ant            (ant) int64 0 1 2 3 4 5 6 7 ... 186 187 188 189 190 191 192
  * pos_coord      (pos_coord) int64 0 1 2
Data variables:
    ANT_NAME       (ant) <U7 dask.array<chunksize=(193,), meta=np.ndarray>
    ANT_POS        (ant, pos_coord) float64 dask.array<chunksize=(193, 3), meta=np.ndarray>
    DISH_DIAMETER  (ant) float64 dask.array<chunksize=(193,), meta=np.ndarray>
Attributes:
    coordinate_system:    WGS84
    elevation_units:      m
    lat_units:            rad
    long_units:           rad
    telescope_elevation:  5056.8
    telescope_lat:        -0.4018251640113072
    telescope_long:       -1.1825465955049892
    telescope_name:       ALMA
```

## ngCASA Prototype

The following ngCASA functions have working prototypes:

- [make_pb]()
- [make_imaging_weight](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_imaging_weight.html#ngcasa.imaging.make_imaging_weight)
- [make_psf](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_psf.html#ngcasa.imaging.make_psf)
- [make_image](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_image.html#ngcasa.imaging.make_image)

Example notebooks can be found [here](https://ngcasa.readthedocs.io/en/latest/prototypes.html). In the [continuum imaging](https://ngcasa.readthedocs.io/en/latest/prototypes/continuum_imaging_example.html) and [cube imaging](https://ngcasa.readthedocs.io/en/latest/prototypes/cube_imaging_example.html)  example notebooks the images generated by the ngCASA prototype and CASA are compared. A benchmarking comparison between the ngCASA prototype and CASA can be found [here](https://ngcasa.readthedocs.io/en/latest/benchmark.html).

### Organization

ngCASA is organized into modules as described below. Each module is responsible for a different functional area.  

- **flagging**         : Generates flags for visibility data.
- **calibration**      : Generates and applies calibration solutions to visibility data.
- **imaging**          : Converts visibility data to images and applies antenna primary beam and w-term corrections.
- **deconvolution**    : Deconvolves PSF from images and combines images. 
- **simulator**        : Simulates vis.zarr and img.zarr files.

<font color='red'>To Decide: Are there other modules to include. The simulator will probably have to be moved to the application layer.</font>

### Architecture
The ngCASA application programming interface (API) is a set of flat, stateless functions that take Xarray datasets as input parameters and returns a new Xarray Dataset as an output. The data variables in the output dataset are associated with task graphs. Compute is only triggered when the user explicitly calls compute on the Xarray dataset or instructs an ngCASA function to save to disk. The term flat means that the functions are not allowed to call each other, and the term stateless means that they many not access any global data outside the parameter list, nor maintain any persistent internal data. There is one exception, for parameter default values that live in the parameter checking files of each module.

The file structure of the [ngCASA prototype](https://github.com/casangi/ngcasa):
```sh
ngcasa
|-- ngcasa
|    |-- module1
|    |     |-- __init__.py  
|    |     |-- file1.py    
|    |     |-- file2.py  
|    |     | ...  
|    |     |-- _module1_utils
|    |     |     |-- __init__.py
|    |     |     |-- _check_module1_parms.py
|    |     |     |-- _file_util1.py
|    |     |     |-- _file_util2.py
|    |     |     | ...
|    |-- module2  
|    |     |-- __init__.py
|    |     |-- file3.py    
|    |     |-- file4.py  
|    |     | ...  
|    |     |-- _module2_utils
|    |     |     |-- __init__.py
|    |     |     |-- _check_module2_parms.py
|    |     |     |-- _file_util3.py
|    |     |     |-- _file_util4.py
|    |     |     | ...
|    | ...
|    |-- _ngcasa_utils
|    |     |-- __init__.py
|    |     |-- _check_parms.py    
|    |     |-- _store.py  
|    |     |-- _file_util5.py
|    |     | ... 
|-- docs  
|    | ...  
|-- tests  
|    | ...  
|-- requirements.txt  
|-- setup.py  
```
File1, file2, file3 and file4 must be documented in the API exactly as they appear. They must not import each other. Any of the utility files (identified by a leading underscore) must not be documented in the API. The utility files may be imported by any related API file. For example only API functions in module1 can import utility functions in the _module1_utils folder. Utility functions in _ngcasa_utils can be imported by any API function.

There are several important files to be aware of:

*   **\_\_init\_\_.py** : dictates what is seen by the API and importable by other functions.
*   **requirements.txt** : lists all library dependencies for development, used by IDE during setup.
*   **setup.py** : defines how to package the code for pip, including version number and library dependencies for installation.
*   **_check_module_parms.py** : Each module has a _check_module_parms.py file that has functions that check the input parameters of the module's API functions. The parameter defaults are also defined here. 
*   **_check_parms.py** : Provides the _check_parms and _check_storage_parms functions. The _check_parms is used by all the _check_module_parms.py files to check parameter data types, values and set defaults. The storage_parm is parameter that is common to all API functions is checked by _check_storage_parms.
*   **_store.py** : Provides the _store function that stores datasets or appends data variables. All API functions use this function. 

Rules

1.   Each file in a module must have exactly one public function exposed to the external API (by docstring and \_\_init\_\_.py).
2. The exposed function name should match the file name.  
3. Functions must be stateless (no classes). 
4. API files in a module cannot import each other.  
5. API files in separate modules cannot import each other.
6. A module's utility files may only be imported by that module's API files.
7. The functions in _ngcasa_utils files may be imported by any function.
8. No utility functions may be exposed to the external API. 
9. No magic numbers. All numerical values should be defined in a _check_module_parms.py file.
10. Functions do not do any data selection or coordinate conversion in addition to there main purpose. For example make_image does not do a frame conversion to LSRK. These conversions should be done beforehand using the appropriate Xarray selection syntax and CNGI functions.
11. Functions do not internally check against FLAG/FLAG_ROW values. Flagging should be done apriori using cngi.vis.applyflags.

<font color='red'>To Decide: Is the design deep enough for all the required functionality.</font>

<font color='red'>To Decide: Should we have default assumed units. For example all skycoords are in radians. In the application layer more readable units will be accepted. This will have implications on vis.zarr and img.zarr.</font>


### ngCASA Function Template
```python
def ngcasa_func(dataset_1, ..., dataset_n, parms_1, ..., parms_m, storage_parms):
    """
    Description of function.
    Parameters
    ----------
    dataset_1 : xarray.core.dataset.Dataset
    ...
    dataset_n : xarray.core.dataset.Dataset
    parms_1 : dict
    ...
    parms_m : dict
    storage_parms : dictionary
    storage_parms['to_disk'] : bool, default = False
    storage_parms['append'] : bool, default = False
    storage_parms['outfile'] : str
    storage_parms['chunks_on_disk'] : dict of int, default ={}
    storage_parms['chunks_return'] : dict of int, default = {}
    storage_parms['graph_name'] : str
    storage_parms['compressor'] : numcodecs.blosc.Blosc,default=Blosc(cname='zstd', clevel=2, shuffle=0)
    Returns
    -------
    output_dataset : xarray.core.dataset.Dataset
    """
    
    ### Import Statement
    from ngcasa._ngcasa_utils._store import _store
    from ngcasa._ngcasa_utils._check_parms import _check_storage_parms
    from ._imaging_utils._check_parms_1 import _check_check_parms_1
    #...
    from ._imaging_utils._check_parms_m import _check_check_parms_m
  
    ### Deep copy user given parameters so that private functions can modify the parameters
    _storage_parms = copy.deepcopy(storage_parms)
    _grid_parms = copy.deepcopy(grid_parms_1)
    #...
    _grid_parms = copy.deepcopy(grid_parms_m)
    
    ### Parameter Checking
    assert(_check_storage_parms(_storage_parms,default_outfile,default_graph_name)), "######### ERROR: storage_parms checking failed"
    assert(_check_params_1(_params_1)), "######### ERROR: parms_1 checking failed"
    #...
    assert(_check_params_m(_params_m)), "######### ERROR: parms_m checking failed"

    ### Function code
  
    ### Package data into a dataset and create a list of created or modified Xarray data variables.

    return _store(dataset,list_xarray_data_variables,_storage_parms)
    
```


By default calling an ngCASA function will not do computation, but rather build a graph of the computation ([example of a graph](https://ngcasa.readthedocs.io/en/latest/prototypes/cube_imaging_example.html)). Computation can be triggered by using ```dask.compute(dataset)```. Each ngCASA function will also have the ```storage_parms``` input, by default ```storage_parms['to_disk']``` is set to False and the other dictionary elements have no effect. However, if ```storage_parms['to_disk']``` is True a compute is triggered and data is written to disk, [example of using storage_parms](https://ngcasa.readthedocs.io/en/latest/prototypes/imaging_weights_example.html). The ```storage_parms``` elements have the following functionality:

- ```storage_parms['to_disk']``` If True the dask graph is executed and saved to disk in the zarr format.

Only if ```storage_parms['to_disk']``` is True do the following parameters have an effect.

- ```storage_parms['append']```  If True only the dask graph associated with the function is executed and the resulting data variables are saved to an existing zarr file on disk. Note that graphs of unrelated data to this function will not be executed or saved. The data variables that are saved are in the list_xarray_data_variables. The dimension coordinates and coordinates that are shared between the on disk dataset and the data variables to append must be the same.
- ```storage_parms['outfile']``` The zarr file to create or append to.
- ```storage_parms['chunks_on_disk']``` The chunk size to use when writing to disk. This is ignored if ```storage_parms['append']``` is True. The default will use the chunking of the input dataset.
- ```storage_parms['chunks_return']``` The chunk size of the dataset that is returned. The default will use the chunking of the input dataset.
- ```storage_parms['graph_name']``` The time to compute and save the data is stored in the attribute section of the dataset and ```storage_parms['graph_name']``` is used in the label.
- ```storage_parms['compressor']``` The compression algorithm to use. Available compression algorithms can be found [here](https://numcodecs.readthedocs.io/en/stable/blosc.html).

The data variables in the dataset that the function uses to build the graph can be specified by the user. There will, however, be defaults for example the [make_image](https://ngcasa.readthedocs.io/en/latest/_api/api/ngcasa.imaging.make_image.html#ngcasa.imaging.make_image) function has the ```user_grid_parms['data_name']``` parameter. This parameter sets the name of the data variable that contains the visibilities to be gridded and the default is 'DATA'. In the [imaging weight example notebook](https://ngcasa.readthedocs.io/en/latest/prototypes/imaging_weights_example.html) this functionality is demonstrated.


<font color='red'>To Decide: Should functions be allowed to return multiple datasets.</font>

### Chunking

In the zarr and dask model, data is broken up into chunks. The chunk size can be specified in the ```xarray.open_zarr``` call using the ```chunks``` parameter. The dask and zarr chunking do not have to be the same. The [zarr chunking](https://zarr.readthedocs.io/en/stable/tutorial.html) is what is used on disk and the [dask chunking ](https://docs.dask.org/en/latest/array-chunks.html) is used during parallel computation. However, it is more efficient for the dask chunk size to be equal to or a multiple of the zarr chunk size (to stop multiple reads of the same data). This hierarchy of chunking allows for flexible and efficient algorithm development. For example cube imaging is more memory efficient if chunking is along the channel axis (the [benchmarking](https://ngcasa.readthedocs.io/en/latest/benchmark.html) example demonstrates this). Note, chunking can be done in any combination of dimensions.

 When using different chunking than the chunking on disk the ```overwrite_encoded_chunks``` parameter in the  ```xarray.open_zarr``` call should be set to True. The ```storage_parms['chunks_on_disk']``` and ```storage_parms['chunks_return']``` functionality is still experimental, please report any bugs. There should also be no dask chunking on the polarization dimension.

### Logging

This is currently being explored, see [Jira ticket CAS-13058](https://open-jira.nrao.edu/browse/CAS-13058).

### Parallel Code with Dask

Code can be parallelized in three different ways: 

- Built in Dask array functions. The list of dask.array functions can be found [here](https://docs.dask.org/en/latest/array-api.html). For example the fast Fourier transform is a built in parallel function:

```python
uncorrected_dirty_image = dafft.fftshift(dafft.ifft2(dafft.ifftshift(grids_and_sum_weights[0], axes=(0, 1)), 
                                                     axes=(0,1)), axes=(0, 1))
```

- Apply a custom function to each Dask data chunk. There are numerous Dask functions, with varying capabilities, that does this: [map_blocks](https://docs.dask.org/en/latest/array-api.html#dask.array.map_blocks), [map_overlap](https://docs.dask.org/en/latest/array-overlap.html#dask.array.map_overlap), [apply_gufunc](https://docs.dask.org/en/latest/array-api.html#dask.array.gufunc.apply_gufunc), [blockwise](https://docs.dask.org/en/latest/array-api.html#dask.array.blockwise). For example the dask.map_block function is used to divide each image in a channel with the gridding convolutional kernel:

```python
def correct_image(uncorrected_dirty_image, sum_weights, correcting_cgk):
    sum_weights[sum_weights == 0] = 1
    corrected_image = (uncorrected_dirty_image / sum_weights[None, None, :, :]) 
        / correcting_cgk[:, :, None, None]
    return corrected_image

corrected_dirty_image = dask.map_blocks(correct_image, uncorrected_dirty_image,                                 
                                      grids_and_sum_weights[1],correcting_cgk_image)
```

- Custom parallel functions can be built using [dask.delayed](https://docs.dask.org/en/latest/delayed.html) objects. Any function or object can be delayed. For example the gridder is implemented using dask.delayed:

```python
for c_time, c_baseline, c_chan, c_pol in iter_chunks_indx:
    sub_grid_and_sum_weights = dask.delayed(_standard_grid_numpy_wrap)(
    vis_dataset[grid_parms["data_name"]].data.partitions[c_time, c_baseline, c_chan, c_pol],
    vis_dataset[grid_parms["uvw_name"]].data.partitions[c_time, c_baseline, 0],
    vis_dataset[grid_parms["imaging_weight_name"]].data.partitions[c_time, c_baseline, c_chan, c_pol],
    freq_chan.partitions[c_chan],
    dask.delayed(cgk_1D), dask.delayed(grid_parms))
    grid_dtype = np.complex128
```

## List of Future Prototypes
The following are suggestions to add to the list of CNGI and ngCASA prototypes that demonstrate that algorithms and science use cases can be implemented with minimal complexity and that they scale as expected. 

Note : This list is preliminary in content, and has no fixed timelines associated with it. 

### Join Operation

Demonstrate a join operation for datasets for visibilities, images, caltables, CF-caches, etc. 

- Visibilities : A common use case is to combine data from multiple datasets that may not share metadata or be consistent in shape. Some algorithms must be done on each dataset (or subset) separately, whereas others are to be done on an entire set. Current CASA6 has a mix of both the following and this is a significant source of inconsistency. Can the new framework simplify any book-keeping for this use-case without sacrificing other aspects such as performance ?

    - Option1 : Always work with small homogeneous subsets of data and have the application layer manage lists and merging of products (via loops). 

    - Option2 : Implement a join and then use data selection as needed. All applications then strictly can take single datasets as inputs. Algorithms such as imaging can view it as a single large dataset. However, for some algorithms (such as calibration solutions that must pay attention to different meta-data across subsets) the internal diversity will still have to be managed through loops inside the methods.

- Images : Applications may need to work with multiple image sets that may not share the same shape or metadata. 

   - Should there be a join operation for images, should there only be custom combination methods, or should the application layer manage lists of image sets ?

   - Example use cases are linear mosaics where a list of small images is combined onto a larger image grid (as an explicit weighted average of images, where the primary beam is used as the weight) and multi-field imaging where main and outlier fields have vastly differing meta-data.

 
### Re-bin and Expand
The reverse operation of time/chan average or rebin/regrid operations. Visibility processing methods should (where mathematically possible) support transformations in both directions, with clear conventions implemented.  

  - E.g. vis.timeaverage () followed by 'edit flags' and then write back to original dataset. FLAG value is to be copied during expansion. 

  - E.g. Caltable solutions need an interpolation to get back to the original data resolution. Interpolation and extrapolation.


### Imaging

- Step 1 : A [fully functioning imaging prototype](https://ngcasa.readthedocs.io/en/latest/prototypes.html) exists and has demonstrated imaging weight calculations, basic gridding and image formation using a python-numba implementation. Equivalence with casa6's standard gridder has been established along with a performance comparison and demonstrations on a local workstation (with pre-specified resource constraints). 

- Step 2 : Over the next several months, this prototype will be extended to include mosaic gridding/imaging and the design of a fully-featured convolution function cache (based on the casa6 awproject gridder cfcache). Demonstrations will likely include a large ALMA cube mosaic imaging example.

- Step 3 : Time permitting, this work will be further extended to include support for heterogenous arrays (dish shapes and pointing offsets) and corresponding demonstrations via ALMA or ngVLA simulations may be written. 

### Flagging 

- Step 1 : Implement manual flag commands using native python-based xarray selection syntax and demonstrate that this scales with realistic flag command counts (several 1000 independent data selections).

### Calibration 
- Step 1 : Basic apply w/ single caltable: establish/exercise basic Jones algebra constructs and topology of the fundamental calibration infrastructure (CNGI vs. ngCASA).

- Step 2 : General apply of many terms: introduce/establish/exercise multi-term topology, including OTF aggregation of calibration.

- Step 3 : Basic solve (no apply): Establish/exercise essential solving mechanics (including pre-averging).

- Step 4 : Solve w/ pre-applies: Exercise pre-apply aggregation and resource management and staging to the solve.

### Simulation
- Step 1 : Generate a simulated dataset to mimic a real observed dataset and show that meta-data are accurate. 

- Step 2 : If feasible, demonstrate an ngVLA simulation by integrating the simulator meta-data generation with the imager prototype (preferably after full heterogenous array support is ready).


### Visualization 

- Step 1 : Use off-the-shelf python libraries to demonstrate interactive plotting of large volumes of data. 

- Step 2 : Demonstrate the use of python visualization tools from within an ngCASA application or user script. The purpose is to support flexible interactive visualization as part of algorithms and applications. 

Features to be included for evaluation for this use case are as follows : 

- Array 'versioning' implemented as separately named arrays allows a user to visualize multiple versions (of flags, or corrected_data, or model_Data). The visualization tool should allow the user to specify what (non-default) array names to use/plot.

- On-the-fly visualization must be possible using the currently-available xarray dataset (which may or may not have a zarr counterpart).  

    - Flagging : An example use case is [autoflagging](https://ngcasa.readthedocs.io/en/latest/ngcasa_flagging.html#Autoflag-with-extension-and-pre-existing-flags) where it is often useful to inspect flags, discard them, try new autoflag parameters, re-inspect, and save flags only when satisfied. 

    - Imaging : [Interactive mask drawing and iteration control](https://ngcasa.readthedocs.io/en/latest/ngcasa_imaging.html#Interactive-Clean) editing is another use case.


- Visualization must apply to zarr/xarray datasets such that they are usable (interchangeably, within reason) for visibility datasets, caltables and images. This merges the roles of what has traditionally been separate in casa6 as plotms and the viewer and will get us (at least) scatter and raster displays for all kinds of datasets that we support. 



### Pipeline Usage Modes

- Step 1 : Demonstrate a pipeline use case of operating on a diverse dataset containing multiple meta-data sub-structures (EBs, pointings, etc), with some algorithmic steps being run separately and some on the joint data. Demonstrate the role of the xarray/dask framework in simplifying the management of data selections/splits, parallelization, and combination for imaging. Demonstrate the use of array/flag versions, etc.

- Step 2 : String together all the above prototypes into a full pipeline demonstration. 





## Coding Practices 

- Snake case.
- Larger descriptive names (avoid generic names).
- Underscore for private functions.

See [CNGI coding standards](https://cngi-prototype.readthedocs.io/en/latest/development.html#Coding-Standards).

<font color='red'>To do: Expand on coding practices.</font>