# HERA Memo: Creating Auto-Correlations with a Generative Adverserial Neural Network 

### Joseph C. Shy, 2021 CHAMP Scholar, 08/13/2021
#### Questions? Contact me: jshy@calpoly.edu or joeyshy883@gmail.com

organize github and include here

define autocorrelation???

## 1. Introduction & Machine Learning Basics

### 1.1. Machine Learning applied to HERA
As the HERA radio antenna array continue to be developed, modified, and built upon, the challenge of identifying working antennas from their broken counterparts becomes a major priority before analysis can be performed with the data received. Currently, the standard for the HERA collaboration is to visually assess auto-correlations that are returned from every observing antenna on a specific night of observation. Each assessment is performed manually by an operator and repeated by a handful of other operators for thoroughness and redundancy. <br><br>
**SHOW EXAMPLE OF BAD AND GOOD AUTOS**<br><br>
This process can prove inconvenient, as it requires a large amount of focused time to gain confidence in the flag given to a certain antenna (the flag states its potential issue or if it is cleared for use). Additionally, it is close to impossible to visually check every auto-correlation that is computed every night (typically > 100,000). Therefore, this leads to the situation where a good auto-correlation from an antenna flagged as broken may be discarded as it was not screened. It is in best interest to preserve as much data from each antenna as possible. <br><br>
Machine learning, and more specifically, a generative adversarial neural network (or GAN), has potential to automate and improve the current system for flagging antennas. The current working networks that will be focused on in the sections below are designed to be able to:
1. Screen every auto-correlation produced.
2. Flag an auto-correlation as `PASS` or `NO PASS` <br>

This system will have the potential to greatly reduce the amount of time required for operators to assess the auto-correlations, instead with their roles being only to classify the issues in the `NO PASS` auto-correlations. Additionally, the ability to screen every auto-correlation should increase the influx of useable data for analysis. <br><br> 
However, full automation of this process could be an opporunity in the future through expanding the scope of the models shown below. Please refer to **fUTRUE WORK**.

### 1.2. Basics of a GAN
A generative adversarial neural network is a clever combination of two deep learning architectures. These deep learning architectures are ["computing system(s) made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs" ](https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b90b87795bc). The basic learning scheme for a neural network involves inputting a data set with specific flags (or classifiers) associated to each singular piece of data within the set (ie. image or plot) and allowing the processing elements within the neural network to update/learn in order to improve in its ability to classify certain sets of data. Each time a model updates its processing elements (or weights) is called an epoch. The magnitude in which the neural network updates its processing elements (or neurons) is based on the result of the loss at each training epoch. However, this is the most basic application of neural networks, but they can be manipulated to do much more, which will be elaborated upon next.<br><br>
The two learning architectures described within this memo ar refered to as the detector model and the generator model. They general idea is to make the two models compete. The generator's goal is to incrementally improve in its ability to create fake auto-correlations that look real. The detector's goal is to become better at discriminating between real measurements and fake ones. <br><br>
The detector model operates similarly to the basic classification neural network described above; however, the classifications that the detector trains upon are `REAL` and `FAKE`. It receives an input training set of auto-correlations that are deemed good for analysis. Accompanying these input auto-correlations are classfication flags (which are simply integers of 1 arranged in a list the size of the auto-correlation training set) that are used to communicate to the detector model that the incoming input values are `REAL` auto-correlations. Additionally, the detector receives input `FAKE` auto-correlations generated from the generator model. At each epoch, the detector will train and improve in its ability to discriminate between `REAL` and `FAKE` auto-correlations. <br><br>
However, the generator model trains as well, learning how to better create `FAKE` auto-correlations that increasingly resemble good auto-correlations. The method in which this occurs is by combining the generator and discriminator model into a larger overarching GAN learning architecture. It is important to note that within the GAN model, the discrimator cannot be trained/updated. The input is a [latent space](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d), which is a set of random numbers vectors returned from the [standard normal distribution](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/bs704_probability9.html). This generator receives this latent space and performs a set of hidden mathematical operations on it, which are derived from the model's architecure and weights. The output of this generator is a `FAKE` auto-correlation, which is subsequently fed into the detector model with a `REAL` classification. This is where the generator training occurs, as the detector will most likely return large losses, as it is being contradictory information contradicting its previous training outside the GAN architecture. The detector model will output a loss value which the generator uses to update its model weights in order to be able to create `FAKE` auto-correlations that will better trick the detector. It is important to note that the contradictory information fed into the detector will not infuence its own training and ability to discriminate between `REAL` and `FAKE` data, as within the GAN model, the detector is restricted from updating its model weights. 


## 2. Training Set Selection <br>
When training a GAN, the selection of the training set is vital for streamlining time to train and to prevent training failure. With HERA's correlators constantly seeing improvement and modifications, what is seen as a good auto-correlation that validates an antenna's use can vary dependent on the data set. The trend of the measurement over the frequency range and/or the magnitude of the power measured at each frequency may differ between different sets. Additionally, each antenna produces auto-correlations of varying polarizations, which differniate from one another as well. <br><br>
As this study is to serve as a validation of GAN architecture for detecting and producing fake auto-correlations, it is important not to expand our training set to be too large and/or too complex. Showing the GAN auto-correlation sets that are all considered valid but differ signficantly from one another, will most likely result in much longer GAN training times or GAN training failure, as neural networks rely heavily on being able to distinguish overarching features and relationships between multiple auto-correlations in a training set. <br><br>
Therefore, the down-selection to using only validated `"ee"` polarized `H4C` auto-correlations on the observation night of `2459122` is made. This training set is used as a case study in order to learn/create a working GAN architecture that can delineate between real and fake auto-correlations and produce realistic auto-correlations from random input vectors known as latent space. <br><br>
In order to train the GAN on what valid auto-correlations look like, there can only be valid data in the training set. Fortunately, the data mentioned above was chosen as it was previously validated by members of the HERA collaboration and had a more successful/valid antenna measurements than other data sets. The bad antennas (which are listed under the list "ex_ants" [here](https://docs.google.com/spreadsheets/d/1xFo2PLVUhXHe-yqHHl0WrRe5pXZF2zC8Z1fLqPZbPZ8/edit#gid=418790055)). 

The auto-correlation data was retrieved from `Lustre` through the filepath `/lustre/aoc/projects/hera/H4C/2459122`. All files within this directory are used, with the good `"ee"` polarized antennas being separated, organized, and saved locally (the code is omitted as it is not the focus of this study). Please use this [reference](https://github.com/HERA-Team/hera_cal/blob/master/scripts/notebooks/io_example.ipynb) if interested how to access `HERAData` files. 

In [2]:
import numpy as np

**NOTE:** 15% of the valid auto-correlations were split into a separate validation set, whose purpose will be described later within this study.

In [2]:
auto_data_train = np.load('2459122_good_auto-corrs_train.npy') # load-in the auto-correlation data to train on
auto_data_valid = np.load('2459122_good_auto-corrs_valid.npy') # load-in the auto-correlation data to use as for validation
freqs = np.load('HERA_auto-corr_freqs.npy') # load-in frequency channels (from HERA meta-data) [Hz]

Logistical data of the auto-correlation sets can be seen below.

In [8]:
print('# of frequency channels: {}'.format(len(freqs)))
print('# of auto-correlations to train on: {}'.format(len(auto_data_train)))
print('# of auto-correlations to use for validation: {}'.format(len(auto_data_valid)))

# of frequency channels: 1536
# of auto-correlations to train on: 142443
# of auto-correlations to use for validation: 25137


## 3. Addressing Auto-Correlation Uncertainty

### 3.1. The Radiometer Equation
Theoretically, two auto-correlations retrieved from two consecutive integrations from the same radio antenna should return identical auto-correlations, as it is a measurement from nearly identical patches of the sky. However, in practice, these auto-correlations will never be identical due to the [noise](https://casper.ssl.berkeley.edu/astrobaki/index.php/Noise_Temperature) introduced by certain measurement devices (ie. receivers or amplifiers). 

The expected noise distribution for the measurment can be understood with the [Radiometer Equation](https://casper.ssl.berkeley.edu/astrobaki/index.php/Radiometer_Equation) This equation quantifies the noise introduced by the measuring equipment from known properties of the measuring devices and experiment. 

<center> $\sigma_{T} = \frac{T_{sky}}{\sqrt{BW*t}}$ <center>

*T<sub>sky</sub>* is the actual "sky temperature" or power that should be measured in an ideal auto-correlation (without losses or noise). *BW* is the integrated bandwidth of the auto-correlation. The integration time, *t*, (or the time in which the measurement is averaged from multiple shorter exposures). These values, when applied with the equation above, produce *σ<sub>T</sub>*, which is the residual uncertainty in a sky temperature measurement.

### 3.2. Application to GAN Learning
Due to this random noise introduced by the measuring equipment, every auto-correlation in the training set is unique from one another. To the GAN, this uniqueness is seemingly random at surface level, as the neural networks only have access to the raw measurements within the training set. Therefore, the generator would not only have to learn the recurring auto-correlation trends and RFI channel patterns, but it would also be forced to learn how to produce fake auto-correlations with random noise. This proved to be a difficult task for the generator early in model development, as a realisic fake auto-correlation could not be produced even after 10,000 epochs (or iterations in which the GAN weights are updated and the model "learns"). <br> <br>
Therefore, in order to assist the generator in training and shorten the training duration, the noise quantified by the Radiometer Equation was implemented into the models and training function. The motivation behind the use of a noise model is to relieve some of the features the generator is required to learn (ie. radiometer noise) and instead allow it to focus on learning idealized auto-correlation patterns. <br><br>
The generator produces some power, or *T<sub>sky</sub>* at each frequency channel. Anytime the the generated (fake) measurement is input into or used to train the detector, a multiplicative random noise factor is applied to the sky temperature at each frequeny. This noise factor is random value drawn from a 1-centered gaussian distribution with a standard deviation drawn from the Radiometer Equation.

<center> $stddev = \frac{1}{\sqrt{BW*t}}$ <center>

The process of calculating this standard deviation for an `H4C` auto-correlation on the night of `2459122` is shown below. 

In [9]:
from hera_cal.io import HERAData # HERAData required for reading in '.uvh5' files

In [13]:
# load-in random file from designated directory for training data retrieval 
# NOTE: abritrary file is used, as only its meta-data is of interest (which remains the same for all H4C data)

hd = HERAData('/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25108.sum.autos.uvh5') # load-in file
meta = hd.get_metadata_dict() # get meta-data

BW = np.median(np.diff(meta['freqs'])) # calculate bandwidth [Hz]

int_time = np.median(np.diff(meta['times'])) # calculate integration time [Julian days]
int_time *= 24*3600 # convert Julian data to seconds 

stddev = 1/np.sqrt(BW*int_time) # calculate standard deviation from Radiometer Eqn. [unitless]

In [14]:
print('standard deviation of auto-correlation power = {}'.format(stddev))

standard deviation of auto-correlation power = 0.0009207119165799618


## 4. Model Building and Training Preparation

Future work
- nn
- semi-supervised