# Downloading Auto-Correlation Data for GAN Training

### Joseph C. Shy, 2021 CHAMP Scholar, 08/13/2021
#### Questions? Contact me: jshy@calpoly.edu or joeyshy883@gmail.com

This notebook downloads the large '*.npy' files that are not present on the [Github](https://github.com/jshy883/jshy883-2459122-H4C_Machine-Learning_Practice-) for this project but are referenced in the documentation/code of the HERA machine-learning project. If this notebook is run, it will download all the files to the corect area that allows for the model traing to be run from `HERA-GAN_joseph-shy_memo` notebook.

**Important Note: To run this notebook, you need access to the library 'hera_cal.io' and access to HERA's 'L
Lustre' storage space.**

In [1]:
# import libraries (download and configure them if necessary)
import numpy as np
from hera_cal.io import HERAData, HERACal
import sklearn.model_selection as sk

In [2]:
# read in filepaths
filepath = open("filepaths-autos_2459122.txt", "r") # open file
filepath = filepath.readlines() # read in file line by line
filepath = [k.replace(',\n','') for k in filepath] # correct formatting, so list is ready for indexing and using

In [3]:
filepath # these are the filepaths that contain the auto-correlations for use

['/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25108.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.45934.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25131.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.45957.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25153.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.45979.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25175.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.46002.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25198.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.46024.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25220.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.46046.sum.autos.uvh5',
 '/lustre/aoc/projects/hera/H4C/2459122/zen.2459122.25243.sum.autos.uvh5',
 '/lustre/aoc/projects/he

In [4]:
hd = HERAData(filepath) # load in files
data, flags, nsamaples = hd.read() # load in auto-correlation data
# this process takes upwards of 15 minutes (reads-in about 4 GB of data)

In [5]:
data

<hera_cal.datacontainer.DataContainer at 0x7fc9a8924160>

In [6]:
# load in good antenna keys
good_key = open("good-ant-keys_2459122.txt", "r")
good_key = good_key.readlines()
good_key = [eval(k.replace(',\n','')) for k in good_key]

In [7]:
good_key

[(36, 36, 'ee'),
 (36, 36, 'nn'),
 (50, 50, 'ee'),
 (50, 50, 'nn'),
 (53, 53, 'ee'),
 (53, 53, 'nn'),
 (58, 58, 'ee'),
 (58, 58, 'nn'),
 (66, 66, 'ee'),
 (66, 66, 'nn'),
 (68, 68, 'ee'),
 (68, 68, 'nn'),
 (82, 82, 'ee'),
 (82, 82, 'nn'),
 (83, 83, 'ee'),
 (83, 83, 'nn'),
 (85, 85, 'ee'),
 (85, 85, 'nn'),
 (91, 91, 'ee'),
 (91, 91, 'nn'),
 (92, 92, 'ee'),
 (92, 92, 'nn'),
 (98, 98, 'ee'),
 (98, 98, 'nn'),
 (99, 99, 'ee'),
 (99, 99, 'nn'),
 (100, 100, 'ee'),
 (100, 100, 'nn'),
 (102, 102, 'ee'),
 (102, 102, 'nn'),
 (103, 103, 'ee'),
 (103, 103, 'nn'),
 (104, 104, 'ee'),
 (104, 104, 'nn'),
 (105, 105, 'ee'),
 (105, 105, 'nn'),
 (108, 108, 'ee'),
 (108, 108, 'nn'),
 (109, 109, 'ee'),
 (109, 109, 'nn'),
 (117, 117, 'ee'),
 (117, 117, 'nn'),
 (118, 118, 'ee'),
 (118, 118, 'nn'),
 (120, 120, 'ee'),
 (120, 120, 'nn'),
 (124, 124, 'ee'),
 (124, 124, 'nn'),
 (127, 127, 'ee'),
 (127, 127, 'nn'),
 (128, 128, 'ee'),
 (128, 128, 'nn'),
 (129, 129, 'ee'),
 (129, 129, 'nn'),
 (130, 130, 'ee'),
 (130, 

In [8]:
# separate by polarization
good_key_ee = [] # intialize list
for ii, key in enumerate(good_key): # iterate through good antenna keys
    # only append the 'ee' keys
    if key[2] == 'ee':
        good_key_ee.append(key)

In [9]:
good_key_ee

[(36, 36, 'ee'),
 (50, 50, 'ee'),
 (53, 53, 'ee'),
 (58, 58, 'ee'),
 (66, 66, 'ee'),
 (68, 68, 'ee'),
 (82, 82, 'ee'),
 (83, 83, 'ee'),
 (85, 85, 'ee'),
 (91, 91, 'ee'),
 (92, 92, 'ee'),
 (98, 98, 'ee'),
 (99, 99, 'ee'),
 (100, 100, 'ee'),
 (102, 102, 'ee'),
 (103, 103, 'ee'),
 (104, 104, 'ee'),
 (105, 105, 'ee'),
 (108, 108, 'ee'),
 (109, 109, 'ee'),
 (117, 117, 'ee'),
 (118, 118, 'ee'),
 (120, 120, 'ee'),
 (124, 124, 'ee'),
 (127, 127, 'ee'),
 (128, 128, 'ee'),
 (129, 129, 'ee'),
 (130, 130, 'ee'),
 (135, 135, 'ee'),
 (140, 140, 'ee'),
 (141, 141, 'ee'),
 (143, 143, 'ee'),
 (144, 144, 'ee'),
 (156, 156, 'ee'),
 (157, 157, 'ee'),
 (158, 158, 'ee'),
 (160, 160, 'ee'),
 (162, 162, 'ee'),
 (163, 163, 'ee'),
 (164, 164, 'ee'),
 (165, 165, 'ee'),
 (176, 176, 'ee'),
 (178, 178, 'ee'),
 (181, 181, 'ee'),
 (185, 185, 'ee')]

In [10]:
# grab all auto-correlations and stack into a numpy array (remove keys from them as GAN training is meant to be random)
auto_data = np.array(data[good_key_ee[0]])
for ii in range(1,len(good_key_ee)):
    auto_data = np.vstack((auto_data,data[good_key_ee[ii]]))

In [11]:
# split data randomly into training and validation set
auto_data_train, auto_data_valid = sk.train_test_split(auto_data, test_size = 0.15, random_state=42)
# 15% validation size, as the auto-corr. data is extremely large (>160,000 signals), so a smaller fraction can be used than standard

# split validation set into 75% for validation and 25% for test
auto_data_valid, auto_data_test = sk.train_test_split(auto_data_valid, test_size = 0.25, random_state=42) # split validation to validation and test data

In [12]:
# access the frequency channels (standard for all auto-correlations)
freqs = np.array(data.freqs) 

In [19]:
auto_data_train[-1].real

array([2426011., 2469778., 2512869., ..., 6432925., 6393291., 6340315.])

In [20]:
auto_data_valid[-1].real

array([ 4117953.,  4213873.,  4308756., ..., 11929441., 11800194.,
       11729282.])

In [21]:
auto_data_test[-1].real

array([3397256., 3465313., 3542367., ..., 9312810., 9229486., 9144309.])

In [13]:
# save all data, it can now be accessed locally for GAN training in 'HERA-GAN_joseph-shy_memo notebook
np.save('HERA_auto-corr_freqs.npy',freqs) # save frequency data
np.save('2459122_good_auto-corrs_valid.npy',auto_data_valid) 
np.save('2459122_good_auto-corrs_test.npy',auto_data_test) 
np.save('2459122_good_auto-corrs_train.npy',auto_data_train) 