# Preprocess training and test set for StarNet
This notebook takes you through the steps of how to pre-process the training data necessary for training StarNet and separate out a high S/N test set.

Requirements:
- python packages: `numpy h5py vos`
* required data files: apStar_visits_main.h5

**Load the file that contains individual visit spectra along with APOGEE data associated with each star**

In [1]:
import numpy as np
import h5py
import os
import GetData

datadir = GetData.getDataDir()
F = h5py.File(datadir + 'apStar_visits_main.h5','r')
data = GetData.get(F, True)


  return f(*args, **kwds)


Dataset keys in file: 

['0_FE', '0_FE_ERR', 'ALPHA_M', 'AL_FE', 'AL_FE_ERR', 'CA_FE', 'CA_FE_ERR', 'C_FE', 'C_FE_ERR', 'FE_H', 'FE_H_ERR', 'IDs', 'K_FE', 'K_FE_ERR', 'LOGG', 'LOGG_ERR', 'MG_FE', 'MG_FE_ERR', 'MN_FE', 'MN_FE_ERR', 'NA_FE', 'NA_FE_ERR', 'NI_FE', 'NI_FE_ERR', 'N_FE', 'N_FE_ERR', 'PARAM', 'SI_FE', 'SI_FE_ERR', 'S_FE', 'S_FE_ERR', 'TEFF', 'TEFF_ERR', 'TI_FE', 'TI_FE_ERR', 'VRAD', 'VRAD_ERR', 'VSCATTER', 'V_FE', 'V_FE_ERR', 'aspcap_flag', 'bluegreen_persist', 'error_spectrum', 'greenred_persist', 'num_visits', 'spectrum', 'stacked_snr', 'star_flag', 'star_flag_indiv', 'targ1_flag', 'targ2_flag', 'visit_snr']
['TEFF', 'FE_H', 'ALPHA_M', 'C_FE', 'N_FE', 'stacked_snr', 'LOGG', 'star_flag', 'aspcap_flag', 'VSCATTER']
Obtained data for 143467 stars.
main flags [113956]
snr flags [53135]
FE_H flags [53135]
ALPHA_M flags [53135]
C_FE flags [53131]
N_FE flags [53128]


**Select the first **$num\_ref$** visits for the reference set**

We shuffle around the data to avoid local effects.
Later on, it will be be split into training and cross-validation sets.
The remaining high S/N spectra will be used in the test set

In [2]:
num_ref = 44784 # number of reference spectra

indices_ref = data['indices'][0:num_ref]
np.random.shuffle(indices_ref)

ap_id_ref = data['IDs'][indices_ref]
for p in GetData.getParams():
    data[p] = data[p][indices_ref]

print('Reference set includes '+str(len(ap_id_ref))+' individual visits from '+str(len(set(ap_id_ref)))+' stars.')

Reference set includes 44784 individual visits from 14498 stars.


**Now collect individual visit spectra, normalize each spectrum, and save data**

**Normalize spectra**
1. separate into three chips
2. divide by median value in each chip
3. recombine each spectrum into a vector of 7214 flux values

In [3]:
# Define edges of detectors
blue_chip_begin = 322
blue_chip_end = 3242
green_chip_begin = 3648
green_chip_end = 6048   
red_chip_begin = 6412
red_chip_end = 8306 

In [4]:
savename = 'training_data.h5'

with h5py.File(datadir + savename, "w") as f:
    
    # Create datasets for your reference data file 
    spectra_ds = f.create_dataset('spectrum', (1,7214), maxshape=(None,7214), dtype="f", chunks=(1,7214))
    ap_id_ds = f.create_dataset('Ap_ID', ap_id_ref.shape, dtype="S18")
    ap_id_ds[:] = ap_id_ref.tolist()
    for p in GetData.getParams():
        p_ds = f.create_dataset(p, data[p].shape, dtype="f")
        p_ds[:] = data[p]
        
    first_entry=True
    
    for i in indices_ref:

        spectrum = F['spectrum'][i:i+1]
        spectrum[np.isnan(spectrum)]=0.

        # NORMALIZE SPECTRUM
        # Separate spectra into chips
        blue_sp = spectrum[0:1,blue_chip_begin:blue_chip_end]
        green_sp = spectrum[0:1,green_chip_begin:green_chip_end]
        red_sp = spectrum[0:1,red_chip_begin:red_chip_end]

        # Normalize spectra by chips

        blue_sp = (blue_sp.T/np.median(blue_sp, axis=1)).T
        green_sp = (green_sp.T/np.median(green_sp, axis=1)).T
        red_sp = (red_sp.T/np.median(red_sp, axis=1)).T 

        # Recombine spectra

        spectrum = np.column_stack((blue_sp,green_sp,red_sp))
        if first_entry:
            spectra_ds[0] = spectrum
            first_entry=False
        else:
            spectra_ds.resize(spectra_ds.shape[0]+1, axis=0)

            spectra_ds[-1] = spectrum

print(savename+' has been saved as the reference set to be used in 4_Train_Model.ipynb')  

training_data.h5 has been saved as the reference set to be used in 4_Train_Model.ipynb
