# Pre-process test data

This notebook takes you through the steps of how to preprocess a high S/N and low S/N test set
* required packages: numpy, h5py, vos
* required data files: apStar_combined_main.h5 and training_data.h5

In [1]:
import numpy as np
import h5py
import os
import GetData

datadir = GetData.getDataDir()
F = h5py.File(datadir + 'apStar_combined_main.h5','r')
data = GetData.get(F, False)

indices = data['indices']
data['IDs'] = data['IDs'][indices]
for col in GetData.getCols():
    data[col] = data[col][indices]

print(str(len(list(set(list(data['IDs'])))))+' stars remain.')


  return f(*args, **kwds)


Dataset keys in file: 

['0_FE', '0_FE_ERR', 'ALPHA_M', 'AL_FE', 'AL_FE_ERR', 'CA_FE', 'CA_FE_ERR', 'C_FE', 'C_FE_ERR', 'FE_H', 'FE_H_ERR', 'IDs', 'K_FE', 'K_FE_ERR', 'LOGG', 'LOGG_ERR', 'MG_FE', 'MG_FE_ERR', 'MN_FE', 'MN_FE_ERR', 'Main Sequence Flag', 'NA_FE', 'NA_FE_ERR', 'NI_FE', 'NI_FE_ERR', 'N_FE', 'N_FE_ERR', 'PARAM', 'SI_FE', 'SI_FE_ERR', 'S_FE', 'S_FE_ERR', 'TEFF', 'TEFF_ERR', 'TI_FE', 'TI_FE_ERR', 'VRAD', 'VRAD_ERR', 'VSCATTER', 'V_FE', 'V_FE_ERR', 'aspcap_flag', 'error_spectrum', 'num_visits', 'spectrum', 'stacked_snr', 'star_flag', 'targ1_flag', 'targ2_flag']
['TEFF', 'FE_H', 'ALPHA_M', 'C_FE', 'N_FE', 'stacked_snr', 'LOGG', 'star_flag', 'aspcap_flag', 'VSCATTER']
Obtained data for 143482 stars.
main flags [35591]
FE_H flags [35591]
ALPHA_M flags [35591]
C_FE flags [35588]
N_FE flags [35587]
['TEFF', 'FE_H', 'ALPHA_M', 'C_FE', 'N_FE', 'stacked_snr', 'LOGG', 'star_flag', 'aspcap_flag', 'VSCATTER']
35507 stars remain.


**Collect label normalization data**

Create a file that contains the mean and standard deviation for parameters in order to normalize labels during training and testing. Note, this is done after we have eliminated those records we don't want.

In [2]:
params = GetData.getParams()
mean = [np.mean(data[p]) for p in params]
std = [np.std(data[p]) for p in params]
mean_and_std = np.row_stack((np.array(mean),np.array(std)))
np.save(datadir+'mean_and_std', mean_and_std)

print('mean_and_std.npy saved')

mean_and_std.npy saved


**Load test set APOGEE IDs**

Load previously created file that contains the training data. We do not want to include any of the APOGEE IDs used in the training set in our test set. This file was created in 2_Preprocessing_of_Training_Data.ipynb

In [3]:
savename = 'training_data.h5'

with h5py.File(datadir + savename, "r") as f:
    train_ap_id = f['Ap_ID'][:]

**Separate data for High S/N test set**

In [4]:
indices_test = [i for i, item in enumerate(data['IDs']) if item not in train_ap_id]
test_ap_id = data['IDs'][indices_test]
test_combined_snr = data['stacked_snr'][indices_test]
test_data = {}
for p in GetData.getParams():
    test_data[p] = data[p][indices_test]

indices_test_set = indices[indices_test] # These indices will be used to index through the spectra

print('Test set includes '+str(len(test_ap_id))+' combined spectra')

Test set includes 21029 combined spectra


**Now collect spectra and error spectra. Then normalize each spectrum and save the data**

**Steps taken to normalize spectra:**
1. separate into three chips
2. divide by median value in each chip
3. recombine each spectrum into a vector of 7214 flux values
4. Error spectra must also be normalized with the same median values for use in the error propagation

In [5]:
# Define edges of detectors
blue_chip_begin = 322
blue_chip_end = 3242
green_chip_begin = 3648
green_chip_end = 6048   
red_chip_begin = 6412
red_chip_end = 8306 

In [6]:
savename = 'test_data.h5'

with h5py.File(datadir + savename, "w") as f:
    
    # Create datasets for your test data file 
    spectra_ds = f.create_dataset('spectrum', (1,7214), maxshape=(None,7214), dtype="f", chunks=(1,7214))
    error_spectra_ds = f.create_dataset('error_spectrum', (1,7214), maxshape=(None,7214), dtype="f", chunks=(1,7214))
    ap_id_ds = f.create_dataset('Ap_ID', test_ap_id.shape, dtype="S18")
    ap_id_ds[:] = test_ap_id.tolist()
    combined_snr_ds = f.create_dataset('combined_snr', test_combined_snr.shape, dtype="f")
    combined_snr_ds[:] = test_combined_snr
    for p in GetData.getParams():
        p_ds = f.create_dataset(p, test_data[p].shape, dtype="f")
        # Save data to data file
        p_ds[:] = test_data[p]
    
    # Collect spectra
    first_entry=True
    
    for i in indices_test_set:

        spectrum = F['spectrum'][i:i+1]
        spectrum[np.isnan(spectrum)]=0.
        
        err_spectrum = F['error_spectrum'][i:i+1]

        # NORMALIZE SPECTRUM
        # Separate spectra into chips
        blue_sp = spectrum[0:1,blue_chip_begin:blue_chip_end]
        green_sp = spectrum[0:1,green_chip_begin:green_chip_end]
        red_sp = spectrum[0:1,red_chip_begin:red_chip_end]
        
        blue_sp_med = np.median(blue_sp, axis=1)
        green_sp_med = np.median(green_sp, axis=1)
        red_sp_med = np.median(red_sp, axis=1)

        # Normalize spectra by chips
        blue_sp = (blue_sp.T/blue_sp_med).T
        green_sp = (green_sp.T/green_sp_med).T
        red_sp = (red_sp.T/red_sp_med).T

        # Recombine spectra
        spectrum = np.column_stack((blue_sp,green_sp,red_sp))
        
        # Normalize error spectrum using the same method
        # Separate error spectra into chips

        blue_sp = err_spectrum[0:1,blue_chip_begin:blue_chip_end]
        green_sp = err_spectrum[0:1,green_chip_begin:green_chip_end]
        red_sp = err_spectrum[0:1,red_chip_begin:red_chip_end]

        # Normalize error spectra by chips
        blue_sp = (blue_sp.T/blue_sp_med).T
        green_sp = (green_sp.T/green_sp_med).T
        red_sp = (red_sp.T/red_sp_med).T

        # Recombine error spectra
        err_spectrum = np.column_stack((blue_sp,green_sp,red_sp))
        
        
        if first_entry:
            spectra_ds[0] = spectrum
            error_spectra_ds[0] = err_spectrum
            
            first_entry=False
        else:
            spectra_ds.resize(spectra_ds.shape[0]+1, axis=0)
            error_spectra_ds.resize(error_spectra_ds.shape[0]+1, axis=0)

            spectra_ds[-1] = spectrum
            error_spectra_ds[-1] = err_spectrum

print(savename+' has been saved as the test set to be used in 5_Test_Model.ipynb')  

test_data.h5 has been saved as the test set to be used in 5_Test_Model.ipynb
