Formatting Your Data for Eureka! Light Curve Extraction and Fitting
====================================================================

This notebook will give a quick example of how to format an arbitrary dataset for input into Eureka! to make use of its light curve extraction and fitting routines (i.e. Eureka! Stage 4 and Stage 5, respectively). At the moment, this notebook is written specifically in the context of understanding and mimicking the Stage 3 Eureka! output data, which contains extracted 1D spectra.

The Eureka! Data Format
------------------
To begin, let's load a Eureka! file to see how things *should* be formatted. 

In [1]:
base_dir = '/Users/acarter/Documents/TRANSITS/EUREKA/ALTERNATE_INPUTS/NIRCam_TESTING/'
eureka_file = base_dir + 'Stage3/S3_2024-01-10_nircam_template_run1/ap8_bg8/S3_nircam_template_ap8_bg8_SpecData.h5'

You'll see that this is a `.h5` file. Many Eureka! outputs are saved in this format, and more specifically as [Xarray](https://docs.xarray.dev/en/latest/index.html) objects. This is a data format that allows for storing complex datasets, and requires specific libraries to be read into Python. There are multiple options, but we'll use the [Astraeus](https://kevin218.github.io/Astraeus/) library created by Kevin Stevenson.

In [2]:
from astraeus.xarrayIO import readXR
eureka_data = readXR(eureka_file)

Finished loading parameters from /Users/acarter/Documents/TRANSITS/EUREKA/ALTERNATE_INPUTS/NIRCam_TESTING/Stage3/S3_2024-01-10_nircam_template_run1/ap8_bg8/S3_nircam_template_ap8_bg8_SpecData.h5


Our file was succesfully loaded, so let's take a look at what's inside:

In [3]:
print(eureka_data)

<xarray.Dataset>
Dimensions:      (time: 128, y: 59, x: 1600)
Coordinates:
  * time         (time) float64 5.969e+04 5.969e+04 ... 5.969e+04 5.969e+04
  * y            (y) int64 5 6 7 8 9 10 11 12 13 ... 55 56 57 58 59 60 61 62 63
  * x            (x) int64 100 101 102 103 104 105 ... 1695 1696 1697 1698 1699
Data variables:
    wave_1d      (x) float32 ...
    medflux      (y, x) float64 ...
    centroid_y   (time) float64 ...
    centroid_sy  (time) float64 ...
    stdspec      (time, x) float32 ...
    stdvar       (time, x) float32 ...
    optspec      (time, x) float64 ...
    opterr       (time, x) float64 ...
    optmask      (time, x) bool ...
Attributes: (12/74)
    ncpu:                  1
    nfiles:                1
    max_memory:            0.5
    indep_batches:         False
    suffix:                calints
    calibrated_spectra:    False
    ...                    ...
    bg_y2:                 38
    bg_y1:                 21
    save_fluxdata:         False
    ci

<br>
Phew! There's a lot going on here, so let's take it step by step:
<br>&nbsp;

`Dimensions:` This tells us the *potential* dimensionality of various arrays within the dataset. In this case there are 3 potential dimensions: time (the time steps for each measurement), y (the detector y pixels), and x (the detector x pixels). 
&nbsp;

`Coordinates:` This tells us the specific values for the coordinates that correspond to the previously mentioned `Dimensions`. Here we can see the time is recorded in MJD as a float, and the x- and y-pixel values are recorded as integers. Note that the x and y coordinates don't have to start at zero, as the initial detector array was cropped during spectral extraction. 
&nbsp;

`Data variables:` This tells us the various data arrays that are saved in the dataset. We won't go through them all here, but notice how their dimensionality corresponds to the `Dimensions`. For example, `wave_1d` is a data variable of the x-pixel dimension, whereas `optspec` is a variable of the time *and* x-pixel dimension. This makes sense, as each x-pixel should have a corresponding wavelength (in a case like this, where the x-axis is the dispersion axis). Similarly, all of our extracted 1D spectra, `optspec`, should have a corresponding time at which each spectrum was measured, and a corresponding x-pixel for each pixel column within the extracted region. 
&nbsp;

`Attributes:` Finally, this tells us a range of metadata attributes corresponding to various settings and outputs from the Eureka! processing. For example, the y-pixel limits at which the background flux was estimated, the output filename of the data, and more. 
&nbsp;

Remember this is just the default output from Eureka! for its Stage 3 spectral extraction. When generating a custom file for input into Stage 4, we don't necessarily need to define all of these variables and attributes.

Mimicking The Eureka! Format
----------------------------
Okay, so let's say you want to create your own file to feed into Eureka! so that you can make use of it's light curve extraction and fitting functionality. Assuming your data are already in the form of an extracted spectral time series, here is what you will need:
&nbsp;

**Required**

* The time steps for each of your spectra (`time`), and units.
* The wavelength array for the spectra (`wave_1d`), and units.
* The extracted spectrum at each time step (`optspec`), and units.
* The error on the extracted spectrum at each time step (`opterr`), and units.

**Optional**

* A 2D mask for specific points to ignore in the spectral time series (`optmask`).
* The shift in the spectral trace centroid and width over time (`centroid_y` and `centroid_sy`, respectively).
* Various other variables and attributes.
&nbsp;

Using this as a rough guide, let's build a mock dataset.

In [4]:
import numpy as np

time = np.linspace(59694.0, 59694.1, 100)
time_units = 'MJD'
wave_1d = np.linspace(1.0, 5.0, 1001)
x = np.arange(len(wave_1d))
wave_units = 'microns'
optspec = np.ones((len(time), len(wave_1d)))
opterr = np.ones((len(time), len(wave_1d)))*0.01
flux_units = 'e-'

Note that these values are entirely made up for the purpose of this notebook - you should be setting these variables to *your* data values!
&nbsp;

To build the Xarray, we will again use the [Astraeus](https://kevin218.github.io/Astraeus/) library. Let's start by initialising a dataset:

In [5]:
from astraeus.xarrayIO import makeDataset

custom_dataset = makeDataset()

Now we can assign all of the arrays from above to the correct Xarray DataArray format: 

In [6]:
custom_dataset['time'] = time
custom_dataset['time'].attrs['time_units'] = time_units
custom_dataset['x'] = x
custom_dataset['optspec'] = (['time', 'x'], optspec)
custom_dataset['optspec'].attrs['flux_units'] = flux_units
custom_dataset['optspec'].attrs['time_units'] = time_units
custom_dataset['optspec'].attrs['wave_units'] = wave_units
custom_dataset['opterr'] = (['time', 'x'], opterr)
custom_dataset['opterr'].attrs['flux_units'] = flux_units
custom_dataset['opterr'].attrs['time_units'] = time_units
custom_dataset['opterr'].attrs['wave_units'] = wave_units
custom_dataset['wave_1d'] = (['x'], wave_1d)
custom_dataset['wave_1d'].attrs['wave_units'] = wave_units

As you can see it's relatively straightforward to assign everything. If you wanted to expand what to include in the custom_dataset, simply assign it to a new dictionary key in `custom_dataset`, but ensure it matches a variable that Eureka! will be looking for. Now, let's see what the dataset looks like:

In [7]:
print(custom_dataset)

<xarray.Dataset>
Dimensions:  (time: 100, x: 1001)
Coordinates:
  * time     (time) float64 5.969e+04 5.969e+04 ... 5.969e+04 5.969e+04
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
Data variables:
    optspec  (time, x) float64 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
    opterr   (time, x) float64 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01
    wave_1d  (x) float64 1.0 1.004 1.008 1.012 1.016 ... 4.988 4.992 4.996 5.0


<br>

All of the data is now in the correct format, and the variable names match those from the actual Eureka! file. The only thing that remains is to assign any relevant `Attributes`. Advanced users may wish to assign more, but for most use cases, only two are mandatory - the instrument the data was obtained with, and a flag to let Eureka! know it's looking at a custom dataset.

In [8]:
# Specify instrument, options are: 'nircam', 'miri', 'nirspec', 'nircam', 'wfc3'
custom_dataset.attrs['inst'] = 'nircam'
# Specify data_format, must be set to 'custom'
custom_dataset.attrs['data_format'] = 'custom'

Finally, we can use Astraeus once more to save the dataset to a Eureka! compatible file.
&nbsp;

<div class="alert alert-warning">

Important!
&nbsp;

To mimic a Stage 3 file, your custom file _**must**_ be named correctly. Specifically, it must follow the format:

<b>
S3_{ecf_event_label}{other_file_text}_SpecData.h5 </b>
<br>&nbsp;

You may replace `other_file_text` with any text you like. The `ecf_event_label` is connected to the filename of the Stage 4 ECF file you are planning to use; if you are unfamiliar with Eureka! event labels, please see the quickstart description **[here](https://eurekadocs.readthedocs.io/en/latest/quickstart.html#customise-the-demo-files)**.

</div>

In [9]:
from astraeus.xarrayIO import writeXR
import os

# Make sure we have made the directory to save the file to
save_dir = base_dir+'Custom_Input/'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
save_file =  save_dir+'S3_custom_dataset_SpecData.h5' 

# Save the file
writeXR(save_file, custom_dataset, verbose=True)

Finished writing to /Users/acarter/Documents/TRANSITS/EUREKA/ALTERNATE_INPUTS/NIRCam_TESTING/Custom_Input/S3_custom_dataset_SpecData.h5


True

Congratulations! You've learned everything you need to make your own Eureka! compatible dataset. Feel free to repurpose or modify any of the cells in this notebook to suit your actual data. 

If you were hoping to import some photometric data, please read the section below.

If this made sense, but you are still unfamiliar with how to use Eureka!, please see the **[Quickstart Guide](https://eurekadocs.readthedocs.io/en/latest/quickstart.html)**.

Finally, if you encounter any issues or bugs while importing your own data, please contact the Eureka! team directly by making a **[GitHub Issue](https://github.com/kevin218/Eureka/issues)**.

An Aside On Photometry
----------------------

If you would like to import a photometry dataset, some slight adjustments must be made. Let's take a look at what a Eureka! photometry Stage 3 output looks like:

In [10]:
photom_file = base_dir + 'Stage3_Photom/S3_2024-04-30_nircam_template_run2/ap60_bg70_90/S3_nircam_template_ap60_bg70_90_SpecData.h5'
photom_data = readXR(photom_file)
print(photom_data)

Finished loading parameters from /Users/acarter/Documents/TRANSITS/EUREKA/ALTERNATE_INPUTS/NIRCam_TESTING/Stage3_Photom/S3_2024-04-30_nircam_template_run2/ap60_bg70_90/S3_nircam_template_ap60_bg70_90_SpecData.h5
<xarray.Dataset>
Dimensions:      (time: 119, y: 252, x: 1024)
Coordinates:
  * time         (time) float64 5.978e+04 5.978e+04 ... 5.978e+04 5.978e+04
  * y            (y) int64 4 5 6 7 8 9 10 11 ... 248 249 250 251 252 253 254 255
  * x            (x) int64 512 513 514 515 516 517 ... 1531 1532 1533 1534 1535
Data variables: (12/14)
    wave_1d      (x) float32 ...
    centroid_x   (time) float64 ...
    centroid_y   (time) float64 ...
    centroid_sx  (time) float64 ...
    centroid_sy  (time) float64 ...
    aplev        (time) float64 ...
    ...           ...
    skylev       (time) float64 ...
    skyerr       (time) float64 ...
    nskypix      (time) float64 ...
    nskyideal    (time) float64 ...
    status       (time) float64 ...
    betaper      (time) float64 ...


<br>

As you can see the general structure is similar, but some of the data variables have changed. Specifically, the `optspec`/`opterr` variables that we needed to assign for our custom dataset before are no longer present. Instead, they are replaced by the `aplev`/`aperr` variables. Furthermore, since the wavelength is the same for every pixel in photometric data, the `wave_1d` variable should just be set to the effective wavelength of the filter for a single, arbitrary `x` pixel coordinate.

So, let's fake another dataset, but this time for photometry:

In [11]:
# Assign your data to some variables
time = np.linspace(59694.0, 59694.1, 100)
time_units = 'MJD'
aplev = np.ones(len(time))
aperr = np.ones(len(time))*0.01
wave_1d = np.array([2.1,])
wave_units = 'microns'
flux_units = 'e-'

# Create an empty dataset
photom_dataset = makeDataset()

# Assign the variables to the dataset
photom_dataset['time'] = time
photom_dataset['time'].attrs['time_units'] = time_units
photom_dataset['x'] = [0,]
photom_dataset['aplev'] = (['time'], aplev)
photom_dataset['aplev'].attrs['flux_units'] = flux_units
photom_dataset['aplev'].attrs['time_units'] = time_units
photom_dataset['aperr'] = (['time'], aperr)
photom_dataset['aperr'].attrs['flux_units'] = flux_units
photom_dataset['aperr'].attrs['time_units'] = time_units
photom_dataset['wave_1d'] = (['x'], wave_1d)
photom_dataset['wave_1d'].attrs['wave_units'] = wave_units

Like before, we also need to assign some attributes to the dataset. These stay the same as for spectroscopic data, however, we also need to add an important attribute to indicate that this is a photometric dataset. Let's do that now, and take a look at everything. 

In [12]:
photom_dataset.attrs['inst'] = 'nircam'
photom_dataset.attrs['data_format'] = 'custom'
photom_dataset.attrs['photometry'] = True

print(photom_dataset)

<xarray.Dataset>
Dimensions:  (time: 100, x: 1)
Coordinates:
  * time     (time) float64 5.969e+04 5.969e+04 ... 5.969e+04 5.969e+04
  * x        (x) int64 0
Data variables:
    aplev    (time) float64 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0
    aperr    (time) float64 0.01 0.01 0.01 0.01 0.01 ... 0.01 0.01 0.01 0.01
    wave_1d  (x) float64 2.1
Attributes:
    inst:         nircam
    data_format:  custom
    photometry:   True


<br>
Great! Finally, we can save the file just like before. 

In [13]:
# Make sure we have made the directory to save the file to
save_dir = base_dir+'Custom_Photom_Input/'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
save_file =  save_dir+'S3_photom_dataset_SpecData.h5' 

# Save the file
writeXR(save_file, photom_dataset, verbose=True)

Finished writing to /Users/acarter/Documents/TRANSITS/EUREKA/ALTERNATE_INPUTS/NIRCam_TESTING/Custom_Photom_Input/S3_photom_dataset_SpecData.h5


True