# File Organization

This notebook shows two ways of using `vampires_dpp` to help organize your data for processing. This notebook can be downloaded as **{nb-download}`file_organization.ipynb`**.

## Setup and Imports

In [1]:
from pathlib import Path
import pandas as pd
from zenodo_get import zenodo_get

datadir = Path("data")

In [2]:
# download example data
zenodo_get(["10.5281/zenodo.7359198", "-o", datadir.absolute()])

Title: VAMPIRES DPP Example Files
Keywords: 
Publication date: 2022-11-24
DOI: 10.5281/zenodo.7359198
Total size: 231.0 MB

Link: https://zenodo.org/api/files/cc9ed3d2-b13e-4d43-a50c-db8c33f84c42/ABAur_01_20190320_750-50_EmptySlot_00_cam1_hdr_calib_FLC1.fits   size: 100.0 MB
ABAur_01_20190320_750-50_EmptySlot_00_cam1_hdr_calib_FLC1.fits is already downloaded correctly.

Link: https://zenodo.org/api/files/cc9ed3d2-b13e-4d43-a50c-db8c33f84c42/ABAur_02__RS___20220224_750-50_LyotStop_00_cam1_fix_calib_FLC1.fits   size: 31.0 MB
ABAur_02__RS___20220224_750-50_LyotStop_00_cam1_fix_calib_FLC1.fits is already downloaded correctly.

Link: https://zenodo.org/api/files/cc9ed3d2-b13e-4d43-a50c-db8c33f84c42/bench_CLC-3_750-50_LyotStop_00_cam1_calib.fits   size: 100.0 MB
bench_CLC-3_750-50_LyotStop_00_cam1_calib.fits is already downloaded correctly.
All files have been downloaded.


## Observation Tables

To help organize our data, which is often in a single directory with STARS frame IDs that provide no useful identification, we can sort, filter, and save all of the data from the FITS headers. We have an automated utility which scrapes all the FITS headers and stores the data into a `pandas.DataFrame`.

Here is an example of how to load your input data into a table and save that to a CSV

In [None]:
from vampires_dpp.headers import observation_table

datadir = Path("data")
filelist = datadir.glob("VMPA*.fits")
table = observation_table(filelist) # sorts by DATE by default
table.to_csv(datadir / "hd32297_20220225_headers.csv")

For this example, we will load a table from observations on 2022/02/25 of HD 32297 in polarimetric imaging mode.

In [41]:
table = pd.read_csv(datadir / "hd32297_20220225_headers.csv")
table.keys()

Index(['Unnamed: 0', 'path', 'SIMPLE', 'BITPIX', 'NAXIS', 'NAXIS1', 'NAXIS2',
       'NAXIS3', 'EXTEND', 'BSCALE',
       ...
       'D_LOOP', 'D_LTTG', 'D_PSUBG', 'D_STTG', 'D_TTCMTX', 'D_TTGAIN',
       'D_WTTG', 'DATE', 'U_FLCSTT', 'COMMENT'],
      dtype='object', length=116)

## Sorting Files

We need to separate out the calibration data from the science data, including the reference stars. From this observation, our targets were as follows

**Science Targets**
- HD 32297

**Reference Targets**
- HD 42352 (Polarized Standard)
- HD 87423 (Unpolarized Standard)
- HD 32909 (PSF Reference)

We will use pandas' [`query`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) functionality to easily sort our files by target

In [4]:
sci_table = table.query("OBJECT == 'HD32297'")
pol_table = table.query("OBJECT == 'HD42352'")
unpol_table = table.query("OBJECT == 'HD87423'")
psfref_table = table.query("OBJECT == 'HD32909'")

print(f"Science files: {len(sci_table)}")
print(f"Pol. standard files: {len(pol_table)}")
print(f"Unpol. standard files: {len(unpol_table)}")
print(f"PSF ref. files: {len(psfref_table)}")

Science files: 650
Pol. standard files: 30
Unpol. standard files: 46
PSF ref. files: 36


### Calibration files

Depending on when you observed with VAMPIRES and who your suppert astronomers were, your calibration files (typically just dark frames) can be found a few ways.

For example, if your darks were archived in STARS and were taken with the mirror in, you can use the `U_MASK` value to filter the table

```python
dark_table = table.query("U_MASK == 'Mirror'")
```

If the darks were just taken during slews, you can try filtering `U_OGFNAM` (the filename as saved to the VAMPIRES computer, usually much more descriptive) for "skies"

```python
dark_table = table.loc[table["U_OGFNAM"].str.contains("skies")]
```

In some cases, you may be given the dark files directly if they were not saved in STARS. In that case, you should just store them in your data directory and add the filenames directly to your pipeline configuration.

For our observations, we took dark frames during slews, so we'll use the "skies" method.



In [5]:
dark_table = table.loc[table["U_OGFNAM"].str.contains("skies")]

print(f"Dark frames (skies): {len(dark_table)}")

Dark frames (skies): 26


now we don't need all of these cubes to make our calibration files- in general you shouldn't need more than a few hundred or thousand frames in total (typically one data cube). It is important, however, to match the exposure time and EM gain of our other images!

The `find_dark_settings` function will automatically sort through the files and find the unique combinations of exposure times and EM gains.

In [6]:
from vampires_dpp.util import find_dark_settings

print("\n".join(f"Science: {t} s / EM {e:.0f}" for t, e in find_dark_settings(sci_table.path)))
print("\n".join(f"Pol. standard: {t} s / EM {e:.0f}" for t, e in find_dark_settings(pol_table.path)))
print("\n".join(f"Unpol. standard: {t} s / EM {e:.0f}" for t, e in find_dark_settings(unpol_table.path)))
print("\n".join(f"PSF ref.: {t} s / EM {e:.0f}" for t, e in find_dark_settings(psfref_table.path)))

Science: 0.5 s / EM 300
Science: 0.25 s / EM 300
Pol. standard: 0.25 s / EM 10
Pol. standard: 0.25 s / EM 300
Unpol. standard: 0.5 s / EM 0
Unpol. standard: 0.5 s / EM 280
PSF ref.: 0.5 s / EM 300


we can see from the EM gain switches that some of the standard stars were taken with the coronagraph in (high gain) and with the coronagraph out (low gain). We could use this information to further filter the tables with `query`, but for now we're only going to focus on the main science observations, which are those with exposure time of 0.5 seconds and 300 EM gain.

In [7]:
sci_table = sci_table.query("EXPTIME == 0.5")

In [8]:
dark_table["EXPTIME"].unique(), dark_table["U_EMGAIN"].unique()

(array([0.5]), array([300.]))

our sky frames only cover the 0.5 second, 300 EM gain files. In order to fully calibrate our reference data, we would need to reach out to the SCExAO team to record some darks for the other exposure times and gains.

From the skies, we only need one cube per camera, so we'll select the first two

In [9]:
cam1_dark = dark_table.query("U_CAMERA == 1").iloc[0]
cam2_dark = dark_table.query("U_CAMERA == 2").iloc[0]
print(f"Cam 1 dark: {cam1_dark.path}")
print(f"Cam 2 dark: {cam2_dark.path}")

Cam 1 dark: /Volumes/mlucas SSD1/2022/20220225/VMPA00022727.fits
Cam 2 dark: /Volumes/mlucas SSD1/2022/20220225/VMPA00022723.fits


## Preparing for processing

Now that we have our file lists, we can prepare them for processing using the `vampires_dpp` pipeline. We'll show you a few ways for organizing your data so you can chose one that suits your needs and tweak to your liking!

### 1. File lists

We can take our dataframes and save the paths of our selected files to simple text files, which can be input directly into the pipeline

In [10]:
sci_files = datadir / "hd32297_files.txt"
with open(sci_files, "w") as fh:
    fh.writelines("\n".join(sci_table["path"]))

dark_files = datadir / "dark_files.txt"
with open(dark_files, "w") as fh:
    fh.writelines("\n".join(c.path for c in (cam1_dark, cam2_dark)))

In [11]:
!head data/hd32297_files.txt

/Volumes/mlucas SSD1/2022/20220225/VMPA00022083.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022084.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022085.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022086.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022087.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022088.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022089.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022090.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022091.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022092.fits


In [12]:
!head data/dark_files.txt

/Volumes/mlucas SSD1/2022/20220225/VMPA00022727.fits
/Volumes/mlucas SSD1/2022/20220225/VMPA00022723.fits

In [14]:
from vampires_dpp.pipeline import Pipeline

pipeline = Pipeline.from_str("""
version = "0.2.0"
name="HD32297_20220225"
directory="data"
output_directory="data/processed"
filenames="data/hd32297_files.txt"

[calibration]
output_directory="calibrated"

[calibration.darks]
filenames="data/dark_files.txt"

[registration]
output_directory="registered"

[collapsing]
output_directory="collapsed"

[derotate]
output_directory="derotated"
""")

### 2. Sub-directories

If you prefer to copy your data into sub-directories, you can do that straight from our tables and using Python's built-in file management tools. Here we will use symlinks to save storage space, but you could use `shutil.copyfile` if you wanted to make hard copies instead.

In [33]:
sci_dir = datadir / "hd32297_science"
sci_dir.mkdir(parents=True, exist_ok=True)
for filename in sci_table["path"]:
    path = Path(filename)
    outpath = sci_dir / path.name
    outpath.symlink_to(path)


dark_dir = datadir / "darks"
dark_dir.mkdir(parents=True, exist_ok=True)
for filename in (cam1_dark.path, cam2_dark.path):
    path = Path(filename)
    outpath = dark_dir / path.name
    outpath.symlink_to(path)

In [36]:
pipeline = Pipeline.from_str("""
version = "0.2.0"
name="HD32297_20220225"
directory="data"
output_directory="data/processed"
filenames="hd32297_science/VMPA*.fits"

[calibration]
output_directory="calibrated"

[calibration.darks]
filenames="darks/VMPA*.fits"

[registration]
output_directory="registered"

[collapsing]
output_directory="collapsed"

[derotate]
output_directory="derotated"
""")

### 3. Direct configuration

The last way we could do this is by directly creating a configuration dictionary, bypassing the TOML configuration altogether.

```{admonition} Advanced Usage
:class: caution

Despite this methods ability to work in pure Python, the heavily-nested structure can be easy to screw up, leading to silent failures like missing dark subtraction.
```

In [49]:
pipeline = Pipeline({
    "version": "0.2.0",
    "name": "HD32297_20220225",
    "directory": datadir,
    "output_directory": datadir / "processed",
    "filenames": sci_table["path"],
    "calibration": {
        "output_directory": "calibrated",
        "darks": {"filenames": dark_table["path"]}
    },
    "registration": {"output_directory": "registered"},
    "collapsing": {"output_directory": "collapsed"},
    "derotate": {"output_directory": "derotated"},
})