# The Rubin Scheduler Simulation Archive Data Model

## Scope
### Visits

**This databases is not intended to store metadata on individual visits, but rather sequences of visits.**

If there are use ever cases where an application is frequently querying small subets of sets of visits, maybe we can consider adding a table of visits. But, for our present use cases, storing the visit metedata in separate files in an S3 bucket is better.

### Included data

The primary intention of this database is to track the plethora of simulations automatically produced to support the monitoring and progress reports (e.g. the prenight briefing). A few other simulations needed for these reports (e.g. the current baseline) will probably also be included, but initially it is not intended to include most simulations done for strategy evaluation, or even most simulations made "by hand" for other reports. It might someday evolvue to replace summary metric data archive described [here](https://github.com/lsst/rubin_sim_notebooks/blob/main/maf/tutorial/04_Getting_Data.ipynb), and some ideas for that are described.

## Visit sequences

- The name "visit sequence" is provisional: I'm looking for a better option.
- "visit sequence" is a table of visit metadata that can be read by `rubin_sim.maf.get_sim_data` or `rubin_scheduler.scheduler.utils.SchemaConverter.opsim2obs`.
    - Documentation I can find is out of date: https://lsst-sims.github.io/sims_ocs/tables/summaryallprops.html
- Presently saved as sqlite3 data files with legacy column names
    - I propose supporting hdf5 files as well (or even instead).
        - More standardized and portable.
        - For `baseline_v4.3_10yrs`, the `sqlite3` file is 719M, local read takes 25s.
        - In `hdf5` (uncompressed), the file is 695M, local read takes 2.2s.
          - `hdf5` has optional built-in compression which shrinks the size of the file at the expense of read and write times. If the download from the S3 bucket is the bottleneck rather than the read into python itself, maybe experimenting with a non-0 compression level would be useful.
        - Whether we do this is not important for the purposes of this metadata database.
    - I propose using column names that match those used in `consdb`, when corresponding columns exist in the `consdb`.

## Contents of visit sequence files

Pre-night and progress simulations will typically be completed after pre-loading a set of completed visits.
There is an important question of whether to include the pre-loaded visits in visit sequences placed into the archive, or to limit the rows in the visit table in the archive to newly simulated visits.
Limiting the saved visits to newly simulated visits has several advantages:
1. The pre-night report only uses visits from one night, and this is always one of the newly simulated nights. The pre-night simulation presently simulate three nights, so including the pre-loaded visits would increase the archive space, bandwith used, and time spent downloading data by a factor 1/3 of the nights we are into the survey. For example, less 10 months into the survey, the visit sequence would be 100 times larger than necessary if previous visits are included.
2. Averaged over all timesteps, half of all visits in the simulation used for one time sample in progress reports are completed visits. If the completed visits are retrieved only once, storing only the simulated visits will reduce the size of the visit downloads by a factor of two.
The cost of this efficience is the additional complexity of requiring the client to combine the simulated visits with the parent visits whenever the complete set is needed. This complexity can be hidden in the client, however.

## Sample Use Cases

### Query the consdb and add the results to the archive

To query the consdb, add them to an S3 bucket, and record that they are there in the archive database from within python:

```
from rubin_sim.visitsarch import VisitsSequenceArchive
visit_seq_archive = VisitSequenceArchive()
# query, endpoint, and archive keyword arguments will default to the values
# shown below, but are explicity shown in this example to show where
# they can be overridden, and indicate that other pre-defined values
# might also be included as module-level variables in visitsarch.

last_consdb_dayobs=20250820
visits, completed_arch_id = visits_seq_archive.query_consdb(
    query=visitsarch.SIMONYI_SCIENCE_CONSDB_QUERY,
    label="Here is are same visits to the consdb.",
    telescope="simonyi",
    last_dayobs=last_consdb_dayobs,
    endpoint=visitsarch.USDF_CONSDB
    archive=visitsarch.USDF_OPSIM_ARCHIVE
)
```

The same could by done in `bash`, putting the visits into an h5 file rather than returning them:

```
COMPLETED_ARCH_ID=$(visitsarch query_consdb \
    completed_visits.h5, 
    --label="Here is another set of sample visits",
    --telescope="simonyi",
    --last_dayobs=20360101,
) 

### Run a simulation and add it to the archive

When running a simulation in python, either with stand-alone python executable or in a jupyter notebook, the process will look like this:

```
from rubin_sim.visitsarch import VisitsSequenceArchive
visit_seq_archive = VisitSequenceArchive()

# Get the completed visits following the previous use case, then update the scheduler:
scheduler.add_observations(SchemaConverter().opsim2obs(completed_visits))

# Run the simulation
sim_runner_kwargs = {
    'observatory': observatory,
    'scheduler': scheduler,
    'band_scheduler': band_scheduler,
    'sim_start_mjd': sim_start,
    'sim_duration': sim_end-sim_start,
    'keep_rewards': True}
observatory, scheduler, obs, rewards, obs_rewards = sim_runner(**sim_runner_kwargs)

# Create the entry in the simulations table and copy the visits to the S3 archive:
arch_id = visit_seq_archive.add_simulation(
    visits=obs,
    label="This is my example simulation.",
    telescope="simonyi",
    first_day_obs=sim_start,
    last_day_obs=sim_end,
    scheduler_version=rubin_scheduler.__version__,
    config_url="https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/8ed1ab/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py",
    sim_runner_args=sim_runner_kwargs,
    parent_visitseq_uuid=completed_arch_id, ;# See previous use case example
    parent_last_dayobs=last_consdb_dayobs, ;# See previous use case example
)

# Add appropriate tags:
visits_seq_archive.tag(arch_id, ['example', 'nominal'])

# Save the python environment:
visits_seq_archive.save_env(arch_id)

# Save the rewards and pickles of important objects to the S3 bucket,
# and record what and where they are in the database
visit_seq_archive.add_file(arch_id, 'rewards', [rewards, obs_rewards])
visit_seq_archive.add_file(arch_id, 'scheduler', scheduler)
visit_seq_archive.add_ifle(arch_id, 'observatory', observatory)
```

Alternately, we could do it in `bash` with files:

```
# Run the simulation, saving visits and rewards to h5 files and scheduler and observatory objects to pickles,
# then:

ARCHID=$(visitsarch add_simulation \
    --label="This is my bash example simulation" \
    --telescope="simonyi",
    --config_url=""https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/8ed1ab/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py" \
)

visitsarch tag $ARCHID example nominal
visitsarch save_env $ARCHID
visitsarch add_file $ARCHID rewards rewards.h5
visitsarch add_file $ARCHID scheduler scheduler.p
visitsarch add_file $ARCHID observatory observatory.p
```


### Query the simulation archive for simulations

Any of the tables and views descibed below can be queried with SQL, and return `pandas.DataFrames` (in python) or delimited text (in bash).
Direct use of `pd.read_sql` will often be perfectly reasonable, but a helper that creates and drops the connection might be used, for example:

```
from rubin_sim.visitsarch import VisitsSequenceArchive
visit_seq_archive = VisitSequenceArchive()
all_tagged_prenight_df = visit_seq_archive.query("SELECT * FROM visitsextra WHERE tags ? 'mytest1'")
```
Note the use of the non-standard `postresql` operator `?` which when applied to a json object (as `tags` is in the `visitsextra` view) tests whether a string is a member of a json sequence.

The corresponding command in bash would be:

```
visitarch query "SELECT \* FROM visitsextra WHERE tags \? 'mytest1'" > mytesttable.txt
```

In addition, there should be a handful of commands that explicitly implement common use cases.
For example, to retrieve a table of pre-night simulations for a give night, in python:
```
day_obs = 20250814
prenight_for_night = visitsarch.prenight_sims(day_obs)
```
or, in bash:
```
visitsarch prenight_sim 20250814
```

These would return tables of simulations suitable for building a pre-night simulation, with all the metadata needed for selecing a simulation from which to make a pre-night briefing, and then obtaining the necessary data (`visitseq_uuid`, `label`, `telescope`, `creation_time`, `parent_last_day_obs`, and URLs for the visits and rewards).

### Getting data from the archive

Once the `visitseq_uuid` of interest is discovered through querying the `simulations` or `completed` tables or the `visitsextra` view, the actual visit, reward, or other data should be downloadable from the buckets pointed to.

For example, in python:
```
from rubin_sim.visitsarch import VisitSequenceArchive
visit_seq_archive = VisitsSequenceArchive()
visits = visit_seq_archive.download(visitseq_uuid, 'visits')
rewards, obs_rewards = visits_seq_archive.download(visitseq_uuid, 'rewards')
scheduler = visits__seq_archive.download(visitseq_uuid, 'scheduler')
```
or, in bash:
```
visitsarch download $ARCHID visits visits.h5
visitsarch download $ARCHID rewards rewards.h5
visitsarch.download $ARCHID scheduler scheduler.p
```

## Tables in the visit sequence metadata database

### `simulations` table

Tracks output of opsim simulations, primarily those generated for the pre-night briefing and progress tracking, but baselines might also be useful to include.

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID PRIMARY KEY UNIQUE | RFC 4122 Universally Unique IDentifier (from python's `uuid.uuid4()`)|
| visitseq_sha256 | BYTEA NOT NULL | hash of visit table |
| visitseq_label | TEXT NOT NULL | label for plots and tables |
| visitseq_url | TEXT | URL of visit sequence (sqlite3, maybe hdf5). It can be NULL, so statistics etc. can be stored in the database even if the actual visits are never saved. |
| telescope | TEXT NOT NULL | "simonyi" or "auxtel" |
| first_day_obs | DATE NOT NULL | day obs of first visit in sequence |
| last_day_obs | DATE NOT NULL | day obs of last visit in sequence |
| creation_time | TIMESTAMP WITH TIME ZONE NOT NULL| when the simulation was run |
| scheduler_version | TEXT | version of `rubin_scheduler` |
| config_url | TEXT | URL for the config script, e.g. https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/${COMMIT_HASH}/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py |
| sim_runner_args | JSONB | arguments to sim runner as a json dict |
| conda_env_sha256 | BYTEA | SHA256 hash of output of `conda list --json` |
| parent_visitseq_uuid | UUID | UUID of visitseq loaded into scheduler before running |
| parent_last_day_obs | DATE | day_obs of last visit loaded into scheduler before running |

In principle, `config_repo`, `config_version`, and `config_script` should be enough to exactly specify the config file used. However, `config_sha256` is still useful for identifying when the config script did not change across config repository versions, or for identifying config scripts that were not taken from a git repository (in which case the `config_repo` etc. would be `NULL`.)

PostgreSQL has a native json type that can be used in queries, so we can, for example, query for simulations with a given value of `sim_start_mjd` or `n_visit_limit` using the `sim_runner_args` column in this table.

RFC 4122 UUIDs can be generated with the python standard library with `uuid.uuid4()`. These should be generated by whatever process is inserting new data into the table.

SHA-256 is a fast hash function that python can apply to recarrays, and is stable across versions of python (unlike python's `hash`). If we compute the SHA-256 for the `recarray` representation of the visit table, we can detect if we ever fail to reconstruct it exactly. This code fragment shows how this might be computed:
```
import hashlib
visitseq_hash = hashlib.sha256(str(recs.dtype).encode())
visitseq_hash.update(np.ascontiguousarray(recs).data.tobytes())
hex_digest = visitseq_hash.hexdigest()
```
To insert it into a BYTEA column in postgresql:
```
sql = f"INSERT INTO obstable (visitshash) VALUES (decode('{hex_higest}', 'hex'))"
```

This would, of course, be handled transparently by the python client.

## `completed` table

Tracks results of visit sequences representing actually compleded visits, primarily (probably entirely) derived from queries to the consdb.

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID PRIMARY KEY | RFC 4122 Universally Unique IDentifier |
| visitseq_sha256 | BYTEA NOT NULL | hash of visit table |
| visitseq_label | TEXT NOT NULL | label for plots and tables |
| visitseq_url | TEXT | URL of visit sequence (sqlite3, maybe hdf5) |
| telescope | TEXT NOT NULL| "simonyi" or "auxtel" |
| first_day_obs | DATE NOT NULL | day obs of first visit in sequence |
| last_day_obs | DATE NOT NULL | day obs of last visit in sequence |
| creation_time | TIMESTAMP WITH TIME ZONE | when the consdb was queried |
| query | TEXT | The query to the consdb used |

Inclusion of the first and last day obs will let us save incremental updates.

Inclusion of the query will let us select subsets (e.g., just one band), but may not be necessary.

We might sometimes want to create entries in this table with the `visitseq_url` set to `NULL`, if we want to record statistics for a set of visits queried from the consdb and want to record how we got them, but do not need to save the visits themselves.

## `mixedvisitseqs` table

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID PRIMARY KEY | RFC 4122 Universally Unique IDentifier |
| visitseq_sha256 | BYTEA NOT NULL | hash of visit table |
| visitseq_label | TEXT NOT NULL | label for plots and tables |
| visitseq_url | TEXT | URL of visit sequence (sqlite3, maybe hdf5) |
| telescope | TEXT NOT NULL | "simonyi" or "auxtel" |
| first_day_obs | DATE NOT NULL | day obs of first visit in sequence |
| last_day_obs | DATE NOT NULL | day obs of last visit in sequence |
| creation_time | TIMESTAMP WITH TIME ZONE | date when sequence was defined |
| last_early_day_obs | DATE | the last day obs drawn from the early parent |
| first_late_day_obs | DATE | the first day obs drawn from the late parent |
| early_parent_uuid | UUID | the UUID of the early parent |
| late_parent_uuid | UUID | the UUID of the late parent |

Note that a client can recover the visit sequence even if the `visitseq_url` column is `NULL` if the early and late parents can be retrieved by querying the early parent for visits between `first_day_obs` and `last_early_day_obs` and the late parent for visits between `first_late_day_obs` and `last_day_obs` and concatenating the results.

I don't know if it's a good idea, but `mixedvisitseqs` visit sequences can in principle be daisy-chained: the parent uuids can themselves refer to other entries in the `mixedvisitseqs` table, allowing for the specification of visit sequences comprised of any number of fragments of other visit sequences. This might be useful as a mechanism for incremental updates to queries of the consdb.

## `visitseq` table

The `simulations`, `completed`, and `mixedvisitseqs` tables all have several fields in common, and we may wish to query a table where we don't have to deal with each separately, for example which joining the parent UUID columns in the `mixedvisitseqs` table to its parents. We could accomplish this with a view, but a better way would be to take advantage of postgresql's inheritence: we can create a parent table with the columns that `simulations`, `completed`, and `mixedvisitseqs` have in common:

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID PRIMARY KEY | RFC 4122 Universally Unique IDentifier |
| visitseq_sha256 | BYTEA NOT NULL | hash of visit table |
| visitseq_label | TEXT NOT NULL | label for plots and tables |
| visitseq_url | TEXT | URL of visit sequence (sqlite3, maybe hdf5) |
| telescope | TEXT NOT NULL | "simonyi" or "auxtel" |
| first_day_obs | DATE NOT NULL | day obs of first visit in sequence |
| last_day_obs | DATE NOT NULL | day obs of last visit in sequence |
| creation_time | TIMESTAMP WITH TIMES ZONE | date when sequence was created |

Queries to this table will see rows from all of its children: `simulations`, `completed`, and `mixedvisitseqs`.

## `tags`

Visit sequences can be marked with tags.
Any given visit sequence can have any number of tags, and any given tag can apply to any number of visit sequences.
Tags will typically be short strings without spaces, and used to retrieve subsets of visit sequences in an automatic way.

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID | SHA256 hash of output of `conda list --json` |
| tag | TEXT NOT NULL | A tag for the visit sequence |

Note that tags can be used to serve the same function as "run families" are used in the `rubin_sim.maf.run_comparison` submodule.

## `comments`

Comments can be added to visit sequences.
Any given visit sequence can have any number of comments.
Comments are intended to be free-form text to be interpreted by humans.

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID| SHA256 hash of output of `conda list --json` |
| comment_time | TIME STAMP WITH TIME ZONE | The date and time at which the comment was added |
| author | TEXT | The user who added the comment |
| comment | TEXT NOT NULL | A tag for the visit sequence |

## `files`

The `files` table saves references to files that are associated with a given set of visits.
Typical examples include `rewards` (for recorded rewards) and `scheduler` (a pickle of the scheduler instance), and `observatory` (a pickle of the observatory).

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID | RFC 4122 Universally Unique IDentifier |
| file_type | TEXT NOT NULL | Examples include "rewards", "scheduler", and "observatory" |
| file_sha256 | BYTEA NOT NULL | hash of the file |
| file_url | text | URL for the file |


`simulations_extra` view

The `visitsextra` includes simulation column with extra columns with tags, comments, and files aggregated into json.

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID PRIMARY KEY UNIQUE | RFC 4122 Universally Unique IDentifier (from python's `uuid.uuid4()`)|
| visitseq_label | TEXT NOT NULL | label for plots and tables |
| visitseq_url | TEXT | URL of visit sequence (sqlite3, maybe hdf5). It can be NULL, so statistics etc. can be stored in the database even if the actual visits are never saved. |
| telescope | TEXT NOT NULL | "simonyi" or "auxtel" |
| first_day_obs | DATE NOT NULL | day obs of first visit in sequence |
| last_day_obs | DATE NOT NULL | day obs of last visit in sequence |
| creation_time | TIMESTAMP WITH TIME ZONE NOT NULL| when the simulation was run |
| scheduler_version | TEXT | version of `rubin_scheduler` |
| config_url | TEXT | URL for the config script, e.g. https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/${COMMIT_HASH}/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py |
| sim_runner_args | JSONB | arguments to sim runner as a json dict |
| conda_env_sha256 | BYTEA | SHA256 hash of output of `conda list --json` |
| parent_visitseq_uuid | UUID | UUID of visitseq loaded into scheduler before running |
| parent_last_day_obs | DATE | day_obs of last visit loaded into scheduler before running |
| tags | JSONB | the set of tags on the simulation |
| comments | JSONB | the mapping of times to comments on the simulation |
| files | JSONB | the mapping of file type to URL on the simulation |


## `condaenv`

While in many cases saving conda environments will probably not be useful, there may be times when it is, and we can save environments in a separate table.
The simplest thing would be to just store the output of `conda list --json` in a table:

| column | type | description |
| --- | --- | --- |
| conda_env_hash | BYTEA PRIMARY KEY | SHA256 hash of output of `conda list --json` |
| conda_env | JSONB NOT NULL | output of `conda list --json` |

We need not require that all environments used be included in this table. For example, saving the detailed environments for the prenight simulations is unlikely to be useful and will take a lot of space, so we should just skip adding these environments to this table.

## `nightly_stats` table

Statistics of the distributions of visit parameters or other values can be included in the database.

The particular statistits are chosen to support the creation of [box-and-whisker plots](https://en.wikipedia.org/wiki/Box_plot).

For a given sequence of visits, the rows to add to this table might be generated thus:
```
columns = ['s_ra', 's_dec', 'sky_rotation', 'azimuth', 'altitude', 'eff_time_median', 'sky_bg_median', 'psf_area_median']
vs_data = visits.groupby('day_obs')[columns].describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).stack(level=0).reset_index()
vs_data['accumulated'] = False
```

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID  | RFC 4122 Universally Unique IDentifier |
| day_obs | DATE | The day obs of the visits included |
| value_name | TEXT | metric or column name |
| accumulated | BOOLEAN | Whether the statistics include all data through night day_obs, or only data on night day_obs |
| count | INTEGER | number of values in distribution |
| mean | DOUBLE PRECISION | mean value of metric |
| std | DOUBLE PRECISION | standard deviation of metric |
| min | DOUBLE PRECISION | min value of metric |
| p05 | DOUBLE PRECISION | 5% quantile of metric distribution |
| q1 | DOUBLE PRECISION | first quartile of metric distribution |
| median | DOUBLE PRECISION | median of metric |
| q3 | DOUBLE PRECISION | third quartile of metric distribution |
| p95 | DOUBLE PRECISION | 95% quantile of metric distribution |
| max | DOUBLE PRECISION | maximum value of metric |

Statistics will not be computed for all visit sequences, and not all avaliable values will be included for sequences that are included.

Distributions of `maf` metrics could alse be added.
For example, the `mean`, `std`, `min`, etc. could be the `mean` etc. of the values of healpixels for a maf metric made with a `HealpixSlicer` and returns a healpy array.


## `maf_metrics` table

**The need for this table is currently speculative.**

If we wanted even more extensive `maf` support than supplied by the `nightlystats` table above, we could create a child table, `mafstats`, that inherets the colmuns from `nightlystats` and adds the following:

| column | type | description |
| --- | --- | --- |
| maf_metric_name | TEXT PRIMARY KEY | A name for the MAF metric |
| rubin_sim_version | TEXT | the version of rubin_sim used |
| constraint | TEXT | constraint imposed in maf |
| metric_class_name | TEXT | class name of the metric |
| metric_args | JSONB | arguments to the metric constructor |
| slicer_class_name | TEXT | class name of the slicer |
| slicer_args | JSONB | arguments to the slicer constructor |

This would support, for example, the creation of box-and-whisker plots of things like area with accumulated depth at different times during the survey, and support tracing exactly how these were computed.

## `maf_summary_metrics` table

**The need for this table is currently speculative.**

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID | RFC 4122 Universally Unique IDentifier |
| maf_metric_name | TEXT REFERENCES maf_stats(maf_metric_name) | The name for the maf metric |
| day_obs | DATE | The last day_obs of visits included |
| accumulated | BOOLEAN | true of all visits through day_obs are included, false if only visits on day_obs are |
| summary_value | DOUBLE PRECISION | the value of the summary metric |


## `maf_metric_sets` table

**The need for this table is currently speculative.**

| column | type | description |
| --- | --- | --- |
| metric_set| TEXT NOT NULL | A name for the metric set |
| maf_metric_name | TEXT REFERENCES maf_stats(maf_metric_name) | the MAF metric name|
| short_name | TEXT | a shorter label to use in plots |
| style | TEXT | matplotlib line style |
| invert | BOOLEAN DEFAULT FALSE | lower is better? |
| mag | BOOLEAN DEFAULT FALSE | value is a magnitude? |


## `maf_summary` view

**The need for view is currently speculative.**

A view can make it easy to get values and plotting style for all summary metrics in a given metric set,
for runs with a given tag:

| column | type | description |
| --- | --- | --- |
| metric_set| TEXT NOT NULL | A name for the metric set |
| maf_metric_name | TEXT REFERENCES maf_stats(maf_metric_name) | the MAF metric name |
| visitseq_uuid | UUID | RFC 4122 Universally Unique IDentifier for the visit sequence |
| visitseq_label | TEXT | Visit sequences label |
| day_obs | DATE | The last day_obs of visits included |
| accumulated | BOOLEAN | frue of all visits through day_obs are included, false if only visits on day_obs are |
| summary_value | DOUBLE PRECISION | the value of the summary metric |
| short_name | TEXT | a shorter label to use in plots |
| style | TEXT | matplotlib line style |
| invert | BOOLEAN DEFAULT FALSE | lower is better? |
| mag | DOUBLE PRECISION | value is a magnitude? |
| tag | TEXT | tag for the run |


## `maf_healpix_stats` table

**The need for this table is currently speculative.**

| column | type | description |
| --- | --- | --- |
| visitseq_uuid | UUID | RFC 4122 Universally Unique IDentifier |
| maf_metric_name | TEXT REFERENCES maf_stats(maf_metric_name) | The name for the maf metric |
| day_obs | DATE | The last day_obs of visits included |
| accumulated | BOOLEAN | frue of all visits through day_obs are included, false if only visits on day_obs are |
| nside | INTEGER NOT NULL | the nside of the healpix map used |
| count | INTEGER | number of unmasked values in distribution |
| mean | DOUBLE PRECISION | mean value of metric |
| std | DOUBLE PRECISION | standard deviation of metric |
| min | DOUBLE PRECISION | min value of metric |
| p05 | DOUBLE PRECISION | 5% quantile of metric distribution |
| q1 | DOUBLE PRECISION | first quartile of metric distribution |
| median | DOUBLE PRECISION | median of metric |
| q3 | DOUBLE PRECISION | third quartile of metric distribution |
| p95 | DOUBLE PRECISION | 95% quantile of metric distribution |
| max | DOUBLE PRECISION | maximum value of metric |
| url | TEXT | A url for the healpix values themselves, if saved |


## Thoughts on presistence

Many of the simulations will be of only transient interest, and will not warrent keeping around for ever. We may, however, want to keep their records in this database around even after the visit sequence data itself has been deleted. We can indicate this by setting the URLs in the various tables to `NULL`.


## Tools

### Introduction

Most of the simulations to be included in this archive will be handled by automated processes, but there should also be `python` and command line APIs to add "hand-generated" simulations or metrics.

The [`click`](https://click.palletsprojects.com/en/stable/) python module may be used to a command line app that has functions that correspond directly to the python API.

Python functions will be supplied by the `rubin_sim.visitsarch` submodule.

### Adding a simulation: `visitsarch add_simulation`

The `visitsarch.add_simulation` function and `visitsarch add_simulation` bash command add a row to the `simulations` table, creating and returning an `visitseq_uuid` in the process.
In addition, there will be an option to copy a set of visits into an archive (directory on disk on an S3 bucket).

Required arguments:

| argument | content |
| --- | --- |
| visits | Visits (in `pd.DataFrame` or `np.recarray`  in python, or name of an hdf5 file if bash) to add to archive |
| label | Value for the "label" column in the `simulation` table |

Optional arguments:

An `archive` keyword argument (and corresponding `--archive` bash option) supplise the base URL for the archive in which to copy the visits.
It can be `None` or an empty string if only a database entry should be created, and will fefaults to a standard S3 bucket.

Some columns will always be computed automatically (`visitseq_uuid`, `visitseq_sha256`). 
Other columns can be set using keyword arguments (in python) or with the an `--` option to the bash command.
In cases where it is possible, defaults should be created automatically. For example, `scheduler_version` can be derived automatically from the current environment, if it is not provided.

python example:

```
>>> visits_id = visitsarch.add_simulation(
...     visits,
...     label="the visits I just simulted",
...     telescope="simonyi",
...     config_url="https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/8ed1ab/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py",
...     sim_runner_args=sim_runner_kwargs
... )
...
```

bash example:

```
$ visitsarch add_simulation visits.h5 "the visits I just simulated" \
>    --telescope=simonyi \
>    --config_url=https://raw.githubusercontent.com/lsst-ts/ts_config_ocs/8ed1ab/Scheduler/feature_scheduler/maintel/fbs_config_sv_survey.py" \
>    --sim_running_kwargs=sim_runner_args.jsona

### Add visits from the `consdb`: `visitsarch query_consdb`

Query the consdb, and add the returned visits as an entry in completed table.

Arguments:

| argument | content |
| --- | --- |
| query | An sql query used to query the consdb |
| label | Value for the "label" column in the `simulation` table |
| telescope | "simonyi" or "auxtel" |
| first_dayobs | dayobs before which there can be no visits. This might be eariler than the dayobs of the first visit, if there were no visits on the night of first_dayobs itself. |
| last_dayobs | dayobs after which there can be no visits |
| endpoint | The connection information for the consdb, defaulting to the USDF consdb |
| archive | An endpoint in which to save the visit table. May be none, to simply record a metadata entry without saving the results. It should often be possible to recover the resuts of the query if it is not save just by sending the query again, but not if the consdb gets updated.|

### Create a mixed set of visits: `visitsarch add_mixed`

Create an entry in the `mixedvisitseqs` table.

The `visitseq_uuid` and `visitseq_sha256` columns in the table will be determined automatically.

`visitseq_url` can be optional: if the parents exist, the mixed sequence can be reconstructed.

Other columns in the table require correspending arguments to `visitsarch add_mixed`.

### Add tags to a sequence of visits: `visitsarch tag`

This can be simple.
In python:
```
> visitsarch.tag(visitseq_id, ['tag1', 'tag2', 'tag3'])
```

or in bash:
```
$ visitsarch tag $VISITSEQID tag1 tag2 tag3
```

### Add a comment to a sequence of visits: `visitsarch comment`

This can be simple too.
In python:
```
> visitsarch.comment(visitseq_id, "This is more of a question that a comment...")
```

or in bash:
```
$ visitsarch comment $VISITSEQID "This is more of a question than a comment..."
```

### Save the current environment in the archive: `visitsarch save_env`

No arguments are even needed, just use conda to get the environment.
Note that the `visitsarch.add_simulation` command will already have added the hash of the environment for whatever simulations it applies to.
This command merely saves what is actually in that environment.

It python:
```
> visitsarch.save_env()
```

In bash:
```
$ visitsarch save_env
````

### Add files to the archive associated with a visit sequence: `visitsarch add_file`

In python:

```
> base_url = "s3://rubin:rubin-scheduler-prenight/opsim/"
> visitsarch.add_file(visits_id, "rewards", rewards_fname, archive=base_url)
```

In bash:

```
$ visitsarch add_file $VISITS_ID rewards myrewards.h5 s3://rubin:rubin-scheduler-prenight/opsim/
```

### Query for simulations for an instrument that cover a date: `visitsarch query_night`

In python:

```
> visit_seqs = visitsarch.query_night(dayobs, telescope, earliest_sim, columns=['id', 'label', 'visitseq_url'], tags=['prenight', 'nominal'])
```

In bash:

```
$ visitsarch query_night 20250815 simonyi 2025-08-12 --columns id label visitseq_url --tags prenight nominal
```

Produces a `pandas.DataFrame` (python) or text table (bash) with columns requested, which could be anything from the `simulation`, `completed`, or `mixedvisitseqs` tables, plus `tags` (a list of tags from the `tags` table) and `comments` (a list of comments from the comments table).

