Store orbit determination results in parquet #134

ChristopherRabotin · 2023-04-18T05:33:36Z

Coauthors: Claude by Anthropic and GPT-4

High level description

Storing the OD results (estimates and residuals) in CSV files is inefficient and cumbersome. We propose the implementation of a to_parquet function for the ODProcess structure. The Apache Parquet format is well suited for this type of data. By storing the data in Parquet, the estimates and residuals can be queried and analyzed much more efficiently by the users. The data should be stored in base units (it's currently in kilometers, leads to error plots being in "micro kilometers", which is confusing).

Requirements

The ODProcess structure must have a to_parquet method to export its data to Parquet.
The to_parquet method must take:
- A path to the Parquet file.
- (Optional) A list of EventEvaluator to evaluate events. If provided, the events data is also exported.
- An ExportCfg configuration object to configure the export (consider reuse of the current one used in trajectory export).
The method must export:
- The estimates (state, state deviation, nominal state)
- The residuals (prefit and postfit for each measurement type)
- The events data (if evaluators were provided)
Null values should be used when there is no data for an epoch (e.g. no measurement, no event). (Human: this is achievable already because the FloatType is an Option<f64>.)
The schema of the Parquet file should be flexible to support different measurement and event types between OD processes.
The export process should be efficient and not require much overhead. The larger the OD result set, the more benefits from using Parquet.
Appropriate error handling should be in place in case of issues exporting or writing to the Parquet file.

Test plans

Unit tests:

Export a small OD result set (few epochs) to Parquet and read it back, asserting the data is correct.
Export an OD result set with null values (missing measurements/events) and assert they are handled properly.
Export an OD result set with different measurement types and assert the schema is flexible enough.
Export an OD result set with event evaluators and assert the event data is also exported. #150
Assert any error during export is handled properly.

Integration tests:

Export a large OD result set (thousands of epochs) to Parquet and read it back to assert no loss of data or performance issues.
~~Query and analyze the data in various ways to ensure the benefits of using Parquet are achieved. ~~ This is evident compared to the current CSV format.

Edge cases:

An empty OD result set (no estimates or residuals).
~~An OD result set with only estimates and no residuals.~~ Not possible.
~~A malformed ExportCfg object.~~ Not possible.
~~Invalid path provided.~~ Handled by the path object
~~Lack of write permissions to the path.~~ Handled by the path object
~~Corrupted Parquet file as input.~~ Not possible since I create a new parquet file
~~Out of memory issues when exporting very large result sets. This would require failures to be handled gracefully.~~ This would be a dyn Error, which is supported. Not sure how to test this on any of the machines I have since they have several GBs of RAM.

Benchmark tests

Compare performance of exporting to Parquet vs CSV for large result sets. Parquet should provide major speedups and efficiency gains.

Documentation and examples:

Clearly document the to_parquet method and ExportCfg to ensure proper usage.
Provide examples of querying and analyzing the exported Parquet data.

Design

Here is a Mermaid JS diagram showing the proposed implementation:

sequenceDiagram
    participant User
    participant ODProcess 
    participant ParquetExporter
    User->>ODProcess: Call to_parquet() method 
    ODProcess->>ParquetExporter: Instantiate exporter
    ODProcess->>ParquetExporter: Provide estimates and residuals data
    ParquetExporter->>ParquetExporter: Validate input data
    ParquetExporter->>ParquetExporter: Create Parquet schema
    ParquetExporter->>ParquetExporter: Write data to Parquet file
    ParquetExporter-->>ODProcess: Return path to exported file
    ODProcess-->>User: Return path to exported file

Consider using the building pattern from https://github.com/apache/arrow-rs/blob/master/arrow/examples/builders.rs or parquet_derive directly (but this might not work because it needs a custom struct for each variation), cf. https://github.com/apache/arrow-rs/blob/master/parquet_derive/README.md .

The text was updated successfully, but these errors were encountered:

ChristopherRabotin added Status: Design Issue at Design phase of the quality assurance process Kind: Improvement This is a proposed improvement Topic: Orbit Determination Priority: high labels Apr 18, 2023

ChristopherRabotin added this to the Version 2.0.0-alpha: Better validation and Python binding milestone Apr 24, 2023

ChristopherRabotin mentioned this issue May 17, 2023

Export OD results in Parquet #147

Merged

7 tasks

ChristopherRabotin closed this as completed in #147 May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store orbit determination results in parquet #134

Store orbit determination results in parquet #134

ChristopherRabotin commented Apr 18, 2023 •

edited

Loading

Store orbit determination results in parquet #134

Store orbit determination results in parquet #134

Comments

ChristopherRabotin commented Apr 18, 2023 • edited Loading

High level description

Requirements

Test plans

Unit tests:

Integration tests:

Edge cases:

Benchmark tests

Documentation and examples:

Design

ChristopherRabotin commented Apr 18, 2023 •

edited

Loading