Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store orbit determination results in parquet #134

Closed
16 of 17 tasks
ChristopherRabotin opened this issue Apr 18, 2023 · 0 comments · Fixed by #147
Closed
16 of 17 tasks

Store orbit determination results in parquet #134

ChristopherRabotin opened this issue Apr 18, 2023 · 0 comments · Fixed by #147
Labels
Kind: Improvement This is a proposed improvement Priority: high Status: Design Issue at Design phase of the quality assurance process Topic: Orbit Determination

Comments

@ChristopherRabotin
Copy link
Member

ChristopherRabotin commented Apr 18, 2023

Coauthors: Claude by Anthropic and GPT-4

High level description

Storing the OD results (estimates and residuals) in CSV files is inefficient and cumbersome. We propose the implementation of a to_parquet function for the ODProcess structure. The Apache Parquet format is well suited for this type of data. By storing the data in Parquet, the estimates and residuals can be queried and analyzed much more efficiently by the users. The data should be stored in base units (it's currently in kilometers, leads to error plots being in "micro kilometers", which is confusing).

Requirements

  1. The ODProcess structure must have a to_parquet method to export its data to Parquet.
  2. The to_parquet method must take:
    • A path to the Parquet file.
    • (Optional) A list of EventEvaluator to evaluate events. If provided, the events data is also exported.
    • An ExportCfg configuration object to configure the export (consider reuse of the current one used in trajectory export).
  3. The method must export:
    • The estimates (state, state deviation, nominal state)
    • The residuals (prefit and postfit for each measurement type)
    • The events data (if evaluators were provided)
  4. Null values should be used when there is no data for an epoch (e.g. no measurement, no event). (Human: this is achievable already because the FloatType is an Option<f64>.)
  5. The schema of the Parquet file should be flexible to support different measurement and event types between OD processes.
  6. The export process should be efficient and not require much overhead. The larger the OD result set, the more benefits from using Parquet.
  7. Appropriate error handling should be in place in case of issues exporting or writing to the Parquet file.

Test plans

Unit tests:

Integration tests:

  • Export a large OD result set (thousands of epochs) to Parquet and read it back to assert no loss of data or performance issues.
  • ~~Query and analyze the data in various ways to ensure the benefits of using Parquet are achieved. ~~ This is evident compared to the current CSV format.

Edge cases:

  • An empty OD result set (no estimates or residuals).
  • An OD result set with only estimates and no residuals. Not possible.
  • A malformed ExportCfg object. Not possible.
  • Invalid path provided. Handled by the path object
  • Lack of write permissions to the path. Handled by the path object
  • Corrupted Parquet file as input. Not possible since I create a new parquet file
  • Out of memory issues when exporting very large result sets. This would require failures to be handled gracefully. This would be a dyn Error, which is supported. Not sure how to test this on any of the machines I have since they have several GBs of RAM.

Benchmark tests

  • Compare performance of exporting to Parquet vs CSV for large result sets. Parquet should provide major speedups and efficiency gains.

Documentation and examples:

  • Clearly document the to_parquet method and ExportCfg to ensure proper usage.
  • Provide examples of querying and analyzing the exported Parquet data.

Design

Here is a Mermaid JS diagram showing the proposed implementation:

sequenceDiagram
    participant User
    participant ODProcess 
    participant ParquetExporter
    User->>ODProcess: Call to_parquet() method 
    ODProcess->>ParquetExporter: Instantiate exporter
    ODProcess->>ParquetExporter: Provide estimates and residuals data
    ParquetExporter->>ParquetExporter: Validate input data
    ParquetExporter->>ParquetExporter: Create Parquet schema
    ParquetExporter->>ParquetExporter: Write data to Parquet file
    ParquetExporter-->>ODProcess: Return path to exported file
    ODProcess-->>User: Return path to exported file
Loading

Consider using the building pattern from https://github.com/apache/arrow-rs/blob/master/arrow/examples/builders.rs or parquet_derive directly (but this might not work because it needs a custom struct for each variation), cf. https://github.com/apache/arrow-rs/blob/master/parquet_derive/README.md .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kind: Improvement This is a proposed improvement Priority: high Status: Design Issue at Design phase of the quality assurance process Topic: Orbit Determination
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant