Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling MVP #1010

Closed
Tracked by #1009
alexowens90 opened this issue Oct 30, 2023 · 2 comments · Fixed by #1495
Closed
Tracked by #1009

Resampling MVP #1010

alexowens90 opened this issue Oct 30, 2023 · 2 comments · Fixed by #1495
Assignees
Labels
enhancement New feature or request

Comments

@alexowens90
Copy link
Collaborator

alexowens90 commented Oct 30, 2023

The aim is to provide an API and minimal feature set that can be extended to cover all of the (useful) functionality provided by Pandas resample method.

Long term we will need to support the functionality provided by the arguments rule, closed, label, origin, offset.

For non-trivial bucket boundaries (e.g. last Thursday of every month) we should leverage Pandas to generate the actual boundaries of interest to pass to the C++ layer. For simpler boundaries (e.g. minute bars) we can have a more compact representation, although this is not required for the MVP.

Proposed MVP:

  • Do not support upsampling, only downsampling.
  • Present the resampled data back to the user, with no option to write the data back to another symbol directly.
  • Leverage Pandas to convert rule, origin, and offset into a list of pairs of UTC timestamps stored as int64_t nanoseconds since epoch representing the bucket boundaries.
  • Pass the closed and label arguments directly through to the clause constructor.
  • Use the QueryBuilder directly rather than adding syntactic-sugar methods to the Library or NativeVersionStore classes.
  • Use "data driven" approach to empty buckets. i.e. only include buckets in the output for which there was an index value in the appropriate range.
  • Have a single clause to handle resampling (as opposed to the 2-stage process for hash-based groupings) since the repartition would always be trivial for resampling.
  • Static schema supported only
  • No "named agg" equivalent, so only one aggregation possible per input column

e.g.

q = QueryBuilder()
q = q.resample(rule="T", closed="left", label="left", origin="epoch", offset=None).agg({"open": first, "close": "last"})
df = lib.read(sym, date_range=(t1, t2), query_builder=q).data
@alexowens90 alexowens90 added the enhancement New feature or request label Oct 30, 2023
@alexowens90 alexowens90 self-assigned this Oct 30, 2023
@alexowens90 alexowens90 mentioned this issue Oct 30, 2023
1 task
@DrNickClarke
Copy link
Collaborator

Supported aggregations : sum, mean, min, max, count, first, last

  • all NaN-correct
  • first, last and count support strings

@mosaikme
Copy link

WOW , this would be such an epic addition, to this alrady awsome libary (db). esp with the newly (nearly) added first, last, count. I would be happy to help in testing . All the best.

alexowens90 added a commit that referenced this issue May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants