Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving quantile #426

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Moving quantile #426

wants to merge 9 commits into from

Conversation

javiber
Copy link
Collaborator

@javiber javiber commented May 6, 2024

This PR implements a slow but exact (as in not an approximation) of the moving quantile.

Implementation considerations:

The implementation splits the window into smaller and bigger values split into 2 heaps while balancing them to keep the desired value near the top, for instance, if looking for the median the heaps will be balanced, if looking for the 0.25 percentile the smaller heap will containing 20% of values.

It uses custom heaps because we need to remove items from the heap (that may or may not be at the top) as they are removed from the windows, for this a mapping is maintained in memory.

In cases where the desired quantile is between 2 values, the average of the 2 closer values is used.

The result is equivalent to Numpy's "averaged_inverted_cdf" as tested in one of the unit tests.

I'm going to measure memory usage shortly but this could be an issue with long windows.

Timing considerations:

  • moving_quantile is significantly slower than other moving operations (~20X slower than simple_moving_average)
  • it's comparable to the pandas equivalent df.rolling(window=w, center=False).quantile(q). It's slower with longer windows and extreme quantiles (these make on of the heaps bigger which makes inserts more costly). See plots below

Note that our solution also works on non-uniformly sampled data where pandas wouldn't.

image
image

Other changes:

  • had to make several changes to the window operations to accommodate the new parameter quantile
  • Fixed an issue with negative window lengths

@javiber javiber force-pushed the moving_quantile branch 2 times, most recently from a2b4455 to 77d109b Compare May 8, 2024 16:29
@javiber javiber requested a review from achoum May 8, 2024 19:04
@javiber javiber marked this pull request as ready for review May 8, 2024 19:05
temporian/core/event_set_ops.py Outdated Show resolved Hide resolved
values.

The quantile calculated in each window is equivalent to numpy's
`"averaged_inverted_cdf"` method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comments that the op only work on floating point features (or make some implicit conversion)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, I also support int as a valid input, however the output is converted to float

float -> float
double -> double
int -> float

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Added a comment on the doc explaining the point above

temporian/core/operators/window/base.py Outdated Show resolved Hide resolved
temporian/core/operators/window/moving_quantile.py Outdated Show resolved Hide resolved
f"Received {quantile}"
)
self.quantile = quantile
super().__init__(input, window_length, sampling)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to have it at the top? If so, can you add a comment. If not, I would move it at the top.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a reason but I had so many backs and forth with this one that I forgot why. Let me try to put at the top as well as change quantile to _quantile and see if anything breaks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work, the reason is that the init of the base class runs a self.check() at the end and this needs all the attributes to be defined including the quantile

CustomHeap(std::function<bool(T, T)> compare) : compare(compare) {}

void push(T value) {
heap.push_back(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalidating all the iterators in "val_to_node" right? Can you add a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't because heap is a double-linked list so all the existing nodes should remain in the same place. That's the reason why I'm using lists instead of vectors

auto it = std::prev(heap.end());
val_to_node[value] = it;
// TODO: better sorting?
while (it != heap.begin()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks expensive, can you add an explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, and I suspect this is where we spend the most time.
I need the heap to have the biggest (according to compare) item at the top and this was the easiest way to ensure that, I'm open to suggestions for a better sorting algorithm or a better data structure for the heap (trees maybe?)

} else {
auto value = heap.back();
heap.pop_back();
auto it = val_to_node.find(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple time the same value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a problem I forgot to add a comment addressing that. It's not a problem in this case because I'm storing the indices as the heap "values" so they are never going to repeat

val_to_node.erase(it);
} else {
// TODO: exception meant for debugging, remove it
throw std::invalid_argument("removing a value that doesn't exists");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no exceptions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch forgot to remove this one, it was only there for debugging

void remove(T value) {
auto it = val_to_node.find(value);
if (it != val_to_node.end()) {
heap.erase(it->second);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the heap is not valid anymore, right?
Can you add a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, it shouldn't because this is a double-linked list so this node is gone, then the next of the previous node and the previous of the next node are updated but all the nodes should remain in the same place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants