Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add builders for percentile and median accumulators/window functions #1139

Merged
merged 16 commits into from
Jun 21, 2023

Conversation

vbabanin
Copy link
Member

@vbabanin vbabanin commented Jun 6, 2023

This pull request (PR) introduces the capability to calculate the median and percentile of numeric values in the MongoDB aggregation pipeline for $group and $setWindowFields stages

JAVA-3860

@vbabanin vbabanin self-assigned this Jun 6, 2023
@vbabanin vbabanin requested a review from katcharov June 8, 2023 05:38
@vbabanin vbabanin requested a review from stIncMale June 16, 2023 02:34
@vbabanin vbabanin requested a review from stIncMale June 16, 2023 18:32
Copy link
Contributor

@katcharov katcharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a partial review.)

Comment on lines +96 to +97
* Returns a combination of a computed field and an accumulator that generates a BSON {@link org.bson.BsonType#DOUBLE Double }
* representing the median value computed from the given {@code inExpression} within a group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am finding these docs surprisingly hard to follow. There appear to be 3 levels of abstraction, and at least 6 complex relations in the same sentence:

  • this combines and returns: computed field + accumulator
  • which generates: BSON
  • which represents: median (which, in turn: computed, from-given, within-group)

I think we should focus on what this method is used for, and break up the relevant pieces of information, starting with the most relevant:

Used to determine the median value for the group. The result is emitted under the specified field name. Each item in a group is a document, so the "input expression" is used to specify the numeric value for each document. This is typically a numeric field on each document, but any expression may be specified.

I think we should also specify what happens to non-numeric values, what the result is when there are no documents, and what MQL value represents the document in cases where a plain field is not used (usually you can just use fields, but it should be possible to get the "median" number of fields per document, and I don't see how to do this ... would we use $$CURRENT?). And, we should have tests that cover these cases.

If we do want to still focus on (or include) certain technical details rather than usage, I think we should be more clear, and still, for example, break up the information so it is easier to understand.

(The same I think applies to the other docs.)

Copy link
Contributor

@katcharov katcharov Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed: we can address this later, in a future PR (or, decide not to).

* @param fieldName The field computed by the accumulator.
* @param inExpression The input expression.
* @param method The method to be used for computing the median.
* @param <InExpression> The type of the input expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should link the "type" (or the word "expression" on the param itself) of any new docs to MqlValue (which is in Beta).

Copy link
Member Author

@vbabanin vbabanin Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea. However, I wonder if we should consider including the information that the Mql API can be utilized in situations where an expression is required, in the class level documentation instead? MqlValue can be used allmost in all methods of Accumulator class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, this would be inconsistent with the rest of the class.

Discussed: we can address this later, in a future PR (or, decide not to).


//when
List<Document> results = getCollectionHelper().aggregate(Collections.singletonList(
group(new Document("gid", "$z"), quantileAccumulator)), DOCUMENT_DECODER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can get rid of DOCUMENT_DECODER everywhere by doing:

List<BsonDocument> results = ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use new Document("gid", "$z") instead of $z?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed: we can address this (DOCUMENT_DECODER) later, in a future PR. ($z fixed)

List<Document> results = getCollectionHelper().aggregate(Collections.singletonList(
group(new Document("gid", "$z"), quantileAccumulator)), DOCUMENT_DECODER);

//then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not test the generated BSON, but it could do so as with the existing tests in this class, via:

List<Bson> pipeline = assertPipeline(...

Copy link
Contributor

@katcharov katcharov Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed: we can address this later, in a future PR.

Comment on lines 84 to 94
Object result = results.stream()
.filter(document -> document.get("_id").equals(new Document("gid", true)))
.findFirst().map(document -> document.get("sat_95")).get();

assertEquals(expectedGroup1, result);

result = results.stream()
.filter(document -> document.get("_id").equals(new Document("gid", false)))
.findFirst().map(document -> document.get("sat_95")).get();

assertEquals(expectedGroup2, result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the above boilerplate could be replaced with, for example:

        assertResults(pipeline, "[\n"
                + "   { _id: true, sat_95: [3, 2] },\n"
                + "   { _id: false, sat_95: [1, 2] }\n"
                + "]");

This seems easier to follow than using a stream to pick out the ids, and then manually matching on the particular field.

So instead of the ~10 lines of boilerplate, plus these ~3 lines per test:

Arguments.of(
    percentile("sat_95", "$x", new double[]{0.95}, QuantileMethod.approximate()),
    asList(3.0), asList(1.0)),

there could be ~6 lines per test:

        List<Bson> pipeline = assertPipeline(
                ...
                group("$z", percentile("sat_95", "$x", ofNumberArray(0.95))));

        assertResults(pipeline, "[\n"
                + "   { _id: true, sat_95: [3, 2] },\n"
                + "   { _id: false, sat_95: [1, 1] }\n"
                + "]");

I count the lines because parameterization is often used to decrease boilerplate and line count, but to me it seems there can often be other ways to deal with that problem which ultimately make the code more clear.

I think it is fine to have a some duplication in the "data" part of a test (for example, trying to remove it separates percentile from group, and the array values from sat_95, which makes things harder to follow).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed: we can address this later, in a future PR (or, decide not to). We should have a consistent style in the tests.

JAVA-3860
Copy link
Contributor

@katcharov katcharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vbabanin vbabanin merged commit c7558f9 into mongodb:master Jun 21, 2023
54 checks passed
@vbabanin vbabanin deleted the JAVA-3860 branch June 21, 2023 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants