Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding fix #392

Merged
merged 3 commits into from
Jan 14, 2024
Merged

Rounding fix #392

merged 3 commits into from
Jan 14, 2024

Conversation

gunnarmorling
Copy link
Owner

No description provided.

@gunnarmorling gunnarmorling merged commit a8fd067 into main Jan 14, 2024
1 check passed
@gunnarmorling gunnarmorling deleted the rounding-fix branch January 14, 2024 10:10
@@ -79,7 +80,7 @@ public static void main(String[] args) throws IOException {
return res;
},
agg -> {
return new ResultRow(agg.min, agg.sum / agg.count, agg.max);
return new ResultRow(agg.min, (Math.round(agg.sum * 10.0) / 10.0) / agg.count, agg.max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct and just adds another rounding on top of two roundings by toString and println and thus hiding the problem even more.

You see, as I mentioned in #49, the problem can not be fixed when double is used for calculation because not all numbers can be exactly represented as doubles (e.g. 0.1 or 99.9, see https://math.stackexchange.com/questions/2710986/exact-representation-of-floating-point-numbers) and therefore Douple.parseDouble or the summation are already imprecise. Adding any kind of rounding during calculation of average or printing won't fix that.

Consider:

package sum;

import java.math.BigDecimal;

class Sum {

    public static void main(String[] args) {
        var sum = 0.0;
        var sumD = BigDecimal.ZERO;
        var rowD = new BigDecimal("99.9");

        var count = 1_000_000_000;

        for (int i = 0; i < count; i++) {
            sum += 99.9;
            sumD = sumD.add(rowD);
        }

        System.out.println(sum);
        System.out.println(sumD);
    }
}

prints

$ java Sum.java
9.989999883589902E10
99900000000.0

As you can see the sum is not precise even before we do any division.

The proper way is either (slow) to use BigDecimal for the row values and to calculate sum and then apply rounding after average calculation or (fast) use integer summation of row*10 which is possible because input uses fixed format and then again apply rounding at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like this:

/*
 *  Copyright 2023 The original authors
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */
package dev.morling.onebrc;

import static java.util.stream.Collectors.collectingAndThen;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.joining;
import static java.util.stream.Collectors.reducing;

import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Optional;
import java.util.TreeMap;
import java.util.stream.Stream;

public class CalculateAverage_AlexanderYastrebov {

    private static class Measurement {

        final String name;
        private long count;

        // min, max and sum hold actual value scaled by 10
        private long min;
        private long max;
        private long sum;

        static Measurement parse(String line) {
            var parts = line.split(";", 2);
            return new Measurement(parts[0], parseMetric(parts[1]));
        }

        private static long parseMetric(String s) {
            return Long.parseLong(s.replaceFirst("[.]", ""));
        }

        Measurement(String name, long value) {
            this.name = name;
            this.count = 1;
            this.min = this.max = this.sum = value;
        }

        Measurement add(Measurement m) {
            this.min = Math.min(min, m.min);
            this.max = Math.max(max, m.max);
            this.sum += m.sum;
            this.count += m.count;
            return this;
        }

        String getName() {
            return name;
        }

        String format() {
            var smin = BigDecimal.valueOf(min)
                    .divide(BigDecimal.TEN, 1, RoundingMode.UNNECESSARY)
                    .toPlainString();

            var smax = BigDecimal.valueOf(max)
                    .divide(BigDecimal.TEN, 1, RoundingMode.UNNECESSARY)
                    .toPlainString();

            var savg = BigDecimal.valueOf(sum)
                    .divide(BigDecimal.valueOf(count * 10), 1, RoundingMode.CEILING)
                    .toPlainString();

            return String.format("%s=%s/%s/%s", name, smin, savg, smax);
        }
    }

    public static void main(String[] args) throws Exception {
        var input = "./measurements.txt";
        if (args.length == 1) {
            input = args[0];
        }

        try (Stream<String> lines = Files.lines(Paths.get(input))) {
            var result = lines.map(Measurement::parse)
                    .collect(groupingBy(Measurement::getName, TreeMap::new,
                            collectingAndThen(reducing(Measurement::add), Optional::get)));

            var output = result.values().stream()
                    .map(Measurement::format)
                    .collect(joining(", ", "{", "}"));

            System.out.println(output);
        }
    }
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the calculation isn't correct, and it's certainly not what I would recommend to do in any real-world application.

But does it matter in any practical sense for the challenge at hand? Specifically, can there be any 1B row dataset with values of one fractional digit where the accumulated error would be so significant, that the result with one fractional digit would differ from the result of a correct implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants