Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding fix #392

Merged
merged 3 commits into from
Jan 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ If you want to use a build not available via these channels, reach out to discus
* There is a maximum of 10,000 unique station names
* Line endings in the file are `\n` characters on all platforms
* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported
* The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive"

## Entering the Challenge

Expand Down
19 changes: 19 additions & 0 deletions calculate_average_baseline_original_rounding.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/sh
#
# Copyright 2023 The original authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

JAVA_OPTS=""
java $JAVA_OPTS --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage_baseline_original_rounding
14 changes: 7 additions & 7 deletions evaluate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,13 @@ failed=()
for fork in "$@"; do
set +e # we don't want prepare.sh, test.sh or hyperfine failing on 1 fork to exit the script early

# Run prepare script
if [ -f "./prepare_$fork.sh" ]; then
print_and_execute source "./prepare_$fork.sh"
else
print_and_execute sdk use java $DEFAULT_JAVA_VERSION
fi

# Run the test suite
print_and_execute $TIMEOUT ./test.sh $fork
if [ $? -ne 0 ]; then
Expand All @@ -165,13 +172,6 @@ for fork in "$@"; do
print_and_execute rm -f measurements.txt
print_and_execute ln -s $MEASUREMENTS_FILE measurements.txt

# Run prepare script
if [ -f "./prepare_$fork.sh" ]; then
print_and_execute source "./prepare_$fork.sh"
else
print_and_execute sdk use java $DEFAULT_JAVA_VERSION
fi

# Use hyperfine to run the benchmark for each fork
HYPERFINE_OPTS="--warmup 0 --runs $RUNS --export-json $fork-$filetimestamp-timing.json --output ./$fork-$filetimestamp.out"

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ private Measurement(String[] parts) {
}

private static record ResultRow(double min, double mean, double max) {

public String toString() {
return round(min) + "/" + round(mean) + "/" + round(max);
}
Expand Down Expand Up @@ -79,7 +80,7 @@ public static void main(String[] args) throws IOException {
return res;
},
agg -> {
return new ResultRow(agg.min, agg.sum / agg.count, agg.max);
return new ResultRow(agg.min, (Math.round(agg.sum * 10.0) / 10.0) / agg.count, agg.max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct and just adds another rounding on top of two roundings by toString and println and thus hiding the problem even more.

You see, as I mentioned in #49, the problem can not be fixed when double is used for calculation because not all numbers can be exactly represented as doubles (e.g. 0.1 or 99.9, see https://math.stackexchange.com/questions/2710986/exact-representation-of-floating-point-numbers) and therefore Douple.parseDouble or the summation are already imprecise. Adding any kind of rounding during calculation of average or printing won't fix that.

Consider:

package sum;

import java.math.BigDecimal;

class Sum {

    public static void main(String[] args) {
        var sum = 0.0;
        var sumD = BigDecimal.ZERO;
        var rowD = new BigDecimal("99.9");

        var count = 1_000_000_000;

        for (int i = 0; i < count; i++) {
            sum += 99.9;
            sumD = sumD.add(rowD);
        }

        System.out.println(sum);
        System.out.println(sumD);
    }
}

prints

$ java Sum.java
9.989999883589902E10
99900000000.0

As you can see the sum is not precise even before we do any division.

The proper way is either (slow) to use BigDecimal for the row values and to calculate sum and then apply rounding after average calculation or (fast) use integer summation of row*10 which is possible because input uses fixed format and then again apply rounding at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like this:

/*
 *  Copyright 2023 The original authors
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */
package dev.morling.onebrc;

import static java.util.stream.Collectors.collectingAndThen;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.joining;
import static java.util.stream.Collectors.reducing;

import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Optional;
import java.util.TreeMap;
import java.util.stream.Stream;

public class CalculateAverage_AlexanderYastrebov {

    private static class Measurement {

        final String name;
        private long count;

        // min, max and sum hold actual value scaled by 10
        private long min;
        private long max;
        private long sum;

        static Measurement parse(String line) {
            var parts = line.split(";", 2);
            return new Measurement(parts[0], parseMetric(parts[1]));
        }

        private static long parseMetric(String s) {
            return Long.parseLong(s.replaceFirst("[.]", ""));
        }

        Measurement(String name, long value) {
            this.name = name;
            this.count = 1;
            this.min = this.max = this.sum = value;
        }

        Measurement add(Measurement m) {
            this.min = Math.min(min, m.min);
            this.max = Math.max(max, m.max);
            this.sum += m.sum;
            this.count += m.count;
            return this;
        }

        String getName() {
            return name;
        }

        String format() {
            var smin = BigDecimal.valueOf(min)
                    .divide(BigDecimal.TEN, 1, RoundingMode.UNNECESSARY)
                    .toPlainString();

            var smax = BigDecimal.valueOf(max)
                    .divide(BigDecimal.TEN, 1, RoundingMode.UNNECESSARY)
                    .toPlainString();

            var savg = BigDecimal.valueOf(sum)
                    .divide(BigDecimal.valueOf(count * 10), 1, RoundingMode.CEILING)
                    .toPlainString();

            return String.format("%s=%s/%s/%s", name, smin, savg, smax);
        }
    }

    public static void main(String[] args) throws Exception {
        var input = "./measurements.txt";
        if (args.length == 1) {
            input = args[0];
        }

        try (Stream<String> lines = Files.lines(Paths.get(input))) {
            var result = lines.map(Measurement::parse)
                    .collect(groupingBy(Measurement::getName, TreeMap::new,
                            collectingAndThen(reducing(Measurement::add), Optional::get)));

            var output = result.values().stream()
                    .map(Measurement::format)
                    .collect(joining(", ", "{", "}"));

            System.out.println(output);
        }
    }
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the calculation isn't correct, and it's certainly not what I would recommend to do in any real-world application.

But does it matter in any practical sense for the challenge at hand? Specifically, can there be any 1B row dataset with values of one fractional digit where the accumulated error would be so significant, that the result with one fractional digit would differ from the result of a correct implementation?

});

Map<String, ResultRow> measurements = new TreeMap<>(Files.lines(Paths.get(FILE))
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
/*
* Copyright 2023 The original authors
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package dev.morling.onebrc;

import static java.util.stream.Collectors.*;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collector;

/**
* This is the original version of the baseline implementation. It contains a
* rounding bug, which can cause calculated mean values to be off by 0.1. See
* {@link CalculateAverage_baseline} for the correct behavior. This version here
* is only kept for reference, in particular for determining whether an
* implementation is valid with the old behavior. Any new or updated entries to
* the challenge must conform to the correct behavior as implemented by
* {@code CalculateAverage_baseline}.
*/
public class CalculateAverage_baseline_original_rounding {

private static final String FILE = "./measurements.txt";

private static record Measurement(String station, double value) {
private Measurement(String[] parts) {
this(parts[0], Double.parseDouble(parts[1]));
}
}

private static record ResultRow(double min, double mean, double max) {
public String toString() {
return round(min) + "/" + round(mean) + "/" + round(max);
}

private double round(double value) {
return Math.round(value * 10.0) / 10.0;
}
};

private static class MeasurementAggregator {
private double min = Double.POSITIVE_INFINITY;
private double max = Double.NEGATIVE_INFINITY;
private double sum;
private long count;
}

public static void main(String[] args) throws IOException {
// Map<String, Double> measurements1 = Files.lines(Paths.get(FILE))
// .map(l -> l.split(";"))
// .collect(groupingBy(m -> m[0], averagingDouble(m -> Double.parseDouble(m[1]))));
//
// measurements1 = new TreeMap<>(measurements1.entrySet()
// .stream()
// .collect(toMap(e -> e.getKey(), e -> Math.round(e.getValue() * 10.0) / 10.0)));
// System.out.println(measurements1);

Collector<Measurement, MeasurementAggregator, ResultRow> collector = Collector.of(
MeasurementAggregator::new,
(a, m) -> {
a.min = Math.min(a.min, m.value);
a.max = Math.max(a.max, m.value);
a.sum += m.value;
a.count++;
},
(agg1, agg2) -> {
var res = new MeasurementAggregator();
res.min = Math.min(agg1.min, agg2.min);
res.max = Math.max(agg1.max, agg2.max);
res.sum = agg1.sum + agg2.sum;
res.count = agg1.count + agg2.count;

return res;
},
agg -> {
return new ResultRow(agg.min, agg.sum / agg.count, agg.max);
});

Map<String, ResultRow> measurements = new TreeMap<>(Files.lines(Paths.get(FILE))
.map(l -> new Measurement(l.split(";")))
.collect(groupingBy(m -> m.station(), collector)));

System.out.println(measurements);
}
}
1 change: 1 addition & 0 deletions src/test/resources/samples/measurements-rounding.out
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{ham=14.6/25.5/33.6, jel=-9.0/18.0/46.5}
Loading
Loading