New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming histogram #2590

Merged
merged 4 commits into from Mar 29, 2018

Conversation

Projects
None yet
2 participants
@jamesmcclain
Member

jamesmcclain commented Mar 25, 2018

Overview

Fixes StreamingHistogram.itemCount. Has the effect of changing the behavior of StreamingHistogram.binCount so that bin counts are not always zero.

Checklist

  • docs/CHANGELOG.rst updated, if necessary
  • docs guides update, if necessary
  • New user API has useful Scaladoc strings
  • Unit tests added for bug-fix or new feature

Demo

Previously, binCount on a streaming histogram would return answers with counts of zero. This was due to the fact that the values of the bins were being generated by the function values which produced numbers that did not match the internals bins of the streaming histogram. This was due to itemCount being incorrect.

New behavior:

scala> val tile = DoubleArrayTile(Array[Double](52, 54, 61, 32, 52, 50, 11, 21, 18), 3, 3)
tile: geotrellis.raster.DoubleArrayTile = DoubleConstantNoDataArrayTile([D@40ed78d,3,3)

scala> val result = tile.histogramDouble(3)
result: geotrellis.raster.histogram.Histogram[Double] = geotrellis.raster.histogram.StreamingHistogram@2f13b4ba

scala> result.binCounts.foreach(println(_))
(16.666666666666668,3)
(32.0,1)
(53.8,5)

scala> println(result.median().get)
34.18

Notes

As stated above, the previous behavior was to use bin labels generated by values which did not (do not) line up with the internal bins used by the streaming histogram. When a count for a non-bucket-label value is requested, zero is returned (because [by construction and intent] the streaming histogram does not have access to that information).

The new behavior is to simply return the internal buckets used by the streaming histogram. Note that the interpretations of these bin counts is therefore somewhat different than for other histogram types.

Note that the median value of 34.18 above is "correct" (expected) for an approximation using three buckets. Because all of the input data are not available, the median has to be approximated. If the approximate histogram is viewed as a curve, the median is approximated by returning the value at which half of the the area under the curve is to the left and half to the right.

Closes #2274

jamesmcclain added some commits Mar 25, 2018

@jamesmcclain jamesmcclain changed the title from Streaming histogram to [WiP] Streaming histogram Mar 26, 2018

@jamesmcclain jamesmcclain changed the title from [WiP] Streaming histogram to Streaming histogram Mar 26, 2018

@echeipesh echeipesh merged commit 133b045 into locationtech:master Mar 29, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jamesmcclain jamesmcclain deleted the jamesmcclain:streaming-histogram branch May 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment