Correct stat uncertainty calculation #24

matthewfeickert · 2020-11-20T20:08:21Z

The stat uncertainty is currently being given as the Poisson uncertainty on the histogram

Line 110 in 8822879

stat_uncert = np.sqrt(model_hist)

but for a stacked histogram this is incorrect, as it should be the sum in quadrature of the uncertainties on each of the histograms in the stack.

(Thanks to @msneubauer for the reminder on this)

matthewfeickert · 2020-11-23T20:03:56Z

As @alexander-held has very helpfully pointed out on Gitter I want to be able to use storage.Weight()

This storage keeps a sum of weights as well (in CERN ROOT, this is like calling .Sumw2() before filling a histogram). It uses the WeightedSum accumulator.

this can be seen working below

import numpy as np
import hist
from hist import Hist
import uproot4 as uproot

root_file = uproot.open("example.root")
values, edges = root_file["data"].to_numpy()
stat_uncert = np.sqrt(values)

_hist = Hist(
    hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
    storage=hist.storage.Weight(),
)
_hist[...] = np.stack([values, np.square(stat_uncert)], axis=-1)
print(f"yields: {_hist.view().value}")
print(f"\nstat_uncert: {np.sqrt(_hist.view().variance)}")

yields: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

stat_uncert: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

@henryiii has confirmed this and has mentioned there is work to make this work with visualization in the near future. 🚀

matthewfeickert · 2020-11-23T20:11:25Z

@alexander-held also sent a nice minimal working example of stat uncertainty propagation being done properly by default. 👍

import boost_histogram as bh
import numpy as np

bins = [1, 2, 3]
yields = np.asarray([3, 4])
stdev = np.asarray([1, 2])
h1 = bh.Histogram(
    bh.axis.Variable(bins, underflow=False, overflow=False),
    storage=bh.storage.Weight(),
)
h1[...] = np.stack([yields, stdev ** 2], axis=-1)
yields = np.asarray([5, 5])
stdev = np.asarray([1, 1])
h2 = bh.Histogram(
    bh.axis.Variable(bins, underflow=False, overflow=False),
    storage=bh.storage.Weight(),
)
h2[...] = np.stack([yields, stdev ** 2], axis=-1)
for hist in [h1, h2, h1 + h2]:
    print(f"yields: {hist.view().value}, stdev: {np.sqrt(hist.view().variance)}")

yields: [3. 4.], stdev: [1. 2.]
yields: [5. 5.], stdev: [1. 1.]
yields: [8. 9.], stdev: [1.41421356 2.23606798]

matthewfeickert · 2021-03-17T07:39:28Z

This still seems to still be an issue, as

$ cat requirements.txt 
hist==2.2.0
uproot==4.0.6
uproot3~=3.14.4
$ pip install -r requirements.txt
$ curl -sL https://raw.githubusercontent.com/matthewfeickert/heputils/master/tests/example_files.py | python
$ file example.root
example.root: ROOT file Version 1061800 (Compression: 256)

and

# example.py
import hist
import numpy as np
import uproot
from hist import Hist


def convert_root_to_hist(in_hist):
    values, edges = in_hist.to_numpy()
    _hist = Hist(
        hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
        storage=hist.storage.Weight(),
    )
    # Poisson variance equals values
    variances = values
    _hist[...] = np.stack([values, variances], axis=-1)
    return _hist


if __name__ == "__main__":
    root_file = uproot.open("example.root")
    data_hist = root_file["data"].to_hist()

    print(f"data hist values: {data_hist.values()}")
    print(f"Poisson uncertainty on values: {np.sqrt(data_hist.values())}")
    print("\n# Unexpected results from .variances()...\n")
    print(f"data hist variances: {data_hist.variances()}")
    print(
        f"sqrt of data hist variances: {np.sqrt(data_hist.variances())}"
    )  # This is just the count values again, strangely

    print("\n# Building weighted from hand works...\n")
    _hist = convert_root_to_hist(root_file["data"])
    print(f"yields: {_hist.values()}")
    print(f"\nstat_uncert: {np.sqrt(_hist.variances())}")

gives

$ python example.py
data hist values: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]
Poisson uncertainty on values: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

# Unexpected results from .variances()...

data hist variances: [0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 1.00000e+00 2.56000e+02
 9.02500e+03 7.39840e+04 2.10681e+05 3.49281e+05 4.54276e+05 4.32964e+05
 3.88129e+05 2.81961e+05 2.07936e+05 1.51321e+05 1.03041e+05 6.45160e+04
 3.76360e+04 2.59210e+04 1.44000e+04 9.02500e+03 5.47600e+03 3.60000e+03
 2.80900e+03 1.44400e+03 7.29000e+02 3.61000e+02 4.00000e+02 2.25000e+02
 1.21000e+02 1.00000e+02 1.96000e+02 2.50000e+01 3.60000e+01 9.00000e+00
 4.90000e+01 4.00000e+00 3.60000e+01 1.00000e+00 9.00000e+00 4.00000e+00
 9.00000e+00 0.00000e+00 0.00000e+00 1.00000e+00 0.00000e+00 0.00000e+00
 0.00000e+00 0.00000e+00]
sqrt of data hist variances: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

# Building weighted from hand works...

yields: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

stat_uncert: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

If a weight storage Hist is needed that seems very reasonable, but is there anyway to create this while using uproot's .to_hist() API?

henryiii · 2021-03-17T14:19:02Z

Can you print a repr of the hist given by data_hist = root_file["data"].to_hist() (or data_hist = Hist(root_file["data"]), which is the same thing)? It should have a Weight storage, looks buggy to me.

henryiii · 2021-03-17T14:25:40Z

Oh, is there no variance info for the uproot hist? So it's a Double storage but the automatic variances are not working?

matthewfeickert · 2021-03-17T15:15:20Z

Can you print a repr of the hist given by data_hist = root_file["data"].to_hist() (or data_hist = Hist(root_file["data"]), which is the same thing)? It should have a Weight storage, looks buggy to me.

>>> from hist import Hist
>>> import uproot
>>> root_file = uproot.open("example.root")
>>> data_hist = root_file["data"].to_hist()
>>> print(data_hist)
  regular(50, 0, 1000, metadata={'name': 'xaxis', 'title': ''}, options=underflow | overflow)
  (-1): 0   ( 0): 0   ( 1): 0   ( 2): 0   ( 3): 0   ( 4): 1  
  ( 5): 16  ( 6): 95  ( 7): 272 ( 8): 459 ( 9): 591 (10): 674
  (11): 658 (12): 623 (13): 531 (14): 456 (15): 389 (16): 321
  (17): 254 (18): 194 (19): 161 (20): 120 (21): 95  (22): 74 
  (23): 60  (24): 53  (25): 38  (26): 27  (27): 19  (28): 20 
  (29): 15  (30): 11  (31): 10  (32): 14  (33): 5   (34): 6  
  (35): 3   (36): 7   (37): 2   (38): 6   (39): 1   (40): 3  
  (41): 2   (42): 3   (43): 0   (44): 0   (45): 1   (46): 0  
  (47): 0   (48): 0   (49): 0   (50): 0  

>>> hist_2 = Hist(root_file["data"])
>>> print(hist_2)
  regular(50, 0, 1000, metadata={'name': 'xaxis', 'title': ''}, options=underflow | overflow)
  (-1): 0   ( 0): 0   ( 1): 0   ( 2): 0   ( 3): 0   ( 4): 1  
  ( 5): 16  ( 6): 95  ( 7): 272 ( 8): 459 ( 9): 591 (10): 674
  (11): 658 (12): 623 (13): 531 (14): 456 (15): 389 (16): 321
  (17): 254 (18): 194 (19): 161 (20): 120 (21): 95  (22): 74 
  (23): 60  (24): 53  (25): 38  (26): 27  (27): 19  (28): 20 
  (29): 15  (30): 11  (31): 10  (32): 14  (33): 5   (34): 6  
  (35): 3   (36): 7   (37): 2   (38): 6   (39): 1   (40): 3  
  (41): 2   (42): 3   (43): 0   (44): 0   (45): 1   (46): 0  
  (47): 0   (48): 0   (49): 0   (50): 0  

>>>

Oh, is there no variance info for the uproot hist? So it's a Double storage but the automatic variances are not working?

@henryiii Correct,

heputils/tests/example_files.py

Lines 248 to 263 in 9450913

    
           def main(): 
        
               np.random.seed(0) 
        
               data_hist = make_data_hist(hists) 
        
               hists["data"] = {"counts": data_hist.tolist(), "bins": _bins} 
        
               with open("example.json", "w") as serialization: 
        
                   json.dump(hists, serialization) 
        
               with uproot3.recreate("example.root", compression=uproot3.ZLIB(4)) as outfile: 
        
                   for key in hists.keys(): 
        
                       outfile[key] = ( 
        
                           np.array(hists[key]["counts"]), 
        
                           np.array(hists[key]["bins"]), 
        
                       )

but maybe that's my failing on how I'm writing the ROOT file with uproot3. I'll look to see if I can add variances on through the uproot3 writing.

henryiii · 2021-03-17T15:42:22Z

Could you print a repr? I can't actually get the class name (Hist or Histogram) or the storage from the str, which @HDembinski forwarded from Boost.Histogram. It also shows things that are not boost-histogram syntax, like options=underflow | overflow, and shows the full metadata, even if it's huge. I thought it was supposed to provide a preview of the bins as a histogram, which it does not seem to be doing above.

matthewfeickert · 2021-03-17T16:00:25Z

oh woops. I was blanking out and forgot that I could just call repr directly and not have to rely on print to get it for me (~~though I would have thought that the histoprint view that normally shows up from print would have kicked in~~ nope, I had forgotten that's dealt with by .show())

>>> from hist import Hist
>>> import uproot
>>> root_file = uproot.open("example.root")
>>> data_hist = data_hist = root_file["data"].to_hist()
>>> repr(data_hist)
"Hist(Regular(50, 0, 1000, name='xaxis', label='xaxis'), storage=Weight()) # Sum: WeightedSum(value=6290, variance=2.83e+06)"
>>> hist_2 = Hist(root_file["data"])
>>> repr(hist_2)
"Hist(Regular(50, 0, 1000, name='xaxis', label='xaxis'), storage=Weight()) # Sum: WeightedSum(value=6290, variance=2.83e+06)"
>>> sanity_check = (
...   Hist.new
...   .Reg(10, 0 ,1, name="x", label="x-axis")
...   .Var(range(10), name="y", label="y-axis")
...   .Int64()
... )
>>> print(sanity_check)
Hist(
  Regular(10, 0, 1, name='x', label='x-axis'),
  Variable([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
  storage=Int64())
>>> repr(sanity_check)
"Hist(\n  Regular(10, 0, 1, name='x', label='x-axis'),\n  Variable([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),\n  storage=Int64())"

henryiii · 2021-03-17T16:06:13Z

Weird, why is print(sanity_check) showing the repr but print(data_hist) showing something else?

henryiii · 2021-03-17T16:08:21Z

This is a weighed histogram, though, so I think that means something is getting squared when this is being stored. Uproot's variances are fine?

henryiii · 2021-03-17T16:10:15Z

This should be directly converted from fSumw2 from uproot. I need to refactor this code to use the new (0.12+) direct view setting feature.

matthewfeickert · 2021-03-17T18:19:23Z

Given that the variances just from uproot are not right

>>> import uproot
>>> root_file = uproot.open("example.root")
>>> root_file["data"].fSumw2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Model_TH1I_v3' object has no attribute 'fSumw2'
>>> root_file["data"].weighted
True
>>> root_file["data"].variances()
array([0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       2.56000e+02, 9.02500e+03, 7.39840e+04, 2.10681e+05, 3.49281e+05,
       4.54276e+05, 4.32964e+05, 3.88129e+05, 2.81961e+05, 2.07936e+05,
       1.51321e+05, 1.03041e+05, 6.45160e+04, 3.76360e+04, 2.59210e+04,
       1.44000e+04, 9.02500e+03, 5.47600e+03, 3.60000e+03, 2.80900e+03,
       1.44400e+03, 7.29000e+02, 3.61000e+02, 4.00000e+02, 2.25000e+02,
       1.21000e+02, 1.00000e+02, 1.96000e+02, 2.50000e+01, 3.60000e+01,
       9.00000e+00, 4.90000e+01, 4.00000e+00, 3.60000e+01, 1.00000e+00,
       9.00000e+00, 4.00000e+00, 9.00000e+00, 0.00000e+00, 0.00000e+00,
       1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00])

it seems that there is something happening with uproot3 on writing. I'll follow up with Jim on Gitter and then here or uproot3 Issues if that's reasonable. I think that it is safe to say that boost-histogram and hist aren't causing issues here.

matthewfeickert · 2021-03-17T19:31:52Z

Discussion on the variances is moving to scikit-hep/uproot3-methods#97

matthewfeickert · 2021-03-18T17:38:55Z

This is solved in scikit-hep/uproot3-methods#98 and with the release of uproot3-methods v0.10.1.

matthewfeickert · 2021-03-18T17:54:11Z

In closing, this sort of thing works now (inside of heputils as this is just an example of what hist already does)

# example.py
import hist
import numpy as np
from hist import Hist

if __name__ == "__main__":
    bins = [1, 2, 3]
    yields = np.asarray([3, 4])
    stdev = np.asarray([1, 2])

    h1 = Hist(
        hist.axis.Regular(len(bins) - 1, bins[0], bins[-1]),
        storage=hist.storage.Weight(),
    )
    h1[...] = np.stack([yields, stdev ** 2], axis=-1)

    yields = np.asarray([5, 5])
    stdev = np.asarray([1, 1])
    h2 = Hist(
        hist.axis.Regular(len(bins) - 1, bins[0], bins[-1]),
        storage=hist.storage.Weight(),
    )
    h2[...] = np.stack([yields, stdev ** 2], axis=-1)

    for hist in [h1, h2, h1 + h2]:
        print(f"yields: {hist.values()}, stdev: {np.sqrt(hist.variances())}")

giving

$ python example.py
yields: [3. 4.], stdev: [1. 2.]
yields: [5. 5.], stdev: [1. 1.]
yields: [8. 9.], stdev: [1.41421356 2.23606798]

matthewfeickert added the bug Something isn't working label Nov 20, 2020

matthewfeickert mentioned this issue Mar 17, 2021

Question on behavior of weights and variances scikit-hep/uproot3-methods#97

Closed

matthewfeickert self-assigned this Mar 18, 2021

matthewfeickert mentioned this issue Mar 18, 2021

fix: Calculate stat uncertainty from variances #54

Merged

matthewfeickert closed this as completed in #54 Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct stat uncertainty calculation #24

Correct stat uncertainty calculation #24

matthewfeickert commented Nov 20, 2020 •

edited

matthewfeickert commented Nov 23, 2020

matthewfeickert commented Nov 23, 2020

matthewfeickert commented Mar 17, 2021 •

edited

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021 •

edited

henryiii commented Mar 17, 2021 •

edited

matthewfeickert commented Mar 17, 2021 •

edited

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021

matthewfeickert commented Mar 18, 2021

matthewfeickert commented Mar 18, 2021 •

edited

Correct stat uncertainty calculation #24

Correct stat uncertainty calculation #24

Comments

matthewfeickert commented Nov 20, 2020 • edited

matthewfeickert commented Nov 23, 2020

matthewfeickert commented Nov 23, 2020

matthewfeickert commented Mar 17, 2021 • edited

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021 • edited

henryiii commented Mar 17, 2021 • edited

matthewfeickert commented Mar 17, 2021 • edited

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

henryiii commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021

matthewfeickert commented Mar 17, 2021

matthewfeickert commented Mar 18, 2021

matthewfeickert commented Mar 18, 2021 • edited

matthewfeickert commented Nov 20, 2020 •

edited

matthewfeickert commented Mar 17, 2021 •

edited

matthewfeickert commented Mar 17, 2021 •

edited

henryiii commented Mar 17, 2021 •

edited

matthewfeickert commented Mar 17, 2021 •

edited

matthewfeickert commented Mar 18, 2021 •

edited