Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct stat uncertainty calculation #24

Closed
matthewfeickert opened this issue Nov 20, 2020 · 15 comments · Fixed by #54
Closed

Correct stat uncertainty calculation #24

matthewfeickert opened this issue Nov 20, 2020 · 15 comments · Fixed by #54
Assignees
Labels
bug Something isn't working

Comments

@matthewfeickert
Copy link
Owner

matthewfeickert commented Nov 20, 2020

The stat uncertainty is currently being given as the Poisson uncertainty on the histogram

stat_uncert = np.sqrt(model_hist)

but for a stacked histogram this is incorrect, as it should be the sum in quadrature of the uncertainties on each of the histograms in the stack.

(Thanks to @msneubauer for the reminder on this)

@matthewfeickert matthewfeickert added the bug Something isn't working label Nov 20, 2020
@matthewfeickert
Copy link
Owner Author

As @alexander-held has very helpfully pointed out on Gitter I want to be able to use storage.Weight()

This storage keeps a sum of weights as well (in CERN ROOT, this is like calling .Sumw2() before filling a histogram). It uses the WeightedSum accumulator.

this can be seen working below

import numpy as np
import hist
from hist import Hist
import uproot4 as uproot

root_file = uproot.open("example.root")
values, edges = root_file["data"].to_numpy()
stat_uncert = np.sqrt(values)

_hist = Hist(
    hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
    storage=hist.storage.Weight(),
)
_hist[...] = np.stack([values, np.square(stat_uncert)], axis=-1)
print(f"yields: {_hist.view().value}")
print(f"\nstat_uncert: {np.sqrt(_hist.view().variance)}")
yields: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

stat_uncert: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

@henryiii has confirmed this and has mentioned there is work to make this work with visualization in the near future. 🚀

@matthewfeickert
Copy link
Owner Author

@alexander-held also sent a nice minimal working example of stat uncertainty propagation being done properly by default. 👍

import boost_histogram as bh
import numpy as np

bins = [1, 2, 3]
yields = np.asarray([3, 4])
stdev = np.asarray([1, 2])
h1 = bh.Histogram(
    bh.axis.Variable(bins, underflow=False, overflow=False),
    storage=bh.storage.Weight(),
)
h1[...] = np.stack([yields, stdev ** 2], axis=-1)
yields = np.asarray([5, 5])
stdev = np.asarray([1, 1])
h2 = bh.Histogram(
    bh.axis.Variable(bins, underflow=False, overflow=False),
    storage=bh.storage.Weight(),
)
h2[...] = np.stack([yields, stdev ** 2], axis=-1)
for hist in [h1, h2, h1 + h2]:
    print(f"yields: {hist.view().value}, stdev: {np.sqrt(hist.view().variance)}")
yields: [3. 4.], stdev: [1. 2.]
yields: [5. 5.], stdev: [1. 1.]
yields: [8. 9.], stdev: [1.41421356 2.23606798]

@matthewfeickert
Copy link
Owner Author

matthewfeickert commented Mar 17, 2021

This still seems to still be an issue, as

$ cat requirements.txt 
hist==2.2.0
uproot==4.0.6
uproot3~=3.14.4
$ pip install -r requirements.txt
$ curl -sL https://raw.githubusercontent.com/matthewfeickert/heputils/master/tests/example_files.py | python
$ file example.root
example.root: ROOT file Version 1061800 (Compression: 256)

and

# example.py
import hist
import numpy as np
import uproot
from hist import Hist


def convert_root_to_hist(in_hist):
    values, edges = in_hist.to_numpy()
    _hist = Hist(
        hist.axis.Regular(len(edges) - 1, edges[0], edges[-1], name=None),
        storage=hist.storage.Weight(),
    )
    # Poisson variance equals values
    variances = values
    _hist[...] = np.stack([values, variances], axis=-1)
    return _hist


if __name__ == "__main__":
    root_file = uproot.open("example.root")
    data_hist = root_file["data"].to_hist()

    print(f"data hist values: {data_hist.values()}")
    print(f"Poisson uncertainty on values: {np.sqrt(data_hist.values())}")
    print("\n# Unexpected results from .variances()...\n")
    print(f"data hist variances: {data_hist.variances()}")
    print(
        f"sqrt of data hist variances: {np.sqrt(data_hist.variances())}"
    )  # This is just the count values again, strangely

    print("\n# Building weighted from hand works...\n")
    _hist = convert_root_to_hist(root_file["data"])
    print(f"yields: {_hist.values()}")
    print(f"\nstat_uncert: {np.sqrt(_hist.variances())}")

gives

$ python example.py
data hist values: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]
Poisson uncertainty on values: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

# Unexpected results from .variances()...

data hist variances: [0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 1.00000e+00 2.56000e+02
 9.02500e+03 7.39840e+04 2.10681e+05 3.49281e+05 4.54276e+05 4.32964e+05
 3.88129e+05 2.81961e+05 2.07936e+05 1.51321e+05 1.03041e+05 6.45160e+04
 3.76360e+04 2.59210e+04 1.44000e+04 9.02500e+03 5.47600e+03 3.60000e+03
 2.80900e+03 1.44400e+03 7.29000e+02 3.61000e+02 4.00000e+02 2.25000e+02
 1.21000e+02 1.00000e+02 1.96000e+02 2.50000e+01 3.60000e+01 9.00000e+00
 4.90000e+01 4.00000e+00 3.60000e+01 1.00000e+00 9.00000e+00 4.00000e+00
 9.00000e+00 0.00000e+00 0.00000e+00 1.00000e+00 0.00000e+00 0.00000e+00
 0.00000e+00 0.00000e+00]
sqrt of data hist variances: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

# Building weighted from hand works...

yields: [  0.   0.   0.   0.   1.  16.  95. 272. 459. 591. 674. 658. 623. 531.
 456. 389. 321. 254. 194. 161. 120.  95.  74.  60.  53.  38.  27.  19.
  20.  15.  11.  10.  14.   5.   6.   3.   7.   2.   6.   1.   3.   2.
   3.   0.   0.   1.   0.   0.   0.   0.]

stat_uncert: [ 0.          0.          0.          0.          1.          4.
  9.74679434 16.4924225  21.42428529 24.31049156 25.96150997 25.65151068
 24.95996795 23.04343724 21.3541565  19.72308292 17.91647287 15.93737745
 13.92838828 12.68857754 10.95445115  9.74679434  8.60232527  7.74596669
  7.28010989  6.164414    5.19615242  4.35889894  4.47213595  3.87298335
  3.31662479  3.16227766  3.74165739  2.23606798  2.44948974  1.73205081
  2.64575131  1.41421356  2.44948974  1.          1.73205081  1.41421356
  1.73205081  0.          0.          1.          0.          0.
  0.          0.        ]

If a weight storage Hist is needed that seems very reasonable, but is there anyway to create this while using uproot's .to_hist() API?

@henryiii
Copy link

Can you print a repr of the hist given by data_hist = root_file["data"].to_hist() (or data_hist = Hist(root_file["data"]), which is the same thing)? It should have a Weight storage, looks buggy to me.

@henryiii
Copy link

Oh, is there no variance info for the uproot hist? So it's a Double storage but the automatic variances are not working?

@matthewfeickert
Copy link
Owner Author

matthewfeickert commented Mar 17, 2021

Can you print a repr of the hist given by data_hist = root_file["data"].to_hist() (or data_hist = Hist(root_file["data"]), which is the same thing)? It should have a Weight storage, looks buggy to me.

>>> from hist import Hist
>>> import uproot
>>> root_file = uproot.open("example.root")
>>> data_hist = root_file["data"].to_hist()
>>> print(data_hist)
  regular(50, 0, 1000, metadata={'name': 'xaxis', 'title': ''}, options=underflow | overflow)
  (-1): 0   ( 0): 0   ( 1): 0   ( 2): 0   ( 3): 0   ( 4): 1  
  ( 5): 16  ( 6): 95  ( 7): 272 ( 8): 459 ( 9): 591 (10): 674
  (11): 658 (12): 623 (13): 531 (14): 456 (15): 389 (16): 321
  (17): 254 (18): 194 (19): 161 (20): 120 (21): 95  (22): 74 
  (23): 60  (24): 53  (25): 38  (26): 27  (27): 19  (28): 20 
  (29): 15  (30): 11  (31): 10  (32): 14  (33): 5   (34): 6  
  (35): 3   (36): 7   (37): 2   (38): 6   (39): 1   (40): 3  
  (41): 2   (42): 3   (43): 0   (44): 0   (45): 1   (46): 0  
  (47): 0   (48): 0   (49): 0   (50): 0  

>>> hist_2 = Hist(root_file["data"])
>>> print(hist_2)
  regular(50, 0, 1000, metadata={'name': 'xaxis', 'title': ''}, options=underflow | overflow)
  (-1): 0   ( 0): 0   ( 1): 0   ( 2): 0   ( 3): 0   ( 4): 1  
  ( 5): 16  ( 6): 95  ( 7): 272 ( 8): 459 ( 9): 591 (10): 674
  (11): 658 (12): 623 (13): 531 (14): 456 (15): 389 (16): 321
  (17): 254 (18): 194 (19): 161 (20): 120 (21): 95  (22): 74 
  (23): 60  (24): 53  (25): 38  (26): 27  (27): 19  (28): 20 
  (29): 15  (30): 11  (31): 10  (32): 14  (33): 5   (34): 6  
  (35): 3   (36): 7   (37): 2   (38): 6   (39): 1   (40): 3  
  (41): 2   (42): 3   (43): 0   (44): 0   (45): 1   (46): 0  
  (47): 0   (48): 0   (49): 0   (50): 0  

>>> 

Oh, is there no variance info for the uproot hist? So it's a Double storage but the automatic variances are not working?

@henryiii Correct,

def main():
np.random.seed(0)
data_hist = make_data_hist(hists)
hists["data"] = {"counts": data_hist.tolist(), "bins": _bins}
with open("example.json", "w") as serialization:
json.dump(hists, serialization)
with uproot3.recreate("example.root", compression=uproot3.ZLIB(4)) as outfile:
for key in hists.keys():
outfile[key] = (
np.array(hists[key]["counts"]),
np.array(hists[key]["bins"]),
)

but maybe that's my failing on how I'm writing the ROOT file with uproot3. I'll look to see if I can add variances on through the uproot3 writing.

@henryiii
Copy link

henryiii commented Mar 17, 2021

Could you print a repr? I can't actually get the class name (Hist or Histogram) or the storage from the str, which @HDembinski forwarded from Boost.Histogram. It also shows things that are not boost-histogram syntax, like options=underflow | overflow, and shows the full metadata, even if it's huge. I thought it was supposed to provide a preview of the bins as a histogram, which it does not seem to be doing above.

@matthewfeickert
Copy link
Owner Author

matthewfeickert commented Mar 17, 2021

oh woops. I was blanking out and forgot that I could just call repr directly and not have to rely on print to get it for me (though I would have thought that the histoprint view that normally shows up from print would have kicked in nope, I had forgotten that's dealt with by .show())

>>> from hist import Hist
>>> import uproot
>>> root_file = uproot.open("example.root")
>>> data_hist = data_hist = root_file["data"].to_hist()
>>> repr(data_hist)
"Hist(Regular(50, 0, 1000, name='xaxis', label='xaxis'), storage=Weight()) # Sum: WeightedSum(value=6290, variance=2.83e+06)"
>>> hist_2 = Hist(root_file["data"])
>>> repr(hist_2)
"Hist(Regular(50, 0, 1000, name='xaxis', label='xaxis'), storage=Weight()) # Sum: WeightedSum(value=6290, variance=2.83e+06)"
>>> sanity_check = (
...   Hist.new
...   .Reg(10, 0 ,1, name="x", label="x-axis")
...   .Var(range(10), name="y", label="y-axis")
...   .Int64()
... )
>>> print(sanity_check)
Hist(
  Regular(10, 0, 1, name='x', label='x-axis'),
  Variable([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
  storage=Int64())
>>> repr(sanity_check)
"Hist(\n  Regular(10, 0, 1, name='x', label='x-axis'),\n  Variable([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),\n  storage=Int64())"

@henryiii
Copy link

Weird, why is print(sanity_check) showing the repr but print(data_hist) showing something else?

@henryiii
Copy link

This is a weighed histogram, though, so I think that means something is getting squared when this is being stored. Uproot's variances are fine?

@henryiii
Copy link

This should be directly converted from fSumw2 from uproot. I need to refactor this code to use the new (0.12+) direct view setting feature.

@matthewfeickert
Copy link
Owner Author

Given that the variances just from uproot are not right

>>> import uproot
>>> root_file = uproot.open("example.root")
>>> root_file["data"].fSumw2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Model_TH1I_v3' object has no attribute 'fSumw2'
>>> root_file["data"].weighted
True
>>> root_file["data"].variances()
array([0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       2.56000e+02, 9.02500e+03, 7.39840e+04, 2.10681e+05, 3.49281e+05,
       4.54276e+05, 4.32964e+05, 3.88129e+05, 2.81961e+05, 2.07936e+05,
       1.51321e+05, 1.03041e+05, 6.45160e+04, 3.76360e+04, 2.59210e+04,
       1.44000e+04, 9.02500e+03, 5.47600e+03, 3.60000e+03, 2.80900e+03,
       1.44400e+03, 7.29000e+02, 3.61000e+02, 4.00000e+02, 2.25000e+02,
       1.21000e+02, 1.00000e+02, 1.96000e+02, 2.50000e+01, 3.60000e+01,
       9.00000e+00, 4.90000e+01, 4.00000e+00, 3.60000e+01, 1.00000e+00,
       9.00000e+00, 4.00000e+00, 9.00000e+00, 0.00000e+00, 0.00000e+00,
       1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00])

it seems that there is something happening with uproot3 on writing. I'll follow up with Jim on Gitter and then here or uproot3 Issues if that's reasonable. I think that it is safe to say that boost-histogram and hist aren't causing issues here.

@matthewfeickert
Copy link
Owner Author

Discussion on the variances is moving to scikit-hep/uproot3-methods#97

@matthewfeickert matthewfeickert self-assigned this Mar 18, 2021
@matthewfeickert
Copy link
Owner Author

This is solved in scikit-hep/uproot3-methods#98 and with the release of uproot3-methods v0.10.1.

@matthewfeickert
Copy link
Owner Author

matthewfeickert commented Mar 18, 2021

In closing, this sort of thing works now (inside of heputils as this is just an example of what hist already does)

# example.py
import hist
import numpy as np
from hist import Hist

if __name__ == "__main__":
    bins = [1, 2, 3]
    yields = np.asarray([3, 4])
    stdev = np.asarray([1, 2])

    h1 = Hist(
        hist.axis.Regular(len(bins) - 1, bins[0], bins[-1]),
        storage=hist.storage.Weight(),
    )
    h1[...] = np.stack([yields, stdev ** 2], axis=-1)

    yields = np.asarray([5, 5])
    stdev = np.asarray([1, 1])
    h2 = Hist(
        hist.axis.Regular(len(bins) - 1, bins[0], bins[-1]),
        storage=hist.storage.Weight(),
    )
    h2[...] = np.stack([yields, stdev ** 2], axis=-1)

    for hist in [h1, h2, h1 + h2]:
        print(f"yields: {hist.values()}, stdev: {np.sqrt(hist.variances())}")

giving

$ python example.py
yields: [3. 4.], stdev: [1. 2.]
yields: [5. 5.], stdev: [1. 1.]
yields: [8. 9.], stdev: [1.41421356 2.23606798]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants