Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for rounding exponent to multiple of 3 for "engineering notation" #159

Open
jagerber48 opened this issue Jul 26, 2022 · 20 comments
Open

Comments

@jagerber48
Copy link
Contributor

For many scientific applications it is nice to have the exponent rounded to a multiple of 3 so that a result like (1.2+/-0.1)e+04 appears as (12+/-1)e+03 can be quickly interpreted, for example, as 12 +/- 1 kHz.

How the rounding should happen is not totally unambiguous. For example, should 1.2, 12, 120 or 0.12, 1.2, 12 be the preferred decimal representations? I'd suggest that somehow be optional.

@lebigot
Copy link
Collaborator

lebigot commented Jul 26, 2022

This is a good idea: it looks like this can indeed be useful.

Some work is needed so as to precisely define the convention that would make sense, as your describe. Maybe there are some standards or common practices?

@jagerber48
Copy link
Contributor Author

A little bit of research doesn't yield standards or common practices. I think anything that has an exponent divisible by 3 counts as engineering notation, so for 1200 all of 1200e0, 1.2e3, 0.012e6 0.000012e9 all count as engineering notation.

I can continue to look for common practices, but I think my intuitive expected practices would be that the base in b^e is either 1.0 <= b < 1000 or 0.1 <= b < 100.

I'd prefer the latter one, but I could imagine others, or in some situations, preferring the former. I could imagine a preference settings somewhere like eng_notation_small_base=True or eng_notation_large_base=True or something that could toggle you between the two modes. I haven't worked too much with uncertainties package other than ufloat, but for that case I could imagine something like: ufloat(value, sigma, eng_notation_small_base=True), or if there's somewhere you can set global preferences/settings it could appear there

@jagerber48
Copy link
Contributor Author

This is as close as we're going to get to an official standard I think:

https://www.nist.gov/pml/special-publication-811/nist-guide-si-chapter-7-rules-and-style-conventions-expressing-values
See 7.9 on choosing SI prefixes. It suggests the exponent is chosen to be a multiple of 3 and that the base is between 0.1 and 1000. I think this leaves either 0.1 - 100 or 1 - 1000 both open as options.

Thinking about this raised another question, is it possible to coerce the exponent to be a specific value in uncertainties? May open a separate issue about that.

@lebigot
Copy link
Collaborator

lebigot commented Jul 27, 2022

Thanks for the research. I think it's a good idea to use NIST's convention.

The natural place for defining the chosen format is Python's string formatting. We could add a new format letter for this engineering format (uppercase for 1–1000, lowercase for 0.1–100 maybe?).

@jagerber48
Copy link
Contributor Author

Yes, that sounds great to me! I like upper case for 1-1000 and lower for 0.1 to 100. Maybe r or R since e and n and g from ENGINEERING are spoken for already?

@lebigot
Copy link
Collaborator

lebigot commented Jul 28, 2022

I like this idea.

There is the question of possible interactions with other format indicators (precision control…). At this stage I'm not seeing any problem: I'm thinking they can simply be applied to the mantissa part.

@lebigot
Copy link
Collaborator

lebigot commented Aug 23, 2022

For reference: the string parsing function ufloat_fromstr() should be updated too so as to understand the new output format.

@jagerber48
Copy link
Contributor Author

I've begun work on this. It was pretty straightforward to get r and R formats working in the __format__ method. The ufloat_fromstr() will be challenging for me to tackle since I don't know whatever format is being used for all the "match" stuff (regular expressions?).

Three questions. (1) Should I open a PR now with the partial code updates or wait until it is more complete and ready for review and (2) I haven't had a look at updating documentation anyways. I can update it in the nearby docstrings but if there are other places it needs to be updated I would need to be pointed there. (3) anywhere in the package other than __format__ and ufloat_fromstr() this change needs to come in?

@lebigot
Copy link
Collaborator

lebigot commented Oct 12, 2022

Thanks.

Yes, ufloat_fromstr() uses regular expressions.

About your questions:

(1) I'm thinking that it's more explicit to only open a Pull Request when you're satisfied with what you can do (so that nobody reads code / documentation that you may update later, as it's more efficient if you touch what you wrote).

(2) Both the docstrings and the documentation should be updated (it's at https://github.com/lebigot/uncertainties/blob/master/doc/user_guide.rst#printing).

(3) I don't think that anything besides __format__ and ufloat_fromstr (str_to_number_with_uncert) needs to be updated.

@jagerber48
Copy link
Contributor Author

Sounds good! I'll continue to work on this and if I have more questions of this nature I'll ask them here!

@jagerber48
Copy link
Contributor Author

jagerber48 commented Oct 14, 2022

Ok, I learned regex now.

Do you anticipate a specific issue with str_to_number_with_uncert and the new format? Without the new r flag in the format specification something like ufloat(12300, 1400) would be printed as 1.23(14)e+04 but with r flag it gets printed to a string as 12.3(14)e+03 but the regex logic for parsing it is still the same. Find the symbol e (or whatever it might be) that splits the base + uncertainty from the exponent, get the exponent, get the base + uncertainty, then split the base and uncertainty and do math using the exponent to get floats for nominal and uncertainty part. In fact it looks like cases of this sort already show up in the documentation as captured by the function already.

At first blush and my first tests there is no change needed to str_to_number_with_uncert but please let me know if you were anticipating another issue.

For now I plan to start adding to the testing script to capture and test different possible edge cases like is already done thoroughly for existing features. My initial tests show that the new format plays well with other formatting instructions, but I want to confirm it more thoroughly. That + documentation and I'll setup a PR.

@jagerber48
Copy link
Contributor Author

jagerber48 commented Oct 14, 2022

Perhaps you were imagining using other delimiters than e, E for example to separate the mantissa and exponent when using engineering notation but my idea for engineering notation is to keep the same formatting, just change the rules for how the exponent is selected to enforce that it's always a power of 3 and the mantissa is either between 1 and 999 (R)or 0.1 and 99.9 (r).

@jagerber48
Copy link
Contributor Author

Consider value 123.456 with uncertainty 0.012.
If we format with .2ur
(0.123456+/-0.000012)e+03
but if we format with .2uR we get
(123.456+/-0.012)E+00

This was the easiest way for it to happen with the r/R thing but I could easily modify it so that both r and R modes printout with the lower case e exponent indicator. This would be my preference but wanted to check if there's a reason to keep a distinction.

@jagerber48
Copy link
Contributor Author

Sorry for the barrage of comments. Unfortunately I think it will take me longer to work this out than I realized.

The formatting is failing for a pair like: 12.3 +/- 456.78 formatted with '.1uR'. This is coming out as (12+/-457)E+00. I think it should come out at (0+/-500)E+00. I think the problem is that some of the code downstream of when I resolve the exponent to be a multiple of 3 assumes the value or the error are between 1 and 10 or something. I'll need to dig into the code and figure out how formatting is being done to resolve this case.

@jagerber48
Copy link
Contributor Author

@lebigot Hello, do you have any suggestions for where would be the best place in the __format__ (or maybe format_num) code to include this change to make sure it plays nicely with other formatting specifications? Especially precision. Also I'm trying to better understand the u flag. Let's suppose that both val and std are non-zero floats and not nan of inf or something to avoid edge cases. Is the idea that if the u flag is not included that the val part should get formatted exactly as it normally would per the format string and then the std part gets formatted basically as it would normally but inheriting the same precision as was used/needed for val? And then if the u flag is present the logic is the same but the std part is formatted first and val inherits its uncertainty from that needed for std?

@lebigot
Copy link
Collaborator

lebigot commented Nov 19, 2022

Thank you for the last 3 messages (which somehow escaped me until now). Let me respond in order:

  1. Exponent indicated with "e" or "E"

    Good question. An important point is that the lowercase/uppercase format specifications result in lowercase/uppercase in the result for formats x, e, f, g (i.e. all the formats for which that matters, essentially). The principle of least surprise would dictate that the r/R specification does the same (i.e. prints e/E).

    This gives me an idea: for handling the choice of a mantissa in [0.1; 100) or [1; 1000), we can use the # format specification, which triggers an "alternate form". We would need to specify which form is standard, and which is is the alternate form. I'd vote for [1; 1000) being the standard, because we can directly see from the exponent that the result is at least equal to 1000, 1 million, etc., but I'm open to other arguments.

  2. Case of 12.3 +/- 456.78. I agree about the output you expect. I was wondering for a second if it should do the same as the .3u format, but with .1uR, the user is asking for a very limited precision indeed, and the output you expect is consistent with the current result of the .1u format. 👍

  3. I am guessing that implementing the r/R specification would be best done by simply calling format_num() with the correct common_exp, which would be calculated as usual in __format__() (I didn't take the time to dig again in these functions, as __format__() is probably the single most complicated function of the whole code!).

    The best place in __format__() would probably be after the handling of the eEfFgG formats. Their code could probably be used as a template.

    As for the general formatting rule that you describe, I think you got it right. 😀

@jagerber48
Copy link
Contributor Author

@lebigot thanks for the responses.

  1. I can look into the # format specification. Sounds like it might be nice. I agree that [1; 1000) should be "standard". Right now my code uses r and R instead of r and # or R and # or something but I can sort this detail later. Right now I'm worried about correctness of the algorithm.

2 + 3. ok, so what you suggest in 3. is what I attempted but it leads to the failure discussed in 2. Here's my code: master...jagerber48:uncertainties:feature/engineering_notation. I've spent more time looking at __format__ than format_num so I don't know where something might be going wrong. I do treat r and R in the same code blocks as eEfF, maybe I should treat it in the gG code blocks instead? I think the issue is that the stdlib python formatting that is used in format_num can't support sig-fig based precision for mantissas >= 10. Somehow the python code seems to be deeply based around the idea of "numbers after the decimal point" which is kind of useless for scientific uncertainty when we care about significant digits. It's true that in scientific notation there's an easy correspondence between the two, but if you want to take a number like 456.78 and convert it 1 or 2 sig figs without just putting it into scientific notation you have to do something different. Maybe I need to manually round nom_val_mantissa and std_dev_mantissa to the right number of sig figs since we can't rely on python formatting to do that.

FWIW I wrote my own val +/- uncertainty string formatting function inspired by the code I've been looking at here. https://github.com/jagerber48/strunc/blob/main/strunc/val_unc_formatting.py It uses a custom "specification language" that's a little more targeted towards the use case of val/uncertainty printing than pythons native formatting which relies heavily on the "digits-past-decimal-point" precision as opposed to sig figs. Basically you can explicitly specify (1) whether to use sig figs of digits-past-decimal precision (2) whether val or std drives the precision i.e. (123.45 +/ 67.1) with 1 sig fig should appear as (100 +/- 0) or (120 +/- 70) and (3) whether the val or std drives the exponent so should (120 +/- 70) go to (1.2 +/- 0.7)e+2 or (12 +/- 7)e+1. For standard scientific printing you'd always want to use sig figs driven by the uncertainty (and the number of sig figs should probably be 1 or 2 and you can follow the PDG recommendation if you like). I think usually you'd want the exponent driven by the value, but there may be cases where you want it driven by the uncertainty. Then there are some other features motivated by what I saw here. I don't think this code is any kind of replacement for what's in this package. Just sharing in case it jogs any ideas. I think extending the native python string formatting to support val +/- unc formatting is one of the slickest parts of this package. It's just unfortunate that that approach has to inherit python's digits-past-the-decimal approach to precision.

  • I know the g format in python can support sig-fig based formatting, but there it just converts the number to scientific notation where sig-figs is basically the same thing (modulo an off-by-one maybe) as digits-past-the-decimal.

@jagerber48
Copy link
Contributor Author

@lebigot I'm still curious to get the engineering notation feature in, but it's challenging for me to get my head around what the expected behavior is for various combinations of inputs. I feel I need this information to help me understand the code and to write unit tests for the new feature. But there are existing unit test features blocking this.

(1) #162 some unit tests are not actually being run because of a bug in the unit test code.
(2) Even when the unit test code is fixed so that all tests run, some tests fail. See #162 (comment) for a summary.

I address issue (1) in this PR #167, but I think issue (2) blocks the PR from passing build. I see these issues as blocking this one, I'm curious to know how you think it best to handle these 2 issues. Should one PR fix the "unit tests not being run" bug AS WELL AS the "some unit tests fail" issue? Or should it be handled in separate PRs? If so, how?

@jagerber48
Copy link
Contributor Author

jagerber48 commented Apr 5, 2023

@lebigot I'm still interested in getting this engineering notation feature into this package if possible. But I'm taking a very long way around. The onion I'm looking at looks like this:

  • Right now the code (I think I'm talking mostly about format_num here) either assumes no scientific notation or scientific notation with exactly 1 digit before the decimal point. This assumption is built in pretty deeply into a lot of the code. Also format_num relies on python string formatting which doesn't really have a support for engineering notation. I think this means a fairly new chunk of logic is going to be needed to get this feature in.
  • Since it's going to be a lot of code changes I think having good unit tests will be helpful
  • I see there is a nice unit test framework built in. Unfortunately I found some issues with (1) how the unit tests were being listed/performed which meant that not all tests were getting run and (2) Some unit tests that were not getting run previously are failing when they are run.
  • So for that I want to clean up the existing unit tests but I would like some feedback from you for that (see links above)
  • Sort of tangential: I'd like to improve my understanding of exactly what is expected for each format string. There is an ambiguity for strings for which no format presentation type is specified in the python string mini language documentation. I'm investigating this in a separate issue.

But I think the last two points can be worked on in parallel. I'm curious if you would be able to address my comments in other issues and PRs in this repo about unit tests pointed out in my previous comment?

ALTERNATIVELY, if taking the "long way around" is a big waste of time, and you see an easy way to get engineering notation in, I'd be very open to tips/suggestions for how to do that!

@lebigot
Copy link
Collaborator

lebigot commented Apr 9, 2023

Thanks for the updates. I'm quite busy right now and cannot dig into this, but I'll see after the current bout of activity if I can whip up something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants