Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_expression fails on units with spaces in the name #799

Closed
jpeacock29 opened this issue Apr 23, 2019 · 8 comments
Closed

parse_expression fails on units with spaces in the name #799

jpeacock29 opened this issue Apr 23, 2019 · 8 comments

Comments

@jpeacock29
Copy link

For example,

from pint import UnitRegistry
UnitRegistry().parse_expression('survey mile')
UnitRegistry().parse_expression('fluid ounce')

both raise UndefinedUnitError. Others instances fail to recognize a unit name with spaces and gives unusual interpretations:

UnitRegistry().parse_expression('fl. oz.')

returns 1 femtoliter ounce.

@hgrecco
Copy link
Owner

hgrecco commented Apr 24, 2019

units with spaces are not supported as there is no robust wat to distinguish the user intention and would make the parser much more complex

@jpeacock29
Copy link
Author

Can you provide an non-robust example? "Fluid ounce" and "survey mile" both appear unambiguous. I'm glad to check out the code, but how would a token with a space in it would make the parsing much more difficult? I'm glad to make a pull request.

@hgrecco
Copy link
Owner

hgrecco commented Apr 24, 2019

A space is now interpreted as product. That is why you can write 2 kg instead of 2 * kg. If we allow units with space when the parser finds a b, it will need to decide if it means a * b or the unit a b. This could be done by checking if a, b or a b (and their expanded versions in case they are compatible with prefixes) are in the registry. But then if you encounter a b c, now you need to check all of them. The parsing is clearly more difficult as it requires to look ahead (or look behind).

And then there are examples like the one you gave, it might be an unusual interpretation but is legal. In this cases we tend to follow a simple rule: do not guess the user intentions.

@jpeacock29
Copy link
Author

I see, thank you for explaining. It looks like a space for multiplication is currently handled in string_preprocessor in util.py, along with many other substitutions. Could another step of pre-proccessing handle this case by replacing "fluid ounce" with the canonical form "fluid_ounce"? This would fix the issue pre-tokenization and not require any look ahead in the parser.

@hgrecco
Copy link
Owner

hgrecco commented Apr 24, 2019

That is completely correct. There has been discussions in the past about making the _subs_re (see dictionary here) pluggable in the registry. (maybe not requiring to write a regex but just string replacements).

I think that would be a nice addition.

@jpeacock29
Copy link
Author

jpeacock29 commented Apr 24, 2019

I can take a go at this probably. My thinking is to use UnitRegistry._units to create a dictionary of names with spaces keyed to their canonical forms (eg, {'fluid ounce': 'fluid_ounce'}) and then use this dictionary to do the replacement in string_preprocessor.

However, this wouldn't work for the given example fl. oz. because the periods would not be picked up. I could work around this with regex like fl\.? oz\.? although I'm not sure if this would be too much a hit on performance. Any thoughts?

@hgrecco
Copy link
Owner

hgrecco commented Apr 25, 2019

I think is better to create a new attribute for this (maybe UnitRegistry._replacements) and then do as you suggested.
Regarding the performance hit, we should measure that. It might be ok, I have no feeling for that.

@jules-ch
Copy link
Collaborator

Gonna close this, preprocessors can be used as a work around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants