New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'H' format string broken in strict parsing #1401

Closed
tp opened this Issue Jan 7, 2014 · 10 comments

Comments

Projects
None yet
3 participants
@tp

tp commented Jan 7, 2014

I hit an unexpected error when using 'H' to parse 24h-time values without a leading zero:

> var moment = require('moment');
undefined
> moment.version
'2.5.0'
> moment().format('H')
'17'
> var parsed = moment('17', 'H');
undefined
> parsed.isValid()
true
> var parsed = moment('17', 'H', true);
undefined
> parsed.isValid()
false

I would expect H to be usable here, so that the code can also work with times before noon.

@ichernev

This comment has been minimized.

Contributor

ichernev commented Jan 9, 2014

You're right. We made strict parsing a bit too strict in the last version.

@icambron

This comment has been minimized.

Member

icambron commented Jan 9, 2014

Hmm, did we? If H eats 2 tokens whenever possible, that will lead to broken parses:

moment('3141113', 'HDDMMYY', true).format(); //=> '2013-11-14T03:00:00-05:00'
moment('3141113', 'HHDDMMYY', true).format(); //=> 'Invalid date'

That's a big part of why we made strict act the way it does. (Of course, the real solution would be to use a global regex so we get backtracking, but that's a separate discussion)

More generally, why even have "H" and "HH" if they're the same?

@ichernev

This comment has been minimized.

Contributor

ichernev commented Jan 9, 2014

Well, we have H and HH for format. I don't think we need them for parsing. For parsing they can mean the same, I'm not sure if other libraries are doing something similar. The same applies for years -- if you use YYYY then you can not parse a year that has different number of digits, which means everybody using YYYY to parse is wrong :) According to the cldr token docs, the number of repetitions (in YYYY -- 4), means pad to that much, meaning bigger years will always be left intact. I guess if you print something with YYYY then you'd expect to parse it back with YYYY (same goes for H).

@icambron

This comment has been minimized.

Member

icambron commented Jan 9, 2014

Meaning, you're proposing that in strict mode, H always parses two characters?

I'm OK with that, but it won't allow you to roundtrip H from the formatter back to the parser. That's never possible to guarantee, because the whole point of H as a formatting token is that it outputs a variable number of characters.

I think maybe the best thing to do would be to eliminate H as a parsing token altogether, but we probably can't because that would break too much for non-strict users.

@ichernev

This comment has been minimized.

Contributor

ichernev commented Jan 14, 2014

@icambron H is NOT intended to produce a specific number of characters. Its intended to NOT pad to a specific number of characters. In the same way that Y doesn't mean a single digit year, it means the year, verbatim. (YY is a special case that always displays 2).

I see two types of issues here. The year family of tokens (/Y+/, /g+/, /G+/). The problem here is that you don't know how many digits they would generate. There is a lower bound (YYYY would produce at least 4 digits), but not an upper one. I can see these use cases:

  • regular years (meaning +/- 100 years from now) displayed with 2 digits (YY)
  • regular years displayed with 4 digits (YYYY)
  • irregular years (negative, too small, too big) displayed with any number of digits (rest of tokens)

So I suggest

  • YYYY to parse exactly 4 digits
  • YY to parse exactly 2 digits
  • Y to parse anything /[-+]?\d+/

This behavior is kind of different than the formatting behavior, but I think it has the least amount of surprise for most users. Because most users use regular years, and YY and YYYY make sense for them.

The second issue is with all other tokens (h, H, d, D, M, s, w, W..., DDD). These can be mostly one or two digits, and only padding is changed with the different formatting tokens. I think if you want rock solid parsing, you'd use HH (or equivalent) for both formatting and parsing. H would mean -- this comes from a user, and it should match any number of digits (until a non-digit is found). In general you don't pad units going to the user, and user won't pad units when entering. But user also won't put date and month numbers glued together, so greedy scanning works in this case.

So I propose:

  • non padded version (h, H, D, DDD, ...) mean eager /\d+/ tokenization
  • padded versions (hh, HH, DD, DDDD) mean strict /\d\d/ tokenization
@tp

This comment has been minimized.

tp commented Jan 14, 2014

@ichernev: I think if you want rock solid parsing, you'd use HH (or equivalent) for both formatting and parsing. H would mean -- this comes from a user, and it should match any number of digits (until a non-digit is found).

But one can neither use H for all 24 hours of the day (see above), nor HH if single-digit hours don't have a leading 0:

var parsed = moment('1', 'HH', true);
parsed.isValid() => false

So what would you suggest?

@ichernev

This comment has been minimized.

Contributor

ichernev commented Jan 14, 2014

/\d+/ means any number of digits.

@icambron

This comment has been minimized.

Member

icambron commented Jan 14, 2014

@ichernev I think that's what I said too. I just wanted to clarify that you won't be able to guarantee round tripping. It sounds like you're OK with that and only intend H (as a parsing token) to be for user-generated strings. I'm just imagining this ticket:

moment(moment().hours(1).format('HDD'), 'HDD', true); //WHY YOU NO WORK?

If you're OK with answering that with "dude, use HH", then I'm good with it too.

@ichernev

This comment has been minimized.

Contributor

ichernev commented Jan 14, 2014

Well, you're shooting yourself in the foot with that code IMHO. So yeah :)

Btw the initial implementation of strict would have worked here -- because it first builds the entire regex, and the second token would be \d\d, so the first \d+ greedy would not match 2 digits -- just one ;-)

@icambron

This comment has been minimized.

Member

icambron commented Jan 14, 2014

Right, the advantage of a global regex is that we get backtracking and we don't have to deal with greed as much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment