Needs canonical examples / reference implementation (including RegExp) #59

Open
coolaj86 opened this Issue Apr 24, 2013 · 19 comments

Comments

Projects
None yet
6 participants

See mojombo/semver#32

  • RegExp in various languages that correctly parse semver
  • Reference implementations

JavaScript RegExp:

/^(([\d+)\.(\d|)\.(\d+))(?:-([\dA-Za-z\-]+(?:\.[\dA-Za-z\-]+)*))?(?:\+([\dA-Za-z\-]+(?:\.[\dA-Za-z\-]+)*))?$/

JavaScript Reference Implementation:

https://github.com/coolaj86/semver-utils

I would suggest having two sections - one for tested Regular Expressions that match and another for modules / reference implementations.

There should be at least canonical reference implementation of a parser / validator with an api that would make sense to copy in any language.

@ghost

ghost commented Jun 24, 2013

The regular expression needs to be updated to detect the presence of leading zeros.

Unfortunately, I haven't found a way to do that, otherwise I would have provided an updated example.

Can you give an example of a valid semver string that has leading zeros? and also provide a reference to the documentation that suggests this is allowed (i.e. a phrase or example in the docs)?

If so I'll add your string to the tests in semver-utils and fix it.

@ghost

ghost commented Jun 24, 2013

Sorry, I meant to disallow.

The regular expression as used now does not seem to reject version numbers with leading zeros..

tbull commented Jun 24, 2013

We decided only recently that leading zeroes are no longer allowed: mojombo/semver#112
That decision invalidated all regexps developed earlier.

gvlx commented Oct 3, 2014

Hi,

I have been playing with the regex for version 2.0.0 and came up with this on regex101.

Expanded:

/^
(?'MAJOR'
    0|(?:[1-9]\d*)
)
\.
(?'MINOR'
    0|(?:[1-9]\d*)
)
\.
(?'PATCH'
    0|(?:[1-9]\d*)
)
(?:-(?'prerelease'
    (?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
    (?:\.
        (?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
    )*
))?
(?:\+(?'build'
    (?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
    (?:\.
        (?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
    )*
))?
$/

The pre-release and build patterns are very complex because they require the 'no leading zeros' rule.

I can't figure any benefits of that over a (very) relaxed pattern as in /[0-9A-Za-z]+([.-][0-9A-Za-z]+)*/ which is just 'dot-or-dash' separated alphanumeric identifiers (e.g "00000-aaaaa.bbbbb") which, for me, would be more useful (I usually have to use UUIDs and other mechanical identifiers).

Notice that according to the railroad diagram and the BNF (boy, isn't that hard to read! 😕) the identifier "0000.0000.0000.0000.------" is valid (leading zeros allowed).

If you can, please supply more edge cases 😄 on the regex101 page (in the first block is all cases are valid, in the second, invalid).

Happy hacking!

coolaj86 commented Oct 3, 2014

I'm in the camp to veto the use of leading zeros. In JavaScript (and some other languages) parsers will default to octal and then your sorting could get all out of wack because suddenly '011' is less than '10' both lexicographically and numerically.

And what about 007 vs 07 vs 7? Numerically they're all the same in base 10 and base 8 so how would you know which version is the "newer" one?

For the love of all that is good on this earth: no leading zeros!!!

coolaj86 commented Oct 3, 2014

Oh, sorry I missed the part about that being a build number. I'll have to look at the spec, but I don't think the parser I mentioned prohibits this either way.

gvlx commented Oct 3, 2014

Hi,

The requirement is on 2.0.0 for pre-release but not on build version:

9 A pre-release version (...) Numeric identifiers MUST NOT include leading zeroes. (...)

But on the BNF:

<pre-release identifier> ::= <alphanumeric identifier>
                           | <numeric identifier>

<build identifier> ::= <alphanumeric identifier>
                     | <digits>

<alphanumeric identifier> ::= <non-digit>
                            | <non-digit> <identifier characters>
                            | <identifier characters> <non-digit>
                            | <identifier characters> <non-digit> <identifier characters>

<identifier characters> ::= <identifier character>
                          | <identifier character> <identifier characters>

<identifier character> ::= <digit>
                         | <non-digit>

<non-digit> ::= <letter>
              | "-"

<digit> ::= "0"
          | <positive digit>

The railroad diagram is less clear.

So maybe the text requires some correction.
Added pull request #95

So, version 2.0.1? (patterns allowed on 2.0.0 will still work here).

gvlx commented Oct 4, 2014

New regex101 pattern:

^
(?'MAJOR'(?:
    0|(?:[1-9]\d*)
))
\.
(?'MINOR'(?:
    0|(?:[1-9]\d*)
))
\.
(?'PATCH'(?:
    0|(?:[1-9]\d*)
))
(?:-(?'prerelease'
    [0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*
))?
(?:\+(?'build'
    [0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*
))?
$

9 A pre-release version (...) Numeric identifiers MUST NOT include leading zeroes. (...)

How I read this, it means that a pre-release identifier like 0123456789 is just not interpreted as a numeric but as an alphanumeric identifier and thus compared lexically instead of numerically.


identifiers consisting of only digits are compared numerically and identifiers with letters or hyphens are compared lexically in ASCII sort order. Numeric identifiers always have lower precedence than non-numeric identifiers.

Never mind, it appears that numeric identifiers with leading zeros are not accepted at all or at least have no precedence defined which is pretty much the same.

rugk referenced this issue in mojombo/semver Nov 3, 2015

Closed

Complete RegExp to verify version numbers #279

Collaborator

Haacked commented Aug 5, 2016

I'm cool with adding a new page that has a regex example. We just need to figure out how we'll change the layout of the site to accommodate such links. #57 has the same design issue.

fer-rum commented Jan 4, 2017 edited

The posted regex seems to have a little trouble with the spec: In §9 it states

Numeric identifiers MUST NOT include leading zeroes

So I assume that a version like 1.0.0-0123 should not be valid; However in the provided regex it will be accepted. I suppose the error is in the prerelease capture group where
[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)* should be
[1-9A-Za-z-]+(\.[0-9A-Za-z-]+)*.

Also, am I correct that according to spec pre-release identifiers and build metadata behave differently with respect to the leading zero policy, since §10 misses the appropriate statement?
In this case the build capture group can be reduced to
[0-9A-Za-z]+, can't it?

Before I mess with the provided regex, it would be nice if someone could confirm/ falsify my suggestion.

Edit: Noticed the discrepancy between pre-release and build metadata spec.

Collaborator

Haacked commented Jan 4, 2017

Good catch. However, your change would also make 1.0.0-0abc invalid but there's no reason that shouldn't be invalid.

Taking a step back, an identifier is either a numeric or an alphanumeric. That's a bit tricky to capture in Regex.

numeric : [1-9][0-9*]
alphanumeric: [0-9]*[A-Za-z-]+[0-9A-Za-z-]* At least one character must be non-numeric

Hence it'd combine to be something like ([1-9][0-9*])|([0-9]*[A-Za-z-]+[0-9A-Za-z-]*)

So complicated. 😦 Does that look correct?

FichteFoll commented Jan 9, 2017 edited

Looks good, except that the asterisk in the numeric pattern has to go out of the set.

You could also speed it up if the second set in the alpha was matched exactly once so the engine doesn't have to backtrack.

fer-rum commented Feb 10, 2017

@Haacked would you like to update the regex then (Including the suggestion by @FichteFoll )?

This is probably a regurgitation of previously discussed topics, but can a reference regex be put into the spec?

Another problem: Just 0 is not considered.

numeric : 0|[1-9][0-9]*
alphanumeric: [0-9]*[A-Za-z-][0-9A-Za-z-]* at least one character must be non-numeric
combined: 0|[1-9][0-9]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*

fer-rum commented Feb 14, 2017

Why should the term "0" be valid? It is purely numeric, starts with a '0' so it should be invalid.

Side thought:
(Why are leading zeros in a numeric version excluded anyway?
Is it an explicit goal to parse the pattern as natural numbers if applicable?
What about negative numbers then?
Or hex-representation?
Can I introduce a leading '0' if I want the number to be interpreted as octal?
Specs are clear, but the intentions aren't.)

https://en.wikipedia.org/wiki/Leading_zero

Therefore, the usual decimal notation of integers does not use leading zeros except for the zero itself, which would be denoted as an empty string otherwise.

Leading zeroes are excluded for simplification and prevention of ambiguity. Is 0.01.1 bigger or smaller than 0.1.01? Are they equal? If they are equal, why do they not have the same string representation?

For negative numbers, I can only speculate. Considering that, in numbering, you want to "start" at a certain point and increase from that onward, it makes sense to have a generally specified starting point (i.e. lowest member) as zero instead of operating on the entire set of whole numbers, which is infinite in both directions.

It should be obvious to everyone that the speak speaks of decimal numbers in all places, which are the standard numbering system pretty much everywhere in the world, afaik.

fer-rum commented Feb 14, 2017 edited

It should be obvious to everyone that the speak speaks of decimal numbers in all places, which are the standard numbering system pretty much everywhere in the world, afaik.

The thing is that this will often be used by programming people who tend to think in octal sometimes. :)

Leading zeroes are excluded for simplification and prevention of ambiguity. Is 0.01.1 bigger or smaller than 0.1.01? Are they equal? If they are equal, why do they not have the same string representation?

This is not about the version numbers which are clearly decimal natural numbers w/o leading 0s.
In the build information, lexical sorting is the way to go, which states that the first one comes before the second one and both are not identical.

I am in favour of stating explicitly that natural decimal numbers (including 0 per se) without leading 0s are the only accepted form of purely numeric notation in the build information. (Alternatively one could go the C-way and assume that pure numeric expressions with leading 0s are interpreted as an octal number.)

Hexadecimal expressions should be no problem, they are usually prefixed by 0x, # or alphanumeric anyway.

I still am not aware why exactly the contents of the build information is restricted in such a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment