Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strutils/isUpperAscii and unicode/isUpper consider space, punctuations, numbers as "lowercase" #7963

Closed
kaushalmodi opened this issue Jun 5, 2018 · 36 comments

Comments

Projects
None yet
@kaushalmodi
Copy link
Contributor

commented Jun 5, 2018

Test:

import strutils except isUpper
import unicode
echo "A B".isUpperAscii()  # Returns false
echo "A B".isUpper()  # From unicode, Returns false
echo "AB?!".isUpper()  # From unicode, Returns false
echo "1, 2, 3 GO!".isUpper() # From unicode, Returns false

Just for precedent, Python, Emacs Lisp return a "true" for upper-case checking of "A B", "AB?!" and "1, 2, 3 GO!".


Python isupper vs Nim isUpper

@kaushalmodi kaushalmodi changed the title strutils/isUpperAscii and unicode/isUpper consider space as "lowercase" strutils/isUpperAscii and unicode/isUpper consider space, punctuations as "lowercase" Jun 5, 2018

@kaushalmodi kaushalmodi changed the title strutils/isUpperAscii and unicode/isUpper consider space, punctuations as "lowercase" strutils/isUpperAscii and unicode/isUpper consider space, punctuations, numbers as "lowercase" Jun 5, 2018

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2018

Just to add, the isUpper documentation says:

Returns true iff s contains all upper case unicode characters.

I believe that all non-alpha characters should be skipped from the check for "upper case unicode characters".

.. because with the current logic.. both of the below return false:

echo "A B".isUpper()
echo "a b".isLower()

.. which seems counterintuitive.

It feels like the isUpper, isLower are conflating their behavior with that of isAlpha.

@andreaferretti

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2018

Seems right to me, only alphabetic characters can be either uppoer or lower

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

all non-alpha characters should be skipped from the check for "upper case unicode characters".

Why?

echo "A B".isUpper()
echo "a b".isLower()

Your tests contains space, it's non alpha. And this the right logic.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

@data-man That's what I said.. the isUpper/isLower checks are conflating with isAlpha.

Bear with me.. here is the reasoning..

  • Consider the current isUpper be named isUpperX
  • Consider the behavior I am suggesting be named isUpperY

Then currently str.isUpperX is behaving like str.isAlpha and str.isUpperY. We need to remove the isAlpha check out of isUpper. All of this applies to isLower too.

When we are dealing with real world text, parsing docs, web pages, we rarely face all alphabetical characters strings. We normally deal with sentences that usually contain at least a space. In that case both isAlpha and isUpper will always be false which is not very useful.

If the implementation is changed to isUpperY, getting the current behavior is as simple as anding with isAlpha. But deriving isUpperY from isUpperX as I need to do today is very hacky.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

Another way to think about this.. you would do an uppercase/lowercase check only on alphabets (or characters that can have upper and lower cases). You wouldn't do that check on spaces (and numbers, full stops, etc) as they don't have cases. So such characters should not affect the isUpper, isLower, etc.

Crude analogy follows..

It's like asking a 3-year old if he likes Guinness (beer), and if he says no, you don't order that for the whole table of 20 adults.

That space is being that 3-year old in that "A B".isUpper ().. it's giving vote for something it's not eligible for.

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

We normally deal with sentences that usually contain at least a space.

No. Passwords and usernames almost always can't contain spaces.
You can implement your MyRightIsUpper for your needs.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

No. Passwords and usernames almost always can't contain spaces.

I agree and also almost is the key there. And consider how few cases of parsing such space-less, digit-less, punctuation-less are out there in the huge text parsing world. As I said, you can still get the current behavior by simply anding the proposed isUpperY with isAlpha.

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

In the upcoming unicode module (RFC #7902) will be possible this:

let myRunes = alphas + spaces + digits + '.' + ','
if check(myString, myRunes):
...

@data-man data-man closed this Jun 6, 2018

@data-man data-man added the Unicode label Jun 6, 2018

@Yardanico

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2018

@data-man why did you close this issue? New unicode module wasn't implemented yet

@Yardanico Yardanico reopened this Jun 6, 2018

@Araq

This comment has been minimized.

Copy link
Member

commented Jun 6, 2018

Well since these have been designed with Python compat in mind, they should follow Python's behaviour (as weird as that is to me).

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

@Yardanico
Do you want to do double work? Well, go ahead!

@Yardanico

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2018

@data-man what double-work? You would just make a PR with "fixes #7963", and as I can tell we didn't fully decide for the Unicode module RFC yet.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

@data-man Just from language semantics, wouldn't you agree that the following statement is all uppercase: "I AM USING ALL CAPS, BUT JUST AS AN EXAMPLE, NOT BEING RUDE."

If someone wants to parse space-less passwords, relying on isUpper for "space-less check " sounds wrong.. they should first use isAlpha, isAlphaNumeric or whatever that meets the valid password chars check.

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

@kaushalmodi

wouldn't you agree that the following statement is all uppercase: "I AM USING ALL CAPS, BUT JUST AS AN EXAMPLE, NOT BEING RUDE."

Of course, I disagree. Spaces aren't letters.

If someone wants to parse space-less passwords

If someone wants to check whole sentences, he can use split and isUpper.

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

The current behavior shouldn't change, this is a breaking change with unpleasant consequences for someone.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

The current behavior shouldn't change, this is a breaking change with unpleasant consequences for someone.

I have some data sample to disprove that! :D

I can of course see only the Nim code that in public on GitHub..

Github search of public repos containing isUpper in Nim files

This returned just 50 hits in all, many of which are false hits like proc g_unichar_isupper*(c: gunichar): gboolean{.cdecl, dynlib: gliblib,, or just hits in Nim forks.

I started documenting and understanding the use of isUpper in all those, and it got boring quite quickly.. below 13 rows cover the use of isUpper in public Nim code on GitHub in the last 2 years.

As you see my proposal won't break any of those.. in fact it's fixing the py2nim behavior.

Line Update year User Will break? Reason
1 regex.nim 2018 @nitely No isUpper used on Rune
2 Nim/pegs.nim 2018 @Araq No isUpper used on Rune
3 Nim/unicode.nim 2018 @Araq No/Yes? This is what we are fixing here!
4 bob/example.nim 2018 @amscotti No isUpper used on Rune
5 unicodeplus.nim 2018 @nitely No N/A - That library implements its own isUpper, again for Rune
6 string_methods.nim 2018 @metacraft-labs It will actually fix it Because that library translates Python isupper to Nim isUpperAscii.
7 builtin_methods.nim 2018 same as above same as above same as above
8 NimMirror/tpegs.nim 2017 @georgiy-pruss N/A -- Nim Mirror
9 NimMirror/unicode.nim 2017 same as above same as above
10 NimMirror/strutils.nim 2017 same as above same as above
11 NimMirror/pegs.nim 2017 same as above same as above
12 uetypes.nim 2017 @pragmagic No isUpper (deprecated version from strutils) used on char
13 another Nim mirror/tpegs.nim 2016 @FedericoCeratto N/A -- Nim Mirror

NOTE: My proposal is to fix isUpper, isUpperAscii, isLower, isLowerAscii only with the string input, not char or Rune input. See this summary: https://scripter.co/notes/string-fns-nim-vs-python/#is-upper -- The Nim implemention of the above 4 deviates from Python only for string input, and that too, only for strings that contain alphabetical chars too.

(Below "WRONG" is with respect to Python implementation.)

echo "[Wrong] ", "A B".isUpperAscii()  # Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "A B".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "AB?!".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo "[Wrong] ", "1, 2, 3 GO!".isUpper() # from unicode, Returns false! BUG https://github.com/nim-lang/Nim/issues/7963
echo ' '.isUpperAscii() # checking this proc on a non-alphabet char
echo ' '.Rune.isUpper() # checking this proc on a non-alphabet Rune
echo "(*&#@(^#$ ".isUpper() # checking this proc on a non-alphabet string

Output:

[Wrong] false
[Wrong] false
[Wrong] false
[Wrong] false
false
false
false
@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

@data-man

Of course, I disagree. Spaces aren't letters.

That's a good excuse for violating general email/forum etiquette of not using all-caps :D

If someone wants to check whole sentences, he can use split and isUpper.

In real-world parsing, sentences don't have just spaces.. you have to consider commas, fullstops, other punc chars, digits, spaces with variety of ascii codes, hyphens with variety of ascii codes, so much more.

My point is plain and simple.. the current isUpper (and others) implementation is too specialized.. my proposal allows isUpper to be better composable with other functions.

  • If you don't care if there are non-alphabetical chars in the parsed string, and just want to check if whatever alphabetical chars are upper cased, use isUpper (my proposed version).
  • If additionally you want to check if the parsed string does not have any non-alphabetical chars, and that with isAlpha.
@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

I can of course see only the Nim code that in public on GitHub.

What about a private repos, GitLab, BitBucket, etc.?

P.S. I'll not argue anymore, because it's pointless.

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

P.S. I'll not argue anymore, because it's pointless.

I have the same feeling. I am hoping that one or more of the folks using isUpper that I pinged in my above analysis comment chimes in with their opinion.

What about a private repos, GitLab, BitBucket, etc.?

I cannot help with that, and so I made it clear about "public GitHub repos". But if anything, at least from this data-sample, I proved my understanding of current isUpper implementation:

  • Folks cannot use isUpper elegantly for strings. So they resort to hacks or using just str.toUpper == str to mimick isUpper. See http://exercism.io/tracks/nim/exercises/bob to see what I mean.
  • My proposal does not break the use of isUpper and family for chars and Runes.
  • Even if this breaks things, the fix is as simple as anding with isAlpha to get back the current behavior.
  • The proposed change fixes this single-case incompatiblity with Python isupper: the case where a string (not char or Rune) contains at least one alphabetical char with optional non-alphabetical chars. But fixing this will make str.isUpper Do The Right Thing (TM), and you will see isUpper being then used for strings too in the wild.
  • It's not that drastic breakage (no breakage at all based on the above small sample set).. we are still pre-1.0 release.

I'll rest my arguments till we get more opinions; I'm now just repeating all the points I made earlier.

@Araq

This comment has been minimized.

Copy link
Member

commented Jun 6, 2018

We collect breaking changes and call the new thing version 0.19. Please fix it to be compatible with Python, these have been added for Python compat in the first place!

Nim's native way is s.allCharsInSet({'a'..'z', ' '}) etc

@dom96

This comment has been minimized.

Copy link
Member

commented Jun 6, 2018

IMO:

doAssert "I AM SHOUTING".isUpper

Should hold. Especially if it's consistent with Python (which from what I can tell it is?)

@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

@Araq

Please fix it to be compatible with Python, these have been added for Python compat in the first place!

Thank you!

@dom96

doAssert "I AM SHOUTING".isUpper

Should hold

Exactly! Below fails the assertion currently.

import unicode
doAssert "I AM SHOUTING".isUpper
@kaushalmodi

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

Here are the tests for isUpperAscii and isLowerAscii that should pass after this behavior change:

    doAssert isLowerAscii("a b")
    doAssert isLowerAscii("ab?!")
    doAssert isLowerAscii("1, 2, 3 go!")
    doAssert(not isLowerAscii("(*&#@(^#$ ")) # None of the string chars are alphabets

    doAssert isUpperAscii("A B")
    doAssert isUpperAscii("AB?!")
    doAssert isUpperAscii("1, 2, 3 GO!")
    doAssert(not isUpperAscii("(*&#@(^#$ ")) # None of the string chars are alphabets

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 6, 2018

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 7, 2018

Fix isUpperAscii, isLowerAscii for strings with non-alpha chars
This commit changes the behavior of isUpperAscii and isLowerAscii in
the strutils module -- only for the case where the input is a *string*
with a mix of alphabetic and non-alphabetic characters.

The new behavior mimics the Python isupper and islower behavior
i.e. non-alphabetic chars are ignored when checking if the whole
string is upper-case or lower-case.

    Before: doAssert(not "A B".isUpperAscii) passed
    Now:    doAssert "A B".isUpperAscii passes

.. and the similar for isLowerAscii.

To get the old behavior (the old behavior checked if the whole string
contained only alphabetic characters AND if all characters were
upper/lower case), simply do:

    str.isAlphaAscii and str.isUpperAscii
    str.isAlphaAscii and str.isLowerAscii

(where str is a variable of type string)

See nim-lang#7963 for more info.

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 7, 2018

Fix isUpper, isLower for strings with non-alpha unicode chars
Fixes nim-lang#7963.

This commit changes the behavior of isUpper and isLower in the
unicode module -- only for the case where the input is a *string*
with a mix of alphabetic and non-alphabetic unicode characters.

The new behavior mimics the Python isupper and islower behavior
i.e. non-alphabetic chars are ignored when checking if the whole
string is upper-case or lower-case.

    Before: doAssert(not "A B".isUpper) passed
    Now:    doAssert "A B".isUpper passes

.. and the similar for isLower.

To get the old behavior (the old behavior checked if the whole string
contained only alphabetic unicode characters AND if all characters
were upper/lower case), simply do:

    str.isAlpha and str.isUpper
    str.isAlpha and str.isLower

(where str is a variable of type string)

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 7, 2018

Fix isUpperAscii, isLowerAscii for strings with non-alpha chars
This commit changes the behavior of isUpperAscii and isLowerAscii in
the strutils module -- only for the case where the input is a *string*
with a mix of alphabetic and non-alphabetic characters.

The new behavior mimics the Python isupper and islower behavior
i.e. non-alphabetic chars are ignored when checking if the whole
string is upper-case or lower-case.

    Before: doAssert(not "A B".isUpperAscii) passed
    Now:    doAssert "A B".isUpperAscii passes

.. and the similar for isLowerAscii.

To get the old behavior (the old behavior checked if the whole string
contained only alphabetic characters AND if all characters were
upper/lower case), simply do:

    str.isAlphaAscii and str.isUpperAscii
    str.isAlphaAscii and str.isLowerAscii

(where str is a variable of type string)

See nim-lang#7963 for more info.

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 7, 2018

Fix isUpper, isLower for strings with non-alpha unicode chars
Fixes nim-lang#7963.

This commit changes the behavior of isUpper and isLower in the
unicode module -- only for the case where the input is a *string*
with a mix of alphabetic and non-alphabetic unicode characters.

The new behavior mimics the Python isupper and islower behavior
i.e. non-alphabetic chars are ignored when checking if the whole
string is upper-case or lower-case.

    Before: doAssert(not "A B".isUpper) passed
    Now:    doAssert "A B".isUpper passes

.. and the similar for isLower.

To get the old behavior (the old behavior checked if the whole string
contained only alphabetic unicode characters AND if all characters
were upper/lower case), simply do:

    str.isAlpha and str.isUpper
    str.isAlpha and str.isLower

(where str is a variable of type string)
@mratsim

This comment has been minimized.

Copy link
Collaborator

commented Jun 8, 2018

tagging @krux02 since I also had some ideas on where strutils should go forward (the trash? :P).

For me spaces shouldn't false isUpper or isLower, otherwise those are completely useless.

@nitely

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2018

@kaushalmodi I think a better metric would be checking what other languages do. I doubt many libraries make use of isUpper/isLower, the only time I used them was for case swapping, and that was 'cause there is no swapCase(Rune). That said, I think this should be always valid: assert isUpper(toUpper(mystring)) == true which currently is not true.

@Araq

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

That said, I think this should be always valid: assert isUpper(toUpper(mystring)) == true which currently is not true.

That is NOT true if we copy Python's design! Because isUpper("") is false.

@nitely

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2018

@Araq

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

Same for isUpper(";: "), there are plenty of special cases here...

@nitely

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2018

I realized that and edited my comment.

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 8, 2018

Make isUpper (and variants) work for strings with non-alpha chars
The other variants are isLower, isUpperAscii and isLowerAscii

Fixes nim-lang#7963.

This commit changes the behavior and signatures of:

- isUpper, isLower in the unicode module
- isUpperAscii, isLowerAscii in the strutils module

A second mandatory parameter skipNonAlpha is added to these 4 procs.

(This change affects only for the case where the input is a *string*.)

---

With skipNonAlpha set to true, the behavior mimics the Python isupper and
islower behavior i.e. non-alphabetic chars/runes are ignored when checking if
the string is upper-case or lower-case.

    Before this commit:

      doAssert(not isUpper("A B"))

    After this commit:

      doAssert(not isUpper("A B", false))    <-- old behavior
      doAssert isUpper("A B", true)

      Below two are equivalent:

                           isUpper("A B", true)

        isAlpha("A B") and isUpper("A B", false)

.. and the similar for other 3 procs.

kaushalmodi added a commit to kaushalmodi/Nim that referenced this issue Jun 8, 2018

Make isUpper (and variants) work for strings with non-alpha chars
The other variants are isLower, isUpperAscii and isLowerAscii

Fixes nim-lang#7963.

This commit changes the behavior and signatures of:

- isUpper, isLower in the unicode module
- isUpperAscii, isLowerAscii in the strutils module

A second mandatory parameter skipNonAlpha is added to these 4 procs.

(This change affects only for the case where the input is a *string*.)

---

With skipNonAlpha set to true, the behavior mimics the Python isupper and
islower behavior i.e. non-alphabetic chars/runes are ignored when checking if
the string is upper-case or lower-case.

    Before this commit:

      doAssert(not isUpper("A B"))

    After this commit:

      doAssert(not isUpper("A B", false))    <-- old behavior
      doAssert isUpper("A B", true)

      Below two are equivalent:

                           isUpper("A B", true)

        isAlpha("A B") and isUpper("A B", false)

.. and the similar for other 3 procs.
@hlaaftana

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2018

see what other languages with a rich stdlib do

Python is the only language I personally know that has isUpper/isLower for strings in its standard library (perhaps it's forced to since it's dynamic). Other languages only have it for chars and codepoints. In Apache Commons Lang however the StringUtils class has isAllUpperCase/isAllLowerCase methods that do what the Nim methods do currently. As bloaty as it is, my suggestion would be to change isUpper/isLower behaviour and add isAllUpper/isAllLower for the old behaviour.

@nitely

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2018

@hlaaftana I didn't think that would be the case, but you are right, I checked: go, java, rust, swift, ruby, C++, C# and none of them has isUpper for strings, only for code points 😮

@data-man

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2018

@nitely
Julia has is* for strings.
With the correct behavior. :-)

@hlaaftana

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2018

@data-man Except they're deprecated now with all(isupper, str) being the alternative.

@GULPF

This comment has been minimized.

Copy link
Member

commented Jun 9, 2018

IMO if those methods are not compatible with python they might as well be removed, they are nearly useless anyway. The fact that not even a single statically typed language has been found that implements this proc should be an indication that it might not be that useful...

No. Passwords and usernames almost always can't contain spaces.

In the entire discussion about this, I think this is the only actual use case that has been mentioned for either of the implementations, and this use case seems strange. Why would a website care if my username only contains uppercase letter? Passwords sometimes has rules, but they are typically much more complex than just "not entirely uppercase" (and fwiw, it's very uncommon that passwords can't contain whitespace).

@data-man Except they're deprecated now with all(isupper, str) being the alternative.

That's what Nim should do as well IMO. There are like ~20 procs in strutils that are just bloat on top of allCharsInSet and the in operator.

@Araq Araq closed this in #8001 Jun 9, 2018

dom96 added a commit that referenced this issue Jun 9, 2018

@krux02

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2018

Well since I got tagged here, I will also provide my two cents. I don't like the idea to copy Python semantics just for the sake of copying python semantics. If Python has reasonable design decisions behind it's library and those dosign decisions do not stand in conflict with priorities that you would want in a statically typed compiled language, then yes I think it is a good idea to design the nim library around an established system.

So please provide different reasoning than "this is how it is done in python". For me isUpper(' ') == false is totally fine, because a space is neither upper case nor lower case. So also isUpper("A B") is false, because not every character in that string is strictly upper case. What you want to check semantically is "does not contain lower case letters". so maybe something like this is what you are looking for: {'a'..'z'} notin "A B". Maybe you want to have some predefined sets, too. So lowercaseASCII notin "A B".

@timotheecour

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2018

@krux02

So also isUpper("A B") is false

see #8600 in which I argue we can completely avoid this senseless debate by deprecating isUpper(string) and only have isUpper on char, Rune

Araq added a commit that referenced this issue Oct 14, 2018

krux02 added a commit to krux02/Nim that referenced this issue Oct 15, 2018

narimiran added a commit to narimiran/Nim that referenced this issue Oct 31, 2018

narimiran added a commit to narimiran/Nim that referenced this issue Nov 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.