Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: words should drop apostrophes #123

Merged

Conversation

jly36963
Copy link
Contributor

@jly36963 jly36963 commented Dec 2, 2023

Words and Apostophes

Words and related casing utilities have weird behavior for words containing '.
I updated the logic to be more like lodash's (removing apostrophes before splitting words)

Relates to, but is not a fix for: #122

Words

// Old behavior
type test = Expect<Equal<Words<"don'tGo">, ["don", "'", "t", "Go"]>>;
// Updated behavior
type test = Expect<Equal<Words<"don'tGo">, ["dont", "Go"]>>;

Casing

// Old behavior
type test = Expect<Equal<SnakeCase<"don'tGo">, "don_'_t_go">>;
// Updated behavior
type test = Expect<Equal<SnakeCase<"don'tGo">, "dont_go">>;

export type WeirdTextUnion = typeof WEIRD_TEXT | 'dont.distribute unions'
export type WeirdTextUnion =
| typeof WEIRD_TEXT
| "where's the leak ma'am"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the leak ma'am?
image

@@ -6,6 +6,7 @@ import type { IsDigit } from './characters/numbers.js'
import type { IsSpecial } from './characters/special.js'
import type { IsStringLiteral } from '../internal/literals.js'

// prettier-ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want the type to turn into indentation hell, open to suggestions for a better solution

@jly36963 jly36963 marked this pull request as ready for review December 2, 2023 15:48
Copy link
Owner

@gustavoguichard gustavoguichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Appreciate you jumping on it with a good strategy.

@p9f
Copy link
Collaborator

p9f commented Dec 4, 2023

I updated the logic to be more like lodash's

I would like to challenge this. Lodash approach is more complex than just "removing apostrophes before splitting words" because lodash behaviour will be different for where's (treated as one word) and ma'am (treated as two words) (you can see the code here, there is more variations of this logic.).

So with the test sentence "Where's the leak ma'am?", we have:

  • lodash: ["Where's", "the", "leak", "ma", "am"]
  • string-ts (current): ["Where", "'", "s", "the", "leak", "ma", "'", "am"]
  • string-ts (pr): ["Where's", "the", "leak", "ma'am"]
    as you can see these are three different values.

I think I would be in favour of removing all special characters, but still count them as a word separator - i.e. go for:
["Where", "s", "the", "leak", "ma", "am"].


also - do we agree this is a breaking change that should wait for a v2?
if yes, we might want to think about this very carefully, because we don't a new major versions to often, so we might want to think about other breaking changes (like lodash remove all special characters like %, &, ... and we currently don't).

@jly36963
Copy link
Contributor Author

jly36963 commented Dec 4, 2023

I would like to challenge this. Lodash approach is more complex than just "removing apostrophes before splitting words".

_.camelCase("Where's the leak ma'am")
// "wheresTheLeakMaam" 

_.snakeCase("Where's the leak ma'am") 
// "wheres_the_leak_maam" 

I guess they do the apostrophe removal at the casing level, before it is passed into words.


We could move the removal of apostrophes to the casing function like they do here. This might be nice for a couple of reasons:

  • Words and the casing functions would align with lodash
  • We could choose whether or not to do the apostrophe removal for different casing functions
    • Machine readable (snake/camel/kebab): Remove apostrophe
    • Human readable (title): Keep apostrophe

Can you think of scenarios where words should be separated based on an apostrophe?

  • ma'am should definitely be one word
  • Common contractions should be one word
  • Not sure about other languages
    • ballon d'or -- "golden" should be one word, right?

Either way, SnakeCase<"ma'am"> // ma_'_am is definitely the wrong answer

@gustavoguichard
Copy link
Owner

Yes, I agree with @p9f that we should think carefully about (any) change in the words method.
I like @jly36963 proposal from his latest comment.

If we just "fixed" machine readable casing functions, would that be a major version?

I'm also a little afraid of going down the "perfect grammar" route bc it can lead to a much more complex code.
Wdyt?

@jly36963
Copy link
Contributor Author

jly36963 commented Dec 4, 2023

Based on this issue, it might be worth:

  • keeping ' at the words level
    • ["Where's", "the", "leak", "ma'am"]
    • Note that this is still a change, as ' isn't treated as a new word
  • removing ' at the (machine-readable) casing level, before the input is passed into Words
    • wheres_the_leak_maam

@jly36963 jly36963 marked this pull request as draft December 4, 2023 13:41
@gustavoguichard
Copy link
Owner

It does makes sense. I tend to adopt the same positioning of J Dalton there: limiting the scope to avoid creating a lil monster, but with that said if no one can come up with a con for that we can add that behavior.
AFAICT in both English and Portuguese it will work.. not sure about other languages

@p9f
Copy link
Collaborator

p9f commented Dec 4, 2023

@jly36963 ballon d'or -- "golden" should be one word, right?

fwiw in french I think "ballon", "d", "or" would be better since d is the contraction of de, a different word from or (kinda ball [made] o[f] gold).
I don't think that really matter though - surely French is out of scope.


I don't see any downside in your latest solution right now, even though I think we should take our time to consider all the implications. It seems the best of both world.

What do you think of, in the same release, changing the logic for special character so that they don't create artificial word? i.e. converting hello, world into hello, world instead of the current hello, ,, world (and this for all character except letters, digits and single quote.

@jly36963
Copy link
Contributor Author

jly36963 commented Dec 5, 2023

I'm not too sure what to do with the different special characters scenarios.
It's probably beyond the scope of this PR.

Scenarios:

  • abc,def -> abc def
    • Treat comma like separator?
  • best-f$%^ing-birria -> best f$%^ing birria
    • ignore non-separator special chars (don't remove or split word)
  • cão -> cão
    • ignore diacritic/foreign letters
  • abc123%^& -> (same)
    • Don't separate when switching between types, only split on separators and case changes.
    • vscode does this with built-in snake/camel case utilities.

I've been thinking of a future solution, maybe something like:

  • Finite list of "separator characters"
  • Finite list of removable "special characters" (!@#$%^&...)
  • Words only splits on "separator characters" and case changes.
  • Casing functions remove "special characters"
    • Maybe add boolean flag to each casing function, default to true?
  • diacritic/foreign letters and emojis are left alone (not removed or split on)

I could put this in a github discussion

@jly36963 jly36963 marked this pull request as ready for review December 7, 2023 14:04
@gustavoguichard
Copy link
Owner

I'm not too sure what to do with the different special characters scenarios. It's probably beyond the scope of this PR.
Keep in mind that supporting international characters is out of the scope of the library at all.

: IsSeparator<T> extends true
? false
: true
? false
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably disable prettier here too... but for other PR

Copy link
Owner

@gustavoguichard gustavoguichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just to confirm.. do we all agree it is a fix or is it a breaking change?

WHat are your opinions @jly36963 @p9f ?

@jly36963
Copy link
Contributor Author

jly36963 commented Dec 7, 2023

@gustavoguichard I'm okay either way.

I think a majority of people would expect the updated behavior (eg: wheres over where_'_s), but there's always a chance that someone is relying on the current behavior.

@gustavoguichard
Copy link
Owner

@jly36963 I'd do a major change ;)

@gustavoguichard gustavoguichard merged commit b378b23 into gustavoguichard:main Dec 8, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants