Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch slugify regex to support more Unicode character groups #8167

merged 1 commit into from May 14, 2020


Copy link

@swanson swanson commented May 8, 2020

This is a 馃檵 feature or enhancement.


Modifies the Regex to support a wider range of Unicode character groups. The original issue has more context on why the existing Regex was not working as expected for Tamil (or other Unicode characters), including external references to a possible fix.

This was deemed to be a breaking change in the original issue. There are no failing tests to help guide us in confirming that it would break existing sites, but it does seem possible.

The original issue mentioned making this change opt-in. We're happy to support this but we were not sure how to best implement that.

Possible proposal: add a new slug mode option (something like pretty_unicode -- open to ideas on the name) that would apply this more advanced regex instead.

Any ideas on how to proceed? We don't expect this to get merged as-is, happy to continue a discussion here.


See #7973

Paired with @josh-works so CCing him :)

@DirtyF DirtyF requested a review from a team May 8, 2020
@DirtyF DirtyF added the fix label May 8, 2020
@DirtyF DirtyF modified the milestones: 4.1, 4.2 May 8, 2020
mattr- approved these changes May 12, 2020
Copy link

@mattr- mattr- left a comment

Considering the fact that this is a bug fix, I'm inclined to include it in the next release, even though site urls will change because of it.

@jekyll/core I'd appreciate your thoughts on this approach.

Copy link

ashmaroli commented May 12, 2020

even though site urls will change because of it

@mattr- How about adding a configuration option that enables the new regexes..? Opt-in for the change..?

@DirtyF DirtyF modified the milestones: 4.2, 4.1 May 12, 2020
Copy link

josh-works commented May 13, 2020

馃憢 I paired with @swanson, as we put together the fix.

I wonder if making this an opt-in feature might accidentally select against benefiting people most likely to benefit from this fix.

The original bug report came from the slugify method wrongly handling Tamil characters.

I suspect that most of the users of this tool are not impacted by this error, which is why it didn't get reported until recently, in @deepestblue's detailed bug report.

This is certainly an edge-case fix, but I'm afraid that making it an opt in feature, the populations most positioned to benefit from the fix would now be implicitly required to go dig deep through the source code to figure out how to use this fix in a way that targets their use case.

Especially if the bug fix aids in slugifying non-standard characters, certainly all individuals who would use Jekyll for Tamil-language websites, or any other non-English website, would be expected to read source code in English to accommodate their non-English use case.

Could it be possible to package this into some sort of Beta build, or get insights from individuals who are close to the target use case, to see if this fix introduces breaking changes?

This is also a function of me not understanding how to make this an opt-in feature.

@ashmaroli, could you point me in the right direction on how this would be made opt-in, and where one would go to see that this feature would be available?

Copy link
Contributor Author

swanson commented May 13, 2020

I think there is a clear path to making this opt-in in the context of being used as a filter: - we could add and document an additional mode (as how latin mode was added in 3.7.0)

It is less clear how to resolve the use-case for generating post/page URLs (which was the problem outlined in the initial issue). It appears that there is not an existing codepath for passing mode options:

def title
Utils.slugify(["slug"], :mode => "pretty", :cased => true) ||
Utils.slugify(@obj.basename_without_ext, :mode => "pretty", :cased => true)

I'm not against adding such a path (if that is deemed appropriate), but it is fairly far-reaching so wanted to check-in before starting a bigger effort.

Copy link

mattr- commented May 13, 2020

even though site urls will change because of it

@mattr- How about adding a configuration option that enables the new regexes..? Opt-in for the change..?

I don't want yet another configuration option for this.

Copy link

ashmaroli commented May 14, 2020

For the record, I only suggested to use a config option because this has a chance of automatically altering existing URLs (albeit edge-cases).
But clearly, nobody is in favor of that. So, I concede.

Copy link

ashmaroli commented May 14, 2020

@jekyllbot: merge +fix

@jekyllbot jekyllbot merged commit f8286b6 into jekyll:master May 14, 2020
7 checks passed
@jekyllbot jekyllbot added the bug label May 14, 2020
Copy link

ashmaroli commented May 14, 2020

Thank you @swanson and @josh-works for your contribution.
This will be included in Jekyll 4.1

@swanson swanson deleted the tamil-slug-issue branch May 14, 2020
@jekyll jekyll locked and limited conversation to collaborators May 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet

Successfully merging this pull request may close these issues.

None yet

6 participants