WIP - Improve permalink generation for URLs with special characters #944

Merged
merged 7 commits into from Aug 28, 2013

Projects

None yet

7 participants

@x3ro
Contributor
x3ro commented Apr 9, 2013

This PR is a WIP attempt to solve #782 😄 Since I'm not sure what the best possible way to proceed is, I'm posting it to get some feedback.

I've already refactored everything I found that is related to URL generation into the URL module, which is now included in Post and Page. Those two only specify what placeholders they'd like to have replaced by means of url_placeholders and the URL#url method does the rest. Note that in the current url_placeholders method there is still some URL escaping going on, which I'll still refactor).

In the current state, all tests still pass, and the next step would be to actually alter/improve the current URL generation, which, as described in #782, has a few shortcomings regarding special characters.

The general question is, how should the generated URLs look, and is it acceptable to make breaking changes in permalink generation. I've already played around a bit, and found Stringex, which seems to generate fairly useful output, but using it would break the current URL generation tests, since some of the output is simply different (for example, it converts everything to lowercase, always). A more obvious downside is the fact that it tries to do stuff like this though:

"10% off if you act now".to_url => "10-percent-off-if-you-act-now"

While awesome, this creates the need to specify a locale for permalink generation, which might one could consider overkill.

I'd really like to hear your opinions on this one.

@x3ro x3ro Refactor URL processing/generation into separate module
This is done to prepare for improved permalink generation
for URLs containing special characters, as proposed in
issue #782
cfcbe1f
@x3ro
Contributor
x3ro commented May 29, 2013

I still think that this could be merged even if the decision on how to modify permalinks will not be made anytime soon. As it currently stands, this PR is basically refactoring, with the advantage of allowing own url_placeholders to be defined 😄

@kelvinst kelvinst commented on the diff Jul 10, 2013
lib/jekyll/url.rb
+ def url
+ @url ||= sanitize_url(permalink || generate_url)
+ end
+
+ # Generate the URL by replacing all placeholders with their respective values
+ #
+ # Returns the _unsanitizied_ String URL
+ def generate_url
+ url_placeholders.inject(template) { |result, token|
+ result.gsub(/:#{token.first}/, token.last)
+ }
+ end
+
+ # Returns a sanitized String URL
+ def sanitize_url(in_url)
+ # Remove all double slashes
@kelvinst
kelvinst Jul 10, 2013

Just a self-opinion, some comments (especially those that only explain what the code does) are unnecessary, like this and the others below...

@mattr-
mattr- Jul 10, 2013 Member

Technically, you're right, they're unnecessary. However, we've chosen to
employ TomDoc for our internal documentation and these comments serve as
TomDoc formatting documentation.

On Wed, Jul 10, 2013 at 1:27 PM, Kelvin Raffael Stinghen <
notifications@github.com> wrote:

In lib/jekyll/url.rb:

  • def url
  •  @url ||= sanitize_url(permalink || generate_url)
    
  • end
  • Generate the URL by replacing all placeholders with their respective values

  • Returns the unsanitizied String URL

  • def generate_url
  •  url_placeholders.inject(template) { |result, token|
    
  •    result.gsub(/:#{token.first}/, token.last)
    
  •  }
    
  • end
  • Returns a sanitized String URL

  • def sanitize_url(in_url)
  •  # Remove all double slashes
    

Just a self-opinion, some comments (especially those that only explain
what the code does) are unnecessary like this and the others below...


Reply to this email directly or view it on GitHubhttps://github.com/mojombo/jekyll/pull/944/files#r5122795
.

@kelvinst
kelvinst Jul 10, 2013

OK, was just an opinion, just when I get a code with too much inline comments (comments like "now I'll do that" and "now that"), my first thought was "this code is as incomprehensible as well?"... But in this case is just a single case, it's just that I've seen worse things, and considered a bad practice...

@x3ro
x3ro Jul 10, 2013 Contributor

In my opinion, it depends on the complexity of the lines being explained. In this concrete case, I like how I can read through the code like this:

  1. Remove all double slashes
  2. Remove every URL segment that consists solely of dots
  3. Append a trailing slash to the URL if the unsanitized URL had one

It is, again in my opinion, way nicer than having to read this:

  1. url = in_url.gsub(/\/\//, "/")
  2. url = url.split('/').reject{ |part| part =~ /^\.+$/ }.join('/')
  3. url += "/" if in_url =~ /\/$/

(especially for the second line)

@kelvinst
kelvinst Jul 11, 2013

Yep, nice... It's pretty more readable, I'm agreed

@parkr
parkr Jul 26, 2013 Member

Maybe each of these could be split into separate descriptive methods and put together to achieve the sanitized URL? String sanitation is a notoriously ugly process and maybe breaking things out into separate private methods could aid in elucidating the process.

@kelvinst
kelvinst Jul 26, 2013

I has thought in this suggestion too, but, in my thoughts, each extracted method would be very tiny... But, thinking a little more now, I prefer three separeted tiny methods that will be used only one time than "inline comments"... Good suggestion (= Thanks @parkr

@maul-esel
maul-esel Jul 26, 2013 Contributor

This goes in a similar direction as the discussion in #1341, partly realized here. It's intended for paths, but it works for URLs as well (there's scarcely any difference between a relative URL and a relative path anyway).

@kelvinst

So @x3ro, I'd take a look on your change, seems like it's only an refactor right? The changes doesn't really change the operation and doesn't change any special characters, right?

I'd like the refactors and it's good enough to merge for me...

@x3ro
Contributor
x3ro commented Jul 10, 2013

Yes, as it stands its only a refactor. It was done with improving permalinks in mind (see #782), but none of that is contained in this PR 😄

@kelvinst

So @mattr- and @parkr, what do you think about merge this? I'd revised the code and don't detected any change to the bahavior too... maybe this will improve code climate a little more

@mattr-
Member
mattr- commented Jul 10, 2013

Can this be updated to the latest version of the master branch? We've made some changes in this area of the code lately and can't merge it without an update. Thanks!

@x3ro
Contributor
x3ro commented Jul 11, 2013

I'll see if I can get that done today :)

@x3ro
Contributor
x3ro commented Jul 11, 2013

Unfortunately, the current master has a couple of tests failing, and bisect points to bd0e45c as the offending commit (ping @parkr)

I've updated my local branch, but I'd rather have those failing tests fix so that I can make sure everything is okay on my side 😄

Edit: Hmm, that commit seems a little old, though, maybe it's the wrong one. Tests currently failing do so because of NameError: uninitialized constant Jekyll::Pager::Pathname, e.g.

      1) Error:
test: Pager should determine the pagination path. (TestPager):
NameError: uninitialized constant Jekyll::Pager::Pathname
@parkr
Member
parkr commented Jul 11, 2013

@x3ro On the latest master, all tests pass for me and seem to be fine on Travis. What version of Ruby are you using? Try adding require 'pathname' to lib/jekyll.rb under # stdlib and try again.

@maul-esel
Contributor

So I'm not the only one. After another merge of master, this recently occured again on my side as well - miraculously fixed the next day. What sorcery is this?

@x3ro
Contributor
x3ro commented Jul 11, 2013

@parkr Fascinatingly enough, the problem went away after upgrading to 1.9.3p448... Should've tried that before bugging you guys ;)

@x3ro x3ro Merge branch 'master' into permalink-special-characters
Conflicts:
	lib/jekyll/page.rb
	lib/jekyll/post.rb
2ac98a7
@x3ro
Contributor
x3ro commented Jul 14, 2013

@mattr- Finally got this done! :shipit:

@mattr-
Member
mattr- commented Jul 15, 2013

LGTM. @parkr?

@maul-esel maul-esel commented on the diff Jul 23, 2013
lib/jekyll/page.rb
@@ -37,7 +38,12 @@ def dir
#
# Returns the String permalink or nil if none has been set.
def permalink
- self.data && self.data['permalink']
+ return nil if self.data.nil? || self.data['permalink'].nil?
+ if site.config['relative_permalinks']
+ File.join(@dir, self.data['permalink'])
+ else
+ self.data['permalink']
+ end
@maul-esel
maul-esel Jul 23, 2013 Contributor

May I ask, what effect do these changes have?

@x3ro
x3ro Jul 23, 2013 Contributor

If only I knew. The logic was introduced by @parkr in 1f23bc4 and 0e82b4e - I just moved it away from its current location.

@maul-esel
maul-esel Jul 23, 2013 Contributor

Oh sorry, I didn't see that.

@parkr
parkr Jul 23, 2013 Member

In jekyll pre-v1.0, you could specify a permalink as relative to its dir. Pretty cool actually

@maul-esel maul-esel commented on an outdated diff Jul 23, 2013
lib/jekyll/url.rb
+#
+#
+
+module Jekyll
+ module URL
+
+ # The generated relative url of this page. e.g. /about.html.
+ #
+ # Returns the String url.
+ def url
+ @url ||= sanitize_url(permalink || generate_url)
+ end
+
+ # Generate the URL by replacing all placeholders with their respective values
+ #
+ # Returns the _unsanitizied_ String URL
@maul-esel
maul-esel Jul 23, 2013 Contributor

Little typo here.

@parkr
Member
parkr commented Jul 23, 2013

I think I'd prefer it if it weren't a module which was included, but rather a class that was instantiated with the various options and a resource.

@kelvinst

Agreed, that way the code will less intrusive, cause doesn't need to remember to implement the url_placeholder method (this placeholders hash can be passed through parameters in the initializer of the instantiated object)

👍 to @parkr idea

@parkr
parkr commented on f5d0be9 Jul 26, 2013

I ❤️ this refactor. Thanks!

@parkr parkr and 1 other commented on an outdated diff Jul 26, 2013
lib/jekyll/page.rb
- # sanitize url
- @url = url.split('/').reject{ |part| part =~ /^\.+$/ }.join('/')
- @url += "/" if url =~ /\/$/
- @url.gsub!(/\A([^\/])/, '/\1')
- @url
+ # See url.rb for an explanation
@parkr
parkr Jul 26, 2013 Member

Maybe this comment could describe what it's providing for the URL class?

@parkr
parkr Jul 26, 2013 Member

We're using TomDoc - please follow the guidelines there. :)

@x3ro
x3ro Jul 26, 2013 Contributor

Wouldn't it be better to keep the docs in the url.rb file in order to avoid duplication?

@parkr
parkr Jul 26, 2013 Member

This method isn't in your url.rb though. If I come across this method, I want to know what it's about, and it's only ever used in #url, there isn't much for me to go from. I want to know what this method is doing and this comment says "go figure it out in this other file, best of luck"

@parkr parkr commented on an outdated diff Jul 26, 2013
lib/jekyll/url.rb
+
+ # The generated relative URL of the resource
+ #
+ # Returns the String URL
+ def to_s
+ sanitize_url(@permalink || generate_url)
+ end
+
+ # Internal: Generate the URL by replacing all placeholders with their
+ # respective values
+ #
+ # Returns the _unsanitizied_ String URL
+ def generate_url
+ @placeholders.inject(@template) { |result, token|
+ result.gsub(/:#{token.first}/, token.last)
+ }
@parkr
parkr Jul 26, 2013 Member

We generally prefer do ... end for blocks on multiple lines. :)

@parkr parkr commented on the diff Jul 26, 2013
lib/jekyll/url.rb
+# }).to_s
+#
+module Jekyll
+ class URL
+
+ # options - One of :permalink or :template must be supplied.
+ # :template - The String used as template for URL generation,
+ # for example "/:path/:basename:output_ext", where
+ # a placeholder is prefixed with a colon.
+ # :placeholders - A hash containing the placeholders which will be
+ # replaced when used inside the template. E.g.
+ # { "year" => Time.now.strftime("%Y") } would replace
+ # the placeholder ":year" with the current year.
+ # :permalink - If supplied, no URL will be generated from the
+ # template. Instead, the given permalink will be
+ # used as URL.
@parkr
parkr Jul 26, 2013 Member

👍 AWESOME comment.

@parkr parkr commented on the diff Jul 26, 2013
lib/jekyll/url.rb
@@ -0,0 +1,67 @@
+# Public: Methods that generate a URL for a resource such as a Post or a Page.
+#
+# Examples
+#
+# URL.new({
+# :template => /:categories/:title.html",
+# :placeholders => {:categories => "ruby", :title => "something"}
@parkr
parkr Jul 26, 2013 Member

Should these Symbol keys be String keys?

:placeholders => { "categories" => "ruby", "title" => "something" }
@x3ro
x3ro Jul 26, 2013 Contributor

Hmm... I used symbols since it lined up nicely with the syntax used in the template string, but you're right that the url_placeholders methods use strings. Is there an advantage of using strings over symbols?

@parkr
parkr Jul 26, 2013 Member

The only difference I can think of is saving memory and CPU cycles to convert. In Ruby < 2.0, it's just easier and more efficient to use Strings.

@jpiasetz
jpiasetz Jul 29, 2013 Contributor

@parkr really? I thought the opposite was true. Strings are fat and symbols are fast. This in 1.9.3 is almost 5 times faster with symbols.

require 'benchmark'

str = Benchmark.measure do
  10_000_000.times do
    "test"
  end
end.total

sym = Benchmark.measure do
  10_000_000.times do
    :test
  end
end.total

puts "String: " + str.to_s
puts "Symbol: " + sym.to_s
puts
@parkr
parkr Jul 29, 2013 Member

Ok, go for symbols!

@x3ro
x3ro Jul 29, 2013 Contributor

I will. Thanks for the clarification @jpiasetz :)

@kelvinst
kelvinst Jul 29, 2013

I had the same question @jpiasetz, thank you to take a time and test this

In lib/jekyll/url.rb:

@@ -0,0 +1,67 @@
+# Public: Methods that generate a URL for a resource such as a Post or a Page.
+#
+# Examples
+#
+# URL.new({
+# :template => /:categories/:title.html",
+# :placeholders => {:categories => "ruby", :title => "something"}
I will. Thanks for the clarification @jpiasetz :)


Reply to this email directly or view it on GitHub.

@parkr
Member
parkr commented Jul 26, 2013

Looks like a mixup between Strings and Symbols as keys for the placeholders Hash. :)

@maul-esel maul-esel commented on the diff Jul 26, 2013
lib/jekyll/post.rb
- "i_day" => date.strftime("%d").to_i.to_s,
- "i_month" => date.strftime("%m").to_i.to_s,
- "categories" => categories.map { |c| URI.escape(c.to_s) }.join('/'),
- "short_month" => date.strftime("%b"),
- "y_day" => date.strftime("%j"),
- "output_ext" => self.output_ext
- }.inject(template) { |result, token|
- result.gsub(/:#{Regexp.escape token.first}/, token.last)
- }.gsub(/\/\//, "/")
- end
+ @url ||= URL.new({
+ :template => template,
+ :placeholders => url_placeholders,
+ :permalink => permalink
+ }).to_s
+ end
@maul-esel
maul-esel Jul 26, 2013 Contributor

Downside of class vs. module is that this code is duplicated now in Page and Post. Any way to avoid this?

@parkr
parkr Jul 26, 2013 Member

Pass in the page/post as the arg and call the necessary methods? I'd prefer to keep the URL class as dumb as possible. I don't mind this duplication. Modules are good but in very small doses. :) I'd prefer to have the logic completely separated here.

@x3ro
x3ro Jul 26, 2013 Contributor

I agree with @parkr here I guess. I initially went with the Module approach, because that was how most of the Jekyll functionality seemed to be implemented.. The duplication in this case is relatively minor, and I think it's worth to put up with it, if the alternative is making the URL class aware of methods it has to call on some Post or Page object 😄

@maul-esel
maul-esel Jul 26, 2013 Contributor

Yeah, passing the post and page is not a good alternative. So the best way seems to leave it that way, at least for now.

@kelvinst
kelvinst Jul 26, 2013

Agreed with @parkr, to avoid black magic into URL, I think that's the best way...

Maybe, if you want to simplify, create a factory or builder class with a method for generate the URL can help, you can pass the page or the post as parameter for it and access the necessary methods, since the both has these three methods.

Buuuut, I didn't like the previous solution because nothing ensure us that the three methods really exists. Some days ago #1300 added a refactor that extracts an main class from Page and Post, with that changes, I think the factory was pretty more "secure".

@x3ro
Contributor
x3ro commented Jul 31, 2013

Updated. I didn't modify the way the URL class is invoked, though. @kelvinst suggested a factory, which is a great idea. However, I think that in this particular case it's overkill, since people trying to get the URL of a page or post will invoke the url method on these classes (at least I cannot think of a scenario where one wouldn't want to do that). For every other use case, the constructor would've to be used anyway 😄

@kelvinst
kelvinst commented Aug 1, 2013

Very nice @x3ro (:
I was thinking in maybe only a method factory, but this is good enough for me too.

@kelvinst
kelvinst commented Aug 1, 2013

And I agreed that a factory class is a big overkill for this.

@parkr
Member
parkr commented Aug 3, 2013

(Travis failure is unrelated.)

@parkr
Member
parkr commented Aug 3, 2013

This PR LGTM. @mattr-?

@x3ro
Contributor
x3ro commented Aug 23, 2013

🍍 ?

@parkr
Member
parkr commented Aug 26, 2013

Just need to hear from @mattr-.

@mattr-
Member
mattr- commented Aug 27, 2013

i'll review it tomorrow. need to 💤 now.

@mattr-
Member
mattr- commented Aug 27, 2013

👍 :shipit:

@parkr parkr merged commit 0d890e4 into jekyll:master Aug 28, 2013

1 check failed

default The Travis CI build failed
Details
@x3ro x3ro deleted the x3ro:permalink-special-characters branch Oct 11, 2013
@parkr parkr referenced this pull request in Shopify/liquid Dec 13, 2013
Closed

Do not support Ruby 1.9 Hash syntax #289

@jekyllbot jekyllbot locked and limited conversation to collaborators Feb 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.