Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable compressed sitemap.xml #1130

Merged
merged 1 commit into from Feb 2, 2018
Merged

Conversation

davidhrbac
Copy link
Contributor

A small patch to introduce compressed sitemap.xml.gz

@davidhrbac
Copy link
Contributor Author

/cc @foxik0070

@d0ugal
Copy link
Member

d0ugal commented Jan 25, 2017

Don't worry about the CI failure, we need to remove Python 2.6 :)

@waylan
Copy link
Member

waylan commented Jan 25, 2017

This is good, but do we want to perhaps have a more comprehensive solution which can compress multiple files, not just the sitemap? For that matter, this may be a good candidate for a Plugin? Although, I can see the value in having this feature out-of-the-box.

@d0ugal
Copy link
Member

d0ugal commented Jan 25, 2017

Hmm, now that I think more about it. Do we even need this in MkDocs at all? I think most web servers can be configured to automatically serve gzip files. Wouldn't it be better for them to handle it?

For example, we already have gzip files on MkDocs.org because GitHub pages does it automatically. I tested with:

$ curl -H "Accept-Encoding: gzip" -I http://www.mkdocs.org -s | grep Content-Encoding
Content-Encoding: gzip

@waylan
Copy link
Member

waylan commented Jan 25, 2017

I think most web servers can be configured to automatically serve gzip files.

True, but most basic shared hosting services don't let the end user configure such things. And what do services like ReadTheDocs or pythonhosted.org do (not sure, I didn't check)?

I seem to recall that some servers only return the gziped file if it already exists. In order words, MkDocs needs to create it before the server will return it. The client would request sitemap.xml (with the "Accept-Encoding: gzip" header) and the server would check for the gzipped file and only fall back to the non-gzipped file if the gzipped file did not exist.

In other words, this may need some investigation.

@davidhrbac
Copy link
Contributor Author

There's another point of view, you can specify robots.txt like this:

Sitemap: https://site_url/sitemap.xml.gz
User-agent: *
Disallow:

@davidhrbac
Copy link
Contributor Author

@waylan Content-Encoding: gzip is completely different issue. It means that GET response has been gzipped on-the-fly by web server. It may remain in web server cache or it may be gzipped with every GET request.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

@davidhrbac I'm assuming that is a server config for a specific server (Apache?). If users are using something different (nginx) then the solution would be different. And many users may not actually be able to configure anything.

One of the benefits of a static site generator is that you can upload the output to a cheap shared host. The downside is that those cheap shared hosts offer almost no configurability. Which is my point. We need a solution that works for most users. Requiring server configuration is not such a solution.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

@waylan Content-Encoding: gzip is completely different issue. It means that GET response has been gzipped on-the-fly by web server. It may remain in web server cache or it may be gzipped with every GET request.

I understand, but what I'm saying is that (I think) some servers fake it with static content. They only serve the gzipped response if a gzipped static file exists on the file system next to the non-gzipped one. Therefore, there is a valid reason for MkDocs to generate gzipped content. Unfortunately, I don't recall which servers do that or where I got that idea from.

Although on further reflection it occurs to me that your request may not be about Content-Encoding: gzip at all. Both @d0ugal and myself assumed it was. Do you want a sitemap.xml.gz file to be requested directly rather than sitemap.xml with the "Accept-Encoding: gzip" header. Could you point to the use case for that?

@davidhrbac
Copy link
Contributor Author

@waylan yes, we are on the same page. That's why there's this RP. Anyway, this solution reduces the resources.

Use-case is the robots.txt file. We can also extend it to search_index.json when lunr.js supports gzipped index file.

@d0ugal
Copy link
Member

d0ugal commented Jan 25, 2017

One of the benefits of a static site generator is that you can upload the output to a cheap shared host. The downside is that those cheap shared hosts offer almost no configurability. Which is my point. We need a solution that works for most users. Requiring server configuration is not such a solution.

Most users don't care about gzip either 😄

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Use-case is the robots.txt file.

Where (or by whom) does this get requested as a gzipped file?

@davidhrbac
Copy link
Contributor Author

Webmaster is the one to define robots.txt content. It's up to the client application to use robots.txt or not. You can try it here https://technicalseo.com/seo-tools/robots-txt/. Use the URL- https://docs.it4i.cz/, hit "Live robots.txt" button.

Every byte saved counts on communication...

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Yes, but do web crawls request the robot.txt (or its alias) specifically with the gz extension or without (but included the "Accept-Encoding: gzip" header)? If the former, then the user needs to configure the alias properly. If the later, then the server needs to be configured for gzipping-on-the-fly. Sorry, but you still haven't clearly defined your use case and I'm not sure I'm understanding it correctly.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Most users don't care about gzip either

True, which is why I suggested this might be a good candidate for a plugin. Those who do care should be able to easily work out how to install and enable a plugin (let alone configure their server), while the rest of the users never even need to give it a thought.

@davidhrbac
Copy link
Contributor Author

Web crawls request always robots.txt. There's no robots.txt.gz file and no request for the robots.txt.gz file. Client and server can agree on the gzipped communication. Here come Accept-Encoding: gzip and Content-Encoding: gzip handshakes. But that's completely different story.

You can declare sitemap file and sitemap file type in robots.txt. Sitemap can be plain text or compressed. So you can define robots.txt to point to gzipped file. As you can se bellow, I can save 88.9% of the original size. Count all the instance in wild.

This is the very same concept why we minimise JS and CSS files.

‹hrb33-toshiba› 19:51 $ HEAD https://docs.it4i.cz/sitemap.xml.gz
200 OK
Connection: close
Date: Wed, 25 Jan 2017 18:51:26 GMT
Accept-Ranges: bytes
ETag: "5888b126-1d6"
Server: nginx/1.10.2
Content-Length: 470
Content-Type: application/octet-stream
Last-Modified: Wed, 25 Jan 2017 14:07:34 GMT
Client-Date: Wed, 25 Jan 2017 18:51:26 GMT
Client-Peer: 195.113.250.58:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=NL/ST=Noord-Holland/L=Amsterdam/O=TERENA/CN=TERENA SSL CA 3
Client-SSL-Cert-Subject: /C=CZ/ST=Moravskoslezsk\xC3\xBD kraj/L=Ostrava - Poruba/O=Vysok\xC3\xA1 \xC5\xA1kola b\xC3\xA1\xC5\x88sk\xC3\xA1 - Technick\xC3\xA1 univerzita Ostrava/CN=docs.it4i.cz
Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Strict-Transport-Security: max-age=15768000

✔ ~/Dokumenty/dev/it4i/docs.it4i.git [capitalize {origin/capitalize}|…5⚑ 1] 
‹hrb33-toshiba› 19:51 $ HEAD https://docs.it4i.cz/sitemap.xml
200 OK
Connection: close
Date: Wed, 25 Jan 2017 18:51:30 GMT
Accept-Ranges: bytes
ETag: "5888b11f-1090"
Server: nginx/1.10.2
Content-Length: 4240
Content-Type: text/xml
Last-Modified: Wed, 25 Jan 2017 14:07:27 GMT
Client-Date: Wed, 25 Jan 2017 18:51:30 GMT
Client-Peer: 195.113.250.58:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=NL/ST=Noord-Holland/L=Amsterdam/O=TERENA/CN=TERENA SSL CA 3
Client-SSL-Cert-Subject: /C=CZ/ST=Moravskoslezsk\xC3\xBD kraj/L=Ostrava - Poruba/O=Vysok\xC3\xA1 \xC5\xA1kola b\xC3\xA1\xC5\x88sk\xC3\xA1 - Technick\xC3\xA1 univerzita Ostrava/CN=docs.it4i.cz
Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Strict-Transport-Security: max-age=15768000

It's also clean that Google is trying to get sitemap.xml.gz on the first try. These are log lines from site without robots.txt and sitemap.xml.gz files:

66.249.73.25 - - [23/Jan/2017:22:59:12 +0100] "GET /sitemap.xml.gz HTTP/1.1" 404 49183 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
66.249.73.25 - - [23/Jan/2017:22:59:14 +0100] "GET /sitemap.xml.gz HTTP/1.1" 404 49183 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Sorry, it's been a long time since I've looked at how robots.txt works. For some reason I was thinking that a server specific configuration was aliasing robots.txt to the sitemap. Instead, the robots.txt file just needs to point to the sitemap. And that is server agnostic and something any user should be able to do. In the event a server is not configured to return Content-Encoding: gzip for .gz files, the user can just point their robots.txt file at sitemap.xml instead.

I was hung up on why only the sitemap should be gzipped and not any other files (html, css, js, ect). I suppose in this case, the robot.txt file can point directly to it regardless of whether the requesting client includes the Accept-Encoding: gzip or whether the server supports (faked) gzipping of the response. For all other files, both of those need to be true before any gzipped content will be served, which makes this uniquely different (the one other exception being the search index if/when the search library includes support for it).

Given the above, I don't see any reason to not accept this as-is.

@waylan
Copy link
Member

waylan commented Feb 12, 2017

Re status: As the Python 2.6 tests are failing, I'm just waiting for us to officially drop 2.6 support (remove tests, update docs, etc) before merging this. We might do a small bugfix release or two in the 0.16.x series and this should wait till after that.

@d0ugal
Copy link
Member

d0ugal commented Feb 13, 2017

We can create a branch for 0.16 from the tag and backport if you like. I don't have strong feelings here, but I don't want any progress to be blocked.

@davidhrbac
Copy link
Contributor Author

davidhrbac commented Feb 13, 2017

No rush. I do not need it to be back-ported. I can create the compressed version in CI script.

@davidhrbac
Copy link
Contributor Author

@d0ugal correction: I do NOT need...

@waylan
Copy link
Member

waylan commented Oct 4, 2017

I've updated this to the latest code in master, which includes the Plugin API. It feels really weird to have this there. Various refactors have removed any 'single page' specific code, so this feels out-of-place. It would be very easy to do this via a plugin. I'm not so convinced this should be there by default, despite the benefits.

In fact, I think a more general solution for gziping various files would make for a better default solution. If someone really wants a gzipped sitemap.xml file, then adding a plugin for that special case should be trivial. After all, that user would need to create their custom robots.txt file to point to the gzipped file anyway.

Finally, if this were to be accepted, it should have a test or two first.

The failing appveyor tests can be ignored. they are being addressed in #1299.

@waylan waylan added this to the 1.0.0 milestone Feb 2, 2018
@waylan waylan merged commit a991b7a into mkdocs:master Feb 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants