Enable compressed sitemap.xml #1130

davidhrbac · 2017-01-25T07:17:29Z

A small patch to introduce compressed sitemap.xml.gz

davidhrbac · 2017-01-25T07:19:22Z

d0ugal · 2017-01-25T07:55:16Z

Don't worry about the CI failure, we need to remove Python 2.6 :)

waylan · 2017-01-25T13:58:04Z

This is good, but do we want to perhaps have a more comprehensive solution which can compress multiple files, not just the sitemap? For that matter, this may be a good candidate for a Plugin? Although, I can see the value in having this feature out-of-the-box.

d0ugal · 2017-01-25T14:26:44Z

Hmm, now that I think more about it. Do we even need this in MkDocs at all? I think most web servers can be configured to automatically serve gzip files. Wouldn't it be better for them to handle it?

For example, we already have gzip files on MkDocs.org because GitHub pages does it automatically. I tested with:

$ curl -H "Accept-Encoding: gzip" -I http://www.mkdocs.org -s | grep Content-Encoding
Content-Encoding: gzip

waylan · 2017-01-25T14:36:07Z

I think most web servers can be configured to automatically serve gzip files.

True, but most basic shared hosting services don't let the end user configure such things. And what do services like ReadTheDocs or pythonhosted.org do (not sure, I didn't check)?

I seem to recall that some servers only return the gziped file if it already exists. In order words, MkDocs needs to create it before the server will return it. The client would request sitemap.xml (with the "Accept-Encoding: gzip" header) and the server would check for the gzipped file and only fall back to the non-gzipped file if the gzipped file did not exist.

In other words, this may need some investigation.

davidhrbac · 2017-01-25T14:54:35Z

There's another point of view, you can specify robots.txt like this:

Sitemap: https://site_url/sitemap.xml.gz
User-agent: *
Disallow:

davidhrbac · 2017-01-25T14:58:47Z

@waylan Content-Encoding: gzip is completely different issue. It means that GET response has been gzipped on-the-fly by web server. It may remain in web server cache or it may be gzipped with every GET request.

waylan · 2017-01-25T15:02:52Z

@davidhrbac I'm assuming that is a server config for a specific server (Apache?). If users are using something different (nginx) then the solution would be different. And many users may not actually be able to configure anything.

One of the benefits of a static site generator is that you can upload the output to a cheap shared host. The downside is that those cheap shared hosts offer almost no configurability. Which is my point. We need a solution that works for most users. Requiring server configuration is not such a solution.

waylan · 2017-01-25T15:12:47Z

@waylan Content-Encoding: gzip is completely different issue. It means that GET response has been gzipped on-the-fly by web server. It may remain in web server cache or it may be gzipped with every GET request.

I understand, but what I'm saying is that (I think) some servers fake it with static content. They only serve the gzipped response if a gzipped static file exists on the file system next to the non-gzipped one. Therefore, there is a valid reason for MkDocs to generate gzipped content. Unfortunately, I don't recall which servers do that or where I got that idea from.

Although on further reflection it occurs to me that your request may not be about Content-Encoding: gzip at all. Both @d0ugal and myself assumed it was. Do you want a sitemap.xml.gz file to be requested directly rather than sitemap.xml with the "Accept-Encoding: gzip" header. Could you point to the use case for that?

davidhrbac · 2017-01-25T15:17:01Z

@waylan yes, we are on the same page. That's why there's this RP. Anyway, this solution reduces the resources.

Use-case is the robots.txt file. We can also extend it to search_index.json when lunr.js supports gzipped index file.

d0ugal · 2017-01-25T16:06:34Z

One of the benefits of a static site generator is that you can upload the output to a cheap shared host. The downside is that those cheap shared hosts offer almost no configurability. Which is my point. We need a solution that works for most users. Requiring server configuration is not such a solution.

Most users don't care about gzip either 😄

waylan · 2017-01-25T16:52:51Z

Use-case is the robots.txt file.

Where (or by whom) does this get requested as a gzipped file?

davidhrbac · 2017-01-25T17:56:02Z

Webmaster is the one to define robots.txt content. It's up to the client application to use robots.txt or not. You can try it here https://technicalseo.com/seo-tools/robots-txt/. Use the URL- https://docs.it4i.cz/, hit "Live robots.txt" button.

Every byte saved counts on communication...

waylan · 2017-01-25T18:25:55Z

Yes, but do web crawls request the robot.txt (or its alias) specifically with the gz extension or without (but included the "Accept-Encoding: gzip" header)? If the former, then the user needs to configure the alias properly. If the later, then the server needs to be configured for gzipping-on-the-fly. Sorry, but you still haven't clearly defined your use case and I'm not sure I'm understanding it correctly.

waylan · 2017-01-25T18:27:45Z

Most users don't care about gzip either

True, which is why I suggested this might be a good candidate for a plugin. Those who do care should be able to easily work out how to install and enable a plugin (let alone configure their server), while the rest of the users never even need to give it a thought.

davidhrbac · 2017-01-25T19:08:47Z

Web crawls request always robots.txt. There's no robots.txt.gz file and no request for the robots.txt.gz file. Client and server can agree on the gzipped communication. Here come Accept-Encoding: gzip and Content-Encoding: gzip handshakes. But that's completely different story.

You can declare sitemap file and sitemap file type in robots.txt. Sitemap can be plain text or compressed. So you can define robots.txt to point to gzipped file. As you can se bellow, I can save 88.9% of the original size. Count all the instance in wild.

This is the very same concept why we minimise JS and CSS files.

‹hrb33-toshiba› 19:51 $ HEAD https://docs.it4i.cz/sitemap.xml.gz
200 OK
Connection: close
Date: Wed, 25 Jan 2017 18:51:26 GMT
Accept-Ranges: bytes
ETag: "5888b126-1d6"
Server: nginx/1.10.2
Content-Length: 470
Content-Type: application/octet-stream
Last-Modified: Wed, 25 Jan 2017 14:07:34 GMT
Client-Date: Wed, 25 Jan 2017 18:51:26 GMT
Client-Peer: 195.113.250.58:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=NL/ST=Noord-Holland/L=Amsterdam/O=TERENA/CN=TERENA SSL CA 3
Client-SSL-Cert-Subject: /C=CZ/ST=Moravskoslezsk\xC3\xBD kraj/L=Ostrava - Poruba/O=Vysok\xC3\xA1 \xC5\xA1kola b\xC3\xA1\xC5\x88sk\xC3\xA1 - Technick\xC3\xA1 univerzita Ostrava/CN=docs.it4i.cz
Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Strict-Transport-Security: max-age=15768000

✔ ~/Dokumenty/dev/it4i/docs.it4i.git [capitalize {origin/capitalize}|…5⚑ 1] 
‹hrb33-toshiba› 19:51 $ HEAD https://docs.it4i.cz/sitemap.xml
200 OK
Connection: close
Date: Wed, 25 Jan 2017 18:51:30 GMT
Accept-Ranges: bytes
ETag: "5888b11f-1090"
Server: nginx/1.10.2
Content-Length: 4240
Content-Type: text/xml
Last-Modified: Wed, 25 Jan 2017 14:07:27 GMT
Client-Date: Wed, 25 Jan 2017 18:51:30 GMT
Client-Peer: 195.113.250.58:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=NL/ST=Noord-Holland/L=Amsterdam/O=TERENA/CN=TERENA SSL CA 3
Client-SSL-Cert-Subject: /C=CZ/ST=Moravskoslezsk\xC3\xBD kraj/L=Ostrava - Poruba/O=Vysok\xC3\xA1 \xC5\xA1kola b\xC3\xA1\xC5\x88sk\xC3\xA1 - Technick\xC3\xA1 univerzita Ostrava/CN=docs.it4i.cz
Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Strict-Transport-Security: max-age=15768000

It's also clean that Google is trying to get sitemap.xml.gz on the first try. These are log lines from site without robots.txt and sitemap.xml.gz files:

66.249.73.25 - - [23/Jan/2017:22:59:12 +0100] "GET /sitemap.xml.gz HTTP/1.1" 404 49183 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
66.249.73.25 - - [23/Jan/2017:22:59:14 +0100] "GET /sitemap.xml.gz HTTP/1.1" 404 49183 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

waylan · 2017-01-25T20:46:49Z

Sorry, it's been a long time since I've looked at how robots.txt works. For some reason I was thinking that a server specific configuration was aliasing robots.txt to the sitemap. Instead, the robots.txt file just needs to point to the sitemap. And that is server agnostic and something any user should be able to do. In the event a server is not configured to return Content-Encoding: gzip for .gz files, the user can just point their robots.txt file at sitemap.xml instead.

I was hung up on why only the sitemap should be gzipped and not any other files (html, css, js, ect). I suppose in this case, the robot.txt file can point directly to it regardless of whether the requesting client includes the Accept-Encoding: gzip or whether the server supports (faked) gzipping of the response. For all other files, both of those need to be true before any gzipped content will be served, which makes this uniquely different (the one other exception being the search index if/when the search library includes support for it).

Given the above, I don't see any reason to not accept this as-is.

waylan · 2017-02-12T21:35:00Z

Re status: As the Python 2.6 tests are failing, I'm just waiting for us to officially drop 2.6 support (remove tests, update docs, etc) before merging this. We might do a small bugfix release or two in the 0.16.x series and this should wait till after that.

d0ugal · 2017-02-13T08:34:34Z

We can create a branch for 0.16 from the tag and backport if you like. I don't have strong feelings here, but I don't want any progress to be blocked.

davidhrbac · 2017-02-13T09:52:20Z

No rush. I do not need it to be back-ported. I can create the compressed version in CI script.

davidhrbac · 2017-02-13T09:53:05Z

@d0ugal correction: I do NOT need...

waylan · 2017-10-04T15:30:23Z

I've updated this to the latest code in master, which includes the Plugin API. It feels really weird to have this there. Various refactors have removed any 'single page' specific code, so this feels out-of-place. It would be very easy to do this via a plugin. I'm not so convinced this should be there by default, despite the benefits.

In fact, I think a more general solution for gziping various files would make for a better default solution. If someone really wants a gzipped sitemap.xml file, then adding a plugin for that special case should be trivial. After all, that user would need to create their custom robots.txt file to point to the gzipped file anyway.

Finally, if this were to be accepted, it should have a test or two first.

The failing appveyor tests can be ignored. they are being addressed in #1299.

GroovyGeekGal approved these changes Feb 12, 2017

View reviewed changes

waylan added Enhancement Update needed labels Oct 3, 2017

waylan added the Needs design decision label Oct 4, 2017

Enable compressed sitemap.xml

7b246f0

waylan force-pushed the compress_sitemap branch from 7ed9e30 to 7b246f0 Compare February 2, 2018 20:29

waylan removed Needs design decision Update needed labels Feb 2, 2018

waylan added this to the 1.0.0 milestone Feb 2, 2018

waylan merged commit a991b7a into mkdocs:master Feb 2, 2018

ofek mentioned this pull request May 9, 2020

Make compressed sitemap deterministic #2100

Merged

AndrewAmmerlaan mentioned this pull request Sep 27, 2020

Dev python/mkdocs: add py3_8, version bump gentoo/gentoo#16366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable compressed sitemap.xml #1130

Enable compressed sitemap.xml #1130

davidhrbac commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017 •

edited

davidhrbac commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Feb 12, 2017

d0ugal commented Feb 13, 2017

davidhrbac commented Feb 13, 2017 •

edited

davidhrbac commented Feb 13, 2017

waylan commented Oct 4, 2017 •

edited

Enable compressed sitemap.xml #1130

Enable compressed sitemap.xml #1130

Conversation

davidhrbac commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017 • edited

davidhrbac commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

d0ugal commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Jan 25, 2017

davidhrbac commented Jan 25, 2017

waylan commented Jan 25, 2017

waylan commented Feb 12, 2017

d0ugal commented Feb 13, 2017

davidhrbac commented Feb 13, 2017 • edited

davidhrbac commented Feb 13, 2017

waylan commented Oct 4, 2017 • edited

waylan commented Jan 25, 2017 •

edited

davidhrbac commented Feb 13, 2017 •

edited

waylan commented Oct 4, 2017 •

edited