Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

noirbizarre · 2017-02-18T13:39:59Z

This PR adds support for an optionnal meta[name=robots] in templates
and define some nofollow/noindex on some pages to handle crawlers behavior:

prevent pagination crawling
prevent search results from being indexed but allows the first page of result to give some weight
prevent all user related pages to be indexed
prevent admin from being indexed

As a side effect, some pages that where missing metadata (title, description...) gains at least a title and a description (for proper preview when sharing on social network)

davidbgk · 2017-02-18T13:56:50Z

udata/templates/macros/metadata.html

@@ -12,6 +12,7 @@
 <meta name="description" content="{{ description }}" />
 <meta property="og:description" content="{{ description }}" />
 <meta property="og:image" content="{{ image }}" />
+{% if meta.robots %}<meta name="robots" content="{{ meta.robots }}">{% endif %}


Cannot we merge that with DISALLOW_INDEXING a few lines above?

Merged: DISALLOW_INDEXING now overrides by page setting.

davidbgk · 2017-02-18T13:58:12Z

udata/templates/topic/datasets.html

+    'title': _('%(topic)s datasets', topic=topic.name),
+    'description': _("%(site)s %(topic)s related datasets", site=config['SITE_TITLE'], topic=topic.name),
+    'keywords': [_('search'), _('datasets'), _('topic')] + topic.tags,
+    'robots': 'noindex',


I'm not sure about deindexing that one.

The topic itself is indexed, this is only the topic datasets listing which is not indexed

davidbgk · 2017-02-18T13:58:45Z

udata/templates/user/base.html

+    'keywords': [_('user'), _('profile')],
+    'robots': 'noindex',
+} %}
+
 {% block extra_head %}
 {{ super() }}
 <meta name="robots" content="noindex,follow">


Remove that one?

davidbgk · 2017-02-19T22:53:08Z

That's the moment I wonder if we should instead whitelist pages that we want to index.

Pros:

more explicit
avoid pages we don't think about

Cons:

might be dangerous if we “forget” to index some critical pages…

I think we already discussed that but at least if there are counter-arguments we can refer to that discussion later. Thoughts?

noirbizarre · 2017-02-20T09:30:39Z

Given the fact this an open data portal, I prefer the open by default.
In this PR I blacklisted the 2 cases I identified as not wanted:

privacy concern (user related pages)
nonsense: indexed pages that don't needs (paginated results, admin)

The cons I see with a whitelist:

needs to think of every pages we want to index and big chances that we never ever see if we miss a page to whitelist
it's the opposite of robots.txt behavior => will be more complex to handle
unless we use DISALLOW_INDEXING, it's very defensive and it's the opposite of the 'being totally transparent' philosophy we had until now
it doesn't cover the inner links case (ex: pagination) on which we can't have a whitelits behavior => 2 different strategies to handle a same topic

davidbgk · 2017-02-20T13:56:01Z

udata/templates/macros/metadata.html

 <meta name="description" content="{{ description }}" />
+{% if config.DISALLOW_INDEXING %}<meta name="robots" content="noindex,nofollow" />
+{% elif meta.robots %}<meta name="robots" content="{{ meta.robots }}">


Missing end / to be consistent?

davidbgk · 2017-02-20T13:57:53Z

Alright, let's keep it indexed by default.

…771)

noirbizarre added the enhancement label Feb 18, 2017

noirbizarre added this to the 1.0.2 milestone Feb 18, 2017

noirbizarre requested review from davidbgk, vinyll, jdesboeufs and teleboas February 18, 2017 13:39

noirbizarre added the in progress label Feb 18, 2017

davidbgk suggested changes Feb 18, 2017

View reviewed changes

davidbgk reviewed Feb 20, 2017

View reviewed changes

davidbgk approved these changes Feb 20, 2017

View reviewed changes

Added some meta[name=robots] markup to handler crawlers behavior (fix #…

5357fe9

…771)

noirbizarre merged commit 9f3341a into opendatateam:master Feb 20, 2017

noirbizarre deleted the 771-prevent-search-result-crawling branch February 20, 2017 20:15

noirbizarre removed the in progress label Feb 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

noirbizarre commented Feb 18, 2017

davidbgk Feb 18, 2017

noirbizarre Feb 18, 2017

davidbgk Feb 18, 2017

noirbizarre Feb 18, 2017

davidbgk Feb 18, 2017

davidbgk commented Feb 19, 2017

noirbizarre commented Feb 20, 2017

davidbgk Feb 20, 2017

davidbgk commented Feb 20, 2017

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

Conversation

noirbizarre commented Feb 18, 2017

davidbgk Feb 18, 2017

Choose a reason for hiding this comment

noirbizarre Feb 18, 2017

Choose a reason for hiding this comment

davidbgk Feb 18, 2017

Choose a reason for hiding this comment

noirbizarre Feb 18, 2017

Choose a reason for hiding this comment

davidbgk Feb 18, 2017

Choose a reason for hiding this comment

davidbgk commented Feb 19, 2017

noirbizarre commented Feb 20, 2017

davidbgk Feb 20, 2017

Choose a reason for hiding this comment

davidbgk commented Feb 20, 2017