Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

Merged
merged 1 commit into from Feb 20, 2017
Merged

Added some meta[name=robots] markup to handler crawlers behavior (fix #771) #777

merged 1 commit into from Feb 20, 2017

Conversation

noirbizarre
Copy link
Contributor

This PR adds support for an optionnal meta[name=robots] in templates
and define some nofollow/noindex on some pages to handle crawlers behavior:

  • prevent pagination crawling
  • prevent search results from being indexed but allows the first page of result to give some weight
  • prevent all user related pages to be indexed
  • prevent admin from being indexed

As a side effect, some pages that where missing metadata (title, description...) gains at least a title and a description (for proper preview when sharing on social network)

@@ -12,6 +12,7 @@
<meta name="description" content="{{ description }}" />
<meta property="og:description" content="{{ description }}" />
<meta property="og:image" content="{{ image }}" />
{% if meta.robots %}<meta name="robots" content="{{ meta.robots }}">{% endif %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we merge that with DISALLOW_INDEXING a few lines above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged: DISALLOW_INDEXING now overrides by page setting.

'title': _('%(topic)s datasets', topic=topic.name),
'description': _("%(site)s %(topic)s related datasets", site=config['SITE_TITLE'], topic=topic.name),
'keywords': [_('search'), _('datasets'), _('topic')] + topic.tags,
'robots': 'noindex',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about deindexing that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic itself is indexed, this is only the topic datasets listing which is not indexed

'keywords': [_('user'), _('profile')],
'robots': 'noindex',
} %}

{% block extra_head %}
{{ super() }}
<meta name="robots" content="noindex,follow">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove that one?

@davidbgk
Copy link
Member

That's the moment I wonder if we should instead whitelist pages that we want to index.

Pros:

  • more explicit
  • avoid pages we don't think about

Cons:

  • might be dangerous if we “forget” to index some critical pages…

I think we already discussed that but at least if there are counter-arguments we can refer to that discussion later. Thoughts?

@noirbizarre
Copy link
Contributor Author

Given the fact this an open data portal, I prefer the open by default.
In this PR I blacklisted the 2 cases I identified as not wanted:

  • privacy concern (user related pages)
  • nonsense: indexed pages that don't needs (paginated results, admin)

The cons I see with a whitelist:

  • needs to think of every pages we want to index and big chances that we never ever see if we miss a page to whitelist
  • it's the opposite of robots.txt behavior => will be more complex to handle
  • unless we use DISALLOW_INDEXING, it's very defensive and it's the opposite of the 'being totally transparent' philosophy we had until now
  • it doesn't cover the inner links case (ex: pagination) on which we can't have a whitelits behavior => 2 different strategies to handle a same topic

<meta name="description" content="{{ description }}" />
{% if config.DISALLOW_INDEXING %}<meta name="robots" content="noindex,nofollow" />
{% elif meta.robots %}<meta name="robots" content="{{ meta.robots }}">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing end / to be consistent?

@davidbgk
Copy link
Member

Alright, let's keep it indexed by default.

@noirbizarre noirbizarre merged commit 9f3341a into opendatateam:master Feb 20, 2017
@noirbizarre noirbizarre deleted the 771-prevent-search-result-crawling branch February 20, 2017 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants