About SEO report and h1 tags #55

ysard · 2022-02-02T04:08:44Z

Problem

I see in the following line that the content of the article is used to count the
h1 tags:

seo/pelican/plugins/seo/seo_report/seo_analyzer/__init__.py

Line 22 in 5ab4c64

self.content_title_analysis = ContentTitleAnalyzer(content=self._content)

You suggest (like everyone) the following Markdown structure in your README:

Title: Page Title
Description: Page Description

# Heading Content

Nevertheless, most (all?) templates already encapsulate the "Title" metadata in an h1 tag.

This processing is independent from the rest of the article written in Markdown indeed contained in the content attribute of the objects. It is inserted as is in the html template.

Therefore such an example poses 2 problems:
- duplication of the h1 tag (that of the template + that of the article content)
- duplication not detected by the current plugin

Currently, as far as I know, the only simple way to get a compliant html page is to write articles starting the heading level at h2 via ##
although it is semantically wrong in Markdown and can disturb some plugins (table of contents rendering, etc.).

I personally use an homemade plugin to modify the final html without modifying the original Markdown.

In this case, the SEO report plugin misleads the user by not detecting any h1 title.

I would like to mention that your plugin is very welcome because it allowed me to highlight a problem that I had totally missed.

Proposal

The plugin should be refactored to read finalized pages, like the SEOEnhancer part
called after the content_written signal.

Notes

Pandoc has implemented an option to automatically shift the heading level :
Extension to treat first heading level as title? jgm/pandoc#5615
The html5 allows nesting of independent units in tags like <section>, <article>, etc., which allows multiple h1 titles to coexist in a page (Outline algorithm). However, Mozilla is very clear about this: it is a non-standard practice and not recommended.
Cf
https://developer.mozilla.org/fr/docs/orphaned/Web/Guide/HTML/Using_HTML_sections_and_outlines#lalgorithme_outline_html5
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Heading_Elements#multiple_h1_elements_on_one_page
Discussion on Hugo's side:
https://discourse.gohugo.io/t/option-to-shift-headings/6136

The text was updated successfully, but these errors were encountered:

MaevaBrunelles · 2022-02-03T18:48:43Z

Hi @ysard and thanks for your report.

From a SEO point of view, template creators should never encapsulate title in a h1 tag. Website creators should be able to customize page title and content title as they want, and thus respect the Markdown semantic (# ## ###...). So I think this is an issue that should be addressed to template creators.

The plugin should be refactored to read finalized pages, like the SEOEnhancer part
called after the content_written signal.

I'm not sure to understand what would it change. In the case you pointed, the plugin works on HTML elements (content), as I use BeautifulSoup to analyze the h1 tag:

seo/pelican/plugins/seo/seo_report/seo_analyzer/content_title_analyzer.py

Line 10 in 5ab4c64

self._soup = BeautifulSoup(content, features="html.parser")

ysard · 2022-02-04T01:16:42Z

To clarify the discussion, let's take the example of a template commonly accepted as flawless:
https://github.com/Pelican-Elegant/elegant/blob/master/templates/article.html

On the subject of interest, here is the usage of the title metadata of a pelican article:

The content of the article stricto-sensu is displayed below:
https://github.com/Pelican-Elegant/elegant/blob/2a689472e42e159b3de2ff5b7bfd5434a9cc63d4/templates/article.html#L62

Two options from here:

either you choose the structure (#, ##, ###, etc.) and the <h1> level title is buried in the body/content of the article;
it becomes non-customizable in the sense that we can no longer make other elements managed by the template appear between the title and the content of the article unless you insert Jinja tags in each markdown (I don't even know if it works (?)).
Ex: gallery, author, number of comments, date, etc.
or you choose the structure (##, ###, etc.) and let the template manage the header part of the article.
This is the way it is commonly applied in Pelican themes,
and for once the common practice is acceptable because it's the only way to properly handle a header article.

Without this, such example is to my knowledge impossible to achieve (I admit I am not an expert in Jinja template, maybe the problem is there):

But I'm digressing because as you mention, this is a matter for both the template creator and the article writer.
This is indeed up to everyone's free will to do whatever they want with their tags, or to create valid code or not.

That said, in any case the SEO Report plugin only counts <h1> elements in the content attribute of an article;
I.e. in a blob of html that is a fragment of the final page.
This, in no way guarantees that the number of <h1> tags counted is correct.

This is why I point out that ContentTitleAnalyzer(content=xxx) should only work on the final result of the page;
so at a later stage than the one corresponding to the current all_generators_finalized signal which only
allows to work with the html blob of the article body.

MaevaBrunelles · 2022-02-08T23:48:11Z

Ok, got it, you're right.

Moreover, on the template you link, even without considering the structure of article content on Markdown/Rest, there is already two h1 tags: https://github.com/Pelican-Elegant/elegant/blob/2a689472e42e159b3de2ff5b7bfd5434a9cc63d4/templates/article.html#L35 and https://github.com/Pelican-Elegant/elegant/blob/2a689472e42e159b3de2ff5b7bfd5434a9cc63d4/templates/article.html#L35.

When the plugin will be updated, the SEO report will surely be red for h1 analysis for all sites using a template but at least, the analysis will have been done correctly.

From what I quickly tested and as you suggested, the seo_report must be launch when content_written signal is triggered; which mean open each HTML files, parse it with BeautifulSoup and do the analysis. I don't know if there is a better way.

Do you want to do the modifications? If not, I will wait for the merge of your current PR before working on this, as it will surely conflict with your PR.

ysard mentioned this issue Feb 4, 2022

Pelican settings #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About SEO report and h1 tags #55

About SEO report and h1 tags #55

ysard commented Feb 2, 2022

MaevaBrunelles commented Feb 3, 2022 •

edited

ysard commented Feb 4, 2022 •

edited

MaevaBrunelles commented Feb 8, 2022

About SEO report and h1 tags #55

About SEO report and h1 tags #55

Comments

ysard commented Feb 2, 2022

Problem

Proposal

Notes

MaevaBrunelles commented Feb 3, 2022 • edited

ysard commented Feb 4, 2022 • edited

MaevaBrunelles commented Feb 8, 2022

MaevaBrunelles commented Feb 3, 2022 •

edited

ysard commented Feb 4, 2022 •

edited