Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_request(plugin): posthtml-declaring-language #11

Open
Kristinita opened this issue Oct 7, 2020 · 0 comments
Open

feature_request(plugin): posthtml-declaring-language #11

Kristinita opened this issue Oct 7, 2020 · 0 comments

Comments

@Kristinita
Copy link

Kristinita commented Oct 7, 2020

1. Summary

It would be nice, if would be possible automatically declare natural languages in HTML.

I couldn’t find any tools on any programming language who would do it.

2. Example of desired behavior

For Russian article.

2.1. Input

<html>
<p>This is English text in the Russian article</p>

2.2. Output

<html lang="ru">
<p lang="en">This is English text in the Russian article</p>

3. Argumentation

3.1. W3C

From official World Wide Web Consortium site:

Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.

3.2. Automation

It would be nice to have automation everywhere if it is possible. lang attribute for html tag is mandatory for W3C validator. We need manually set it each time-wasting our time on it.

And a fortiori, manual set the lang attribute in multilingual texts takes a lot of time.

4. Dependencies

To solve this problem we need dependency for natural language detection. For example, Franc Node.js library intended for this and has a simple syntax.

5. Possible problems

5.1. Element content and attribute values in different languages

See an example from official W3C site:

  • ✖ Bad code. Don’t copy!

    <a lang="es" title="Spanish" href="qa-html-language-declarations.es">Español</a>
  • Valid code:

    <span title="Spanish"><a lang="es" href="qa-html-language-declarations.es">Español</a></span>

posthtml-declaring-language should detect bad code as in the example.

5.2. Incorrect language detection

5.2.1. Detection quality

I don’t tested Franc on any another Node.js tools, but I use Python library cld2-cffi for natural text detection in real books and I’m getting good results. For example, see my issue and reply for another repository: cld2-cffi defines natural language well for physics and chemistry books.

5.2.2. Limiting the number of languages

In my case, in the vast majority of situations I need lang="en" or lang="ru". Tools as Franc and cl2-cffi on short text between tags may have difficulties to determine if it is Russian language or Ukrainian. But they shouldn’t have a problem to determine between Russian and English.

It would be nice to have the languages option in posthtml-declaring-language plugin. If values of this option is en, ru, the plugin will automatically add lang="en" and lang="ru" for tags, the text between which the plugin has regarded with a high degree of probability as English or Russian. If we need lang="uk", we need to add it to our HTML markup manually in this case.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant