MDL crash/fails when markdown file contains UTF-8 characters #502

RochaStratovan · 2024-05-15T22:28:02Z

Description

Running mdl against a markdown file that contains utf-8 characters causes it to fail/crash.

Environment

Ubuntu 20 Linux docker container running in GitLab pipeline.

MDL Version
0.12.0

Expected Behavior

It should process the UTF-8 characters/file without a problem.

Actual Behavior

It fails/crashes with the following error output

$ mdl --git-recurse .
/var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:39:in `initialize'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:52:in `new'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:52:in `new_from_file'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:90:in `block in run'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:82:in `each'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:82:in `run'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/bin/mdl:10:in `<top (required)>'
	from /usr/local/bin/mdl:23:in `load'
	from /usr/local/bin/mdl:23:in `<main>'

Replication Case

Run mdl against a file such as the following:
README.md

It renders fine as illustrated in the screen shot from GitLab.

The text was updated successfully, but these errors were encountered:

nbehrnd · 2024-05-16T07:50:25Z

@RochaStratovan Can you update/request an update to MDL 0.13.0, released in October 2023?

With this version in hand, both your README.md file as well as a toy test file (cf. archive attached below) don't report a problem.

2024-05-16_test_mdl.zip

RochaStratovan · 2024-05-16T14:59:09Z

Will do. Thank you.

RochaStratovan · 2024-05-16T17:05:12Z

Hmmmmmm..... so I agree it doesn't happen for the README.md file I posted. I was also able to reproduce it within my environment with that file, and now with MDL.0.13.0 it passes.

However, it is still failing with my full README.md file. I'm trying to figure out more to share with you.

RochaStratovan · 2024-05-16T17:30:45Z

@nbehrnd,

It seems like it's getting a UTF-8 failure on a different README.md file now.

The problem no longer happens for that "small" example, but it's still happening on my larger files. I started the "minification" process again to find the problem.

Updated file:
README_new.md

Updated failure message

rocha@e20c13008e8e:~/JRRTEST2$ mdl README_new.md
Traceback (most recent call last):
        9: from /usr/local/bin/mdl:23:in `<main>'
        8: from /usr/local/bin/mdl:23:in `load'
        7: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/bin/mdl:10:in `<top (required)>'
        6: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `run'
        5: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `each'
        4: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:91:in `block in run'
        3: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new_from_file'
        2: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new'
        1: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `initialize'
/var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)

Updated GitLab Rendered Output with what I think are the problem characters highlighted in yellow

What it looks like in a vi session, with yellow highlights again

nbehrnd · 2024-05-17T16:07:34Z

@RochaStratovan I was able to replicate your findings.

the background of the story:

The cause is character encoding and what the operating system/the editor uses as code page. Originally, there was ASCII 7bit, allowing to store 2^7 = 128 characters only (some non-visible/control, A-Z, a-z, 0-9, a few special characters) for US American English. Because that's not enough to cover other languages and other scripts, unicode encodings are today the better way. While working with contemporary Python, you possibly encounter lines like

#!/usr/bin/env python3
# -*- coding: utf-8 -*- 
import os
records = []

with open("example.txt", mode="r", encoding="utf-8") as source:
    records = source.readlines()

to be a explicit about the file encoding in the Python script file (line 2), or/and about the file to process by the script (line 6). Unicode utf-8 is very frequent, but not the only unicode around (for instance utf-16 and utf-32). Between the two are many other code tables which may depend both on the language/script as well as the release and setup of the operating system/editor used. However, I wouldn't consider this a bug related to markdownlint.

The character in particular here is the (R) / ®.

how to prevent this obstacle with files created in future

Check your editor used to toggle to UTF-8. By your screen photo, I presume Windows is (one/the) operating system you use. In the case notepad++ (project page, entry on portableapps) you can set this parameter here:

In case you prefer cross-platform geany (project page, entry portableapps), go Edit -> Preferences, tab Files:

The two only as an example; feel free to use the editor which suits your needs best. Equally, it might be worth to check a twice if (presuming you use git from Windows) the setup of your instance of git uses Linux file endings. (Which is on one of questions on an early pane, during the installation.)

how to resolve the current obstacle

You have to edit the files in question, which requires i) to identify "the ones" in first place, and ii) adjust the code page used for them. The following approach requires some basic Unix/Linux commands; in case you don't have access to Linux Ubuntu, Debian, suse, or Fedora, etc you equally can resort to the minimal (Bash) shell provided e.g., by TortoiseGit for the pull down menu there.

step 1: using the minimal git shell, enter the folder with the files to be checked. It may require a couple of cd to change into the corresponding directory.
step 2: run e.g., file *.md to check all files in the present folder and at present level of hierarchy.
```
$ file *.md
backup.md:      ISO-8859 text, with very long lines (456)
no_r_backup.md: ASCII text, with very long lines (456)
out.md:         Unicode text, UTF-8 text, with very long lines (456)
README_new.md:  ISO-8859 text, with very long lines (456)
```
In the listing above, in addition to your file (note ISO-8859) an unchanged backup to work with, one where I manually removed the ® and one modified to utf-8.
step 3: for the conversion of the code page to be used, there is the iconv utility. For each file, you state the current encoding (-f, as in "from ..."), the new encoding you want to convert to (-t) and where to save the resulting output. In case you access a Linux installation, the pattern is
```
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file
```
I equally attempted a conversion in an old installation of Windows with the minimal bash shell by tortoise git and noticed the -o flag did not work well. Instead, I had to redirect the result into a new file, i.e. a pattern of
```
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file > out.file
```
Personally, I prefer the conversion to provide a new file first (which can be checked) over one approach which attempts an automatic overwrite of the original file (which can cause to loose the file in question for good). The transliteration (//TRANSLIT) possibly can be dropped if you convert files from/into an encoding of the identical (for instance Latin) script. If there are multiple files to adjust, then the small bash script provided here might be helpful.

RochaStratovan · 2024-05-20T15:56:45Z

Hello @nbehrnd,

Thank you for the detailed analysis and answer. I understand the problem, however, I don't agree with what I think you are proposing as the solution. I believe you are suggesting that in order to avoid/prevent MDL from crashing, we should modify the input tools.

First, this isn't really a solution that scales. We have many developers that contribute to the documentation at our company and they use various tools such as:

vi
emacs
visual studio text editor
notepad[++]

just to name a few.

Second, I would categorize this as an issue with MDL. It crashes on text files that standard text editors can handle. When my devs and I see this crash, it's an MDL error. I agree as a workaround they could scan the text file to find the symbols that MDL is crashing on, but that doesn't take away from the fact that this is an issue with the MDL parsing logic.

MDL is a great tool. It just needs a few fixes such as this to be a bit more robust.

nbehrnd · 2024-05-20T19:31:52Z

It is true that I didn't test how various editors react if they i) usually are used to use one code page (e.g., ISO 8895-1) and now get an input file written in an other, for instance UTF-8. That is: after an intentional edit, will the document be consistently saved with their usual ISO 8895-1, or with the UTF-8 code page? On the other hand, presuming the code basis were hosted on GitHub, I speculate changing the code page used for files eventually managed by git possibly could be automated by one of the CI workflows offered, or one one can build and tailor: after local work, one would file the pull request to the repository; prior to a merge the automated workflow would i) determine the code page, and ii) fix it if necessary -- no manual intervention required. Eventually, only after successfully passing this automated step, the merge could happen: either after a manual / peer review of the code owner(s), or equally automated (with an additional secret key to deposit) by this workflow set up. Recently, I became aware of such a format checker as an automated action in the avogadro2 project,[1] which can extend to test and build executables, etc. too.[2] GitHub compiled information how to use such an action[3] which maybe scale well enough for your work. But perhaps a «local GitHub workflow» suits your needs better to adhere to local IP policy, and manage performance. I lack the necessary insight how `markdownlint` itself could become more robust to process markdown syntax regardless of the code page used. [1] https://github.com/OpenChemistry/avogadroapp/blob/master/.github/workflows/clang-format-check.yml [2] https://github.com/openmopac/mopac/blob/main/.github/workflows/CI.yaml [3] https://docs.github.com/en/actions/examples/using-scripts-to-test-your-code-on-a-runner

RochaStratovan added the bug Something isn't working label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDL crash/fails when markdown file contains UTF-8 characters #502

MDL crash/fails when markdown file contains UTF-8 characters #502

RochaStratovan commented May 15, 2024

nbehrnd commented May 16, 2024 •

edited

RochaStratovan commented May 16, 2024

RochaStratovan commented May 16, 2024

RochaStratovan commented May 16, 2024

nbehrnd commented May 17, 2024 •

edited

RochaStratovan commented May 20, 2024

nbehrnd commented May 20, 2024 via email

MDL crash/fails when markdown file contains UTF-8 characters #502

MDL crash/fails when markdown file contains UTF-8 characters #502

Comments

RochaStratovan commented May 15, 2024

Description

Environment

Expected Behavior

Actual Behavior

Replication Case

nbehrnd commented May 16, 2024 • edited

RochaStratovan commented May 16, 2024

RochaStratovan commented May 16, 2024

RochaStratovan commented May 16, 2024

nbehrnd commented May 17, 2024 • edited

the background of the story:

how to prevent this obstacle with files created in future

how to resolve the current obstacle

RochaStratovan commented May 20, 2024

nbehrnd commented May 20, 2024 via email

nbehrnd commented May 16, 2024 •

edited

nbehrnd commented May 17, 2024 •

edited