Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDL crash/fails when markdown file contains UTF-8 characters #502

Open
RochaStratovan opened this issue May 15, 2024 · 7 comments
Open

MDL crash/fails when markdown file contains UTF-8 characters #502

RochaStratovan opened this issue May 15, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@RochaStratovan
Copy link

Description

Running mdl against a markdown file that contains utf-8 characters causes it to fail/crash.

Environment

Ubuntu 20 Linux docker container running in GitLab pipeline.

MDL Version
0.12.0

Expected Behavior

It should process the UTF-8 characters/file without a problem.

Actual Behavior

It fails/crashes with the following error output

$ mdl --git-recurse .
/var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:39:in `initialize'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:52:in `new'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl/doc.rb:52:in `new_from_file'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:90:in `block in run'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:82:in `each'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/lib/mdl.rb:82:in `run'
	from /var/lib/gems/2.7.0/gems/mdl-0.12.0/bin/mdl:10:in `<top (required)>'
	from /usr/local/bin/mdl:23:in `load'
	from /usr/local/bin/mdl:23:in `<main>'

Replication Case

Run mdl against a file such as the following:
README.md

It renders fine as illustrated in the screen shot from GitLab.

image

@RochaStratovan RochaStratovan added the bug Something isn't working label May 15, 2024
@nbehrnd
Copy link
Contributor

nbehrnd commented May 16, 2024

@RochaStratovan Can you update/request an update to MDL 0.13.0, released in October 2023?

With this version in hand, both your README.md file as well as a toy test file (cf. archive attached below) don't report a problem.

2024-05-16_test_mdl.zip

@RochaStratovan
Copy link
Author

Will do. Thank you.

@RochaStratovan
Copy link
Author

Hmmmmmm..... so I agree it doesn't happen for the README.md file I posted. I was also able to reproduce it within my environment with that file, and now with MDL.0.13.0 it passes.

However, it is still failing with my full README.md file. I'm trying to figure out more to share with you.

@RochaStratovan
Copy link
Author

@nbehrnd,

It seems like it's getting a UTF-8 failure on a different README.md file now.

The problem no longer happens for that "small" example, but it's still happening on my larger files. I started the "minification" process again to find the problem.


Updated file:
README_new.md

Updated failure message

rocha@e20c13008e8e:~/JRRTEST2$ mdl README_new.md
Traceback (most recent call last):
        9: from /usr/local/bin/mdl:23:in `<main>'
        8: from /usr/local/bin/mdl:23:in `load'
        7: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/bin/mdl:10:in `<top (required)>'
        6: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `run'
        5: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `each'
        4: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:91:in `block in run'
        3: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new_from_file'
        2: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new'
        1: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `initialize'
/var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError)

Updated GitLab Rendered Output with what I think are the problem characters highlighted in yellow

utf-8-new

What it looks like in a vi session, with yellow highlights again

utf-8-new-vi

@nbehrnd
Copy link
Contributor

nbehrnd commented May 17, 2024

@RochaStratovan I was able to replicate your findings.

the background of the story:

The cause is character encoding and what the operating system/the editor uses as code page. Originally, there was ASCII 7bit, allowing to store 2^7 = 128 characters only (some non-visible/control, A-Z, a-z, 0-9, a few special characters) for US American English. Because that's not enough to cover other languages and other scripts, unicode encodings are today the better way. While working with contemporary Python, you possibly encounter lines like

#!/usr/bin/env python3
# -*- coding: utf-8 -*- 
import os
records = []

with open("example.txt", mode="r", encoding="utf-8") as source:
    records = source.readlines()

to be a explicit about the file encoding in the Python script file (line 2), or/and about the file to process by the script (line 6). Unicode utf-8 is very frequent, but not the only unicode around (for instance utf-16 and utf-32). Between the two are many other code tables which may depend both on the language/script as well as the release and setup of the operating system/editor used. However, I wouldn't consider this a bug related to markdownlint.

The character in particular here is the (R) / ®.

how to prevent this obstacle with files created in future

Check your editor used to toggle to UTF-8. By your screen photo, I presume Windows is (one/the) operating system you use. In the case notepad++ (project page, entry on portableapps) you can set this parameter here:

npp

In case you prefer cross-platform geany (project page, entry portableapps), go Edit -> Preferences, tab Files:

geany

The two only as an example; feel free to use the editor which suits your needs best. Equally, it might be worth to check a twice if (presuming you use git from Windows) the setup of your instance of git uses Linux file endings. (Which is on one of questions on an early pane, during the installation.)

how to resolve the current obstacle

You have to edit the files in question, which requires i) to identify "the ones" in first place, and ii) adjust the code page used for them. The following approach requires some basic Unix/Linux commands; in case you don't have access to Linux Ubuntu, Debian, suse, or Fedora, etc you equally can resort to the minimal (Bash) shell provided e.g., by TortoiseGit for the pull down menu there.

  • step 1: using the minimal git shell, enter the folder with the files to be checked. It may require a couple of cd to change into the corresponding directory.

  • step 2: run e.g., file *.md to check all files in the present folder and at present level of hierarchy.

    $ file *.md
    backup.md:      ISO-8859 text, with very long lines (456)
    no_r_backup.md: ASCII text, with very long lines (456)
    out.md:         Unicode text, UTF-8 text, with very long lines (456)
    README_new.md:  ISO-8859 text, with very long lines (456)

    In the listing above, in addition to your file (note ISO-8859) an unchanged backup to work with, one where I manually removed the ® and one modified to utf-8.

  • step 3: for the conversion of the code page to be used, there is the iconv utility. For each file, you state the current encoding (-f, as in "from ..."), the new encoding you want to convert to (-t) and where to save the resulting output. In case you access a Linux installation, the pattern is

    iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file

    I equally attempted a conversion in an old installation of Windows with the minimal bash shell by tortoise git and noticed the -o flag did not work well. Instead, I had to redirect the result into a new file, i.e. a pattern of

    iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file > out.file

    Personally, I prefer the conversion to provide a new file first (which can be checked) over one approach which attempts an automatic overwrite of the original file (which can cause to loose the file in question for good). The transliteration (//TRANSLIT) possibly can be dropped if you convert files from/into an encoding of the identical (for instance Latin) script. If there are multiple files to adjust, then the small bash script provided here might be helpful.

@RochaStratovan
Copy link
Author

Hello @nbehrnd,

Thank you for the detailed analysis and answer. I understand the problem, however, I don't agree with what I think you are proposing as the solution. I believe you are suggesting that in order to avoid/prevent MDL from crashing, we should modify the input tools.

First, this isn't really a solution that scales. We have many developers that contribute to the documentation at our company and they use various tools such as:

  1. vi
  2. emacs
  3. visual studio text editor
  4. notepad[++]

just to name a few.

Second, I would categorize this as an issue with MDL. It crashes on text files that standard text editors can handle. When my devs and I see this crash, it's an MDL error. I agree as a workaround they could scan the text file to find the symbols that MDL is crashing on, but that doesn't take away from the fact that this is an issue with the MDL parsing logic.

MDL is a great tool. It just needs a few fixes such as this to be a bit more robust.

@nbehrnd
Copy link
Contributor

nbehrnd commented May 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants