Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pandoc spellcheck filter #333

Merged
merged 12 commits into from
Apr 29, 2020
Merged

Conversation

agitter
Copy link
Member

@agitter agitter commented Apr 20, 2020

References manubot/manubot#31
Initial implementation from agitter/my-manuscript#15

There are a few open questions to discuss:

  • Where should this be documented (SETUP.md?)
  • Should it be enabled in AppVeyor builds by default (I vote yes)
  • Should it be enabled in Travis CI builds by default?
  • What should be in the initial custom dictionary?

I've noticed that many spelling errors come from 00.front-matter.md. We could automatically add all of the manuscript metadata to the custom dictionary, but then we would ignore typos in affiliations or titles. I would prefer to save that for a separate pull request even if we do want it.

@AppVeyorBot
Copy link

AppVeyor build 1.0.107 for commit 2f30622 is now complete.

Found 74 potential spelling error(s). Preview:adipiscing
aliqua
aliquip
amet
anim
aute
cillum
commodo
consectetur
consequat
culpa
cupidatat
deserunt
DOI
dolore
Duis
eiusmod
elit
enim
eq
esse
et
eu
Excepteur
exercitation
fugiat
incididunt
ipsum
irure
isbn
janeroe
johndoe
labore
laboris
laborum
LaTeX
Lorem
magna
manubot
mollit
nisi
nostrud
nulla
occaecat
officia
ORCID
pariatur
permalink
PMC
pmcid
pmid
proident
PubMed
qui
quis
reprehenderit
rootstock
s
sed
sint
Striketh...
The rendered manuscript from this build is temporarily available for download at:

@agitter
Copy link
Member Author

agitter commented Apr 20, 2020

The first draft of the spellcheck pull request is working in AppVeyor and Travis CI. Travis CI doesn't store the spelling errors as an artifact or add them as a pull request comment. However, it can dump them in the build log.

@dhimmel
Copy link
Member

dhimmel commented Apr 21, 2020

Really great progress! Will help catch spelling errors hopefully.

Did you look into detecting the language set in metadata.lang and using it for aspell?

It'd be nice to show spellcheck output as a github actions step. Then it'd be easy to expand that section of the output to check for spelling errors. Was the motivation for excluding Actions that it doesn't use ci/install.sh?

Should it be enabled in AppVeyor builds by default (I vote yes)

I think so. The only concern I'd have is non-english manuscripts, but it's not too hard to turn off or switch the language.

Should it be enabled in Travis CI builds by default?

Doesn't matter. I view Travis CI support mostly for backwards compatibility at this point and less about new features.

What should be in the initial custom dictionary?

From agitter/my-manuscript#15 (comment), it looks like lot's of manubot related terms cause problems: LaTeX, ORCID, permalink... not sure how many are because this is on delete-me which uses words like DOI outside of citekeys.

@agitter
Copy link
Member Author

agitter commented Apr 24, 2020

Did you look into detecting the language set in metadata.lang and using it for aspell?

I haven't, but I will. I believe this would involve downloading the appropriate aspell dictionary with apt-get and then specifying the language in the ASPELL_CONF variable.

Was the motivation for excluding Actions that it doesn't use ci/install.sh?

I initially planned to only set it up for AppVeyor because it can preview spelling errors in the pull request comment. If we enable it for GitHub Actions, would you recommend a new conditional step between Install Environment and Build Manuscript?

I should be able to come up with a reasonable custom dictionary based on the errors from delete-me.

@agitter
Copy link
Member Author

agitter commented Apr 24, 2020

It might be better to save language detection for a follow up pull request. Aspell dictionary files use ISO 639 language codes. An exception is Portuguese, which has pt_BR and pt_PT, which are not standard. Pandoc follows the BCP 47 standard for languages, and I haven't yet assessed their differences.

The first step would be reading the language from metadata.lang. Then, we could install the requested language dictionary and set that as the main dictionary with Asepll's master setting. However, there would be several special cases, string manipulation, and error handling required because.

@dhimmel
Copy link
Member

dhimmel commented Apr 27, 2020

If we enable it for GitHub Actions, would you recommend a new conditional step between Install Environment and Build Manuscript?

I was thinking a conditional step that writes the spellcheck errors file to standard output (if the file exists. But whatever you think is best (including nothing) is okay.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agitter, this looks good to me. Up to you if you want to pursue anything else in this PR or leave it for later.

build/build.sh Outdated
# Spellcheck
if [ "${SPELLCHECK:-}" = "true" ]; then
export ASPELL_CONF="add-extra-dicts $(pwd)/build/assets/custom-dictionary.txt"
pandoc --lua-filter spellcheck.lua output/manuscript.md | sort | uniq > output/spelling-errors.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change

pandoc --lua-filter spellcheck.lua output/manuscript.md | sort | uniq > output/spelling-errors.txt

to the following line:

pandoc --lua-filter spellcheck.lua output/manuscript.md | uniq | while read word; do grep -on "\<$word\>" output/manuscript.md; done | sort -h > output/spelling-errors.txt

^ This adds line numbers to each misspelled word. Would be useful to identify the exact spot of the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @danich1. I tested that in 14215d3 and the output is below.

Is this line numbering useful enough? The line numbers correspond to output.md, but authors will have to edit the original *.md files in the content subdirectory to correct the spelling errors.

Copy link
Contributor

@danich1 danich1 Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This fix appends file names at the beginning of each "misspelled" word:

pandoc --lua-filter spellcheck.lua output/manuscript.md | uniq | while read word; do grep -on "\<$word\>" content/*md; done | sort -h -t ":" -k 1b,1 -k2,2 > output/spelling-errors.txt

^ Assuming the desire is to only spell check md files.

Copy link
Member Author

@agitter agitter Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danich1. I added that in a63502f but the CI build failed. I like this solution, but I don't think I'll be able to debug locally until tomorrow.

Edit: the failure is possibly related to the pipefail option

Co-authored-by: David Nicholson <dnicholson329@gmail.com>
@AppVeyorBot
Copy link

AppVeyor build 1.0.108 for commit 14215d3 is now complete.

Found 118 potential spelling error(s). Preview:46:XXXX
46:XXXX
46:XXXX
46:XXXX
48:johndoe
56:XXXX
56:XXXX
56:XXXX
56:XXXX
118:XXXX
118:XXXX
118:XXXX
118:XXXX
118:XXXX
118:XXXX
118:XXXX
118:XXXX
120:johndoe
120:johndoe
122:johndoe
122:johndoe
125:XXXXXXXX
130:XXXX
130:XXXX
130:XXXX
130:XXXX
130:XXXX
130:XXXX
130:XXXX
130:XXXX
132:janeroe
132:janeroe
187:adipiscing
187:aliqua
187:amet
187:consectetur
187:dolore
187:eiusmod
187:elit
187:incididunt
187:ipsum
187:labore
1...
The rendered manuscript from this build is temporarily available for download at:

agitter and others added 3 commits April 27, 2020 21:47
Co-authored-by: David Nicholson <dnicholson329@gmail.com>
Make dictionary lowercase now that case is ignored
@agitter agitter changed the title Enable pandoc spellcheck filter in AppVeyor Enable pandoc spellcheck filter Apr 28, 2020
@AppVeyorBot
Copy link

AppVeyor build 1.0.110 for commit 24241ee failed.

@agitter
Copy link
Member Author

agitter commented Apr 28, 2020

Up to you if you want to pursue anything else in this PR or leave it for later.

Thanks for re-reviewing @dhimmel. My plan is to get @danich1's line numbering idea to work and then merge this. I added some light documentation to USAGE.md.

I'll close manubot/manubot#31 after merging this and create a new rootstock issue with the remaining spellcheck improvements, including GitHub Actions.

grep exits with 1 if there is no match found,
which makes the pipeline fail due to the pipefail
option
Also make grep case insensitive
@AppVeyorBot
Copy link

AppVeyor build 1.0.111 for commit 311269d failed.

@AppVeyorBot
Copy link

AppVeyor build 1.0.112 for commit a63cb12 is now complete.

Found 58 potential spelling error(s). Preview:content/02.delete-me.md:44:adipiscing
content/02.delete-me.md:44:aliqua
content/02.delete-me.md:44:amet
content/02.delete-me.md:44:consectetur
content/02.delete-me.md:44:dolore
content/02.delete-me.md:44:eiusmod
content/02.delete-me.md:44:elit
content/02.delete-me.md:44:incididunt
content/02.delete-me.md:44:ipsum
content/02.delete-me.md:44:labore
content/02.delete-me.md:44:Lorem
content/02.delete-me.md:44:magna
content/02...
The rendered manuscript from this build is temporarily available for download at:

@agitter
Copy link
Member Author

agitter commented Apr 28, 2020

The build passes now, but I get different output from the Travis CI and AppVeyor builds for a63cb12.

Edit:
This is probably due to different versions of the Aspell English dictionary:
Travis CI: Unpacking aspell-en (7.1-0-1) ...
AppVeyro: Unpacking aspell-en (2017.08.24-0-0.1) ...

I okay with that difference because it isn't due to our usage of the spellcheck filter.

@AppVeyorBot
Copy link

AppVeyor build 1.0.113 for commit 8dee14c is now complete.

Found 58 potential spelling error(s). Preview:content/02.delete-me.md:44:adipiscing
content/02.delete-me.md:44:aliqua
content/02.delete-me.md:44:amet
content/02.delete-me.md:44:consectetur
content/02.delete-me.md:44:dolore
content/02.delete-me.md:44:eiusmod
content/02.delete-me.md:44:elit
content/02.delete-me.md:44:incididunt
content/02.delete-me.md:44:ipsum
content/02.delete-me.md:44:labore
content/02.delete-me.md:44:Lorem
content/02.delete-me.md:44:magna
content/02...
The rendered manuscript from this build is temporarily available for download at:

@agitter agitter merged commit 2c28689 into manubot:master Apr 29, 2020
@agitter
Copy link
Member Author

agitter commented Apr 29, 2020

Using this with a complex manuscript revealed some bugs: greenelab/covid19-review#261

I'll continue testing there. In the meantime, I don't recommend this for general use in other manuscripts.

ploegieku added a commit to ploegieku/2023-functional-homology-paper that referenced this pull request Aug 6, 2024
merges manubot/rootstock#333

Installs and runs aspell via the pandoc spellcheck filter when the
SPELLCHECK environment variable is true. Spellcheck outputs
candidate spelling errors and the filenames and line numbers
in which they occur.  Initially only supported in Travis CI and
AppVeyor builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants