Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix word recognition for spell checker, ignore active partial words #651

Merged

Conversation

gedakc
Copy link
Collaborator

@gedakc gedakc commented Sep 19, 2019

This PR restores the functionality that prevents spell checking a word that is being actively typed at the end of a paragraph.

History

A problem with the spell checker being invoked on active partial words (e.g., those being typed, but not yet completed) was fixed with issue #166. However a subsequent fix to ignore underscores in words with issue #283 broke the previous fix. This commit is intended to fix these issues.

Goals

The goals for the spell check word match regexp are:

A. Words should include those with an apostrophe
      E.g., can't
B. Words should exclude underscore
      E.g., hello_world is two words
C. Words in other languages should be recognized
      E.g., French word familiarisé
D. Spell check should include word at absolute end of line with no trailing space or punctuation
      E.g., tezt
E. Spell check should ignore partial words in progress (user typing)
      E.g., paragr while midway through typing paragraph

Test Strings

Following are some strings to help test the above goals.

A. Words with apostrophe.  Can't for cannot.

B. Words with underscore: _Italic_, _Three word italic_,  hello_world.

C. French words: familiarisé. système a été installé à partir

D. Spell check word at absolute end of line:  Manuskript tezt

E. Spell check ignore active partial word being typed:  Manuskript test.

Test Results

It took a while to craft a regular expression that addresses all five of the above goals.

Note that the regexp being changed is the definition for WORDS as used in the file manuskript/ui/highlighters/basicHighlighter.py.

# Following algorithm would not check words at the end of line.
# This hacks adds a space to every line where the text cursor is not
# So that it doesn't spellcheck while typing, but still spellchecks at
# end of lines. See github's issue #166.
textedText = text
if self.currentBlock().position() + len(text) != \
self.editor.textCursor().position():
textedText = text + " "
# Based on http://john.nachtimwald.com/2009/08/22/qplaintextedit-with-in-line-spell-check/
WORDS = r'(?iu)(((?!_)[\w\'])+)'
# (?iu) means case insensitive and Unicode
# (?!_) means perform negative lookahead to exclude "_" from pattern match. See issue #283
).

Regexp 1: (?iu)([\w\']+)[^\'\w]

Active in Manuskript 0.5.0 to 0.6.0 - see issue #166.

Fails on goal B. hello_world is recognized as a single word.

Regexp-1-test

Note screen shots created using https://regex101.com/

Regexp 2: (?iu)(((?!_)[\w\'])+)

Active in Manuskript 0.7.0 to 0.9.0 - see issue #283.

Fails on goal E. Wrongly spell checks actively typed word.

Regexp-2-test

Regexp 3: (?iu)((?:[^_\W]|\')+)[^A-Za-z0-9\']

Proposed new regexp for 0.10.0 inspired from the following link:
https://stackoverflow.com/questions/2062169/regex-w-in-utf-8

Succeeds for all five goals.

Regexp-3-test

Before Fix Applied

Notice that words are spell checked while being typed.

Manuskript-Spellcheck-Before-PR

After Fix Applied

Notice that words are NOT spell checked while being typed. Spell check is only invoked after typing has proceeded beyond the word.

Manuskript-Spellcheck-After-PR

See PR olivierkes#651

This commit restores the functionality that prevents spell checking a
word that is being actively typed at the end of a paragraph.

The goals for the spell check word match regexp are:

A. Words should include those with an apostrophe
   *E.g., can't*
B. Words should exclude underscore
   *E.g., hello_world is two words*
C. Words in other languages should be recognized
   *E.g., French word familiarisé*
D. Spell check should include word at absolute end of line with no
   trailing space or punctuation
   *E.g., tezt*
E. Spell check should ignore partial words in progress (user typing)
   *E.g., paragr while midway through typing paragraph*

This commit addresses all five of the above goals.

HISTORY:
- See issue olivierkes#166 and commit 6ec0c19 in the 0.5.0 release.
- See issue olivierkes#283 and commit 63b471e in the 0.7.0 release.

Also fix minor incorrect utf-8 encoding at top of source file.
@gedakc gedakc force-pushed the fix-spellcheck-active-on-partial-words-v2 branch from 6425887 to 88b79a2 Compare September 19, 2019 22:44
@gedakc
Copy link
Collaborator Author

gedakc commented Sep 19, 2019

The spell check regression was reported in #166 (comment).

@vithiri can you test the fix in PR #651?

This PR is ready for review.

@vithiri
Copy link

vithiri commented Sep 20, 2019

@gedakc I'm actually not sure how to test a PR before it's merged into develop. I do have git installed on Manjaro, so if it's just a simple command to check the change out into my current copy of develop pulled through git and test it I should be able to assist. :)

@vithiri
Copy link

vithiri commented Sep 20, 2019

@gedakc I figured out how to clone the entire branch, and the spell checking experience is much better again.

I'm not sure about "Words in other languages should be recognized" - the suggested french words (e.g. partir) are marked as misspelled to me, but I would expect this from my spell checker as I'm not writing in French.

@gedakc
Copy link
Collaborator Author

gedakc commented Sep 20, 2019

Thanks @vithiri for testing and confirming that spell checking is improved.

I'm not sure about "Words in other languages should be recognized"

This means that the regexp must recognize the whole word and not as multiple words separated by characters that have accents, circumflexes, etc. This is visible in that the entire word "familiarisé" is highlighed in green for regexp 3 above.

  • the suggested french words (e.g. partir) are marked as misspelled to me, but I would expect this from my spell checker as I'm not writing in French.

That is expected behaviour. If one sets Manuskript to use a French dictionary then these words will show as correctly spelled (no red squiggly underlining).

I plan to include this fix with the 0.10.0 release.

@gedakc gedakc added this to the 0.10.0 milestone Sep 20, 2019
@gedakc gedakc merged commit b473ead into olivierkes:develop Sep 22, 2019
@gedakc gedakc deleted the fix-spellcheck-active-on-partial-words-v2 branch September 22, 2019 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants