Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-enchant spellcheck support #505

Closed
kakaroto opened this issue Feb 21, 2019 · 10 comments
Closed

Add non-enchant spellcheck support #505

kakaroto opened this issue Feb 21, 2019 · 10 comments
Milestone

Comments

@kakaroto
Copy link
Contributor

I'm using manuskript on Windows with python 64 bits and there is no pyenchant support for 64 bits unfortunately. Also it appears pyenchant is not maintained anymore and even the link that manuskript opens when enchant isn't installed doesn't work.

My solution is to add support for an alternative spellchecker, and a simple "python spellchecker" search suggests pyspellchecker. I've started to add support for it and it seems to work fine so far. My patches are not yet ready for review/push, but I'm creating this issue to discuss the feature and my proposed changes before I'm done with it.

First, I wanted to keep pyenchant support, so I'm not replacing it but just adding support for something else.
If nothing is installed, the menu will suggest installing either pyenchant or pyspellchecker and clicking on it would open both links.
image
If both are available, then the Tools->Dictionary would show dictionaries from both libraries so the user can make their choice of library to use :
image

If only one is available, mention that another choice is available to the user :
image

So first question: Does that UI look good or you prefer it to be done differently?

I think that's good, but I have one issue: It only works as is because the dictionary names are different from pyspellchecker and pyenchant, I don't know how it would work if pyspellchecker suddenly changed dictionary names to look like pyenchant.
So I'm wondering if I should store the library name as part of the 'dict' setting, or not?
I'm not entirely sure, but I think I can store the library name in the QAction without it appearing as text to the user, I'd have to check, but if not, I don't want to have the library name on every entry...

Right now, the code handles both enchant and pyspellchecker, but I think it would make a lot more sense to create an abstraction layer, a Spellchecker class that does that for us, so the code remains simple within the text editor, highlighter, main window, context menu, etc...
Once I write the abstraction class, it would be easy to add support for other spellchecker libraries, so another question: Are you OK with adding support for pyspellchecker or do you have another library to suggest?
From my initial tests, pyspellchecker seems pretty good, it's rather slow to load the dictionary though but that's only because each class will load the dict, while pyenchant will share the same dict among all instances for the same language (note: this shouldn't be a problem once I add the abstraction class, since it would create a single dictionary to share among all text editors). It also offers less suggestions for corrections but that's because it gives you only 2 letter changes/permutations as maximum, while pyenchant does a lot more.

Final issue I can foresee is the custom dictionary. Enchant will automatically add new words to its own files, pyspellchecker does not, while we can add/remove words from its dictionary, we'd need to keep a separate file to list custom words from the user and pass the file again when we load, which is fine I think, but I'm wondering Should the custom dictionary be shared between pyenchant and pyspellchecker? If it doesn't need to, then it's great, but if it does, we have no way of getting the custom dictionary from pyenchant, unless we poke in the data directory manually, but I also don't think it would be good to share a custom dictionary between technically different languages (switch from english in enchant to french in pyspellchecker shouldn't carry the same words...). Secondary question, should the custom dictionaries be stored in manuskript data directory or in the project directory?

Final question: Anything else? Requests/suggestions/comments/questions of your own?

I should have the feature done within the next couple of days I expect, unless I'm distracted by something. If I don't get answers/suggestions quickly enough, we can always discuss things in the PR itself once I submit it.

@gedakc
Copy link
Collaborator

gedakc commented Feb 21, 2019

Thank you @kakaroto for continuing your work to improve Manuskript.

I'll be away for a long weekend so I won't have a chance to look at this in earnest, or your other PR, until I return next week.

As you might already know you can work around the lack of a 64-bit PyEnchant wheel for Windows by using the 32-bit version of Python and PyEnchant. This is described in Appendix A: Install Required Software Packages on Windows 7 and Higher.

@kakaroto
Copy link
Contributor Author

Thanks gedakc, yes I know about the workaround, but for some reason python-32b is slow for me, and instead of trying to investigate that slowdown (which would probably end with the conclusion of "It's a python problem"), I'm trying to improve the spellcheck support. I'm testing my changes on both 32-bit and 64-bit now which allows to test support for having both spellcheckers available to users.

Enjoy your weekend!

@gedakc
Copy link
Collaborator

gedakc commented Feb 21, 2019

Three suggestions I just thought of before I go:

  1. There is no visual indication that spell check is enabled. Perhaps place a checkbox toggle in Tools -> Spellcheck menu entry?

  2. Spellcheck automatically turns back on if Search or Cheatsheet is used (issue 474).

  3. Some users have reported performance slowdowns with Spellcheck, Markdown hightlight, and Focus mode (issues 474, 387, 280, 394). It would be great if we can improve performance.

@kakaroto
Copy link
Contributor Author

  1. There is a visual indication, it's just kind of hard to see. If spellcheck is enabled, there's a grey square around the spellcheck icon in Tools->spellcheck menu entry, otherwise it's a transparent background. There might be a way to style it so it's more visible, but I'm probably not the best person to think about how to make it more aesthetically pleasing.

  2. I can confirm, should be easy to fix along the way.

  3. I made it print when a word is misspelled and I noticed that even if the only editor open is one with no text, it will print misspelled words from other texts. I didn't investigate this a lot but I'm guessing the issue here is that it's parsing too much text and too often (and regexps are not cheap). I think refactoring how/when/on what the spellchecker works would be needed to alleviate this problem and it's outside the scope of what I'm doing. I might tackle it after I'm done with this though.

kakaroto added a commit to kakaroto/manuskript that referenced this issue Feb 22, 2019
This modifies the Spellchecker abstraction to add a new dictionary support, with
support for pyspellchecker. It also changes the main UI so that multiple libraries
can be supported and dictionaries provided to the user. The custom dictionary of
pyspellchecker has to be handled manually, and the performance and words of this
library isn't on par with PyEnchant, but at least it works with 64 bits.

Fixes olivierkes#505
@DonEdwards
Copy link

This is probably a pipe dream, but I'd love a mechanism where there can be multiple custom dictionaries as properties of a project.

@kakaroto
Copy link
Contributor Author

Not really a pipe dream :) I am now adding support for symspellpy which is so much faster than pyspellchecker (near instantaneous spelling suggestions), but it has one issue in that it doesn't come with a dictionary AND it's very slow to load the dictionary on startup (from 10 to 30 seconds). So I'm thinking of having an interface in the Settings window for configuring spellcheckers, things like the distance in suggestions (how many characters to change/remove to find the right suggestion match), and a way to select/download/add dictionaries for symspellpy (it would take a minute to load the dictionary and save
it in an optimized format which could then be loaded within 2 seconds instead of 30) or to add/manage custom dictionaries so you can see which words you manually added and remove them without having to type them in the editor and right click.
In terms of managing the custom dictionaries, you could have per-project custom dictionaries or a custom dictionary that you could assign to multiple spellchecker languages (i.e: switch from english to french or from one spellchecker library to another, but retain the same custom words).

I'm not much of a UI person though, so I don't think I will/can do that interface, but if someone wants to design it, and finds a way to make it not horrible/complex to the user, I can do the internal code for handling all of that.

@kakaroto
Copy link
Contributor Author

kakaroto commented Feb 23, 2019

@DonEdwards requires a little bit of work, but with the latest changes I just pushed to #507, you can now do what you wanted. See bottom of this comment for details.

I just added support for symspellpy which is a much better spellchecker library than pyspellchecker. Most specifically, it's a lot faster at giving suggestions and the 1 second time to load the dictionary is barely perceptible.
Since it didn't come with a dictionary, I made it use dictionaries from pyspellchecker, but I can also add my own dictionaries by writing them to the manuskript/resources/dictionaries/symspellpy directory. Since pyspellchecker's dictionary is seriously lacking, I used the dictionary from aspell (enchant's dictionary basically) and exported it to symspellpy format.
I'm writing this here because I'll probably forget the procedure and assuming this gets merged, it would need to be written in the wiki or something.
First convert the aspell dictionary format with affixes into a full words dictionary list :

$ dnf install aspell aspell-en
$ aspell -d en_US dump master | aspell -l en expand > en_US.dic

Then enter a python shell and load that dictionary and export it in symspellpy format :

$ python3
> import symspellpy
> s = symspellpy.SymSpell()
> with open("en_US.dic", "r", encoding='utf-8') as infile:
        for line in infile:
            s.create_dictionary_entry(line.strip(), 1)

> s.save_pickle("en_US.sym", False)

Then copy that file into the appropriate directory, since I use manuskript on Windows, it was c:\Users\kakaroto\AppData\Local\manuskript\manuskript\resources\dictionaries\symspellpy\
Now when I open manuskript, I get a "en_US" dictionary entry that I can select to get a much better dictionary, and a "en_US.json.gz" file (the custom dict) gets created alongside the "en_US.sym" dictionary.

Now for @DonEdwards if I wanted a per-project custom dictionary, I would just copy the en_US.sym file into multiple copies, one per project, something like "en_US_projectA.sym" and "en_US_projectB.sym", etc.. then in the Tools->Dictionary I'd see the two entries, and each dictionary would have its own custom dictionary that will come along with it (Note the .sym file doesn't get modified, so copying it and renaming it doesn't actually bring along the custom entries you added before. If you wanted to keep those, you can also copy the .json.gz file).
That sort of gives you a 'per project dictionary' even though it requires a little bit of work (copying the file and renaming it per project), and it wouldn't get stored inside your manuskript project file itself, so if you're working on multiple machines, it wouldn't follow over. But for a "pipe dream", it's not so bad I guess :)

@gedakc
Copy link
Collaborator

gedakc commented Mar 2, 2019

@kakaroto a thought recently came to me. Did you investigate compiling PyEnchant for Window 64-bit?

I came across the following link that indicates the task might not be easy but sounds possible:

StackOverflow - Install pyenchant on a Windows 64-bit machine

If building a 64-bit Enchant and PyEnchant works, then we could create a separate win64 PyInstaller package for Manuskript to complement the current win32 version.

Please note that I unfortunately I haven't had time to look at this issue in earnest as I have lots of other things to attend to, both within and outside of volunteer activities.

EDIT: From reading the following link for the PyEnchant project, it sounds like the compiling task is very challenging and no one has solved it since it was raised in June of 2014. :-(

PyEnchant Issue 42 - Build pre-compiled libs for win64

@kakaroto
Copy link
Contributor Author

kakaroto commented Mar 2, 2019

Yeah, I looked into compiling PyEnchant for Windows 64 bit, but I dropped the idea pretty quickly because I didn't want to try and compile stuff for windows (I only develop under Linux usually, and while I use manuskript on my windows machine, there is no compilation or complex devenv setup involved, so I'm ok with that) and because it didn't seem to be a trivial task anyway.

Also, like I said, PyEnchant is not maintained anymore, its last commit is from over a year ago and it added the "this project is unmaintained" notice to its README. I liked having the option of using something other enchant and I'm happy with the symspellpy implementation I added recently.

@gedakc
Copy link
Collaborator

gedakc commented Mar 2, 2019

I understand. I only develop on Linux now too. It's been well over a decade since I last developed on Windows.

kakaroto added a commit to kakaroto/manuskript that referenced this issue Mar 27, 2019
This modifies the Spellchecker abstraction to add a new dictionary support, with
support for pyspellchecker. It also changes the main UI so that multiple libraries
can be supported and dictionaries provided to the user. The custom dictionary of
pyspellchecker has to be handled manually, and the performance and words of this
library isn't on par with PyEnchant, but at least it works with 64 bits.

Fixes olivierkes#505
kakaroto added a commit to kakaroto/manuskript that referenced this issue Mar 29, 2019
This modifies the Spellchecker abstraction to add a new dictionary support, with
support for pyspellchecker. It also changes the main UI so that multiple libraries
can be supported and dictionaries provided to the user. The custom dictionary of
pyspellchecker has to be handled manually, and the performance and words of this
library isn't on par with PyEnchant, but at least it works with 64 bits.

Fixes olivierkes#505
@gedakc gedakc closed this as completed in 20c5586 Apr 30, 2019
@gedakc gedakc added this to the 0.10.0 milestone Sep 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants