JamSpell

JamSpell is a spell checking library with following features:

accurate - it consider words surroundings (context) for better correction
fast - near 5K words per second
multi-language - it's written in C++ and available for many languages with swig bindings

Content

Benchmarks
Usage
- Python
- C++
- Other languages
- HTTP API
Train

Benchmarks

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words/second)
JamSpell	3.25%	1.27%	79.53%	84.10%	0.64%	4854
Norvig	7.62%	5.00%	46.58%	66.51%	0.69%	395
Hunspell	13.10%	10.33%	47.52%	68.56%	7.14%	163
Dummy	13.14%	13.14%	0.00%	0.00%	0.00%	-

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

Errors - percent of words with errors after spell checker processed
Top 7 Errors - percent of words missing in top7 candidated
Fix Rate - percent of errored words fixed by spell checker
Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
Broken - percent of non-errored words broken by spell checker
Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words per second)
JamSpell	3.56%	1.27%	72.03%	79.73%	0.50%	5524
Norvig	7.60%	5.30%	35.43%	56.06%	0.45%	647
Hunspell	9.36%	6.44%	39.61%	65.77%	2.95%	284
Dummy	11.16%	11.16%	0.00%	0.00%	0.00%	-

More details about reproducing available in "Train" section.

Usage

Python

Install swig3 (usually it is in your distro package manager)
Install jamspell:

pip install jamspell

Download or train language model
Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

Add jamspell and contrib dirs to your project
Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

Option 1 - python (via flask)

Will run on port 80, open to anyone (not just localhost) by default.
Expects the model to be in the same folder as webserver.py and be named medical_model.bin (since this fork is for the medical spell check)
Gives a few more options than the c++ option. Specifically these params can be sent with the GET or POST api call
- limit ... limit number of items per candidate on response from the /candidates endpoint to this i.e. /candidates?limit=1&text=blahblah
- html ... if set, will return a human-readable html table instead of json. Works for /fix and /candidates i.e. /fix?html=1&text=blahblah

python webserver.py

Option 2 - c++

Install cmake
Clone and build medSpellCheck (it includes http server):

git clone https://github.com/jackneil/medSpellCheck.git
cd medSpellCheck
mkdir build
cd build
cmake ..
make

on Windows replace the 'make' command with:

cmake --build . --target ALL_BUILD --config Release

Download or train language model
Run http server:

./web_server/web_server en.bin localhost 8080

GET Request example:

$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker

POST Request example

$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker

Candidate example

curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates

{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

Install cmake
Clone and build medSpellCheck:

git clone https://github.com/jackneil/medSpellCheck.git
cd medSpellCheck
mkdir build
cd build
cmake ..
make

SPECIAL WINDOWS INSTRUCTIONS for building:
1. MUST HAVE Visual Studio 2019 Community Edition (or greater) installed as well as Visual Studio 2019 C++ Build Tools!!!
2. cmake .. will build a shit .exe unless you've followed ^^^
3. replace the 'make' command with: (note that the jamspell.exe executable will be located in the /build/main/Release/ folder)
  cmake --build . --target ALL_BUILD --config Release

Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)
Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

To evaluate spellchecker you can use evaluate/evaluate.py script:

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.
Send it stuff like this: curl "http://localhost:55555/candidates?text=This is a 62 yer old femle with high blod pressur and she has had a lap appendectoy by an aneesthesiologist also she has dibetes mellitus. she takes 50mg of metopfolol per day and an 81mg asprin and 15miligram hydrochlorathiozide plus his mother is a smker and has had a bunch of seezures. they like icee creem and pzza. hx of coranary artery dizease and has had a transeent ishcemic attak"

Download models

Here is our hank.ai medical model pre-trained on a large medical corpus (a few million records):

medical_model.zip (180mb)

Here are a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
.vscode		.vscode
contrib		contrib
evaluate		evaluate
jamspell		jamspell
main		main
test_data		test_data
tests		tests
web_server		web_server
.devcontainer.json		.devcontainer.json
.gitignore		.gitignore
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
azureAppStartup.txt		azureAppStartup.txt
clear.sh		clear.sh
jamspell.i		jamspell.i
jamspell.py		jamspell.py
jamspell_wrap.cpp		jamspell_wrap.cpp
medspell.code-workspace		medspell.code-workspace
msc.code-workspace		msc.code-workspace
requirements.txt		requirements.txt
run_webserver_medical.bat		run_webserver_medical.bat
setup.cfg		setup.cfg
setup.py		setup.py
test_jamspell.py		test_jamspell.py
web_server.exe		web_server.exe
webserver.py		webserver.py

License

hank-ai/medSpellCheck

Folders and files

Latest commit

History

Repository files navigation

JamSpell

Content

Benchmarks

Usage

Python

C++

Other languages

HTTP API

Option 1 - python (via flask)

Option 2 - c++

Train

Download models

About

Resources

License

Stars

Watchers

Forks

Languages