Skip to content
🙈 assistant for hunting down tpyos
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
samples
LICENSE
README.md
tpyo.gif
tpyo_revealo.py

README.md

tpyo gif


Why?

  • Saw couple of typos while reading a fantasy book and wondered why weren't they caught
  • Felt like a good mini-project to improve my Python and programming skills

Idea

  1. Compare list of dictionary words with words extracted from e-book using Python code
    • as of now, working on docx/epub formats
  2. The output generated has to be manually checked to validate
    • in-world terms like names, places, etc
    • words not found in reference dictionary
    • hyphenated words
  3. These words can then be added to reference list of words so that further runs will reveal only typos
  4. Repeat steps 1-3 when input documents change

Caveats

  • Use the program at your own risk
    • files/directories are read/created programmatically, bug could corrupt your system
    • I only have Linux, so don't know how it'll behave when used with other operating systems
  • at best, project could be said to be at alpha stage

Instructions

For Linux and Unix-like systems

First, clone the repo or download the zip

$ git clone https://github.com/learnbyexample/tpyo_revealo.git

$ cd tpyo_revealo/
$ mkdir ref_words input_doc
$ # multiple documents and reference lists can be put in these directories
$ cp samples/sample.docx input_doc/
$ cp /usr/share/dict/words ref_words/words.txt

$ # this will create a log directory using current time as directory name
$ python3 tpyo_revealo.py

$ cat 2017-12-20_15_38_07.341621/hyphenated_words.log
en-IN: 1
full-fledged: 1
$ cat 2017-12-20_15_38_07.341621/tpyo_words.log
LibreOffice/5.2.0.4$Linux_X: 1
LibreOffice_project/20m0$Build: 1
rny: 1
samlpe: 1
T15:37:31Z: 1
tpyo: 1
wordswithoutspace: 1

$ # create ignore lists and run again
$ cat > ref_words/ignore.txt
en-IN
LibreOffice/5.2.0.4$Linux_X
LibreOffice_project/20m0$Build
T15:37:31Z
$ echo 'full-fledged' > ref_words/hyphenated_words.txt

$ python3 tpyo_revealo.py
$ cat 2017-12-20_15_40_45.505735/hyphenated_words.log
$ cat 2017-12-20_15_40_45.505735/tpyo_words.log
rny: 1
samlpe: 1
tpyo: 1
wordswithoutspace: 1

Where to get word lists

  • this stackoverflow Q&A might help
  • aspell looked good (mentioned in above link)
    • American/British/Canadian/Australian spellings
    • SCOWL size 95, Variants 3, Diacritic stripped gives 660+K words
      • The script finished in less than 3 seconds for Oathbringer book(450+K words) against 660+K reference words, so performance not an issue
    • Can be downloaded for both Windows/Unix
    • See scowl-readme for more details including usage and license

Wishlist

  • Better parsing for xhtml files. As of now xml extraction is used, so things like T<span class="XXX">HOSE words messes up things
  • Code organization - need to break up into different functions, etc
  • Features - repeated words, adverbs repeated in short space, etc
  • Look into NLTK

Contributing

  • Open an issue for suggestions, feature requests, bugs, etc

License

MIT, see LICENSE file

You can’t perform that action at this time.