In [1]:
%load_ext autoreload

In [2]:
import data.merge_data as md
from collections import Counter
import pandas as pd
import re

### This is a quick example of spellchecking for multiple languages using `pyenchant` package

First of all, we load the message data and count words frequencies similar to `words_distribution.ipynb`

In [3]:
data = md.get_data()
msg = data['body']

words = []
for i, m in enumerate(msg.items()):
    lst = re.split('\s+', m[1].lower())
    lst = [''.join(e for e in x if e.isalnum()) for x in lst]  # remove special chars
    lst = [x for x in lst if len(x) > 0]  # remove words shorter than N
    words.extend(lst)

print('Total words:', len(words))

c = Counter(words)
df = pd.DataFrame(c.most_common())
df.columns = ['word', 'freq']

Total words: 644556


`pyenchant` library can load external dictionaries, such as https://packages.ubuntu.com/trusty/myspell-dictionary. English is usually installed by default. To list available dictionaries:

In [4]:
import enchant
enchant.list_languages()

['ru_RU', 'en_US', 'en', 'en_CA', 'en_GB']

If your language isn't in the list, just install the dictionary package, e.g. for Ubuntu Linux and Russian language:

`sudo apt-get install myspell-ru`

In [5]:
# load the dictionary
d = enchant.Dict('ru_RU')

The following code will output the full list of misspelled words arranged by frequency

In [None]:
df['OK'] = df['word'].apply(lambda x: d.check(x))
df[df['OK'] == False]

Next, we can count how many unique misspelled words there are in the dataset

In [7]:
unique_words = len(df)
unknown_words = len(df[df['OK'] == False])
print('Unique words: {0}, Misspelled: {1}, Ratio: {2}'.format(unique_words,
    unknown_words, unknown_words / unique_words))

Unique words: 62987, Misspelled: 19854, Ratio: 0.315207900042866


For me, roughly 30% of all unique words aren't recognized by spellcheker :)
Of course, it doesn't mean all of these errors are grammar errors. Mostly it's just typos, slang words, characters from smiley faces that weren't deleted by the procedure at the beginning (;P), and lots of other trash I didn't care to filter out. So this is a very basic example that should be improved to have somewhat practical meaning.