# Finding Unusual Words in Given Language

Which words do not belong with the rest of the text?

In [1]:
text = "Truly Kryptic is the best puzzle game. It's browser-based and free. Google it."

# 1. Tokenizing text

In [2]:
from nltk import word_tokenize
text_tokenized = word_tokenize(text.lower())
text_tokenized

['truly',
 'kryptic',
 'is',
 'the',
 'best',
 'puzzle',
 'game',
 '.',
 'it',
 "'s",
 'browser-based',
 'and',
 'free',
 '.',
 'google',
 'it',
 '.']

# 2. Importing and exploring the words corpus
NLTK includes some corpora that are nothing more than wordlists. NLTK has 2 lists for English words:
1. **en**: This words corpus is simply a newline-delimited list of dictionary words. It is a standard file located on any Unix operating systems. It is used by some spell checkers and we can use the Words Corpus to find unusual or mis-spelt words in a text corpus.
2. **en-basic**: Has 850 English words. The source is: _C.K. Ogden in The ABC of Basic English (1932)_

Reference: Section 4.1 ([Wordlist Corpora](http://www.nltk.org/book/ch02.html#fig-lexicon)), chapter 2 of Natural Language Processing with Python.

In [3]:
from nltk.corpus import words
words.readme().replace('\n', ' ') # Read the contents of the README file of the corpus

'Wordlists  en: English, http://en.wikipedia.org/wiki/Words_(Unix) en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932) '

In [4]:
words.fileids()

['en', 'en-basic']

In [5]:
words.words('en')[:30] # Show just the first 30 words

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron',
 'Aaronic',
 'Aaronical',
 'Aaronite',
 'Aaronitic',
 'Aaru',
 'Ab',
 'aba',
 'Ababdeh',
 'Ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally']

In [6]:
en_words_length = len(words.words('en'))
print('There are', en_words_length, 'English words in the Words corpus.')

There are 235886 English words in the Words corpus.


In [7]:
words.words('en-basic')[:30] # Show just the first 30 basic English words

['I',
 'a',
 'able',
 'about',
 'account',
 'acid',
 'across',
 'act',
 'addition',
 'adjustment',
 'advertisement',
 'after',
 'again',
 'against',
 'agreement',
 'air',
 'all',
 'almost',
 'among',
 'amount',
 'amusement',
 'and',
 'angle',
 'angry',
 'animal',
 'answer',
 'ant',
 'any',
 'apparatus',
 'apple']

In [8]:
basic_en_words_length = len(words.words('en-basic'))
print('There are', basic_en_words_length, 'English words in the en-basic words corpus.')

There are 850 English words in the en-basic words corpus.


# 3. Finding Unusual Words
Filtering a Text: This part computes the vocabulary of a text, then removes all items that occur in an existing wordlist, leaving just the uncommon or mis-spelt words.

In [9]:
# .isalpha() checks if all the characters in the text are letters. It will remove punctuation tokens but hyphenated tokens like 'browser-based' will remain because .isalpha() would be false.
text_vocab = set(w.lower() for w in text_tokenized if w.isalpha())
text_vocab

{'and',
 'best',
 'free',
 'game',
 'google',
 'is',
 'it',
 'kryptic',
 'puzzle',
 'the',
 'truly'}

In [10]:
english_vocab = set(w.lower() for w in words.words('en')) # lowercase all words then turn them into type: Set.
unusual = text_vocab - english_vocab
unusual

{'google'}