Skip to content

Determine whether "a" or "an" is more appropriate before a word, symbol, or acronym.

License

Notifications You must be signed in to change notification settings

mogsdad/a-vs-an

 
 

Repository files navigation

a-vs-an

Find the english language indeterminate article ("a" or "an") for a word. Based on real usage patterns extracted from the wikipedia text dump; can therefore even deal with tricky edge cases such as acronyms (FIAT vs. FAA, NASA vs. NSA) and odd symbols.

The implementations (C# and Javascript) in this project determine whether "a" or "an" should precede a word. They are efficient and accurate (using the method described in this stackoverflow response).

You can try the javascript implementation of this library online: A-vs-An.

The dataset used is based on the wikipedia-article-text dump of july 2014. Some additional preprocessing was done to remove as much wiki-markup as possible and extract only things vaguely resembling sentences using regular expressions. If the word following 'a' or 'an' started with a quote or parenthesis, the initial quote or parenthesis was ignored. The resulting prefix-list with the code to query it is less than 10KB in size; excluding the actual counts would reduce the size still further.

The implementations are efficient: on a single thread of a 3.6GHz i7-4770k a benchmark classifying all words of an english dictionary achieves about 37 million words a second; that's just 100 clock cycles per word. The javascript implementations were benchmarked on chrome 35, firefox 32.0a1 (2014-05-22), IE 11, and opera (12 and 21), and are all about 7-10 times slower, at approximately 4-5 million classifications per second.

About

Determine whether "a" or "an" is more appropriate before a word, symbol, or acronym.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 94.9%
  • HTML 4.3%
  • Other 0.8%