Simple tables of word frequencies, derived from Google ngram corpora
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Simple table of word frequencies, derived from Google Ngram corpora.

words-all.txt is a tab-separated file, one word per line, followed by the total number of times the word was seen in Google's scanned books from the past century. Any word with a capital letter was ignored.

words.txt is a subset of words-all.txt, corresponding to words found in /usr/share/dict/words on Mac OS X 10.7. Note that the words containing capital letters will not be found in this file. The program can be used to compile similar subsets.

These files were based on the 1-gram files in the 20120701 release[1] of Google Ngram's corpora. Individual files for each letter were created with the freqaz script, which calls the script. Then these individual files were sorted, and then merged with sort -m into words-all.txt.