Simple tables of word frequencies, derived from Google ngram corpora
Perl
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENSE
README.md
freq.pl
freqaz
selecter.pl
words-all.txt
words.txt

README.md

wordfrequencies

Simple table of word frequencies, derived from Google Ngram corpora.

words-all.txt is a tab-separated file, one word per line, followed by the total number of times the word was seen in Google's scanned books from the past century. Any word with a capital letter was ignored.

words.txt is a subset of words-all.txt, corresponding to words found in /usr/share/dict/words on Mac OS X 10.7. Note that the words containing capital letters will not be found in this file. The program selecter.pl can be used to compile similar subsets.

These files were based on the 1-gram files in the 20120701 release[1] of Google Ngram's corpora. Individual files for each letter were created with the freqaz script, which calls the freq.pl script. Then these individual files were sorted, and then merged with sort -m into words-all.txt.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html