Skip to content

Simple tables of word frequencies, derived from Google ngram corpora

License

Notifications You must be signed in to change notification settings

neilk/wordfrequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wordfrequencies

Simple table of word frequencies, derived from Google Ngram corpora.

words-all.txt is a tab-separated file, one word per line, followed by the total number of times the word was seen in Google's scanned books from the past century. Any word with a capital letter was ignored.

words.txt is a subset of words-all.txt, corresponding to words found in /usr/share/dict/words on Mac OS X 10.7. Note that the words containing capital letters will not be found in this file. The program selecter.pl can be used to compile similar subsets.

These files were based on the 1-gram files in the 20120701 release[1] of Google Ngram's corpora. Individual files for each letter were created with the freqaz script, which calls the freq.pl script. Then these individual files were sorted, and then merged with sort -m into words-all.txt.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

About

Simple tables of word frequencies, derived from Google ngram corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages