Skip to content
Calculation of vocabulary using Porter stemming of writers based on their corpuses
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
files
results
README

README

================================================================================
word_counts
================================================================================

github repo containing the programs used and output generated for the small
project described here:

http://zwischenzugs.wordpress.com/2011/03/06/shakespeare_unexceptional_vocabulary

================================================================================
Directory = Description
================================================================================
bin       = Contains scripts used to generate results
files     = Contains files of words generated from authors' works (please note
            that the works themselves have not been included due to copyright 
            concerns. They are freely available online at Project Gutenberg, but
            require editing before analysis).
results   = Results output for examined writers.



================================================================================
File                       = Description
================================================================================
bin/analyse.pl             = Takes stdin, outputs words sorted with numbers,
                             possessives, and "'d"s replaced with "ed"s.
bin/calculate_all.sh       = Takes all the files in
                             "../files/*_files/<writer's name>_complete" and
                             outputs results from calculate.sh to
                             ../results/<writer's name>_results.txt
bin/calculate.sh           = Given a filename representing a writer's corpus, 
                             outputs: number of words in corpus; number of
                             words after analyse.pl is run over it; number of
                             unique words after analyse.pl is run over it; 
                             number of unique words after stemmer.pl is run
                             on corpus that has been through analyse.pl
bin/stemmer.pl             = Takes words as inputs and outputs them in a 
                             canonical Porter-stemmed form.
results/<writer's name>_results.txt
                           = Results of writer's corpus's analysis.
files/<writer'name>_files/<writer's name>_complete_words_all_lc
                           = All words used in corpus in lower case.
files/<writer'name>_files/<writer's name>_complete_words_all_lc_uniq
                           = As above, but each word is unique in the file
                             and in alphabetical order.
files/<writer'name>_files/<writer's name>_complete_stemmed_uniq
                           = As above, but each word is stemmed, unique and in
                             alphabetical order.




You can’t perform that action at this time.