Skip to content

Latest commit

 

History

History
34 lines (24 loc) · 1.01 KB

pimlico.modules.corpora.corpus_stats.rst

File metadata and controls

34 lines (24 loc) · 1.01 KB

Corpus statistics

Path pimlico.modules.corpora.corpus_stats
Executable yes

Some basic statistics about tokenized corpora

Counts the number of tokens, sentences and distinct tokens in a corpus.

Inputs

Name Type(s)
corpus TarredCorpus<TokenizedDocumentType>

Outputs

Name Type(s)
stats ~pimlico.datatypes.files.NamedFile