Pre-calculated corpus size lists required #20

gkunter · 2015-07-27T06:57:22Z

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).

SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/20

gkunter · 2015-09-13T06:43:54Z

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):

This issue could be solved by implementing Issue #34 for corpus source columns. The counts could be stored in a collection.Counter object, with a tuple (containing the values of the source columns) as keys.This would save the slow calculation of cross-tables with COUNT(*) after the corpus has been compiled.

What is needed, then, is simply that a list of all features that are source features (i.e. non-word features, possibly excluding time features) is available to the corpus builder.

gkunter added major enhancement labels Mar 2, 2016

This was referenced Mar 2, 2016

get_frequency() and get_corpus_size() don't acknowledge result filters #21

Closed

[New feature] Track number of references to MySQL table record during corpus building #34

Open

gkunter added on hold and removed on hold labels Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-calculated corpus size lists required #20

Pre-calculated corpus size lists required #20

gkunter commented Jul 27, 2015

gkunter commented Sep 13, 2015

Pre-calculated corpus size lists required #20

Pre-calculated corpus size lists required #20

Comments

gkunter commented Jul 27, 2015

gkunter commented Sep 13, 2015