Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-calculated corpus size lists required #20

Open
gkunter opened this issue Jul 27, 2015 · 1 comment
Open

Pre-calculated corpus size lists required #20

gkunter opened this issue Jul 27, 2015 · 1 comment

Comments

@gkunter
Copy link
Owner

gkunter commented Jul 27, 2015

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).

SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.


@gkunter
Copy link
Owner Author

gkunter commented Sep 13, 2015

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


This issue could be solved by implementing Issue #34 for corpus source columns. The counts could be stored in a collection.Counter object, with a tuple (containing the values of the source columns) as keys.This would save the slow calculation of cross-tables with COUNT(*) after the corpus has been compiled.

What is needed, then, is simply that a list of all features that are source features (i.e. non-word features, possibly excluding time features) is available to the corpus builder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant