You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)
ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).
SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.
Original comment bygkunter (Bitbucket: gkunter, GitHub: gkunter):
This issue could be solved by implementing Issue #34 for corpus source columns. The counts could be stored in a collection.Counter object, with a tuple (containing the values of the source columns) as keys.This would save the slow calculation of cross-tables with COUNT(*) after the corpus has been compiled.
What is needed, then, is simply that a list of all features that are source features (i.e. non-word features, possibly excluding time features) is available to the corpus builder.
Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)
ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).
SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.
The text was updated successfully, but these errors were encountered: