You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)
ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).
SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.
Original comment bygkunter (Bitbucket: gkunter, GitHub: gkunter):
The latest version of the BNC builder uses a much flatter database layout. The sentence, source, and speaker tables are linked directly to the corpus table, and the file table is linked to the source table. The lemma table has been removed, and lemma information is now part of the word table.
Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)
ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).
SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.
The text was updated successfully, but these errors were encountered: