BNC corpus: sentence_id is used as source_id for tokens #14

gkunter · 2015-05-13T10:36:33Z

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).

SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/14

gkunter · 2015-07-27T21:16:26Z

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):

The latest version of the BNC builder uses a much flatter database layout. The sentence, source, and speaker tables are linked directly to the corpus table, and the file table is linked to the source table. The lemma table has been removed, and lemma information is now part of the word table.

gkunter added major enhancement labels Mar 2, 2016

gkunter closed this as completed Mar 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BNC corpus: sentence_id is used as source_id for tokens #14

BNC corpus: sentence_id is used as source_id for tokens #14

gkunter commented May 13, 2015

gkunter commented Jul 27, 2015

BNC corpus: sentence_id is used as source_id for tokens #14

BNC corpus: sentence_id is used as source_id for tokens #14

Comments

gkunter commented May 13, 2015

gkunter commented Jul 27, 2015