Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BNC corpus: sentence_id is used as source_id for tokens #14

Closed
gkunter opened this issue May 13, 2015 · 1 comment
Closed

BNC corpus: sentence_id is used as source_id for tokens #14

gkunter opened this issue May 13, 2015 · 1 comment

Comments

@gkunter
Copy link
Owner

gkunter commented May 13, 2015

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).

SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.


@gkunter
Copy link
Owner Author

gkunter commented Jul 27, 2015

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


The latest version of the BNC builder uses a much flatter database layout. The sentence, source, and speaker tables are linked directly to the corpus table, and the file table is linked to the source table. The lemma table has been removed, and lemma information is now part of the word table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant