Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a unified layout for resource descriptions #11

Closed
gkunter opened this issue Apr 27, 2015 · 3 comments
Closed

Create a unified layout for resource descriptions #11

gkunter opened this issue Apr 27, 2015 · 3 comments

Comments

@gkunter
Copy link
Owner

gkunter commented Apr 27, 2015

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, each corpus module can define an arbitrary set of labels that are used in the different sql_string_xxx functions to construct valid MySQL query strings. However, there is no mechanism that can be used to unify access to different tables across corpora.

SOLUTION:
Instead of using simply strings, a corpus module could contain a complete table layout description that also represents the links between the different tables. This layout could be part of a configuration file, so in order to adjust the module to an existing database, only the configuration file would need to be changed.


@gkunter
Copy link
Owner Author

gkunter commented Apr 29, 2015

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


The suggested solution is inferior to the mechanism currently employed in the BNC corpus module. In this database layout, text sources are referenced indirectly by the sentence_id. The module uses a table alias that constructs the table structure required to solve this indirect reference.

bnc.py defines three variables for the source data:

  • source_table_name (containing the name of the MySQL table 'text')
  • source_table_alias (containing the 'SOURCETABLE' alias that SQL queries use to access the fields)
  • source_table (containing an SQL SELECT that constructs the new table)

This paradigm should be used consistently in corpus.py so that basically any information might come from whatever location.

@gkunter
Copy link
Owner Author

gkunter commented Jun 4, 2015

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


For Coquery, the table descriptions are essentially the API to the internal logic. Thus, it is essential that there is a clear structure to it before a public release is possible.

@gkunter
Copy link
Owner Author

gkunter commented Jul 18, 2015

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):


There is now a unified layout for table descriptions.

Each table now contains minimally two resource features: one for the name of the table, and another for the column name that contains the unique row identifier. The names of these two features are fixed: xxx_table and xxx_id, where 'xxx' is the name of the table. Any other resource feature contains other columns in the table.

Minimally, a table definition contains one table named 'corpus'. Any other table is linked to this table by using linking resource features. A linking resource feature has a fixed name: xxx_yyy_id, where 'xxx' is the name of the parent table and 'yyy' the name of the linked child table. The two tables are linked so that every row from table 'xxx' is linked to exactly one row from 'yyy', and the other columns from 'yyy' can be displayed together with the matching columns from 'xxx'. A row from 'yyy' can be linked to more than one row from 'xxx' in this way.

One table can be linked to more than one other table, and a linked table can also contain a link to another table. However, querying linked tables increases processing power notably. A denormalized database design is therefore strongly recommended for a corpus! As corpora are typically read-only databases, there is no danger of data inconsistencies, which is usually considered one strong argument for database normalization. However, a flat design strongly decreases query times (and increases the storage requirements of the corpus).

Technically, this is realized by an INNER JOIN of table yyy on table xxx on the two columns xxx.xxx_yyy_id and yyy.yyy_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant