Create a unified layout for resource descriptions #11

gkunter · 2015-04-27T12:14:54Z

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, each corpus module can define an arbitrary set of labels that are used in the different sql_string_xxx functions to construct valid MySQL query strings. However, there is no mechanism that can be used to unify access to different tables across corpora.

SOLUTION:
Instead of using simply strings, a corpus module could contain a complete table layout description that also represents the links between the different tables. This layout could be part of a configuration file, so in order to adjust the module to an existing database, only the configuration file would need to be changed.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/11

gkunter · 2015-04-29T17:02:29Z

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):

The suggested solution is inferior to the mechanism currently employed in the BNC corpus module. In this database layout, text sources are referenced indirectly by the sentence_id. The module uses a table alias that constructs the table structure required to solve this indirect reference.

bnc.py defines three variables for the source data:

source_table_name (containing the name of the MySQL table 'text')
source_table_alias (containing the 'SOURCETABLE' alias that SQL queries use to access the fields)
source_table (containing an SQL SELECT that constructs the new table)

This paradigm should be used consistently in corpus.py so that basically any information might come from whatever location.

gkunter · 2015-06-04T07:36:12Z

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):

For Coquery, the table descriptions are essentially the API to the internal logic. Thus, it is essential that there is a clear structure to it before a public release is possible.

gkunter · 2015-07-18T19:39:14Z

Original comment by gkunter (Bitbucket: gkunter, GitHub: gkunter):

There is now a unified layout for table descriptions.

Each table now contains minimally two resource features: one for the name of the table, and another for the column name that contains the unique row identifier. The names of these two features are fixed: xxx_table and xxx_id, where 'xxx' is the name of the table. Any other resource feature contains other columns in the table.

Minimally, a table definition contains one table named 'corpus'. Any other table is linked to this table by using linking resource features. A linking resource feature has a fixed name: xxx_yyy_id, where 'xxx' is the name of the parent table and 'yyy' the name of the linked child table. The two tables are linked so that every row from table 'xxx' is linked to exactly one row from 'yyy', and the other columns from 'yyy' can be displayed together with the matching columns from 'xxx'. A row from 'yyy' can be linked to more than one row from 'xxx' in this way.

One table can be linked to more than one other table, and a linked table can also contain a link to another table. However, querying linked tables increases processing power notably. A denormalized database design is therefore strongly recommended for a corpus! As corpora are typically read-only databases, there is no danger of data inconsistencies, which is usually considered one strong argument for database normalization. However, a flat design strongly decreases query times (and increases the storage requirements of the corpus).

Technically, this is realized by an INNER JOIN of table yyy on table xxx on the two columns xxx.xxx_yyy_id and yyy.yyy_id.

gkunter added major enhancement labels Mar 2, 2016

gkunter closed this as completed Mar 2, 2016

This was referenced Mar 2, 2016

Import of corpus module should involve testing for completeness #13

Closed

BNC corpus: sentence_id is used as source_id for tokens #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a unified layout for resource descriptions #11

Create a unified layout for resource descriptions #11

gkunter commented Apr 27, 2015

gkunter commented Apr 29, 2015

gkunter commented Jun 4, 2015

gkunter commented Jul 18, 2015

Create a unified layout for resource descriptions #11

Create a unified layout for resource descriptions #11

Comments

gkunter commented Apr 27, 2015

gkunter commented Apr 29, 2015

gkunter commented Jun 4, 2015

gkunter commented Jul 18, 2015