-
Notifications
You must be signed in to change notification settings - Fork 3
CoNLL
CoNLL is a yearly conference with associated shared task, in the course of which a number of versions of CoNLL formats have been defined. The shared tasks frequently involved syntactic parsing, and as such the formats are designed for token-level annotations.
The common ground of all CoNLL formats is a tab-separated table with verticalised text in the first column, i.e. each token of a text is at the beginning of a separate line, followed by additional information. Sentences are separated by blank lines. In its most basic form, the format has only two columns, the token and a label (e.g. a PoS tag or an IOB label), which is a typical format accepted and produced by many sequence labeling tools.
The CoNLL format understood by bconv
has 4 columns: token
, start_offset
, end_offset
, label
.
Additional columns are silently ignored.
When serialising, the two character-offset columns may be skipped.
# doc_id = 354896
Lidocaine 0 9 S-Chemical
- 9 10 O
induced 10 17 O
cardiac 18 25 B-Disease
asystole 26 34 E-Disease
. 34 35 O
Intravenous 36 47 O
administration 48 62 O
of 63 65 O
...
The CoNLL website contains links to past shared tasks, many of which defined a new version of the data format.
-
Document structure: The CoNLL format can represent sentence and (optionally) document boundaries.
Sentences are separated by a blank line, and the beginning of a document is marked with a line starting with
# doc_id =
. The format does not have a way to mark section boundaries. When loading,bconv
interprets the first sentence of a document as its title and puts all remaining sentences into a single "body" section. Text in CoNLL format is also word-tokenised, which is preserved bybconv
. -
Metadata: The CoNLL format understood by
bconv
supports document identifiers, which are given in a line starting with# doc_id =
. When serialising, this line can be suppressed by setting theinclude_docid
option toFalse
. -
Offsets: When loading,
bconv
requires character offsets for each token, ie. each non-blank line must have at least four columns (token, start, end, tag). When serialising, the character offsets can be disabled through theinclude_offsets
option. -
Whitespace: Whitespace (between tokens, sentences, sections etc.) is not represented in CoNLL in any way.
The amount of whitespace between text units can be inferred from the character offsets if they are given.
This is also why character offsets are required when
bconv
loads a CoNLL file. -
Entity annotations: Annotations are encoded with the IO, IOB, or IOBES tagging scheme (specified by the
tagset
option when serialising). By default, entities are annotated with their type (eg. "B-disease"), but any otherEntity.metadata
entry may be specified through thelabel
option. -
Discontinuous spans: The IOB[ES] tagging scheme cannot represent discontinuous or overlapping spans.
When converting from another format to CoNLL, discontinuous and overlapping annotations are subject to entity flattening.
Future versions of
bconv
might implement the DB/DI/HB/HI tag-set extensions as proposed by Zhang et al (2014), Metke-Jimenez and Karimi (2016), and Dai (2018).
fmt | conll |
---|---|
native type | Collection |
lazy loading | yes |
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
label | str | 'type' |
key in Entity.metadata for storing the label |
fmt | conll |
---|---|
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
label | str | 'type' |
key in Entity.metadata to use as the label |
tagset | str | 'IOBES' |
one of 'IO' , 'IOB' , 'IOBES'
|
include_docid | bool | True |
add a document-ID comment at document start |
include_offsets | bool | True |
add two columns with character offsets |
avoid_gaps | str | 'split' |
suppress discontinuous spans |
avoid_overlaps | str | 'keep-longer' |
suppress annotation collisions |