Skip to content
Lenz Furrer edited this page May 24, 2021 · 8 revisions

CoNLL

CoNLL is a yearly conference with associated shared task, in the course of which a number of versions of CoNLL formats have been defined. The shared tasks frequently involved syntactic parsing, and as such the formats are designed for token-level annotations.

The common ground of all CoNLL formats is a tab-separated table with verticalised text in the first column, i.e. each token of a text is at the beginning of a separate line, followed by additional information. Sentences are separated by blank lines. In its most basic form, the format has only two columns, the token and a label (e.g. a PoS tag or an IOB label), which is a typical format accepted and produced by many sequence labeling tools.

The CoNLL format understood by bconv has 4 columns: token, start_offset, end_offset, label. Additional columns are silently ignored. When serialising, the two character-offset columns may be skipped.

Example

# doc_id = 354896
Lidocaine	0	9	S-Chemical
-	9	10	O
induced	10	17	O
cardiac	18	25	B-Disease
asystole	26	34	E-Disease
.	34	35	O

Intravenous	36	47	O
administration	48	62	O
of	63	65	O
...

Full example

Sources

The CoNLL website contains links to past shared tasks, many of which defined a new version of the data format.

Notes

  • Document structure: The CoNLL format can represent sentence and (optionally) document boundaries. Sentences are separated by a blank line, and the beginning of a document is marked with a line starting with # doc_id =. The format does not have a way to mark section boundaries. When loading, bconv interprets the first sentence of a document as its title and puts all remaining sentences into a single "body" section. Text in CoNLL format is also word-tokenised, which is preserved by bconv.
  • Metadata: The CoNLL format understood by bconv supports document identifiers, which are given in a line starting with # doc_id =. When serialising, this line can be suppressed by setting the include_docid option to False.
  • Offsets: When loading, bconv requires character offsets for each token, ie. each non-blank line must have at least four columns (token, start, end, tag). When serialising, the character offsets can be disabled through the include_offsets option.
  • Whitespace: Whitespace (between tokens, sentences, sections etc.) is not represented in CoNLL in any way. The amount of whitespace between text units can be inferred from the character offsets if they are given. This is also why character offsets are required when bconv loads a CoNLL file.
  • Entity annotations: Annotations are encoded with the IO, IOB, or IOBES tagging scheme (specified by the tagset option when serialising). By default, entities are annotated with their type (eg. "B-disease"), but any other Entity.metadata entry may be specified through the label option.
  • Discontinuous spans: The IOB[ES] tagging scheme cannot represent discontinuous or overlapping spans. When converting from another format to CoNLL, discontinuous and overlapping annotations are subject to entity flattening. Future versions of bconv might implement the DB/DI/HB/HI tag-set extensions as proposed by Zhang et al (2014), Metke-Jimenez and Karimi (2016), and Dai (2018).

Loaders

CoNLLLoader

Properties

fmt conll
native type Collection
lazy loading yes
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
label str 'type' key in Entity.metadata for storing the label

Exporters

CoNLLFormatter

Properties

fmt conll
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
label str 'type' key in Entity.metadata to use as the label
tagset str 'IOBES' one of 'IO', 'IOB', 'IOBES'
include_docid bool True add a document-ID comment at document start
include_offsets bool True add two columns with character offsets
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str 'keep-longer' suppress annotation collisions