CoNLL

CoNLL is a yearly conference with associated shared task, in the course of which a number of versions of CoNLL formats have been defined. The shared tasks frequently involved syntactic parsing, and as such the formats are designed for token-level annotations.

The common ground of all CoNLL formats is a tab-separated table with verticalised text in the first column, i.e. each token of a text is at the beginning of a separate line, followed by additional information. Sentences are separated by blank lines. In its most basic form, the format has only two columns, the token and a label (e.g. a PoS tag or an IOB label), which is a typical format accepted and produced by many sequence labeling tools.

The CoNLL format understood by bconv has 4 columns: token, start_offset, end_offset, label. Additional columns are silently ignored. When serialising, the two character-offset columns may be skipped.

Example

# doc_id = 354896
Lidocaine	0	9	S-Chemical
-	9	10	O
induced	10	17	O
cardiac	18	25	B-Disease
asystole	26	34	E-Disease
.	34	35	O

Intravenous	36	47	O
administration	48	62	O
of	63	65	O
...

→ Full example

Sources

The CoNLL website contains links to past shared tasks, many of which defined a new version of the data format.

Notes

Document structure: The CoNLL format can represent sentence and (optionally) document boundaries. Sentences are separated by a blank line, and the beginning of a document is marked with a line starting with # doc_id =. The format does not have a way to mark section boundaries. When loading, bconv interprets the first sentence of a document as its title and puts all remaining sentences into a single "body" section. Text in CoNLL format is also word-tokenised, which is preserved by bconv.
Metadata: The CoNLL format understood by bconv supports document identifiers, which are given in a line starting with # doc_id =. When serialising, this line can be suppressed by setting the include_docid option to False.
Offsets: When loading, bconv requires character offsets for each token, ie. each non-blank line must have at least four columns (token, start, end, tag). When serialising, the character offsets can be disabled through the include_offsets option.
Whitespace: Whitespace (between tokens, sentences, sections etc.) is not represented in CoNLL in any way. The amount of whitespace between text units can be inferred from the character offsets if they are given. This is also why character offsets are required when bconv loads a CoNLL file.
Entity annotations: Annotations are encoded with the IO, IOB, or IOBES tagging scheme (specified by the tagset option when serialising). By default, entities are annotated with their type (eg. "B-disease"), but any other Entity.metadata entry may be specified through the label option.
Discontinuous spans: The IOB[ES] tagging scheme cannot represent discontinuous or overlapping spans. When converting from another format to CoNLL, discontinuous and overlapping annotations are subject to entity flattening. Future versions of bconv might implement the DB/DI/HB/HI tag-set extensions as proposed by Zhang et al (2014), Metke-Jimenez and Karimi (2016), and Dai (2018).

Loaders

`CoNLLLoader`

Properties

fmt	`conll`
native type	Collection
lazy loading	yes
supports text	yes
supports annotations	yes
stream type	text

Options

name	type	default	purpose
label	str	`'type'`	key in `Entity.metadata` for storing the label

Exporters

`CoNLLFormatter`

Properties

fmt	`conll`
supports text	yes
supports annotations	yes
stream type	text

Options

name	type	default	purpose
label	str	`'type'`	key in `Entity.metadata` to use as the label
tagset	str	`'IOBES'`	one of `'IO'`, `'IOB'`, `'IOBES'`
include_docid	bool	`True`	add a document-ID comment at document start
include_offsets	bool	`True`	add two columns with character offsets
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`'keep-longer'`	suppress annotation collisions

bconv Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoNLL

CoNLL

Example

Sources

Notes

Loaders

`CoNLLLoader`

Properties

Options

Exporters

`CoNLLFormatter`

Properties

Options

Clone this wiki locally