-
Notifications
You must be signed in to change notification settings - Fork 3
Entity Flattening
Entity annotations are references to a portion of text, typically a contiguous sequence of one or more words. However, there are complex cases like discontinuous spans (containing gaps) and overlapping annotations, which are not fully supported by all output formats.
In many languages, there is a pattern (elliptic coordination) that can be a frequent source of both discontinuous and overlapping annotations, as in this example:
ES and somatic cells
-- ----- CL:0002322 (embryonic stem cell)
------------- CL:0002371 (somatic cell)
Here, the annotation for “ES … cells” is discontinuous (composed of two separated spans) and partially overlaps with the annotation for “somatic cells”. This structure can be easily represented in stand-off formats like BioNLP:
T1 CL:0002322 0 2;15 20 ES ... cells
T2 CL:0002371 7 20 somatic cells
or PubAnnotation JSON:
{
"text": "ES and somatic cells",
"denotations": [
{
"id": "T1",
"span": [
{ "begin": 0, "end": 2 },
{ "begin": 15, "end": 20 }
],
"obj": "CL:0002322"
},
{
"id": "T2",
"span": { "begin": 7, "end": 20 },
"obj": "CL:0002371"
}
]
}
Other formats are more limited. For example, PubTator does not allow discontinuous spans, and the CoNLL format, in addition, cannot properly represent overlapping annotations. When serialising complex annotations to a restricted output format, they need to be simplified (flattened) beforehand, ie. converted into contiguous and/or non-overlapping spans.
Inspired by Sampo Pyysalo's standoff2conll converter, bconv
provides two formatter options, avoid_gaps
and avoid_overlaps
, which control flattening of entity annotations.
These two options are available for all output formats that support annotations, but their default value is not the same for all formats.
The options can be specified at the top-level function bconv.dump()
and bconv.dumps()
, just like format-specific options:
bconv.dump(coll, fmt="pubtator", avoid_gaps="fill")
Discontinuous annotations (like “ES ... cells” in the example above) can be converted to contiguous spans with different strategies.
The avoid_gaps
option can take one of the following values:
-
None
: Do not convert (keep gaps). This is the default value for formats that support discontinuous annotations. -
"split"
: Create a separate annotation for each contiguous sub-span of the original annotation. This is the default value for formats that do not support discontinuous annotations. The above example in BioNLP becomes:T1 CL:0002322 0 2 ES T2 CL:0002322 15 20 cells T3 CL:0002371 7 20 somatic cells
-
"fill"
: Merge the sub-spans by including intervening tokens. Example:T1 CL:0002322 0 20 ES and somatic cells T2 CL:0002371 7 20 somatic cells
-
"first"
: Keep only the first sub-span. Example:T1 CL:0002322 0 2 ES T2 CL:0002371 7 20 somatic cells
-
"last"
: Keep only the last sub-span. Example:T1 CL:0002322 15 20 cells T2 CL:0002371 7 20 somatic cells
Overlapping annotations (as for the span “cells“ in the example above) can be avoided by removing all but one of a cluster of partially co-located entity annotations.
The avoid_overlaps
option can take one of the following values:
-
None
: Do not convert (keep overlaps). This is the default value for all formats except CoNLL. -
"keep-longer"
: For any set of overlapping annotations, keep only the longest. This is the default value for CoNLL output. In the example above, only “somatic cells” is kept, as it is longer (13 characters) than “ES ... cells” (2+5=7 characters):T1 CL:0002371 7 20 somatic cells
-
"keep-shorter"
: For any set of overlapping annotations, keep only the shortest. Example:T1 CL:0002322 0 2;15 20 ES ... cells
Note: Using avoid_overlaps
without avoid_gaps
is possible, but not typical usage, since there is no format that supports discontinuous annotations withouth supporting overlapping annotations (as in the last example).
If both options are set, gap removal is performed before overlap suppression. With the example from above, the effect of the two simplification steps can be illustrated for all combinations:
-
avoid_gaps="split"
,avoid_overlaps="keep-longer"
T1 CL:0002322 0 2 ES T2 CL:0002371 7 20 somatic cells
-
avoid_gaps="split"
,avoid_overlaps="keep-shorter"
T1 CL:0002322 0 2 ES T2 CL:0002322 15 20 cells
-
avoid_gaps="fill"
,avoid_overlaps="keep-longer"
T1 CL:0002322 0 20 ES and somatic cells
-
avoid_gaps="fill"
,avoid_overlaps="keep-shorter"
T1 CL:0002371 7 20 somatic cells
-
avoid_gaps="first"
,avoid_overlaps=("keep-longer"|"keep-shorter")
T1 CL:0002322 0 2 ES T2 CL:0002371 7 20 somatic cells
-
avoid_gaps="last"
,avoid_overlaps="keep-longer"
T1 CL:0002371 7 20 somatic cells
-
avoid_gaps="last"
,avoid_overlaps="keep-shorter"
T1 CL:0002322 15 20 cells