Skip to content

Entity Flattening

Lenz Furrer edited this page May 24, 2021 · 1 revision

Entity Flattening

Entity annotations are references to a portion of text, typically a contiguous sequence of one or more words. However, there are complex cases like discontinuous spans (containing gaps) and overlapping annotations, which are not fully supported by all output formats.

In many languages, there is a pattern (elliptic coordination) that can be a frequent source of both discontinuous and overlapping annotations, as in this example:

ES and somatic cells
--             -----  CL:0002322 (embryonic stem cell)
       -------------  CL:0002371 (somatic cell)

Here, the annotation for “ES … cells” is discontinuous (composed of two separated spans) and partially overlaps with the annotation for “somatic cells”. This structure can be easily represented in stand-off formats like BioNLP:

T1  CL:0002322 0 2;15 20  ES ... cells
T2  CL:0002371 7 20       somatic cells

or PubAnnotation JSON:

{
  "text": "ES and somatic cells",
  "denotations": [
    {
      "id": "T1",
      "span": [
        { "begin": 0, "end": 2 },
        { "begin": 15, "end": 20 }
      ],
      "obj": "CL:0002322"
    },
    {
      "id": "T2",
      "span": { "begin": 7, "end": 20 },
      "obj": "CL:0002371"
    }
  ]
}

Other formats are more limited. For example, PubTator does not allow discontinuous spans, and the CoNLL format, in addition, cannot properly represent overlapping annotations. When serialising complex annotations to a restricted output format, they need to be simplified (flattened) beforehand, ie. converted into contiguous and/or non-overlapping spans.

Inspired by Sampo Pyysalo's standoff2conll converter, bconv provides two formatter options, avoid_gaps and avoid_overlaps, which control flattening of entity annotations. These two options are available for all output formats that support annotations, but their default value is not the same for all formats. The options can be specified at the top-level function bconv.dump() and bconv.dumps(), just like format-specific options:

bconv.dump(coll, fmt="pubtator", avoid_gaps="fill")

Convert Discontinuous Annotations with avoid_gaps

Discontinuous annotations (like “ES ... cells” in the example above) can be converted to contiguous spans with different strategies. The avoid_gaps option can take one of the following values:

  • None: Do not convert (keep gaps). This is the default value for formats that support discontinuous annotations.

  • "split": Create a separate annotation for each contiguous sub-span of the original annotation. This is the default value for formats that do not support discontinuous annotations. The above example in BioNLP becomes:

    T1  CL:0002322 0 2    ES
    T2  CL:0002322 15 20  cells
    T3  CL:0002371 7 20   somatic cells
    
  • "fill": Merge the sub-spans by including intervening tokens. Example:

    T1  CL:0002322 0 20   ES and somatic cells
    T2  CL:0002371 7 20   somatic cells
    
  • "first": Keep only the first sub-span. Example:

    T1  CL:0002322 0 2    ES
    T2  CL:0002371 7 20   somatic cells
    
  • "last": Keep only the last sub-span. Example:

    T1  CL:0002322 15 20   cells
    T2  CL:0002371 7 20    somatic cells
    

Suppress Annotation Overlap with avoid_overlaps

Overlapping annotations (as for the span “cells“ in the example above) can be avoided by removing all but one of a cluster of partially co-located entity annotations. The avoid_overlaps option can take one of the following values:

  • None: Do not convert (keep overlaps). This is the default value for all formats except CoNLL.

  • "keep-longer": For any set of overlapping annotations, keep only the longest. This is the default value for CoNLL output. In the example above, only “somatic cells” is kept, as it is longer (13 characters) than “ES ... cells” (2+5=7 characters):

    T1  CL:0002371 7 20   somatic cells
    
  • "keep-shorter": For any set of overlapping annotations, keep only the shortest. Example:

    T1  CL:0002322 0 2;15 20  ES ... cells
    

Note: Using avoid_overlaps without avoid_gaps is possible, but not typical usage, since there is no format that supports discontinuous annotations withouth supporting overlapping annotations (as in the last example).

Combining avoid_gaps and avoid_overlaps

If both options are set, gap removal is performed before overlap suppression. With the example from above, the effect of the two simplification steps can be illustrated for all combinations:

  • avoid_gaps="split", avoid_overlaps="keep-longer"

    T1  CL:0002322 0 2    ES
    T2  CL:0002371 7 20   somatic cells
    
  • avoid_gaps="split", avoid_overlaps="keep-shorter"

    T1  CL:0002322 0 2    ES
    T2  CL:0002322 15 20  cells
    
  • avoid_gaps="fill", avoid_overlaps="keep-longer"

    T1  CL:0002322 0 20   ES and somatic cells
    
  • avoid_gaps="fill", avoid_overlaps="keep-shorter"

    T1  CL:0002371 7 20   somatic cells
    
  • avoid_gaps="first", avoid_overlaps=("keep-longer"|"keep-shorter")

    T1  CL:0002322 0 2    ES
    T2  CL:0002371 7 20   somatic cells
    
  • avoid_gaps="last", avoid_overlaps="keep-longer"

    T1  CL:0002371 7 20    somatic cells
    
  • avoid_gaps="last", avoid_overlaps="keep-shorter"

    T1  CL:0002322 15 20   cells