Skip to content
Lenz Furrer edited this page May 24, 2021 · 9 revisions

BioC

BioC is a light-weight format for structured text, text-bound annotations and other metadata. It exists in two versions, XML and JSON, which are structurally equivalent.

BioC is particularly well-suited for representing annotations over structured documents like scientific articles. The choice of XML/JSON allows having the benefits of stand-off annotations (nesting/overlaps) in a single file along with the annotated text. Most elements in BioC can be mapped 1:1 to bconv's document model.

Examples

BioC XML

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
  <source>BC5CDR</source>
  <date></date>
  <key></key>
<document>
  <id>354896</id>
  <passage>
    <infon key="type">Title</infon>
    <offset>0</offset>
    <text>Lidocaine-induced cardiac asystole.
</text>
    <annotation id="2">
      <infon key="type">Disease</infon>
      <infon key="cui">D006323</infon>
      <location offset="18" length="16"/>
      <text>cardiac asystole</text>
    </annotation>
  </passage>
</document>
</collection>

Full example

BioC JSON

{
  "source": "BC5CDR",
  "date": "",
  "key": "",
  "infons": {},
  "documents": [
    {
      "id": "354896",
      "infons": {},
      "passages": [
        {
          "infons": {
            "type": "Title"
          },
          "offset": 0,
          "text": "Lidocaine-induced cardiac asystole.\n",
          "sentences": [],
          "annotations": [
            {
              "id": "2",
              "infons": {
                "type": "Disease",
                "cui": "D006323"
              },
              "text": "cardiac asystole",
              "locations": [
                {
                  "offset": 18,
                  "length": 16
                }
              ]
            }
          ],
          "relations": []
        }
      ],
      "relations": []
    }
  ]
}

Full example

Sources

BioC was originally designed as an XML format by Comeau et al. (2013). Later, a JSON version was defined. A collection of tools and other resources is available on sourceforge.

Notes

  • Document structure: BioC has three mandatory levels of structuring: collection, document, and passage (section), and an optional sentence level. Text content is defined either at the passage or the sentence level. Annotations are anchored at the same level as the text. When parsing a BioC collection with passage-level text, bconv performs sentence splitting and distributes any annotations to the corresponding sentences. During serialisation, the sentence_level option controls whether text and annotations are embedded at the passage or sentence level in the output.
  • Metadata: BioC defines three metadata fields at the top level (collection): source, date, and key. Additionally, arbitrary key-value pairs can be stored in infons elements at all levels of document structuring as well as annotations. All metadata are captured in bconv's representation and can be serialised back to BioC. The metadata option allows adding or overriding collection-level metadata.
  • Offsets: Character offsets in Bioc are calculated by counting bytes of the UTF-8-encoded text. Since this is incovenient for working with (Unicode) strings in Python and also incompatible with offset computation in other formats, bconv recalculates all offsets as codepoint offsets during parsing. For serialisation to BioC, the offsets are converted back to the byte level. This recalculation can be disabled with the byte_offsets option for both parsing and serialisation, which is useful eg. for processing files that violate the specs in this respect. Note that parsing a BioC file with the wrong byte_offsets value is likely to fail or cause inconsistencies.
  • Discontinuous spans: For annotations with multiple spans, multiple location elements are used.
  • Relations/events: BioC supports relations with any arity (even 0 members are possible). Every member (node) references an entity or another relation by ID. Relation metadata (infons) and member roles are optional and can take arbitrary values. However, converting to other formats may entail certain restrictions; eg. Brat requires relation members to have a role, whereas PubAnnotation only supports binary relations and needs a relation type.

Loaders

BioCXMLLoader

Properties

fmt bioc_xml
native type Collection
lazy loading yes
supports text yes
supports annotations yes
stream type binary

Options

name type default purpose
byte_offsets bool True recalculate offsets from bytes to codepoints

BioCJSONLoader

Properties

fmt bioc_json
native type Collection
lazy loading no
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
byte_offsets bool True recalculate offsets from bytes to codepoints

Exporters

BioCXMLFormatter

Properties

fmt bioc_xml
supports text yes
supports annotations yes
stream type binary

Options

name type default purpose
byte_offsets bool True recalculate offsets from codepoints to bytes
sentence_level bool False anchor text at the sentence level
metadata dict None add or override collection-level metadata
avoid_gaps str None suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions

BioCJSONFormatter

Properties

fmt bioc_json
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
byte_offsets bool True recalculate offsets from codepoints to bytes
sentence_level bool False anchor text at the sentence level
metadata dict None add or override collection-level metadata
avoid_gaps str None suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions