-
Notifications
You must be signed in to change notification settings - Fork 3
BioC
Lenz Furrer edited this page May 24, 2021
·
9 revisions
BioC is a light-weight format for structured text, text-bound annotations and other metadata. It exists in two versions, XML and JSON, which are structurally equivalent.
BioC is particularly well-suited for representing annotations over structured documents like scientific articles.
The choice of XML/JSON allows having the benefits of stand-off annotations (nesting/overlaps) in a single file along with the annotated text.
Most elements in BioC can be mapped 1:1 to bconv
's document model.
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
<source>BC5CDR</source>
<date></date>
<key></key>
<document>
<id>354896</id>
<passage>
<infon key="type">Title</infon>
<offset>0</offset>
<text>Lidocaine-induced cardiac asystole.
</text>
<annotation id="2">
<infon key="type">Disease</infon>
<infon key="cui">D006323</infon>
<location offset="18" length="16"/>
<text>cardiac asystole</text>
</annotation>
</passage>
</document>
</collection>
{
"source": "BC5CDR",
"date": "",
"key": "",
"infons": {},
"documents": [
{
"id": "354896",
"infons": {},
"passages": [
{
"infons": {
"type": "Title"
},
"offset": 0,
"text": "Lidocaine-induced cardiac asystole.\n",
"sentences": [],
"annotations": [
{
"id": "2",
"infons": {
"type": "Disease",
"cui": "D006323"
},
"text": "cardiac asystole",
"locations": [
{
"offset": 18,
"length": 16
}
]
}
],
"relations": []
}
],
"relations": []
}
]
}
BioC was originally designed as an XML format by Comeau et al. (2013). Later, a JSON version was defined. A collection of tools and other resources is available on sourceforge.
-
Document structure: BioC has three mandatory levels of structuring: collection, document, and passage (section), and an optional sentence level.
Text content is defined either at the passage or the sentence level.
Annotations are anchored at the same level as the text.
When parsing a BioC collection with passage-level text,
bconv
performs sentence splitting and distributes any annotations to the corresponding sentences. During serialisation, thesentence_level
option controls whether text and annotations are embedded at the passage or sentence level in the output. -
Metadata: BioC defines three metadata fields at the top level (collection): source, date, and key.
Additionally, arbitrary key-value pairs can be stored in
infons
elements at all levels of document structuring as well as annotations. All metadata are captured inbconv
's representation and can be serialised back to BioC. Themetadata
option allows adding or overriding collection-level metadata. -
Offsets: Character offsets in Bioc are calculated by counting bytes of the UTF-8-encoded text.
Since this is incovenient for working with (Unicode) strings in Python and also incompatible with offset computation in other formats,
bconv
recalculates all offsets as codepoint offsets during parsing. For serialisation to BioC, the offsets are converted back to the byte level. This recalculation can be disabled with thebyte_offsets
option for both parsing and serialisation, which is useful eg. for processing files that violate the specs in this respect. Note that parsing a BioC file with the wrongbyte_offsets
value is likely to fail or cause inconsistencies. -
Discontinuous spans: For annotations with multiple spans, multiple
location
elements are used. - Relations/events: BioC supports relations with any arity (even 0 members are possible). Every member (node) references an entity or another relation by ID. Relation metadata (infons) and member roles are optional and can take arbitrary values. However, converting to other formats may entail certain restrictions; eg. Brat requires relation members to have a role, whereas PubAnnotation only supports binary relations and needs a relation type.
fmt | bioc_xml |
---|---|
native type | Collection |
lazy loading | yes |
supports text | yes |
supports annotations | yes |
stream type | binary |
name | type | default | purpose |
---|---|---|---|
byte_offsets | bool | True |
recalculate offsets from bytes to codepoints |
fmt | bioc_json |
---|---|
native type | Collection |
lazy loading | no |
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
byte_offsets | bool | True |
recalculate offsets from bytes to codepoints |
fmt | bioc_xml |
---|---|
supports text | yes |
supports annotations | yes |
stream type | binary |
name | type | default | purpose |
---|---|---|---|
byte_offsets | bool | True |
recalculate offsets from codepoints to bytes |
sentence_level | bool | False |
anchor text at the sentence level |
metadata | dict | None |
add or override collection-level metadata |
avoid_gaps | str | None |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |
fmt | bioc_json |
---|---|
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
byte_offsets | bool | True |
recalculate offsets from codepoints to bytes |
sentence_level | bool | False |
anchor text at the sentence level |
metadata | dict | None |
add or override collection-level metadata |
avoid_gaps | str | None |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |