Skip to content
Lenz edited this page Feb 12, 2021 · 12 revisions

bconv Documentation

bconv has an API modeled after that of Python's pickle and json libraries. In particular, there is a pair of top-level functions load/dump which convert between a format-specific serialisation and a Python representation in memory.

>>> import bconv
>>> with open('path/to/example.json', encoding='utf8') as f:
...     coll = bconv.load(f, fmt='bioc_json')
>>> coll
<Collection with 37 documents at 0x7f1966e4b3c8>
>>> with open('path/to/example.conll', 'w', encoding='utf8') as f:
...     bconv.dump(coll, f, fmt='conll', tagset='IOBES', include_offsets=True)

Unlike json or pickle, bconv is not a parser/serialiser for a single format, but for a whole range of different formats. The loaded contents aren't arbitrary Python types, but instances of a specific document model appropriate for representing annotated text. From this internal representation, the contents can be exported into any of the supported formats.

Loader Functions load, loads, fetch

The functions bconv.load, bconv.loads and bconv.fetch allow loading a document or collection of documents from a specific format into a Python representation.

bconv.load(
    source: str|Path|IO,
    fmt: str = None,
    mode: str = 'native',
    id: int|str = None,
    **options) -> bconv.Collection|bconv.Document

Load a document or collection from a file.

  • source may be a path or a readable file-like object. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property).
  • fmt specifies the format to use. It must be one of the format names listed here or in bconv.LOADERS. This parameter is semi-optional: if source is a path and the file extension is the same as the format name (eg. _path/to/file.conll), fmt may be omitted.
  • mode determines the return type:
    • "native": a Document or Collection object, depending on the format (cf. the native type format property);
    • "collection": a Collection object wrapping all content;
    • "lazy": an iterator of Document objects, consumed lazily if possible.
  • id can be an arbitrary identifier for the loaded document or collection. It will be accessible as an attribute id on the returned object. It is not particularly important to set this parameter (and it can even be automatically inferred from the content for some formats), but may be convenient in some cases.
  • Any format-specific options can be passed as keyword arguments.
bconv.loads(
    source: str|bytes,
    fmt: str,
    mode: str = 'native',
    id: int|str = None,
    **options) -> bconv.Collection|bconv.Document

Load a document or collection from a str or bytes object.

This is a mere convenience function that internally wraps source in an io.StringIO or io.BytesIO object and passes it to bconv.load. The type of source (str/bytes) must match the expectation of the respective format (cf. the stream type format property). The fmt parameter is not optional, as there is no obvious way to reliably guess the file format without a file name.

bconv.fetch(
    query: str|Sequence[str],
    fmt: str,
    mode: str = 'native',
    id: int|str = None,
    **options) -> bconv.Collection

Load a collection from a remote service.

Currently, PubMed abstracts and PMC articles can be fetched from NCBI's efetch service. Note: Even though it is not technically enforced, requests to NCBI should include the caller's e-mail address (specify it as a keyword parameter email).

  • query specifies the documents to retrieve. It is a sequence or comma-separated list of PubMed or PMC IDs. Note that non-existing IDs are silently skipped; bconv does not attempt to check the returned collection for completeness.
  • fmt may be "pubmed" or "pmc".

Exporter Functions dump, dumps

The functions bconv.dump and bconv.dumps serialise a loaded document or collection to disk or memory.

bconv.dump(
    content: bconv.Collection|bconv.Document,
    dest: str|Path|IO,
    fmt: str = None,
    **options)

Serialise a document or collection to disk.

  • content is a Document or Collection object to be serialised (see the note below for limitations to the choice of the type).
  • dest is the destination for writing the data, given as a path or a writable open file. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property). If dest is a path to an existing directory, a file name is constructed based on content.id or (if it is None/empty) content.filename.
  • fmt specifies the format to use. It must be one of the format names listed here or in bconv.EXPORTERS. This parameter is semi-optional: if dest is a path and the file extension is the same as the format name (eg. path/to/file.bionlp), fmt may be omitted.
  • Any format-specific options can be passed as keyword arguments.
bconv.dumps(
    content: bconv.Collection|bconv.Document,
    fmt: str,
    **options) -> str|bytes

Return the serialisation of a document or collection as a str or bytes object.

The parameters are the same as for bconv.dump. The fmt parameter is not optional though, as there is no way to guess. The type of the return value depends on the format (cf. the stream type format property).

Note: bconv.dump and bconv.dumps accept both Document and Collection objects. However, not all formats are equally well suited to represent both levels. For example, when serialising a collection to txt plain-text with brat stand-off annotations, the document boundaries are lost.

Loader and Exporter Classes

bconv's functions load, loads, fetch, dump and dumps are high-level wrappers around a hierarchy of loader and exporter classes, which can also be instantiated directly.