Home
bconv
has an API modeled after that of Python's pickle
and json
libraries.
In particular, there is a pair of top-level functions load
/dump
which convert between a format-specific serialisation and a Python representation in memory.
>>> import bconv
>>> with open('path/to/example.json', encoding='utf8') as f:
... coll = bconv.load(f, fmt='bioc_json')
>>> coll
<Collection with 37 documents at 0x7f1966e4b3c8>
>>> with open('path/to/example.conll', 'w', encoding='utf8') as f:
... bconv.dump(coll, f, fmt='conll', tagset='IOBES', include_offsets=True)
Unlike json
or pickle
, bconv
is not a parser/serialiser for a single format, but for a whole range of different formats.
The loaded contents aren't arbitrary Python types, but instances of a specific document model appropriate for representing annotated text.
From this internal representation, the contents can be exported into any of the supported formats.
The functions bconv.load
, bconv.loads
and bconv.fetch
allow loading a document or collection of documents from a specific format into a Python representation.
bconv.load(
source: str|Path|IO,
fmt: str = None,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.Collection|bconv.Document
Load a document or collection from a file.
source
may be a path or a readable file-like object. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property).fmt
specifies the format to use. It must be one of the format names listed here or inbconv.LOADERS
. This parameter is semi-optional: ifsource
is a path and the file extension is the same as the format name (eg. _path/to/file.conll),fmt
may be omitted.mode
determines the return type:
"native"
: aDocument
orCollection
object, depending on the format (cf. the native type format property);"collection"
: aCollection
object wrapping all content;"lazy"
: an iterator ofDocument
objects, consumed lazily if possible.id
can be an arbitrary identifier for the loaded document or collection. It will be accessible as an attributeid
on the returned object. It is not particularly important to set this parameter (and it can even be automatically inferred from the content for some formats), but may be convenient in some cases.- Any format-specific
options
can be passed as keyword arguments.
bconv.loads(
source: str|bytes,
fmt: str,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.Collection|bconv.Document
Load a document or collection from a
str
orbytes
object.This is a mere convenience function that internally wraps
source
in anio.StringIO
orio.BytesIO
object and passes it tobconv.load
. The type ofsource
(str
/bytes
) must match the expectation of the respective format (cf. the stream type format property). Thefmt
parameter is not optional, as there is no obvious way to reliably guess the file format without a file name.
bconv.fetch(
query: str|Sequence[str],
fmt: str,
mode: str = 'native',
id: int|str = None,
**options) -> bconv.Collection
Load a collection from a remote service.
Currently, PubMed abstracts and PMC articles can be fetched from NCBI's efetch service. Note: Even though it is not technically enforced, requests to NCBI should include the caller's e-mail address (specify it as a keyword parameter
query
specifies the documents to retrieve. It is a sequence or comma-separated list of PubMed or PMC IDs. Note that non-existing IDs are silently skipped;bconv
does not attempt to check the returned collection for completeness.fmt
may be"pubmed"
or"pmc"
.
The functions bconv.dump
and bconv.dumps
serialise a loaded document or collection to disk or memory.
bconv.dump(
content: bconv.Collection|bconv.Document,
dest: str|Path|IO,
fmt: str = None,
**options)
Serialise a document or collection to disk.
content
is aDocument
orCollection
object to be serialised (see the note below for limitations to the choice of the type).dest
is the destination for writing the data, given as a path or a writable open file. If it is an open file, its type (text or binary) must match the expectation of the respective format (cf. the stream type format property). Ifdest
is a path to an existing directory, a file name is constructed based oncontent.id
or (if it isNone
/empty)content.filename
.fmt
specifies the format to use. It must be one of the format names listed here or inbconv.EXPORTERS
. This parameter is semi-optional: ifdest
is a path and the file extension is the same as the format name (eg. path/to/file.bionlp),fmt
may be omitted.- Any format-specific
options
can be passed as keyword arguments.
bconv.dumps(
content: bconv.Collection|bconv.Document,
fmt: str,
**options) -> str|bytes
Return the serialisation of a document or collection as a
str
orbytes
object.The parameters are the same as for
bconv.dump
. Thefmt
parameter is not optional though, as there is no way to guess. The type of the return value depends on the format (cf. the stream type format property).
Note: bconv.dump
and bconv.dumps
accept both Document
and Collection
objects.
However, not all formats are equally well suited to represent both levels.
For example, when serialising a collection to txt
plain-text with brat
stand-off annotations, the document boundaries are lost.
bconv
's functions load
, loads
, fetch
, dump
and dumps
are high-level wrappers around a hierarchy of loader and exporter classes, which can also be instantiated directly.