Skip to content
Lenz edited this page Mar 15, 2021 · 3 revisions

Plain Text

Plain text is a limited, yet universal and robust format for storing textual content.

Many of the formats supported by bconv are technically plain-text files (e.g. PubTator, BioC JSON, CoNLL), but use some mark-up to denote document structure, metadata, or annotations. The txt format, however, holds only the contents of a document in plain text, precluding the encoding of metadata and annotations, and supporting document structure only to a very limited extent.

The txt.json format is a simple wrapper for the txt format. It allows representing multiple documents in a single file and supports a document ID.

Examples

txt (single-doc)

Lidocaine-induced cardiac asystole.

Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man ...

Full example

txt.json (multi-doc)

[
  {
    "id": "354896",
    "text": "Lidocaine-induced cardiac asystole.\n\nIntravenous administration of ..."
  }
]

Full example

Sources

The Wikipedia articles on text files and plain text as a format provide information and further reading about many aspects of the format.

Notes

  • Document structure: Plain-text files are interpreted as a single document. Blank lines are interpreted as section boundaries, unless the single_section option is set, in which case the entire text is read as a single section. With the sentence_split option, line breaks are interpreted/inserted as sentence boundaries (in this case, bconv attempts no further sentence splitting when loading). Multiple documents per file can only be represented in the txt.json format.
  • Metadata: The filename (if available) is used as a fallback for inferring the document ID, if none was provided to the load() call.
  • Whitespace: Line breaks may be indicative of document structure, depending on the options single_section and sentence_split, as described above. When serialising text alongside stand-off annotations (eg. bionlp), do not use the sentence_split option, as it does not guarantee to preserve character offsets.

Loaders

TXTLoader

Properties

fmt txt
native type Document
lazy loading no
supports text yes
supports annotations no
stream type text

Options

name type default purpose
single_section bool False Conflate all content into a single section
sentence_split bool False Interpret line breaks as given sentence boundaries

TXTJSONLoader

Properties

fmt txt.json
native type Collection
lazy loading no
supports text yes
supports annotations no
stream type text

Options

name type default purpose
single_section bool False Conflate all content into a single section
sentence_split bool False Interpret line breaks as given sentence boundaries

Exporters

TXTFormatter

Properties

fmt txt
supports text yes
supports annotations no
stream type text

Options

| sentence_split | bool | False | Separate sentences with line breaks |

TXTJSONFormatter

Properties

fmt txt.json
supports text yes
supports annotations no
stream type text

Options

| sentence_split | bool | False | Separate sentences with line breaks |