Skip to content
Lenz Furrer edited this page May 24, 2021 · 4 revisions

CSV and TSV

Comma or tab-separated values (CSV/TSV) is a straight-forward tabular format for annotations. The data are organised into rows delimited by line breaks and fields separated by a specific character, such as comma (CSV) or tab (TSV). For each annotated entity, there is a separate row with information about its position in the document (section type, character offset) and metadata (entity type, identifier etc.). CSV is well-understood by spreadsheet applications like MS Excel, whereas TSV plays well with Unix command-line tools like cut and sort.

Besides the tsv format for annotations only, there is a text_tsv variant which includes the remaining text of the documents. It features a row for every text token that is not part on an annotation, including positional information, but empty fields for the entity metadata. In case of overlapping entities and sub-word annotations, (parts of) tokens will be repeated. This is the main difference compared to the CoNLL format, where the linearity of the input sequence is preserved, at the cost of annotations potentially being simplified.

The fields are as follows:

doc_id section sent_id entity_id start end term

Additional fields can be added by specifying keys for the entity metadata through the fields parameter.

For formatting the comma-/tab-separated values, bconv relies on the standard-library csv module. The csv and text_csv formats use the default settings, ie. commas for field delimiting and double quotes for protecting commas inside field values. The tsv and text_tsv formats set the following formatting parameters: lineterminator="\n", delimiter="\t", quotechar=None, ie. tab delimiters and no protection/escaping mechanism for potential separator characters inside values. Instead, tab and newline characters inside annotations are (irreversibly) replaced with space characters. All formats accept additional formatting parameters as keyword arguments, which are passed directly to the csv.writer() constructor.

Examples

csv (entities only)

354896,Title,1,1,0,9,Lidocaine
354896,Title,1,2,18,34,cardiac asystole
354896,Abstract,2,3,90,99,lidocaine
354896,Abstract,2,4,142,152,depression
354896,Abstract,3,5,331,347,bradyarrhythmias
354896,Abstract,3,6,409,418,lidocaine

tsv (entities only)

354896	Title	1	1	0	9	Lidocaine
354896	Title	1	2	18	34	cardiac asystole
354896	Abstract	2	3	90	99	lidocaine
354896	Abstract	2	4	142	152	depression
354896	Abstract	3	5	331	347	bradyarrhythmias
354896	Abstract	3	6	409	418	lidocaine

text_tsv (all tokens)

354896	Title	1	1	0	9	Lidocaine
354896	Title	1		9	10	-
354896	Title	1		10	17	induced
354896	Title	1	2	18	34	cardiac asystole
354896	Title	1		34	35	.
354896	Abstract	2		36	47	Intravenous
354896	Abstract	2		48	62	administration
354896	Abstract	2		63	65	of
...

Full example

Sources

An RFC Memo exists for the general CSV format. A very short definition of the general TSV format is given by the IANA. The specific selection of fields used by bconv does not follow any standard.

Notes

  • Document structure: Due to the document ID in the first field, the CSV/TSV format can be used for both single documents and multi-doc collections. The fields "section" (section type) and "sent_id" (sentence counter) provide information on document-internal structuring.
  • Metadata: Only the document IDs and the section type (if available) are represented in the format.
  • Entity annotations: By default, only positional information and an annotation ID are given. Through the fields parameter, additional entity information can be written to the output.
  • Whitespace: If annotations span more than one word, the "term" field may contain whitespace. Since line breaks and tab characters are not allowed in the TSV format, they are replaced with spaces. In the CSV format, they are protected with quoting or escaping as appropriate.
  • Offsets: Character offsets are included in every row.
  • Discontinuous spans: Annotations with multiple spans are subject to entity flattening. By default, sub-spans are split into separate rows, but with a shared entity ID (fourth column).

Exporters

CSVFormatter

Properties

fmt csv
supports text no
supports annotations yes
stream type text

Options

name type default purpose
fields Sequence[str] () keys in Entity.metadata for the additional fields
include_header bool False add a header line
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**fmtparams Dict[str, str] {} keyword args directly passed to csv.writer()

TSVFormatter

Properties

fmt tsv
supports text no
supports annotations yes
stream type text

Options

name type default purpose
fields Sequence[str] () keys in Entity.metadata for the additional fields
include_header bool False add a header line
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**fmtparams Dict[str, str] {} keyword args directly passed to csv.writer()

TextCSVFormatter

Properties

fmt text_csv
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
fields Sequence[str] () keys in Entity.metadata for the additional fields
include_header bool False add a header line
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**fmtparams Dict[str, str] {} keyword args directly passed to csv.writer()

TextTSVFormatter

Properties

fmt text_tsv
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
fields Sequence[str] () keys in Entity.metadata for the additional fields
include_header bool False add a header line
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**fmtparams Dict[str, str] {} keyword args directly passed to csv.writer()