CSV

CSV and TSV

Comma or tab-separated values (CSV/TSV) is a straight-forward tabular format for annotations. The data are organised into rows delimited by line breaks and fields separated by a specific character, such as comma (CSV) or tab (TSV). For each annotated entity, there is a separate row with information about its position in the document (section type, character offset) and metadata (entity type, identifier etc.). CSV is well-understood by spreadsheet applications like MS Excel, whereas TSV plays well with Unix command-line tools like cut and sort.

Besides the tsv format for annotations only, there is a text_tsv variant which includes the remaining text of the documents. It features a row for every text token that is not part on an annotation, including positional information, but empty fields for the entity metadata. In case of overlapping entities and sub-word annotations, (parts of) tokens will be repeated. This is the main difference compared to the CoNLL format, where the linearity of the input sequence is preserved, at the cost of annotations potentially being simplified.

The fields are as follows:

doc_id section sent_id entity_id start end term

Additional fields can be added by specifying keys for the entity metadata through the fields parameter.

For formatting the comma-/tab-separated values, bconv relies on the standard-library csv module. The csv and text_csv formats use the default settings, ie. commas for field delimiting and double quotes for protecting commas inside field values. The tsv and text_tsv formats set the following formatting parameters: lineterminator="\n", delimiter="\t", quotechar=None, ie. tab delimiters and no protection/escaping mechanism for potential separator characters inside values. Instead, tab and newline characters inside annotations are (irreversibly) replaced with space characters. All formats accept additional formatting parameters as keyword arguments, which are passed directly to the csv.writer() constructor.

Examples

`csv` (entities only)

354896,Title,1,1,0,9,Lidocaine
354896,Title,1,2,18,34,cardiac asystole
354896,Abstract,2,3,90,99,lidocaine
354896,Abstract,2,4,142,152,depression
354896,Abstract,3,5,331,347,bradyarrhythmias
354896,Abstract,3,6,409,418,lidocaine

`tsv` (entities only)

354896	Title	1	1	0	9	Lidocaine
354896	Title	1	2	18	34	cardiac asystole
354896	Abstract	2	3	90	99	lidocaine
354896	Abstract	2	4	142	152	depression
354896	Abstract	3	5	331	347	bradyarrhythmias
354896	Abstract	3	6	409	418	lidocaine

`text_tsv` (all tokens)

354896	Title	1	1	0	9	Lidocaine
354896	Title	1		9	10	-
354896	Title	1		10	17	induced
354896	Title	1	2	18	34	cardiac asystole
354896	Title	1		34	35	.
354896	Abstract	2		36	47	Intravenous
354896	Abstract	2		48	62	administration
354896	Abstract	2		63	65	of
...

→ Full example

Sources

An RFC Memo exists for the general CSV format. A very short definition of the general TSV format is given by the IANA. The specific selection of fields used by bconv does not follow any standard.

Notes

Document structure: Due to the document ID in the first field, the CSV/TSV format can be used for both single documents and multi-doc collections. The fields "section" (section type) and "sent_id" (sentence counter) provide information on document-internal structuring.
Metadata: Only the document IDs and the section type (if available) are represented in the format.
Entity annotations: By default, only positional information and an annotation ID are given. Through the fields parameter, additional entity information can be written to the output.
Whitespace: If annotations span more than one word, the "term" field may contain whitespace. Since line breaks and tab characters are not allowed in the TSV format, they are replaced with spaces. In the CSV format, they are protected with quoting or escaping as appropriate.
Offsets: Character offsets are included in every row.
Discontinuous spans: Annotations with multiple spans are subject to entity flattening. By default, sub-spans are split into separate rows, but with a shared entity ID (fourth column).

Exporters

`CSVFormatter`

Properties

fmt	`csv`
supports text	no
supports annotations	yes
stream type	text

Options

name	type	default	purpose
fields	Sequence[str]	`()`	keys in `Entity.metadata` for the additional fields
include_header	bool	`False`	add a header line
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions
**fmtparams	Dict[str, str]	`{}`	keyword args directly passed to `csv.writer()`

`TSVFormatter`

Properties

fmt	`tsv`
supports text	no
supports annotations	yes
stream type	text

Options

name	type	default	purpose
fields	Sequence[str]	`()`	keys in `Entity.metadata` for the additional fields
include_header	bool	`False`	add a header line
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions
**fmtparams	Dict[str, str]	`{}`	keyword args directly passed to `csv.writer()`

`TextCSVFormatter`

Properties

fmt	`text_csv`
supports text	yes
supports annotations	yes
stream type	text

Options

name	type	default	purpose
fields	Sequence[str]	`()`	keys in `Entity.metadata` for the additional fields
include_header	bool	`False`	add a header line
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions
**fmtparams	Dict[str, str]	`{}`	keyword args directly passed to `csv.writer()`

`TextTSVFormatter`

Properties

fmt	`text_tsv`
supports text	yes
supports annotations	yes
stream type	text

Options

name	type	default	purpose
fields	Sequence[str]	`()`	keys in `Entity.metadata` for the additional fields
include_header	bool	`False`	add a header line
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions
**fmtparams	Dict[str, str]	`{}`	keyword args directly passed to `csv.writer()`

bconv Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV

CSV and TSV

Examples

`csv` (entities only)

`tsv` (entities only)

`text_tsv` (all tokens)

Sources

Notes

Exporters

`CSVFormatter`

Properties

Options

`TSVFormatter`

Properties

Options

`TextCSVFormatter`

Properties

Options

`TextTSVFormatter`

Properties

Options

Clone this wiki locally

CSV

CSV and TSV

Examples

csv (entities only)

tsv (entities only)

text_tsv (all tokens)

Sources

Notes

Exporters

CSVFormatter

Properties

Options

TSVFormatter

Properties

Options

TextCSVFormatter

Properties

Options

TextTSVFormatter

Properties

Options

Clone this wiki locally

`csv` (entities only)

`tsv` (entities only)

`text_tsv` (all tokens)

`CSVFormatter`

`TSVFormatter`

`TextCSVFormatter`

`TextTSVFormatter`