Skip to content

Latest commit

 

History

History
47 lines (35 loc) · 6.25 KB

pimlico.modules.input.text_annotations.vrt.rst

File metadata and controls

47 lines (35 loc) · 6.25 KB

VRT annotated text files

.. py:module:: pimlico.modules.input.text_annotations.vrt

Path pimlico.modules.input.text_annotations.vrt
Executable yes

Input reader for VRT text collections (VeRticalized Text, as used by Korp:). Reads in files from arbitrary locations in the same way as :mod:`pimlico.modules.input.text.raw_text_files`.

This is an input module. It takes no pipeline inputs and is used to read in data

Inputs

No inputs

Outputs

Name Type(s)
corpus :class:`~pimlico.modules.input.text_annotations.vrt.info.VRTOutputType`

Options

Name Description Type
files (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) comma-separated list of (line range-limited) file paths
exclude A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) comma-separated list of strings
encoding_errors What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details string
encoding Encoding to assume for input files. Default: utf8 string