Skip to content

Latest commit

 

History

History
208 lines (160 loc) · 6.34 KB

development.adoc

File metadata and controls

208 lines (160 loc) · 6.34 KB

Developing

The development environment is managed by Poetry. Ensure that Poetry is installed.

Then optionally set Python virtual environment to in-project, which may help your IDE find types more easily.

poetry config virtualenvs.in-project true

Then run the following to install dependencies required for development:

poetry install

To run tests:

poetry run pytest

Building executables

Run the following:

poetry run scripts/dev/build_executables.py VERSION

Where VERSION is a published version number of travdata on PyPI. This will generate a file under dist named travdata_cli, which is an executable for the platform that you ran it on.

VERSION can instead be localdev to build from the local copy including local changes. Note that this will include development dependencies as well, so will be larger than an executable generated via a qualified version number.

Adding more table configurations for extraction

Overall process:

  1. If you have not already done so, install Tabula. This will be used to create .tabula-template.json files, but is not required for the actual extraction itself.

  2. Access Tabula through its web interface. By default this will be available via localhost:8080.

  3. Import the PDF containing the tables.

  4. Create a template per table in the PDF file.

  5. Use the tool to select individual tables, and define a template for each table. Specific guidance:

    • Only select a single table per template.

    • Create multiple selections within the same template for a table that is split into multiple parts. For example, a table that spans two pages.

    • Where multiple selections are made for a single table, only include the header row once, for the first selection, omit it on subsequent selections.

    • Preview extraction within Tabula. Experiment with both "Stream" and "Lattice" modes.

    • Lattice works very well with less processing configuration later, but only works well where the table contains an outline and interior gridlines.

    • Stream works well as a fallback for other types of table, but requires more processing later. Most tables from Mongoose Traveller books fall into this category.

  6. Export the template file from Tabula, and add it to an appropriate subdirrectory within the config directory corresponding to the PDF file.

  7. For each table template, create a table entry within the book.yaml file. See the Extraction section for details of how this works.

  8. Test the extraction by running the tools/extractcsvtables.py tool. Adjust the config.yaml and .tabular-template.json files. Remove

Per-book configuration

A book-level configuration directory containing a book.yaml and subdirectories and files containing .tabula-tempalte.yaml files is the configuration to extract data from a PDF file, and metadata about the extracted data.

The schema is a typed-YAML file, with types as follows:

!Group

mapping

This is the top-level type, and can contain child groups. A group corresponds to a directory within the configuration directory.

Fields
groups

optional mapping<string, !Group>

A mapping from string (group name - a subdirectory name) to the definition of that group.

tables

optional mapping<string, !Table>

The map key is the filename stem of the table. This is the filename without the .tabula-template.json suffix within the configuration directory, and the filename without the .csv suffix in the output directory.

The map value is the metadata and extraction configuration of the table.

extraction_templates

optional list<!TableExtraction>

A list of table extraction configurations. This can be used with the YAML "anchor" (&) syntax to define common table extraction configurations elsewhere in the file, which can be used by the "alias" (*) syntax.

!Table

mapping

Defines metadata and extraction configuration relating to a single table. The "path" of group names and the table name form the path for both the .tabula-template.json file within the configuration directory and the output .csv file in the output directory.

Fields
type

optional string

Name of the type. This is very optional, and relates to a speculative feature to translate tables further from CSV to YAML files, names the type of each row. At this time, ignore this field.

extraction

optional !TableExtraction

Configures processing of data extracted by Tabula. If left unset or set to !!null, then no PDF to CSV extraction will be attempted. See the section on Extraction for more information.

Extraction

!TableExtraction

mapping

When present as the value of an extraction field in a !Table, requests extraction of that table. Fields inside this type adjust how the data is adjusted from the data emitted by Tabula into the rows in the final CSV file.

Fields
add_header_row

optional list<string>

Adds the list of strings as the first row in the resulting CSV file. This row is not subject to any configured row_folding.

row_folding

optional list<!StaticRowCount | !EmptyColumn>

Specifies how to merge together a sequence of rows into single rows in the output. For entries in this list that cover a limited number of input rows (like !StaticRowCount), following rows will fall into grouping by the subsequent entry. Any input rows not covered by these entries will pass through ungrouped.

!StaticRowCounts

mapping

Groups input rows according to each of the numbers in turn.

Fields
row_counts

list[integer]

Specifies input row counts per output row.

!EmptyColumn

mapping

Groups input rows together with previous input rows when the given column is empty.

Fields
column_index

integer

Specifies the zero-based index of the column that must be empty in order to group it with previous input rows.

Updating README

README.adoc is the source of truth, perform any edits there. When completed, run the following command to update the .rst and .md files:

./scripts/dev/convertadoc.sh README.adoc

Explanation: Asciidoc is the preferred format for documentation in this project, the other formats are for compatibility with PyPI and other sites.