The development environment is managed by Poetry. Ensure that Poetry is installed.
Then optionally set Python virtual environment to in-project, which may help your IDE find types more easily.
poetry config virtualenvs.in-project true
Then run the following to install dependencies required for development:
poetry install
To run tests:
poetry run pytest
Run the following:
poetry run scripts/dev/build_executables.py VERSION
Where VERSION
is a published version number of travdata
on PyPI. This will
generate a file under dist
named travdata_cli
, which is an executable for
the platform that you ran it on.
VERSION
can instead be localdev
to build from the local copy including local
changes. Note that this will include development dependencies as well, so will
be larger than an executable generated via a qualified version number.
Overall process:
-
If you have not already done so, install Tabula. This will be used to create
.tabula-template.json
files, but is not required for the actual extraction itself. -
Access Tabula through its web interface. By default this will be available via
localhost:8080
. -
Import the PDF containing the tables.
-
Create a template per table in the PDF file.
-
Use the tool to select individual tables, and define a template for each table. Specific guidance:
-
Only select a single table per template.
-
Create multiple selections within the same template for a table that is split into multiple parts. For example, a table that spans two pages.
-
Where multiple selections are made for a single table, only include the header row once, for the first selection, omit it on subsequent selections.
-
Preview extraction within Tabula. Experiment with both "Stream" and "Lattice" modes.
-
Lattice works very well with less processing configuration later, but only works well where the table contains an outline and interior gridlines.
-
Stream works well as a fallback for other types of table, but requires more processing later. Most tables from Mongoose Traveller books fall into this category.
-
-
Export the template file from Tabula, and add it to an appropriate subdirrectory within the config directory corresponding to the PDF file.
-
For each table template, create a table entry within the
book.yaml
file. See the Extraction section for details of how this works. -
Test the extraction by running the
tools/extractcsvtables.py
tool. Adjust theconfig.yaml
and.tabular-template.json
files. Remove
A book-level configuration directory containing a book.yaml
and subdirectories
and files containing .tabula-tempalte.yaml
files is the configuration to
extract data from a PDF file, and metadata about the extracted data.
The schema is a typed-YAML file, with types as follows:
!Group
-
mapping
This is the top-level type, and can contain child groups. A group corresponds to a directory within the configuration directory.
- Fields
-
groups
-
optional
mapping<string, !Group>
A mapping from string (group name - a subdirectory name) to the definition of that group.
tables
-
optional
mapping<string, !Table>
The map key is the filename stem of the table. This is the filename without the
.tabula-template.json
suffix within the configuration directory, and the filename without the.csv
suffix in the output directory.
The map value is the metadata and extraction configuration of the table.
extraction_templates
-
optional
list<!TableExtraction>
A list of table extraction configurations. This can be used with the YAML "anchor" (
&
) syntax to define common table extraction configurations elsewhere in the file, which can be used by the "alias" (*
) syntax.!Table
-
mapping
Defines metadata and extraction configuration relating to a single table. The "path" of group names and the table name form the path for both the
.tabula-template.json
file within the configuration directory and the output.csv
file in the output directory.- Fields
type
-
optional
string
Name of the type. This is very optional, and relates to a speculative feature to translate tables further from CSV to YAML files, names the type of each row. At this time, ignore this field.
extraction
-
optional
!TableExtraction
Configures processing of data extracted by Tabula. If left unset or set to
!!null
, then no PDF to CSV extraction will be attempted. See the section on Extraction for more information.
!TableExtraction
-
mapping
When present as the value of an
extraction
field in a!Table
, requests extraction of that table. Fields inside this type adjust how the data is adjusted from the data emitted by Tabula into the rows in the final CSV file.- Fields
-
add_header_row
-
optional
list<string>
Adds the list of strings as the first row in the resulting CSV file. This row is not subject to any configured
row_folding
. row_folding
-
optional
list<!StaticRowCount | !EmptyColumn>
Specifies how to merge together a sequence of rows into single rows in the output. For entries in this list that cover a limited number of input rows (like
!StaticRowCount
), following rows will fall into grouping by the subsequent entry. Any input rows not covered by these entries will pass through ungrouped.
!StaticRowCounts
-
mapping
Groups input rows according to each of the numbers in turn.
- Fields
-
row_counts
-
list[integer]
Specifies input row counts per output row.
!EmptyColumn
-
mapping
Groups input rows together with previous input rows when the given column is empty.
- Fields
-
column_index
-
integer
Specifies the zero-based index of the column that must be empty in order to group it with previous input rows.
README.adoc
is the source of truth, perform any edits there. When completed,
run the following command to update the .rst
and .md
files:
./scripts/dev/convertadoc.sh README.adoc
Explanation: Asciidoc is the preferred format for documentation in this project, the other formats are for compatibility with PyPI and other sites.