Skip to content

v0.9.4: Fix PDF parser: embedded digits, ABONO sign, and folio detection (#28)

Choose a tag to compare

@mabahamo mabahamo released this 16 Mar 14:10
· 5 commits to main since this release
4890ea4
* Fix PDF parser: embedded digits, ABONO sign, and folio detection

Three parsing bugs fixed:

1. Digits embedded in alphanumeric description tokens (e.g. "B9", "C2")
   were matched by the number regex, corrupting amounts and descriptions.
   Added negative lookbehind (?<![A-Za-z]) to the number pattern.

2. ABONO POR CAPTACIONES was classified as a debit (egreso) instead of
   credit (ingreso). Added "ABONO" to the is_ingreso keyword checks.

3. Folio numbers not starting with "0" (e.g. "2005078957") were not
   filtered, causing them to be summed into the transaction amount.
   Changed folio detection to match any 10+ digit number without dot
   separators, since real CLP amounts in PDFs always use dots.

Also recalculates the description boundary after folio filtering so that
channel keywords between the folio and amounts are properly captured.

* Fix ruff lint: shorten docstring line

* Bump version to 0.9.4