v0.9.4: Fix PDF parser: embedded digits, ABONO sign, and folio detection (#28)

mabahamo released this 16 Mar 14:10

· 5 commits to main since this release

v0.9.4

4890ea4

* Fix PDF parser: embedded digits, ABONO sign, and folio detection

Three parsing bugs fixed:

1. Digits embedded in alphanumeric description tokens (e.g. "B9", "C2")
   were matched by the number regex, corrupting amounts and descriptions.
   Added negative lookbehind (?<![A-Za-z]) to the number pattern.

2. ABONO POR CAPTACIONES was classified as a debit (egreso) instead of
   credit (ingreso). Added "ABONO" to the is_ingreso keyword checks.

3. Folio numbers not starting with "0" (e.g. "2005078957") were not
   filtered, causing them to be summed into the transaction amount.
   Changed folio detection to match any 10+ digit number without dot
   separators, since real CLP amounts in PDFs always use dots.

Also recalculates the description boundary after folio filtering so that
channel keywords between the folio and amounts are properly captured.

* Fix ruff lint: shorten docstring line

* Bump version to 0.9.4

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.4: Fix PDF parser: embedded digits, ABONO sign, and folio detection (#28)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!