You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
* Fix PDF parser: embedded digits, ABONO sign, and folio detection
Three parsing bugs fixed:
1. Digits embedded in alphanumeric description tokens (e.g. "B9", "C2")
were matched by the number regex, corrupting amounts and descriptions.
Added negative lookbehind (?<![A-Za-z]) to the number pattern.
2. ABONO POR CAPTACIONES was classified as a debit (egreso) instead of
credit (ingreso). Added "ABONO" to the is_ingreso keyword checks.
3. Folio numbers not starting with "0" (e.g. "2005078957") were not
filtered, causing them to be summed into the transaction amount.
Changed folio detection to match any 10+ digit number without dot
separators, since real CLP amounts in PDFs always use dots.
Also recalculates the description boundary after folio filtering so that
channel keywords between the folio and amounts are properly captured.
* Fix ruff lint: shorten docstring line
* Bump version to 0.9.4