New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create SelfWealthPDFExtractor.java #2340
Create SelfWealthPDFExtractor.java #2340
Conversation
Good idea. SelfWealth only have buy or sell transactions, no dividends. Both added to repo. Also note: |
It's a bit hard without knowing German or Java but with a bit of a tidy up this must be pretty close. |
This looks good start for our ASX(Australian Stock Exchange). Just a few features for ASX specific:
|
Thanks, I have raised the feature request - #2351, and a topic on the forum is not yet visible. Here is an example of the PDF how it looks like: https://www.coursehero.com/file/63931397/Contract-98505191pdf/ |
Alright, thanks @flywire for the contribution and thanks @Nirus2000 for commenting on the change and helping. I have now picked up the code and fixed it as good as I could understand it. A couple for remarks:
I am happy to merge more contributions. Please make sure the Github Actions workflow compiles. |
Hmm
Seems corruption problem is with PDFBox. Portfolio Performance The corrupted dump characters ( I can open pdf file in Microsoft Edge Version 92.0.902.55 (Official build) (64-bit) Win10, select/copy/paste into Notepad++ and viewed as above has no strange characters (same 20 files). |
Interestingly:
|
Unicode character U+00A0 is the no-break space. In UTF-8, it is represented as the byte sequence C2 A0. In ISO 8859-1 and other (older) one-byte encodings, it is represented as A0. |
The SelfWealth PDF files have been decoded and issue investigated: https://issues.apache.org/jira/browse/PDFBOX-5247 Indeed every space is a Non-breaking_space
@buchen I think this is a good case for post-processing rather than have hidden codes in the test files. What do you think? Add to SelfWealthPDFExtractor.java something like: String nbsp = " "
pdfstream.replaceAll(" ", " "); Files are very fiddly to create and error-prone with Non-breaking space. Updated: SelfWealthBuy01.txt SelfWealthSell01.txt |
PP has PDFBox version 1.8.16 embedded. If I understand it, the "purchase" has regular spaces, the "sale" has non-breaking spaces - is that also what you see when converting the PDF documents using the "File -> Import -> Debug: Create Text from PDF" feature? From my point of view, there are two options:
|
No, they both have non-breaking spaces, see Updated: SelfWealthBuy01.txt SelfWealthSell01.txt (above your last post) using hex editor. My preference is |
SelfWealthPDFExtractor in development.
See Issue #2329