Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SelfWealthPDFExtractor.java #2340

Closed
wants to merge 6 commits into from
Closed

Create SelfWealthPDFExtractor.java #2340

wants to merge 6 commits into from

Conversation

flywire
Copy link
Contributor

@flywire flywire commented Jul 17, 2021

SelfWealthPDFExtractor in development.

See Issue #2329

@buchen buchen marked this pull request as draft July 18, 2021 14:10
@flywire
Copy link
Contributor Author

flywire commented Jul 18, 2021

If you want, I can also create the importer for you. But for this I need more PDF debugs for buying, selling, dividends, and so on.

Good idea. SelfWealth only have buy or sell transactions, no dividends. Both added to repo.

Also note: 2000, AUS Market: ASX "ASX" is Australian exchange.

@flywire flywire marked this pull request as ready for review July 19, 2021 22:09
@flywire
Copy link
Contributor Author

flywire commented Jul 19, 2021

It's a bit hard without knowing German or Java but with a bit of a tidy up this must be pretty close.

@ismogal12
Copy link

This looks good start for our ASX(Australian Stock Exchange).
I hope one can be created for CommSec PDF as well. I am not sure if this is the right place to put feature requests.

Just a few features for ASX specific:

  1. CGT discount if the holding period is more than 12 months.
  2. Ability to calculate LIFO, I think currently it only does FIFO
  3. There is a split event but no Consolidation (which may be specific to ASX - ex: 10 shares are consolidated into 1).

@ismogal12
Copy link

I hope one can be created for CommSec PDF as well. I am not sure if this is the right place to put feature requests.

No, you need a new issue with sample buy/sell/dividend dumps.

#2329 (comment)

For the structur in addBuySellTransaction or other transactions....

  1. security with security currency
  2. date
  3. amount and currency
  4. sometimes forex calculation if security not in amount currency
  5. notes (...)

Raise the ASX specific features on https://forum.portfolio-performance.info/c/english/16

Thanks, I have raised the feature request - #2351, and a topic on the forum is not yet visible.

Here is an example of the PDF how it looks like: https://www.coursehero.com/file/63931397/Contract-98505191pdf/

@flywire flywire mentioned this pull request Jul 24, 2021
@ismogal12 ismogal12 mentioned this pull request Jul 25, 2021
buchen added a commit that referenced this pull request Jul 25, 2021
Issue: #2340 #2329
Co-authored-by: flywire <flywire0@gmail.com>
Co-authored-by: Andreas Buchen <andreas.buchen@gmail.com>
@buchen
Copy link
Member

buchen commented Jul 25, 2021

Alright, thanks @flywire for the contribution and thanks @Nirus2000 for commenting on the change and helping.

I have now picked up the code and fixed it as good as I could understand it.

A couple for remarks:

  • I remove the GST line - it apparently was already included in the other line items
  • The "Sell" file was somehow corrupted - it had strange space characters which led to the patterns not working. I did remove them, but if the PDF actually creates this space characters, then we need updated regex pattern.

Bildschirmfoto 2021-07-25 um 21 56 44

  • The reason I fixed the sample file and not the regex expressions was that I had the impression the "sell" file was manually created. At least in Germany - if you sell something - you get the proceeds without brokerage fees. 😄 Therefore I have updated the file accordingly. If this different in Australia, then PP cannot treat "brokerage fees" as fees we would have to add them to the original proceeds.
Sell Confirmation
[...]
397 WPL WOODSIDE PETROLEUM 21.88 $8,686.36 AUD
Brokerage* $9.50 AUD
Adviser Fee* $0.00 AUD
Net Value $8,695.86 AUD
  • I registered the importer in the PDFImportAssistent because otherwise it is not applied to PDF imports

I am happy to merge more contributions. Please make sure the Github Actions workflow compiles.

@buchen buchen closed this Jul 25, 2021
@flywire
Copy link
Contributor Author

flywire commented Jul 26, 2021

Hmm

A couple of remarks:

  • gst - tax, I think will be revisited another time 😄
  • The "Sell" file was somehow corrupted - yes, I wasn't aware of that problem - see below. (Buy dump unintentionally fixed when deidentified.)
  • in Germany - if you sell something - you get the proceeds without brokerage fees. 😄 I made a bad mistake - correct way is sell for $100.00, fees $10.00 get $90.00.

Seems corruption problem is with PDFBox.

Portfolio Performance
Version: 0.54.0 (Jul. 2021)
Platform: win32, x86_64
Java: 11.0.4+11-LTS, Azul Systems, Inc.
Locale: AU

The corrupted dump characters (c2 a0 instead of 20) occur with any dump of buy or sell transactions (tested 20 files including buy and sell). Notepad++ v8.1.1 (64bit), File, Open, PlugIns, Hex-Editor, View in Hex:

image

I can open pdf file in Microsoft Edge Version 92.0.902.55 (Official build) (64-bit) Win10, select/copy/paste into Notepad++ and viewed as above has no strange characters (same 20 files).

@flywire
Copy link
Contributor Author

flywire commented Jul 26, 2021

Interestingly:

  • java -jar pdfbox-app-1.8.10.jar ExtractText PrivateBuy.pdf uses a0 as space character
  • java -jar pdfbox-app-2.0.24.jar ExtractText PrivateBuy.pdf uses c2 a0 as space character, same v1.8.16

https://issues.apache.org/jira/browse/PDFBOX-5247

@chirlu
Copy link
Member

chirlu commented Jul 28, 2021

Unicode character U+00A0 is the no-break space. In UTF-8, it is represented as the byte sequence C2 A0. In ISO 8859-1 and other (older) one-byte encodings, it is represented as A0.

@flywire
Copy link
Contributor Author

flywire commented Jul 28, 2021

The SelfWealth PDF files have been decoded and issue investigated: https://issues.apache.org/jira/browse/PDFBOX-5247

Indeed every space is a Non-breaking_space

if you don't like it, do your own postprocessing. It's unusual but valid.

@buchen I think this is a good case for post-processing rather than have hidden codes in the test files. What do you think?

Add to SelfWealthPDFExtractor.java something like:

String nbsp = "&nbsp;"
pdfstream.replaceAll("&nbsp;", " ");

Files are very fiddly to create and error-prone with Non-breaking space.

Updated: SelfWealthBuy01.txt SelfWealthSell01.txt

@buchen
Copy link
Member

buchen commented Aug 4, 2021

@buchen I think this is a good case for post-processing rather than have hidden codes in the test files. What do you think?

PP has PDFBox version 1.8.16 embedded.

If I understand it, the "purchase" has regular spaces, the "sale" has non-breaking spaces - is that also what you see when converting the PDF documents using the "File -> Import -> Debug: Create Text from PDF" feature?

From my point of view, there are two options:

  • adopt the regular expressions (currently looking for regular spaces) - but this is cumbersome if one file has regular spaces, the other non-breaking spaces as it duplicates the regular expressions
  • do a post-processing and replace the non-breaking spaces with regular spaces - would we do this on the text output before passing it to any converter, or would this be a pre-processing specific to SelfWealth?

@flywire
Copy link
Contributor Author

flywire commented Aug 4, 2021

If I understand it, the "purchase" has regular spaces, the "sale" has non-breaking spaces - is that also what you see when converting the PDF documents using the "File -> Import -> Debug: Create Text from PDF" feature?

No, they both have non-breaking spaces, see Updated: SelfWealthBuy01.txt SelfWealthSell01.txt (above your last post) using hex editor.

My preference is do a post-processing and replace the non-breaking spaces with regular spaces and given they are not normally visible (unless in hex) I suggest do it on all files. Would also give same result as select/copy/paste from pdf file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants