Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting content from encrypted file #531

Closed
praveentiru opened this issue Nov 23, 2022 · 3 comments
Closed

Extracting content from encrypted file #531

praveentiru opened this issue Nov 23, 2022 · 3 comments
Assignees

Comments

@praveentiru
Copy link

Please ensure the following:

  • I built executable using the latest code (git clone followed by go install)
  • I am using Windows 11
  • Bug detail described below:
    • I was trying to build a tool to extract financial transactions from Indian custodian for all mutual fund investments
    • The download is encrypted with a user password which I know
    • I used the user password to decrypt the file using pdfcpu
    • Then I extracted the content using pdfcpu in pdfcpu extract -mode content xxx.pdf path
    • Lot of strings in extracted content are non-ascii characters where it seems they are still encrypted

Is this an issue because I do not have access to owner password? Is there a way to work around this issue?

@hhrutter
Copy link
Collaborator

Hi there!

This is because you are looking at raw PDF content.
Extracting text ouf of page content is not supported at this time.

@mengstr
Copy link

mengstr commented Nov 27, 2022

The best/easiest solution I found was to use Ghostscript to extract the text from PDFs. When running on production/staging servers I can call Ghostscript directly, but on my development Macbook the Ghostscript installed from homebrew fails so I have a small docker container with Ghostscript to do the extraction there.

var outb []byte
if os.Getenv("devMode") != "yes" {
	outb, err = exec.Command("gs", "-sDEVICE=txtwrite", "-sOutputFile=-", "-q", "-dNOPAUSE", "-dBATCH", filename).Output()
} else {
	outb, err = exec.Command("docker", "run",
		"--rm",
		"-v", "/Users/mats/Desktop/HealthManager/hm-storage/labtests/:/app",
		"-w", "/app",
		"minidocks/ghostscript",
		"-sDEVICE=txtwrite", "-sOutputFile=-", "-q", "-dNOPAUSE", "-dBATCH",
		filepath.Base(filename),
	).Output()
}
out := string(outb)

@praveentiru
Copy link
Author

Ok. I have found that after decrypting even Apache Tika was able to extract text properly without issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants