Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix comparations for image colorspace literals #132

Merged
merged 6 commits into from Oct 15, 2019

Conversation

bashkirtsevich
Copy link
Contributor

@bashkirtsevich bashkirtsevich commented Mar 25, 2018

See issue #131
Python: 3.6.3

@tataganesh tataganesh changed the base branch from master to develop November 8, 2018 19:07
eladkehat added a commit to eladkehat/yapdfminer that referenced this pull request May 1, 2019
@pietermarsman
Copy link
Member

Hi @bashkirtsevich, thanks for your contribution. This is a very late response...

We don't have any test that run this code so I am not sure if this breaks any other kind of pdfs. I can imagine that sometimes the colorspace is a list and sometimes it is not. Do you have any idea if that is the case?

If you want this to be merged you should also add a a test. That makes sure that this code can actually be run (which I believe by looking at it) and it will make sure that we don't change it without thinking.

Do you have time to work on this?

@bashkirtsevich
Copy link
Contributor Author

bashkirtsevich commented Aug 4, 2019

In my PR I just fix logical mistakes such as is and in.
We using this library with my patch, and its finally work on a lots of PDF documents.
Merge not necessary. May be this PR can helpful as a knowledge base.

@pietermarsman
Copy link
Member

pietermarsman commented Aug 19, 2019

I've just checked if any of our current tests is using the ImageWriter, and this is not the case.

But we do have some sample pdf's that contain images. This is the result of evaluating image.colorspace at the start of ImageWriter.export_image().

nonfree/dmca.pdf.

  • [/'Indexed', /'DeviceRGB', 255, <PDFObjRef:86>]

nonfree/175.pdf

  • [/'DeviceRGB']
  • [/'DeviceRGB']

Both pdfs have a list of colorspaces instead of a single value. And thus our current way of checking e.g. image.colorspace is LITERAL_DEVICE_RGB is wrong.

@pietermarsman
Copy link
Member

I've checked the code from this PR with nonfree/dmca.pdf and it improves the output! With this PR the code detects that it is a RGB image and should be written to disk as a .bmp. With the old code the image is not recognized as RGB and it is written as a generic .img file that cannot be easily opened.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have some tests for this code but @bashkirtsevich indicated that he does not have time to work on this.

I've checked the code with pdf's that contain images and that works!

So, I suggest we merge this into develop.

@pietermarsman pietermarsman merged commit 4df6d4e into pdfminer:develop Oct 15, 2019
@pietermarsman pietermarsman mentioned this pull request Oct 15, 2019
@chengmengli06
Copy link

chengmengli06 commented May 20, 2020

@pietermarsman I have a pdf with some images write as general ".img" files, how could I convert these images to jpg format? I use pdf2txt, the first 3 bytes of the stream are H\x89\xec, the corresponding hex is 4889ec.
image

@pietermarsman
Copy link
Member

pdfminer.six writes to a .img file if it cannot infer the type of the image. It supports jpeg (.jpg), jbig2 (.jb2) and grayscale (.bmp) images. The bytes of an unrecognized images type are written to this extention: '.%d.%dx%d.img' % (image.bits, width, height). From the filename you can infer the number bits, width and height of the image. I don't know how you should open it.

Please share if you find a way to open the images. Perhaps we can add it to pdfminer.six.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants