Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculated clipsrc area is too small for pdf to geojson ogr2ogr conversion #10335

Closed
tiffanywei opened this issue Jun 27, 2024 · 3 comments
Closed

Comments

@tiffanywei
Copy link

tiffanywei commented Jun 27, 2024

What is the bug?

When I convert the attached pdf to geojson using ogr2ogr, the resulting geojson is on a very tiny area of the pdf.

-removed file-

Steps to reproduce the issue

Given the cropbox of the pdf:

$ pdfinfo -box clipsrc-problem.pdf
Creator:         TeX
Producer:        GPL Ghostscript 10.02.1
CreationDate:    Wed May 22 11:29:41 2024 PDT
ModDate:         Wed May 22 11:29:41 2024 PDT
Custom Metadata: no
Metadata Stream: yes
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           1
Encrypted:       no
Page size:       2588.04 x 1381.28 pts
Page rot:        0
MediaBox:            0.00     0.00  2588.04  1381.28
CropBox:             0.00     0.00  2588.04  1381.28
BleedBox:            0.00     0.00  2588.04  1381.28
TrimBox:             0.00     0.00  2588.04  1381.28
ArtBox:              0.00     0.00  2588.04  1381.28
File size:       1178486 bytes
Optimized:       no
PDF version:     1.7

And a dpi of 264, the dimensions of the pdf should be 9489px wide and 5065px high.

When I run the ogr2ogr command without the clipsrc flag:

ogr2ogr --config OGR_PDF_READ_NON_STRUCTURED YES --config GDAL_PDF_DPI 264 -lco COORDINATE_PRECISION=3 without-clipsrc.geojson clipsrc-problem.pdf

I get a correct geojson file of the entire pdf.
-removed file-
But when I run it with the clipsrc flag with the a min and max that should span the entire pdf:

ogr2ogr --config OGR_PDF_READ_NON_STRUCTURED YES --config GDAL_PDF_DPI 264 -lco COORDINATE_PRECISION=3 -clipsrc 0 0 9489 5065 with-clipsrc.geojson clipsrc-problem.pdf

I get a geojson file that spans just a small portion of the pdf.
-removed file-

I would expect the two commands to return almost equivalent geojson files.

Versions and provenance

OS: MacOS Sonoma 14.5
gdal: GDAL 3.8.4, released 2024/02/08
pdfinfo: pdfinfo version 24.01.0

Additional context

No response

@jratike80
Copy link
Collaborator

jratike80 commented Jun 27, 2024

If you use ogr2ogr you are playing with vector data, not pixels. You can check the extent of the PDF with ogrinfo

ogrinfo --config OGR_PDF_READ_NON_STRUCTURED YES  clipsrc-problem.pdf -al -so
Extent: (-51.548640, -14146.138422) - (54006.879028, 28800.419032)

I thought that this would return the same output than without clipsrc
-clipsrc -51.548640 -14146.138422 54006.879028 28800.419032

However, it does not. There is something in the extents that I do not understand. Ogrinfo shows these extents:

For the pdf:

Feature Count: 63161
Extent: (-51.548640, -14146.138422) - (54006.879028, 28800.419032)

For the without-clipsrc

Feature Count: 63161
Extent: (-90.717000, -24895.827000) - (95042.892000, 50685.936000)

For the with-clipsrc

Feature Count: 16590
Extent: (2345.217000, 225.514000) - (54006.879000, 28800.419000)

Gdalinfo works in a rasterized world but it finds also different size for the document than what you calculated:

gdalinfo clipsrc-problem.pdf
Driver: PDF/Geospatial PDF
Files: clipsrc-problem.pdf
Size is 5392, 2878

@rouault
Copy link
Member

rouault commented Jun 29, 2024

I don't reproduce the issue using GDAL master. The geojson files generated with or without clipsrc are of similar size (the clipped one is slightly smaller due to some invalid geometries, like lines reduced to one single point, being omitted), and they look identical (except the "frame" line of the unclipped file being removed, as expected) when opened in QGIS

$ ls -al without-clipsrc.geojson with-clipsrc.geojson
-rw-rw-r-- 1 even even 15017544 juin  29 17:53 with-clipsrc.geojson
-rw-rw-r-- 1 even even 16653048 juin  29 17:52 without-clipsrc.geojson

$ ogrinfo -al -so without-clipsrc.geojson
INFO: Open of `without-clipsrc.geojson'
      using driver `GeoJSON' successful.

Layer name: content
Geometry: Unknown (any)
Feature Count: 63161
Extent: (-9.072000, -2489.583000) - (9504.289000, 5068.594000)

$ ogrinfo -al -so with-clipsrc.geojson
INFO: Open of `with-clipsrc.geojson'
      using driver `GeoJSON' successful.

Layer name: content
Geometry: Unknown (any)
Feature Count: 53658
Extent: (234.522000, 21.891000) - (9482.474000, 5032.952000)

I believe that the fixes for the PDF driver that went in 3.9.1 to solve #9870, in particular 53895c7, were the key to fix the issue.

With 3.8, you can workaround the issue (for that particular file) by multiplying by 10 your -clipsrc values

@rouault rouault closed this as completed Jun 29, 2024
@tiffanywei
Copy link
Author

@rouault Upgrading gdal fixed it, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants