New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GDAL errors when reading repeatedly from one GDALRasterSource #3184
Comments
GDAL formats in environment
|
@metasim just to confirm, have you tried GDAL 2.4.4? OSGeo/gdal#1244 |
Not sure... I'll try that later today.
Don't know. Took me a week to get to a repeatable test case, so those sorts of refinements are needed. |
@pomadchin Confirmed the bug occurs under GDAL 2.4.4, released 2020/01/08
|
@metasim perfetct (in terms of debugging) :D |
Just ran test against this GeoTIFF: And it does complete successfully. Perhaps it's a GDAL JP2 issue? |
@metasim ‾_(ツ)_/‾ requires a bit more investigations; can be just beacuase there is some random nature of this issue. I wish we could reproduce it on a laptop :/ |
@metasim sounds really sad, slow, and not too reliable |
Looking to try to reproduce at a lower level. |
Wondering if this might be the cause (fixed in 3.0.2): https://github.com/OSGeo/gdal/blob/ee535a1a3f5b35b0d231e1faac89ac1f889f7988/gdal/NEWS#L232-L238 |
@metasim I think it makes sense to try to use GDAL 3.0.4 |
Working on it. |
@pomadchin
|
I was able to hack together a new from pyrasterframes.utils import gdal_version
gdal_version()
...
'GDAL 3.0.4, released 2020/01/28' Bad news is that the bug is still there. 😢
|
BTW, it may be worth trying to run the Test Case on a non-AWS Linux machine or Docker container. My laptop is MacOS, so OS is a variable changed between local vs remote execution. It may not have to do with it being EC2 or a particular instance size. |
* feature/gt-3.0: Made JP2 GDAL thread lock configurable. Added global thread lock on JP2 GDAL file reading. See locationtech/geotrellis#3184 Refactor clip to clamp Use rf_local_clip and other viz functions in supervised learning doc page rf_rescale tests and refinements Add Rescale function to Scala, Python and SQL APIs Add rf_where and rf_standardize functions Add rf_local_min, rf_local_max, and rf_local_clip functions
Test case using custom buildCustom First create a shell in the environment:
Note: Running this locally does not fail. Maybe 8 or more cores are needed? Edit: With Docker on MacOS configured with all 8 cores, the job above does indeed fail. |
Custom
|
Tweaking Number of Cores in Docker
Edit: I was running this at home over mediocre WiFi. The office environment is 1Gbps wired. |
Update on the script to reproduce it. From within the docker container:
Although I do not reproduce the failure with 8 cores.
|
@vpipkt What happens if you leave out the |
@vpipkt Also, if you re-run it, can you do |
I pulled the image again (image id 26d9771deb79), and ran again omitting the explicit |
Same.... on wired internet at work it's passing. 😠 These results were from running it at home on mediocre WiFi. |
When using my phone's hot spot using 8 cores it fails. |
Bandwidth Limiting on MacOSThe "Additional Tools for Xcode 11" package includes a tool called Network Link Conditioner that simulates slow or error prone networks: When using this tool (and remembering to filp the "On" switch) results in the test fails. Edit: If it disappears from your System Preferences after install, do this: https://agilewarrior.wordpress.com/2018/10/31/trouble-installing-link-conditioner/ |
Also added a new issue to make this exception more trackable in the future geotrellis/gdal-warp-bindings#83 |
Still getting this error in our environment, but need to confirm that
|
Confirmed |
@metasim gotcha; I'll move it into in progress and will work on geotrellis/gdal-warp-bindings#83 and geotrellis/gdal-warp-bindings#84 next; so there would be a unique error thrown to detect that you're still can not aquire a locked dataset. |
after releasing geotrellis/gdal-warp-bindings#83 I will ask you to run tests again; if this error would happen again, we'll add some parametrized timeout setting (it is |
Edit: misinterpreted error message in this environment. Ignore for now while I revisit. |
To be clear, we are still having the error identified here in our environment: #3184 (comment) Just need to fix the RasterFrames notebook to reproduce. |
@metasim could you also print all the availble GDALOptions from the application.conf file? println(geotrellis.raster.gdal.config.GDALOptionsConfig.conf) |
Also: running against GDAL 2.4.4 |
@metasim look into configuration: GDALOptionsConfig(...,1048576) |
Here's where the new setting is defined: |
@metasim it didn't pick up; are you sure that the assembly jar contains an appropriate configuration file? |
Good question... I'll double check. |
crap... the assembly merged the geotrellis reference.conf and the rasterframes reference.conf, with the former overriding the latter. Suggestions on how to override a GT setting?... have an |
@metasim application.conf can be the way, and you can also work on the merging strategies of a reference.conf file probably |
Another option can be to leave only yours reference.conf ._. Or to decline the GDAL reference conf |
Testing with |
@metasim have you printed the |
I moved the geotrellis overrides to an 'GDALOptionsConfig(Map(CPL_VSIL_CURL_CHUNK_SIZE -> 1000000, CPL_VSIL_CURL_ALLOWED_EXTENSIONS -> .tif,.tiff,.jp2,.mrf,.idx,.lrc,.mrf.aux.xml,.vrt, AWS_REQUEST_PAYER -> requester, GDAL_HTTP_MAX_RETRY -> 10, CPL_DEBUG -> ON, GDAL_PAM_ENABLED -> NO, GDAL_DISABLE_READDIR_ON_OPEN -> YES, GDAL_CACHEMAX -> 512, GDAL_HTTP_RETRY_DELAY -> 2),List(SOURCE, WARPED),2147483647)' |
Sadly, after all this, still appears to be happening:
Py4JJavaError: An error occurred while calling o123.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 135 in stage 1.0 failed 1 times, most recent failure: Lost task 135.0 in stage 1.0 (TID 192, localhost, executor driver): java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/5/31/0/R60m/B12.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B08.jp2), GDALRasterSource(s3://sentinel-s2-l2a/tiles/22/L/EP/2019/9/13/0/R60m/B12.jp2) at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRasterSource.scala:93) at geotrellis.raster.gdal.GDALRasterSource.gridExtent(GDALRasterSource.scala:93) at geotrellis.raster.RasterMetadata$class.cols(RasterMetadata.scala:52) at geotrellis.raster.RasterSource.cols(RasterSource.scala:44) at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:71) at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53) at org.locationtech.rasterframes.ref.GDALRasterSource$$anonfun$tiffInfo$1.apply(GDALRasterSource.scala:53) at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262) at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139) at com.github.benmanes.caffeine.cache.UnboundedLocalCache.lambda$computeIfAbsent$2(UnboundedLocalCache.java:238) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) at com.github.benmanes.caffeine.cache.UnboundedLocalCache.computeIfAbsent(UnboundedLocalCache.java:234) at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108) at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62) at com.github.blemale.scaffeine.Cache.get(Cache.scala:40) at org.locationtech.rasterframes.ref.SimpleRasterInfo$.apply(SimpleRasterInfo.scala:49) at org.locationtech.rasterframes.ref.GDALRasterSource.tiffInfo(GDALRasterSource.scala:53) at org.locationtech.rasterframes.ref.GDALRasterSource.extent(GDALRasterSource.scala:57) at org.locationtech.rasterframes.ref.RFRasterSource.rasterExtent(RFRasterSource.scala:71) at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:65) at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs$$anonfun$1.apply(RasterSourceToRasterRefs.scala:63) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:63) ... 29 more Reproducible in this notebook environment:
with this notebook and data: gt-3184-test.zip You can run the notebook on the command line with |
Notebook viewable here: https://gist.github.com/metasim/1734fee3eefc4474a0f269aa976394a0 |
@metasim okay, this is another error code afterall; error code 4 show to run this notebook? Do I need ec2 m4.x4large? |
I get the error running on my 8 core macbook. To run, unpack gt-3184-test.zip and execute this:
edit: no coredumps |
@pomadchin I have a sneaking suspicion that "GDAL Error Code: 4" here might be triggered by an AWS identity error caused by reading a requester-pays bucket and not having a |
@metasim Error code 4 means GDAL failed to open the dataset, so it can be the case; wating till confirmation from you than. Thanks for the update! |
Pretty sure this last round was a false alarm due to:
With all those things addressed, the original test case now completes. I suggest this ticket be closed once an updated GT referencing |
@metasim 👍 |
GDAL 1.0.0 is published! Also look into the CHANGELOG for the all changes that were also a part of this release. Closing it now, feel free to reopen / open a new issue if smth would happen with it again! |
This error originated in some RasterFrames work. We have a table where one column is predominantly the same file and the analysis fails with one of a number of errors from
GDALDataset
, such as:or
(See below for extended output)
I removed RasterFrames from the mix, resulting in the test case below. (At this point I have not further reduced to get Spark out of mix with, say,
Future
s instead.) It should be noted that some of the reads complete successfully.When I run it on my laptop is completes successfully, but when I run it on a beefier EC2 instance (
m5a.2xlarge
) it fails. Suspect concurrency level and I/O throughput set the conditions. It appears to work when setting--master=local[1]
.Edit:
my laptop is MacOS, whereas the EC2 instance is Linux. That may be the pertinent variable instead of instance size.Ran in docker locally with 4 cores and the job succeeded.Edit: Configured docker to run with 8 cores on my laptop and it failed!
Test Case
RSRead.scala
Execution Command
Using Spark 2.4.4, Scala 2.11.12, GDAL 2.4.3 (released 2019/10/28)
Sample Backtrace
Full log output
cc: @vpipkt
The text was updated successfully, but these errors were encountered: