-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark parquet writes to s3 fail with "InvalidDigest: The Content-MD5 you specified was invalid" #1152
Comments
@mvpc I am encountering this same issue. Have found a workaround for this? |
@lyle-nel I ended up monkey-patching localstack's
|
Had the same issue during saving with had solved the problem by setting sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.fast.upload.buffer", "bytebuffer") |
It works, thanks! but one more note: |
Hi, I am using localstack s3 in unit tests for code where pyspark reads and writes parquet to s3. Reads work great, but during writes I'm encountering
InvalidDigest: The Content-MD5 you specified was invalid
. The code works just fine with real s3.Looks like the errors happen when localstack's
s3_listener.ProxyListenerS3.forward_request
callscheck_content_md5
. All the failed requests are PUTs into temporary.snappy.parquet
files, with empty data, here is an example request:I see similar problem mentioned in #1140, but that fix did not help with my issue (I did try upgrading to localstack 0.9.0), so looks like this parquet issue is different.
To reproduce
I'm using python 2.7.12 on mac.
Install localstack and pyspark into a virtualenv, add jars for communication with s3:
in one tab start localstack s3:
in another tab start pyspark shell:
in the pyspark shell run the following:
observe InvalidDigest errors like
The text was updated successfully, but these errors were encountered: