Cannot access S3 with protocols s3n or s3a #538

spaghettifunk · 2018-01-05T11:29:33Z

I do have a Java Spark job which access an S3 bucket using the protocol s3a but with localstack I'm having issues in accessibility. Here the details of what I've done:

In my pom.xml I imported:

    <dependency>
      <groupId>cloud.localstack</groupId>
      <artifactId>localstack-utils</artifactId>
      <version>0.1.9</version>
    </dependency>

The class I'm trying to test looks like this:

public class Session {

    private static SparkSession sparkSession = null;
    
    public SparkSession getSparkSession() {
        if (sparkSession == null) {
            sparkSession = SparkSession.builder()
                .appName("my-app)
                .master("local)
                .getOrCreate();

            Configuration hadoopConf = sparkSession.sparkContext().hadoopConfiguration();
            hadoopConf.set("fs.s3a.access.key", "test");
            hadoopConf.set("fs.s3a.secret.key", "test");
            hadoopConf.set("fs.s3a.endpoint", "http://test.localhost.atlassian.io:4572");           
        }
        return sparkSession;
    }
   
    public Dataset<Row> readJson(Seq<String> paths) {
        // Paths example: "s3a://testing-bucket/key1", "s3a://testing-bucket/key2", ...
        return this.getSparkSession().read().json(paths)
    }
}

My testing class look like this:

@RunWith(LocalstackTestRunner.class)
public class BaseTest {

    private static AmazonS3 s3Client = null;

    @BeforeAll
    public static void setup() {
        EndpointConfiguration configuration = new AwsClientBuilder.EndpointConfiguration(
            LocalstackTestRunner.getEndpointS3(),
            LocalstackTestRunner.getDefaultRegion());

        s3Client = AmazonS3ClientBuilder.standard()
            .withEndpointConfiguration(configuration)
            .withCredentials(TestUtils.getCredentialsProvider())
            .withChunkedEncodingDisabled(true)
            .withPathStyleAccessEnabled(true)
            .build();

        String bucketName = "testing-bucket";
        s3Client.createBucket(bucketName);
        
        // load files here: I followed the example under the folder ext/java
        // to put some objects in there
    }

    @Test
    public void testOne() {
         Session s = new Session();
         Seq<String> paths = new Seq<String>("....");
        
         Dataset<Row> files = s.readJson(paths); // <--- error here
    }

I do receive two types of error depending on the protocol I am using. If I use protocol s3a I do have the following:

org.apache.spark.sql.AnalysisException: Path does not exist: s3a://testing-bucket/key1;

	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
	at com.company.Session.readJson(Session.java:63)
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

If I use s3n (with the appropriate Spark configurations), I do receive this:

       .....
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AllAccessDisabled</Code><Message>All access to this object has been disabled</Message><RequestId>EDDCE84A7470E7EC</RequestId><HostId>pKR4W3eUs6UCc0LU+QGifEvja3xMA4SfaBZRDt8JKoiu450VJtKoOaRhK/B5wD4f1iBu6HwUlWk=</HostId></Error>
	at org.jets3t.service.S3Service.getObject(S3Service.java:1470)
	at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:164)
	... 51 more

I don't understand if it's a bug or I am missing something. Thanks!

The text was updated successfully, but these errors were encountered:

whummer · 2018-01-09T03:24:54Z

Thanks for reporting @davideberdin . Can you try running your example with S3 path style access enabled (This is also briefly mentioned in the Troubleshooting section of the README):

        hadoopConf.set("fs.s3a.path.style.access", "true");

spaghettifunk · 2018-01-09T08:55:08Z

Hi @whummer, thanks for your answer! I tried by setting that property but I still have the same issue

java.nio.file.AccessDeniedException: s3a://testing-bucket/key1: getFileStatus on s3a://testing-bucket/key1: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 5A704DAC25D2DA09), S3 Extended Request ID: CE3wcJ/I5mF0t3CHoqAAf6fLkFJYVgKzIehHeXvYKWt0FZLOYFg6kFiLRRGTuEFl8SOZ9/h/IGQ=

Do you know if I should in somehow configure the ACL of the bucket or something related to the file in S3 ? This is the way I put the object in my bucket

AmazonS3 client = TestUtils.getClientS3();
byte[] dataBytes = .... // my content here

ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentType(ContentType.APPLICATION_JSON.toString());
metaData.setContentEncoding(StandardCharsets.UTF_8.name());
metaData.setContentLength(dataBytes.length);

byte[] resultByte = DigestUtils.md5(dataBytes);
String streamMD5 = new String(Base64.encodeBase64(resultByte));
metaData.setContentMD5(streamMD5);

PutObjectRequest putObjectRequest = new PutObjectRequest("testing-bucket", "key1", new ByteArrayInputStream(dataBytes), metaData);

client.putObject(putObjectRequest);

whummer · 2018-01-09T14:20:15Z

One thing I noticed - can you double check the bucket names - the example in the first post uses "testing-bucket" for createBucket , but "test-bucket" for the paths in readJson. Also, please use the s3a protocol (it appears that s3n is deprecated). It should not be necessary to set bucket ACL, since you're already using a (fake) access key and secret key.

spaghettifunk · 2018-01-11T12:37:37Z

I checked the naming carefully and in my code i'm using the same name. I made a typo here. Let me correct it. Sorry for the inconvenient.

hilios · 2018-02-06T10:51:35Z

I am facing a problem like this but using the hadoop-aws library directly. My program tries to read a Parquet (using parquet-avro) file from localstack but it cannot

// Creat Bucket
val testBucket = s"test-bucket-${UUID.randomUUID()}"
val localtack = new EndpointConfiguration("http://s3:4572", Regions.US_EAST_1.getName)
val s3 = AmazonS3ClientBuilder.standard()
    .withEndpointConfiguration(localtack)
    .withPathStyleAccessEnabled(true)
    .build()
s3.createBucket(testBucket)
// Upload test file
val parquet = new File(getClass.getResource("/track.parquet").toURI)
val obj = new PutObjectRequest(testBucket, "track.parquet", parquet)
s3.putObject(obj)

// Hadoop configuration
val configuration = new Configuration()
configuration.set("fs.s3a.endpoint", "http://s3:4572")
configuration.set("fs.s3a.access.key", "<empty>")
configuration.set("fs.s3a.secret.key", "<empty>")
configuration.set("fs.s3a.path.style.access", "true")

// Read parquet
val path = new Path(s"s3a://$testBucket/track.parquet")
val reader = AvroParquetReader.builder[GenericRecord](path).withConf(configuration).build()

println(reader.read()) // This piece of code never is executed but no Exception is thrown

I can access the file normally with the AWS CLI and SDK but have no clue what is going on with with the ParquetReader.

There is any way to debug the S3 call made to localstack in that way I think it would be easier to track where the error is either on localstack the parquet-avro or in my code.

One last thing is that the AvroParquetReader works fine when is called with a local path.

sam-reh-hs · 2018-04-27T23:35:09Z

This was broken for me as well. It was related to Content-Length always being returned as 0 on a HEAD request.
I upgraded to localstack 0.8.6 and it works 😸

ghost · 2018-09-25T15:00:26Z

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())

Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

vishal98 · 2018-11-05T14:17:13Z

i am trying to read the csv file from s3 using following code

s3_read_df = spark.read \
        .option("delimiter", ",") \
        .format("csv") \
        .load("s3a://tutorial/test.csv")

Getting xml parsing error for same.


py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)

vishal98 · 2018-11-09T23:43:17Z

i was able to read/write on localstack s3 with spark without hadoop 2.3.1 version installation and hadoop 2.8.4.
we will need to set parameter mentioned in link to make whole thing working
http://spark.apache.org/docs/latest/hadoop-provided.html

Issue is outlined in localstack#869 and localstack#538 (Specifically localstack#538 (comment)) One of the external symptoms of the bug is that S3 interactions for objects that are only the null string (zero length objects) throw an error like `Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: ...` which is telling because the `contentMD5` value is the base64 encoded value for the md5 of null string. The correct etag value should be `d41d8cd98f00b204e9800998ecf8427e`. On multiple executions of the test case the etag generated by the S3 server is different so there is some non deterministic value being put into the hash function. I haven't been able to find where the actual issue exists in the code as I don't use python very often. One workaround for avoiding this issue is to disable the MD5 validations in the S3 client. For Java that means setting the following properties to any value. ``` com.amazonaws.services.s3.disableGetObjectMD5Validation com.amazonaws.services.s3.disablePutObjectMD5Validation ```

i-aggarwal · 2019-03-11T15:29:57Z

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())

Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

@avn4986, were you able to sort this out? I have the same issue.

risdenk · 2019-03-20T13:27:24Z

One of the current issues with s3a and localstack is rename. I am putting together a small test project that shows the issue. Using S3A to rename ends up checksum errors.

risdenk · 2019-03-20T19:38:06Z

Created https://github.com/risdenk/s3a-localstack to demonstrate the S3A rename issue. This uses localstack docker with JUnit.

Example output:

testS3ALocalStackFileSystem(io.github.risdenk.hadoop.s3a.TestS3ALocalstack)  Time elapsed: 4.562 sec  <<< ERROR!
org.apache.hadoop.fs.s3a.AWSBadRequestException: copyFile(YQuGJ, lzKBU) on YQuGJ: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:InvalidDigest: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:224)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:125)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:2541)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:996)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:863)
	at io.github.risdenk.hadoop.s3a.TestS3ALocalstack.testS3ALocalStackFileSystem(TestS3ALocalstack.java:123)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.docker.LocalstackDockerTestRunner.run(LocalstackDockerTestRunner.java:43)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
	at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1890)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:132)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

steveloughran · 2019-03-22T12:10:33Z

This looks like checksum issues. The AWS client builds an MD5 checksum as it uploads stuff, and then compares it with what it got back. If they are different, it assumes corruption en-route.

lyle-nel · 2019-05-15T11:18:59Z

@steveloughran Have you found a workaround for this?

lyle-nel · 2019-05-15T11:19:03Z

#1152 #538 #869 #1120

risdenk · 2019-05-22T20:28:08Z

Based on @mvpc comment here: #1152 (comment)

Looks like the problem is in this method:
https://github.com/localstack/localstack/blame/master/localstack/services/s3/s3_listener.py#L311

risdenk · 2019-05-23T03:01:51Z

Problem Statement - s3a rename
So finally had some time to sit down and look at the s3a rename failure. The AWS Java SDK in the case of s3a is doing a CopyObjectRequest here [1]. The ObjectMetadata is being copied to this request and so the Content-MD5 is being sent. It is unclear from the AWS docs [2] if Content-MD5 should be set on the request.

Localstack thinks that if the Content-MD5 header is set, then the MD5 of data must be checked [3]. The content ends up being empty on a copy request. When Localstack calculates the MD5 of '' it doesn't match what the AWS SDK/s3a set as the Content-MD5 [4].

In the case of say Minio, there is a separate copy request handler that doesn't check the Content-MD5 header [5].

Potential Solution
So I think there are best option here is to:

Make localstack be more lenient and check if this is a copy request before checking MD5
- Something like if 'Content-MD5' in headers and 'x-amz-copy-source' not in headers:
- https://github.com/localstack/localstack/blob/master/localstack/services/s3/s3_listener.py#L444

An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

There are existing s3a users out there this wouldn't work for existing clients
S3AFilesystem works against Amazon S3 currently

[1] https://github.com/apache/hadoop/blob/branch-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2541
[2] https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
[3] https://github.com/localstack/localstack/blob/master/localstack/services/s3/s3_listener.py#L444
[4] https://github.com/localstack/localstack/blame/master/localstack/services/s3/s3_listener.py#L311
[5] https://github.com/minio/minio/blob/master/cmd/object-handlers.go#L623

steveloughran · 2019-05-31T21:27:24Z

An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

There are existing s3a users out there this wouldn't work for existing clients

valid point

S3AFilesystem works against Amazon S3 currently

true, but we don't want to preclude other stores. Looking at the last time anyone touched the metadata propagation code it actually came from EMC HADOOP-11687.

One suggestion here is that someone checks out the hadoop source code and run the hadoop-aws tests against this new s3a endpoint. Then if there are problems, they can be fixed at either end. If we shouldn't be setting the content-MD5, then we shouldn't be setting the content-MD5.

whummer · 2019-08-25T20:06:32Z

Great analysis, thanks @risdenk @steveloughran . How should we proceed with this issue - would you be able to create a pull request with the proposed changes? Thanks!

steveloughran · 2019-08-26T11:09:13Z

I am actually doing some copy stuff in HADOOP-16490. Improve S3Guard handling of FNFEs in copy apache/hadoop#1229, but this should be separate if it stands any chance of being backported.
the probability of me finding the time to work on this is ~0
but if others were to do a patch and, after complying with our test requirements, repeatedly harass me to look at it, I'll take a look

my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

risdenk · 2019-08-28T02:50:53Z

I can make a pull request for not checking the MD5 if it is a copy. It is a simple one line change. Not sure where the tests are. Let me put up the PR and go from there.

my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

I tried to take a crack at this tonight and ran into a bunch of issues with ETag Change detection policy requires ETag.

Fixes localstack#538

Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>

risdenk · 2019-08-28T02:56:22Z

@whummer created PR #1510 for the MD5 copy check change.

Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>

whummer · 2019-08-29T08:12:56Z

@risdenk thanks for the PR! With #1510 being merged, can this issue be closed from your perspective?

risdenk · 2019-08-29T16:09:39Z

@whummer - PR #1510 definitely fixes the rename issues that were reported here with the MD5 being invalid. Can't say it fixes every potential problem :)

whummer · 2019-08-29T18:29:40Z

Thanks. Closing this issue for now. Please create follow-up issues with specific details if any new problems come up.

whummer added type: question Please ask questions on discuss.localstack.cloud area: configuration Configuring LocalStack labels Jan 9, 2018

bskarda mentioned this issue Feb 11, 2019

Add failing test to show S3 md5 null string issue #1120

Merged

Pedrock mentioned this issue Feb 22, 2019

Fix the upload of empty S3 objects with chunk encoding #1140

Merged

risdenk mentioned this issue Mar 21, 2019

Test against localstack risdenk/solr-s3a-testing#1

Open

risdenk added a commit to risdenk/localstack that referenced this issue Aug 28, 2019

S3 skip checking MD5 on copy

c087b15

Fixes localstack#538

risdenk added a commit to risdenk/localstack that referenced this issue Aug 28, 2019

S3 skip checking MD5 on copy

5ee88c4

Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>

risdenk mentioned this issue Aug 28, 2019

S3 skip checking MD5 on copy #1510

Merged

risdenk added a commit to risdenk/localstack that referenced this issue Aug 29, 2019

S3 skip checking MD5 on copy

35a89f1

Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>

whummer closed this as completed Aug 29, 2019

davidahern mentioned this issue Dec 29, 2020

upload file with boto, download it with boto3: file gets corrupted (wrong md5 sum) getmoto/moto#816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot access S3 with protocols s3n or s3a #538

Cannot access S3 with protocols s3n or s3a #538

spaghettifunk commented Jan 5, 2018 •

edited

Loading

whummer commented Jan 9, 2018

spaghettifunk commented Jan 9, 2018 •

edited

Loading

whummer commented Jan 9, 2018

spaghettifunk commented Jan 11, 2018

hilios commented Feb 6, 2018

sam-reh-hs commented Apr 27, 2018 •

edited

Loading

ghost commented Sep 25, 2018 •

edited by ghost

Loading

vishal98 commented Nov 5, 2018 •

edited

Loading

vishal98 commented Nov 9, 2018

i-aggarwal commented Mar 11, 2019

risdenk commented Mar 20, 2019

risdenk commented Mar 20, 2019

steveloughran commented Mar 22, 2019

lyle-nel commented May 15, 2019

lyle-nel commented May 15, 2019

risdenk commented May 22, 2019

risdenk commented May 23, 2019

steveloughran commented May 31, 2019

whummer commented Aug 25, 2019

steveloughran commented Aug 26, 2019

risdenk commented Aug 28, 2019

risdenk commented Aug 28, 2019

whummer commented Aug 29, 2019

risdenk commented Aug 29, 2019

whummer commented Aug 29, 2019

Cannot access S3 with protocols s3n or s3a #538

Cannot access S3 with protocols s3n or s3a #538

Comments

spaghettifunk commented Jan 5, 2018 • edited Loading

whummer commented Jan 9, 2018

spaghettifunk commented Jan 9, 2018 • edited Loading

whummer commented Jan 9, 2018

spaghettifunk commented Jan 11, 2018

hilios commented Feb 6, 2018

sam-reh-hs commented Apr 27, 2018 • edited Loading

ghost commented Sep 25, 2018 • edited by ghost Loading

vishal98 commented Nov 5, 2018 • edited Loading

vishal98 commented Nov 9, 2018

i-aggarwal commented Mar 11, 2019

risdenk commented Mar 20, 2019

risdenk commented Mar 20, 2019

steveloughran commented Mar 22, 2019

lyle-nel commented May 15, 2019

lyle-nel commented May 15, 2019

risdenk commented May 22, 2019

risdenk commented May 23, 2019

steveloughran commented May 31, 2019

whummer commented Aug 25, 2019

steveloughran commented Aug 26, 2019

risdenk commented Aug 28, 2019

risdenk commented Aug 28, 2019

whummer commented Aug 29, 2019

risdenk commented Aug 29, 2019

whummer commented Aug 29, 2019

spaghettifunk commented Jan 5, 2018 •

edited

Loading

spaghettifunk commented Jan 9, 2018 •

edited

Loading

sam-reh-hs commented Apr 27, 2018 •

edited

Loading

ghost commented Sep 25, 2018 •

edited by ghost

Loading

vishal98 commented Nov 5, 2018 •

edited

Loading