Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access S3 with protocols s3n or s3a #538

Closed
spaghettifunk opened this issue Jan 5, 2018 · 25 comments
Closed

Cannot access S3 with protocols s3n or s3a #538

spaghettifunk opened this issue Jan 5, 2018 · 25 comments
Labels
area: configuration Configuring LocalStack type: question Please ask questions on discuss.localstack.cloud

Comments

@spaghettifunk
Copy link

spaghettifunk commented Jan 5, 2018

I do have a Java Spark job which access an S3 bucket using the protocol s3a but with localstack I'm having issues in accessibility. Here the details of what I've done:

In my pom.xml I imported:

    <dependency>
      <groupId>cloud.localstack</groupId>
      <artifactId>localstack-utils</artifactId>
      <version>0.1.9</version>
    </dependency>

The class I'm trying to test looks like this:

public class Session {

    private static SparkSession sparkSession = null;
    
    public SparkSession getSparkSession() {
        if (sparkSession == null) {
            sparkSession = SparkSession.builder()
                .appName("my-app)
                .master("local)
                .getOrCreate();

            Configuration hadoopConf = sparkSession.sparkContext().hadoopConfiguration();
            hadoopConf.set("fs.s3a.access.key", "test");
            hadoopConf.set("fs.s3a.secret.key", "test");
            hadoopConf.set("fs.s3a.endpoint", "http://test.localhost.atlassian.io:4572");           
        }
        return sparkSession;
    }
   
    public Dataset<Row> readJson(Seq<String> paths) {
        // Paths example: "s3a://testing-bucket/key1", "s3a://testing-bucket/key2", ...
        return this.getSparkSession().read().json(paths)
    }
}

My testing class look like this:

@RunWith(LocalstackTestRunner.class)
public class BaseTest {

    private static AmazonS3 s3Client = null;

    @BeforeAll
    public static void setup() {
        EndpointConfiguration configuration = new AwsClientBuilder.EndpointConfiguration(
            LocalstackTestRunner.getEndpointS3(),
            LocalstackTestRunner.getDefaultRegion());

        s3Client = AmazonS3ClientBuilder.standard()
            .withEndpointConfiguration(configuration)
            .withCredentials(TestUtils.getCredentialsProvider())
            .withChunkedEncodingDisabled(true)
            .withPathStyleAccessEnabled(true)
            .build();

        String bucketName = "testing-bucket";
        s3Client.createBucket(bucketName);
        
        // load files here: I followed the example under the folder ext/java
        // to put some objects in there
    }

    @Test
    public void testOne() {
         Session s = new Session();
         Seq<String> paths = new Seq<String>("....");
        
         Dataset<Row> files = s.readJson(paths); // <--- error here
    }

I do receive two types of error depending on the protocol I am using. If I use protocol s3a I do have the following:

org.apache.spark.sql.AnalysisException: Path does not exist: s3a://testing-bucket/key1;

	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
	at com.company.Session.readJson(Session.java:63)
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

If I use s3n (with the appropriate Spark configurations), I do receive this:

       .....
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AllAccessDisabled</Code><Message>All access to this object has been disabled</Message><RequestId>EDDCE84A7470E7EC</RequestId><HostId>pKR4W3eUs6UCc0LU+QGifEvja3xMA4SfaBZRDt8JKoiu450VJtKoOaRhK/B5wD4f1iBu6HwUlWk=</HostId></Error>
	at org.jets3t.service.S3Service.getObject(S3Service.java:1470)
	at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:164)
	... 51 more

I don't understand if it's a bug or I am missing something. Thanks!

@whummer
Copy link
Member

whummer commented Jan 9, 2018

Thanks for reporting @davideberdin . Can you try running your example with S3 path style access enabled (This is also briefly mentioned in the Troubleshooting section of the README):

        hadoopConf.set("fs.s3a.path.style.access", "true");

@whummer whummer added type: question Please ask questions on discuss.localstack.cloud area: configuration Configuring LocalStack labels Jan 9, 2018
@spaghettifunk
Copy link
Author

spaghettifunk commented Jan 9, 2018

Hi @whummer, thanks for your answer! I tried by setting that property but I still have the same issue

java.nio.file.AccessDeniedException: s3a://testing-bucket/key1: getFileStatus on s3a://testing-bucket/key1: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 5A704DAC25D2DA09), S3 Extended Request ID: CE3wcJ/I5mF0t3CHoqAAf6fLkFJYVgKzIehHeXvYKWt0FZLOYFg6kFiLRRGTuEFl8SOZ9/h/IGQ=

Do you know if I should in somehow configure the ACL of the bucket or something related to the file in S3 ? This is the way I put the object in my bucket

AmazonS3 client = TestUtils.getClientS3();
byte[] dataBytes = .... // my content here

ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentType(ContentType.APPLICATION_JSON.toString());
metaData.setContentEncoding(StandardCharsets.UTF_8.name());
metaData.setContentLength(dataBytes.length);

byte[] resultByte = DigestUtils.md5(dataBytes);
String streamMD5 = new String(Base64.encodeBase64(resultByte));
metaData.setContentMD5(streamMD5);

PutObjectRequest putObjectRequest = new PutObjectRequest("testing-bucket", "key1", new ByteArrayInputStream(dataBytes), metaData);

client.putObject(putObjectRequest);

@whummer
Copy link
Member

whummer commented Jan 9, 2018

One thing I noticed - can you double check the bucket names - the example in the first post uses "testing-bucket" for createBucket , but "test-bucket" for the paths in readJson. Also, please use the s3a protocol (it appears that s3n is deprecated). It should not be necessary to set bucket ACL, since you're already using a (fake) access key and secret key.

@spaghettifunk
Copy link
Author

I checked the naming carefully and in my code i'm using the same name. I made a typo here. Let me correct it. Sorry for the inconvenient.

@hilios
Copy link

hilios commented Feb 6, 2018

I am facing a problem like this but using the hadoop-aws library directly. My program tries to read a Parquet (using parquet-avro) file from localstack but it cannot

// Creat Bucket
val testBucket = s"test-bucket-${UUID.randomUUID()}"
val localtack = new EndpointConfiguration("http://s3:4572", Regions.US_EAST_1.getName)
val s3 = AmazonS3ClientBuilder.standard()
    .withEndpointConfiguration(localtack)
    .withPathStyleAccessEnabled(true)
    .build()
s3.createBucket(testBucket)
// Upload test file
val parquet = new File(getClass.getResource("/track.parquet").toURI)
val obj = new PutObjectRequest(testBucket, "track.parquet", parquet)
s3.putObject(obj)

// Hadoop configuration
val configuration = new Configuration()
configuration.set("fs.s3a.endpoint", "http://s3:4572")
configuration.set("fs.s3a.access.key", "<empty>")
configuration.set("fs.s3a.secret.key", "<empty>")
configuration.set("fs.s3a.path.style.access", "true")

// Read parquet
val path = new Path(s"s3a://$testBucket/track.parquet")
val reader = AvroParquetReader.builder[GenericRecord](path).withConf(configuration).build()

println(reader.read()) // This piece of code never is executed but no Exception is thrown

I can access the file normally with the AWS CLI and SDK but have no clue what is going on with with the ParquetReader.

There is any way to debug the S3 call made to localstack in that way I think it would be easier to track where the error is either on localstack the parquet-avro or in my code.

One last thing is that the AvroParquetReader works fine when is called with a local path.

@sam-reh-hs
Copy link

sam-reh-hs commented Apr 27, 2018

This was broken for me as well. It was related to Content-Length always being returned as 0 on a HEAD request.
I upgraded to localstack 0.8.6 and it works 😸

@ghost
Copy link

ghost commented Sep 25, 2018

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

@vishal98
Copy link

vishal98 commented Nov 5, 2018

i am trying to read the csv file from s3 using following code

s3_read_df = spark.read \
        .option("delimiter", ",") \
        .format("csv") \
        .load("s3a://tutorial/test.csv")

Getting xml parsing error for same.


py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
    

@vishal98
Copy link

vishal98 commented Nov 9, 2018

i was able to read/write on localstack s3 with spark without hadoop 2.3.1 version installation and hadoop 2.8.4.
we will need to set parameter mentioned in link to make whole thing working
http://spark.apache.org/docs/latest/hadoop-provided.html

bskarda added a commit to bskarda/localstack that referenced this issue Feb 11, 2019
Issue is outlined in
localstack#869
and
localstack#538
(Specifically
localstack#538 (comment))

One of the external symptoms of the bug is that S3 interactions for
objects that are only the null string (zero length objects) throw
an error like
`Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: ...`
which is telling because the `contentMD5` value is the base64 encoded
value for the md5 of null string. The correct etag value should be
`d41d8cd98f00b204e9800998ecf8427e`.

On multiple executions of the test case the etag generated by the S3
server is different so there is some non deterministic value being put
into the hash function. I haven't been able to find where the actual
issue exists in the code as I don't use python very often.

One workaround for avoiding this issue is to disable the MD5 validations
in the S3 client. For Java that means setting the following properties
to any value.
```
com.amazonaws.services.s3.disableGetObjectMD5Validation
com.amazonaws.services.s3.disablePutObjectMD5Validation
```
@i-aggarwal
Copy link

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

@avn4986, were you able to sort this out? I have the same issue.

@risdenk
Copy link
Contributor

risdenk commented Mar 20, 2019

One of the current issues with s3a and localstack is rename. I am putting together a small test project that shows the issue. Using S3A to rename ends up checksum errors.

@risdenk
Copy link
Contributor

risdenk commented Mar 20, 2019

Created https://github.com/risdenk/s3a-localstack to demonstrate the S3A rename issue. This uses localstack docker with JUnit.

Example output:

testS3ALocalStackFileSystem(io.github.risdenk.hadoop.s3a.TestS3ALocalstack)  Time elapsed: 4.562 sec  <<< ERROR!
org.apache.hadoop.fs.s3a.AWSBadRequestException: copyFile(YQuGJ, lzKBU) on YQuGJ: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:InvalidDigest: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:224)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:125)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:2541)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:996)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:863)
	at io.github.risdenk.hadoop.s3a.TestS3ALocalstack.testS3ALocalStackFileSystem(TestS3ALocalstack.java:123)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.docker.LocalstackDockerTestRunner.run(LocalstackDockerTestRunner.java:43)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
	at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1890)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:132)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@steveloughran
Copy link

This looks like checksum issues. The AWS client builds an MD5 checksum as it uploads stuff, and then compares it with what it got back. If they are different, it assumes corruption en-route.

@lyle-nel
Copy link

@steveloughran Have you found a workaround for this?

@lyle-nel
Copy link

#1152 #538 #869 #1120

@risdenk
Copy link
Contributor

risdenk commented May 22, 2019

Based on @mvpc comment here: #1152 (comment)

Looks like the problem is in this method:
https://github.com/localstack/localstack/blame/master/localstack/services/s3/s3_listener.py#L311

@risdenk
Copy link
Contributor

risdenk commented May 23, 2019

Problem Statement - s3a rename
So finally had some time to sit down and look at the s3a rename failure. The AWS Java SDK in the case of s3a is doing a CopyObjectRequest here [1]. The ObjectMetadata is being copied to this request and so the Content-MD5 is being sent. It is unclear from the AWS docs [2] if Content-MD5 should be set on the request.

Localstack thinks that if the Content-MD5 header is set, then the MD5 of data must be checked [3]. The content ends up being empty on a copy request. When Localstack calculates the MD5 of '' it doesn't match what the AWS SDK/s3a set as the Content-MD5 [4].

In the case of say Minio, there is a separate copy request handler that doesn't check the Content-MD5 header [5].

Potential Solution
So I think there are best option here is to:

An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

  • There are existing s3a users out there this wouldn't work for existing clients
  • S3AFilesystem works against Amazon S3 currently

[1] https://github.com/apache/hadoop/blob/branch-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2541
[2] https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
[3] https://github.com/localstack/localstack/blob/master/localstack/services/s3/s3_listener.py#L444
[4] https://github.com/localstack/localstack/blame/master/localstack/services/s3/s3_listener.py#L311
[5] https://github.com/minio/minio/blob/master/cmd/object-handlers.go#L623

@steveloughran
Copy link

An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

There are existing s3a users out there this wouldn't work for existing clients

valid point

S3AFilesystem works against Amazon S3 currently

true, but we don't want to preclude other stores. Looking at the last time anyone touched the metadata propagation code it actually came from EMC HADOOP-11687.

One suggestion here is that someone checks out the hadoop source code and run the hadoop-aws tests against this new s3a endpoint. Then if there are problems, they can be fixed at either end. If we shouldn't be setting the content-MD5, then we shouldn't be setting the content-MD5.

@whummer
Copy link
Member

whummer commented Aug 25, 2019

Great analysis, thanks @risdenk @steveloughran . How should we proceed with this issue - would you be able to create a pull request with the proposed changes? Thanks!

@steveloughran
Copy link

  1. I am actually doing some copy stuff in HADOOP-16490. Improve S3Guard handling of FNFEs in copy apache/hadoop#1229, but this should be separate if it stands any chance of being backported.
  2. the probability of me finding the time to work on this is ~0
  3. but if others were to do a patch and, after complying with our test requirements, repeatedly harass me to look at it, I'll take a look

my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

@risdenk
Copy link
Contributor

risdenk commented Aug 28, 2019

I can make a pull request for not checking the MD5 if it is a copy. It is a simple one line change. Not sure where the tests are. Let me put up the PR and go from there.

my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

I tried to take a crack at this tonight and ran into a bunch of issues with ETag Change detection policy requires ETag.

risdenk added a commit to risdenk/localstack that referenced this issue Aug 28, 2019
risdenk added a commit to risdenk/localstack that referenced this issue Aug 28, 2019
Fixes localstack#538

Signed-off-by: Kevin Risden <krisden@apache.org>
@risdenk
Copy link
Contributor

risdenk commented Aug 28, 2019

@whummer created PR #1510 for the MD5 copy check change.

risdenk added a commit to risdenk/localstack that referenced this issue Aug 29, 2019
Fixes localstack#538

Signed-off-by: Kevin Risden <krisden@apache.org>
@whummer
Copy link
Member

whummer commented Aug 29, 2019

@risdenk thanks for the PR! With #1510 being merged, can this issue be closed from your perspective?

@risdenk
Copy link
Contributor

risdenk commented Aug 29, 2019

@whummer - PR #1510 definitely fixes the rename issues that were reported here with the MD5 being invalid. Can't say it fixes every potential problem :)

@whummer
Copy link
Member

whummer commented Aug 29, 2019

Thanks. Closing this issue for now. Please create follow-up issues with specific details if any new problems come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: configuration Configuring LocalStack type: question Please ask questions on discuss.localstack.cloud
Projects
None yet
Development

No branches or pull requests

9 participants