-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot access S3 with protocols s3n or s3a #538
Comments
Thanks for reporting @davideberdin . Can you try running your example with S3 path style access enabled (This is also briefly mentioned in the Troubleshooting section of the README):
|
Hi @whummer, thanks for your answer! I tried by setting that property but I still have the same issue
Do you know if I should in somehow configure the ACL of the bucket or something related to the file in S3 ? This is the way I put the object in my bucket
|
One thing I noticed - can you double check the bucket names - the example in the first post uses "testing-bucket" for |
I checked the naming carefully and in my code i'm using the same name. I made a typo here. Let me correct it. Sorry for the inconvenient. |
I am facing a problem like this but using the // Creat Bucket
val testBucket = s"test-bucket-${UUID.randomUUID()}"
val localtack = new EndpointConfiguration("http://s3:4572", Regions.US_EAST_1.getName)
val s3 = AmazonS3ClientBuilder.standard()
.withEndpointConfiguration(localtack)
.withPathStyleAccessEnabled(true)
.build()
s3.createBucket(testBucket)
// Upload test file
val parquet = new File(getClass.getResource("/track.parquet").toURI)
val obj = new PutObjectRequest(testBucket, "track.parquet", parquet)
s3.putObject(obj)
// Hadoop configuration
val configuration = new Configuration()
configuration.set("fs.s3a.endpoint", "http://s3:4572")
configuration.set("fs.s3a.access.key", "<empty>")
configuration.set("fs.s3a.secret.key", "<empty>")
configuration.set("fs.s3a.path.style.access", "true")
// Read parquet
val path = new Path(s"s3a://$testBucket/track.parquet")
val reader = AvroParquetReader.builder[GenericRecord](path).withConf(configuration).build()
println(reader.read()) // This piece of code never is executed but no Exception is thrown I can access the file normally with the AWS CLI and SDK but have no clue what is going on with with the ParquetReader. There is any way to debug the S3 call made to localstack in that way I think it would be easier to track where the error is either on localstack the parquet-avro or in my code. One last thing is that the |
This was broken for me as well. It was related to |
Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.
Spark Code:
P.S. Able to write to Actual S3 with the same code. |
i am trying to read the csv file from s3 using following code
Getting xml parsing error for same.
|
i was able to read/write on localstack s3 with spark without hadoop 2.3.1 version installation and hadoop 2.8.4. |
Issue is outlined in localstack#869 and localstack#538 (Specifically localstack#538 (comment)) One of the external symptoms of the bug is that S3 interactions for objects that are only the null string (zero length objects) throw an error like `Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: ...` which is telling because the `contentMD5` value is the base64 encoded value for the md5 of null string. The correct etag value should be `d41d8cd98f00b204e9800998ecf8427e`. On multiple executions of the test case the etag generated by the S3 server is different so there is some non deterministic value being put into the hash function. I haven't been able to find where the actual issue exists in the code as I don't use python very often. One workaround for avoiding this issue is to disable the MD5 validations in the S3 client. For Java that means setting the following properties to any value. ``` com.amazonaws.services.s3.disableGetObjectMD5Validation com.amazonaws.services.s3.disablePutObjectMD5Validation ```
@avn4986, were you able to sort this out? I have the same issue. |
One of the current issues with s3a and localstack is rename. I am putting together a small test project that shows the issue. Using S3A to rename ends up checksum errors. |
Created https://github.com/risdenk/s3a-localstack to demonstrate the S3A rename issue. This uses localstack docker with JUnit. Example output:
|
This looks like checksum issues. The AWS client builds an MD5 checksum as it uploads stuff, and then compares it with what it got back. If they are different, it assumes corruption en-route. |
@steveloughran Have you found a workaround for this? |
Based on @mvpc comment here: #1152 (comment) Looks like the problem is in this method: |
Problem Statement - Localstack thinks that if the In the case of say Minio, there is a separate copy request handler that doesn't check the Potential Solution
An alternative is to make Apache Hadoop
[1] https://github.com/apache/hadoop/blob/branch-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2541 |
valid point
true, but we don't want to preclude other stores. Looking at the last time anyone touched the metadata propagation code it actually came from EMC HADOOP-11687. One suggestion here is that someone checks out the hadoop source code and run the hadoop-aws tests against this new s3a endpoint. Then if there are problems, they can be fixed at either end. If we shouldn't be setting the content-MD5, then we shouldn't be setting the content-MD5. |
Great analysis, thanks @risdenk @steveloughran . How should we proceed with this issue - would you be able to create a pull request with the proposed changes? Thanks! |
my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens |
I can make a pull request for not checking the MD5 if it is a copy. It is a simple one line change. Not sure where the tests are. Let me put up the PR and go from there.
I tried to take a crack at this tonight and ran into a bunch of issues with ETag |
Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>
Fixes localstack#538 Signed-off-by: Kevin Risden <krisden@apache.org>
Thanks. Closing this issue for now. Please create follow-up issues with specific details if any new problems come up. |
I do have a Java Spark job which access an S3 bucket using the protocol s3a but with localstack I'm having issues in accessibility. Here the details of what I've done:
In my pom.xml I imported:
The class I'm trying to test looks like this:
My testing class look like this:
I do receive two types of error depending on the protocol I am using. If I use protocol s3a I do have the following:
If I use s3n (with the appropriate Spark configurations), I do receive this:
I don't understand if it's a bug or I am missing something. Thanks!
The text was updated successfully, but these errors were encountered: