Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to provide Hadoop Path for reading files from S3 #300

Closed
jeet23 opened this issue May 30, 2023 · 3 comments
Closed

Question: How to provide Hadoop Path for reading files from S3 #300

jeet23 opened this issue May 30, 2023 · 3 comments

Comments

@jeet23
Copy link

jeet23 commented May 30, 2023

I am trying to use the parquet-fs2 function fromParquet to read a file from S3.
I have set the below Hadoop configuration to read it from S3 (I am running localstack hence the s3 url is localhost:4566) :

        val hadoopConf: Configuration = new Configuration()
        hadoopConf.set("fs.s3a.path.style.access", "true")
        hadoopConf.set("fs.s3a.endpoint.region", "eu-central-1")
        hadoopConf.set("fs.s3a.connection.ssl.enabled", "false")
        hadoopConf.set("fs.s3a.endpoint", "http://localhost:4566")
          
       val readerStream = 
            fromParquet[F]
              .as[MyCustomCaseClass]
              .options(ParquetReader.Options(hadoopConf = hadoopConf))
              .read(Path(directoryName))

where directoryName is the name of directory in S3. (In my case the directoryName = local_directory)

However, upon running this, I get the FileNotFound exception on this line.
As a result, the app bombs out with this error:

java.lang.IllegalArgumentException: Inconsistent partitioning.
[error] Parquet files must live in leaf directories.
[error] Every files must contain the same numbers of partitions.
[error] Partition directories at the same level must have the same names.
[error] Check following directories:
[error] 	local_directory

Can you please guide me on this as to why the FileNotFound exception is happening even though the file is present in the local S3?

Thanks

@jeet23
Copy link
Author

jeet23 commented May 30, 2023

update:
I was able to read the S3 files locally from localstack.

Below props were needed:

hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.change.detection.mode", "none")
hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

@jeet23
Copy link
Author

jeet23 commented May 30, 2023

I am now getting the following exception on actual AWS account. I'm using the IAM assumed role to connect from Hadoop to AWS.

Failed to initialize fileystem s3a://my-bucket/my-file.parquet: java.nio.file.AccessDeniedException: : org.apache.hadoop.fs.s3a.auth.NoAwsCredentialsException: SimpleAWSCredentialsProvider: No AWS credentials in the Hadoop configuration

Do you have any idea on this?

@mjakubowski84
Copy link
Owner

@jeet23 Your questions are related to Hadoop AWS, not Parquet4S (which is just based on Hadoop indirectly). Please seek for a support in Hadoop community or docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants