Question: How to provide Hadoop Path for reading files from S3 #300

jeet23 · 2023-05-30T14:09:25Z

I am trying to use the parquet-fs2 function fromParquet to read a file from S3.
I have set the below Hadoop configuration to read it from S3 (I am running localstack hence the s3 url is localhost:4566) :

        val hadoopConf: Configuration = new Configuration()
        hadoopConf.set("fs.s3a.path.style.access", "true")
        hadoopConf.set("fs.s3a.endpoint.region", "eu-central-1")
        hadoopConf.set("fs.s3a.connection.ssl.enabled", "false")
        hadoopConf.set("fs.s3a.endpoint", "http://localhost:4566")
          
       val readerStream = 
            fromParquet[F]
              .as[MyCustomCaseClass]
              .options(ParquetReader.Options(hadoopConf = hadoopConf))
              .read(Path(directoryName))

where directoryName is the name of directory in S3. (In my case the directoryName = local_directory)

However, upon running this, I get the FileNotFound exception on this line.
As a result, the app bombs out with this error:

java.lang.IllegalArgumentException: Inconsistent partitioning.
[error] Parquet files must live in leaf directories.
[error] Every files must contain the same numbers of partitions.
[error] Partition directories at the same level must have the same names.
[error] Check following directories:
[error] 	local_directory

Can you please guide me on this as to why the FileNotFound exception is happening even though the file is present in the local S3?

Thanks

The text was updated successfully, but these errors were encountered:

jeet23 · 2023-05-30T15:17:26Z

update:
I was able to read the S3 files locally from localstack.

Below props were needed:

hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.change.detection.mode", "none")
hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

jeet23 · 2023-05-30T17:55:55Z

I am now getting the following exception on actual AWS account. I'm using the IAM assumed role to connect from Hadoop to AWS.

Failed to initialize fileystem s3a://my-bucket/my-file.parquet: java.nio.file.AccessDeniedException: : org.apache.hadoop.fs.s3a.auth.NoAwsCredentialsException: SimpleAWSCredentialsProvider: No AWS credentials in the Hadoop configuration

Do you have any idea on this?

mjakubowski84 · 2023-05-30T19:20:00Z

@jeet23 Your questions are related to Hadoop AWS, not Parquet4S (which is just based on Hadoop indirectly). Please seek for a support in Hadoop community or docs.

mjakubowski84 closed this as completed May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to provide Hadoop Path for reading files from S3 #300

Question: How to provide Hadoop Path for reading files from S3 #300

jeet23 commented May 30, 2023

jeet23 commented May 30, 2023

jeet23 commented May 30, 2023

mjakubowski84 commented May 30, 2023

Question: How to provide Hadoop Path for reading files from S3 #300

Question: How to provide Hadoop Path for reading files from S3 #300

Comments

jeet23 commented May 30, 2023

jeet23 commented May 30, 2023

jeet23 commented May 30, 2023

mjakubowski84 commented May 30, 2023