Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hadoop to 3.2.0 #53

Conversation

rajatrj20
Copy link

@rajatrj20 rajatrj20 commented Dec 20, 2021

When presto runs on Hadoop 3 with dfs.encrypt.data.transfer=true, queries fails with error:

com.facebook.presto.spi.PrestoException: Error committing write to Hive

Caused by: org.apache.hadoop.ipc.RemoteException: File /tmp/presto-hadoop/a38381f4-fcc3-431d-bd73-bb0d872a1572/20211106_073946_00003_k9xuf_070809d7-6c9b-4d5f-ac17-86a998d2958d could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation.

and on HDFS side following error is thrown:

javax.security.sasl.SaslException: Invalid token in javax.security.sasl.qop:
at com.sun.security.sasl.util.AbstractSaslImpl.parseProp(AbstractSaslImpl.java:242)
at com.sun.security.sasl.util.AbstractSaslImpl.parseQop(AbstractSaslImpl.java:206)
at com.sun.security.sasl.util.AbstractSaslImpl.parseQop(AbstractSaslImpl.java:197)
at com.sun.security.sasl.util.AbstractSaslImpl.(AbstractSaslImpl.java:73)
at com.sun.security.sasl.digest.DigestMD5Base.(DigestMD5Base.java:174)
at com.sun.security.sasl.digest.DigestMD5Server.(DigestMD5Server.java:145)
at com.sun.security.sasl.digest.FactoryImpl.createSaslServer(FactoryImpl.java:109)
at javax.security.sasl.Sasl.createSaslServer(Sasl.java:567)
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.createServerSaslParticipant(SaslParticipant.java:66)
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:385)
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getEncryptedStreams(SaslDataTransferServer.java:186)
at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:118)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:748)

This happens because presto still uses hadoop 2 libraries which results in failure when running on hadoop 3 system. This PR is to update support of hadoop 3 to presto engine by upgrading hadoop to version 3.2.0. This changes are not tested on hadoop 2 environment after updating it to hadoop 3 and if visibility is needed I can update that.

To add some historical context on how PrestoDB moved from hadoop-apache1 to hadoop-apache2, Presto was using hadoop-apache1 and then plugin for hadoop-apache2 was added as well. After sometime plugin for hadoop-apache1 was removed from the dependencies and it started using only hadoop-apache2.
Also, Trino has moved away from hadoop 2 and is now on hadoop 3. Reference: Update to Hadoop 3.2.0 #3

Ideally, we should also look at renaming this repository to presto-hadoop-apache to make it more generic or a new repository presto-hadoop-apache3 should be created for hadoop version 3 the way repo presto-hadoop-apache1 and presto-hadoop-apache2 already exist for version 1 and 2 respectively. Since, I am not sure which route needs to be taken, I didn't create or tried to rename the repository. Please let me know on the steps to be taken and I can follow those steps to do the further work required.

@rajatrj20
Copy link
Author

@highker Could you please review this PR for me?

Copy link

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate what features are currently broken, because of the old Hadoop version ?

@@ -402,7 +448,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.2</version>
<version>3.2.1</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the old shaded jar size versus the new size ?

Copy link
Author

@rajatrj20 rajatrj20 Dec 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to add support for Hadoop 3 to presto so presto works when environment has Hadoop 3.

Currently, queries with encryption flag set to true fails (example is updated in description) when it runs on Hadoop 3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert on Hadoop, so will defer this to Hadoop experts.

@@ -426,10 +472,18 @@
<pattern>org.apache.http</pattern>
<shadedPattern>${shadeBase}.org.apache.http</shadedPattern>
</relocation>
<relocation>
<pattern>mozilla</pattern>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these new dependencies from hadoop upgrade ? Seems like hadoop3 has lot of new dependencies.

@rajatrj20
Copy link
Author

@arunthirupathi @imjalpreet, Can you please help me review this PR. I have updated it with more information since last change.

@arunthirupathi
Copy link

@arunthirupathi @imjalpreet, Can you please help me review this PR. I have updated it with more information since last change.

I work for Meta/Facebook. We don't use the hadoop at all, so not sure, if this would break anything. @imjalpreet do you have concerns ?

My concerns are around the changed dependency and how it impacts other libraries already out there.

@imjalpreet
Copy link
Member

@arunthirupathi @rajatrj20 Thanks for the PR. I will try to review this early next week and let you know.

@imjalpreet
Copy link
Member

@rajatrj20 Apologies for such a long delay. I will complete the review this week. But before I begin, I have a small clarification to make. Since we are taking reference from a Trino PR trinodb/trino-hadoop-apache#3, is there any reason behind picking only partial changes? IMO we should cherry pick the entire commit.

@rajatrj20
Copy link
Author

@imjalpreet I decided to leave classes that will not be used in PrestoDB from the original commit. Also, there are some differences in excluded dependencies between the trino commit and this change, hence the original commit was not cherry-picked as it is.

@imjalpreet
Copy link
Member

@rajatrj20 oh okay, I will take a look.

Also, I see there is already a tag created for 3.2.0-1: https://github.com/prestodb/presto-hadoop-apache2/tree/3.2.0-1 which has all the required changes.

@aweisberg Do you have some context on this tag and can it be used for the upgrade?

@aweisberg
Copy link

@rajatrj20 oh okay, I will take a look.

Also, I see there is already a tag created for 3.2.0-1: https://github.com/prestodb/presto-hadoop-apache2/tree/3.2.0-1 which has all the required changes.

@aweisberg Do you have some context on this tag and can it be used for the upgrade?

You should be able to use it. I think it was already released to maven. I don’t recall why I didn’t complete the upgrade.

@imjalpreet
Copy link
Member

@rajatrj20 oh okay, I will take a look.
Also, I see there is already a tag created for 3.2.0-1: https://github.com/prestodb/presto-hadoop-apache2/tree/3.2.0-1 which has all the required changes.
@aweisberg Do you have some context on this tag and can it be used for the upgrade?

You should be able to use it. I think it was already released to maven. I don’t recall why I didn’t complete the upgrade.

@rajatrj20 I think it should be good as well, what do you think about this tag? It is already released to maven as well (https://mvnrepository.com/artifact/com.facebook.presto.hadoop/hadoop-apache2/3.2.0-1)

Ideally, we should also look at renaming this repository to presto-hadoop-apache to make it more generic or a new repository presto-hadoop-apache3 should be created for hadoop version 3 the way repo presto-hadoop-apache1 and presto-hadoop-apache2 already exist for version 1 and 2 respectively. Since, I am not sure which route needs to be taken, I didn't create or tried to rename the repository. Please let me know on the steps to be taken and I can follow those steps to do the further work required.

@aweisberg we should think about this too. Do we want to rename the repository or create a new for hadoop3?

@Ayan07
Copy link

Ayan07 commented Mar 31, 2022

@imjalpreet Is it possible now to build presto with presto-hadoop-apache 3 version??

@rajatrj20
Copy link
Author

@rajatrj20 I think it should be good as well, what do you think about this tag? It is already released to maven as well (https://mvnrepository.com/artifact/com.facebook.presto.hadoop/hadoop-apache2/3.2.0-1)

@imjalpreet What are you suggesting here - Do you want me to rebase my changes on top of 3.2.0-1 tag? Also, I am not sure if we can release same tag again to maven central please confirm if it can be done.

@aweisberg
Copy link

We can't release the same tag again. You do want to be on the 3.2.x branch. Is this adding anything or have we already released an artifact for this and we just need to update Presto itself?

@rajatrj20
Copy link
Author

I am not updated with 3.2.x branch, @aweisberg is the current branch 3.2.x a complete update for hadoop 3? In that case I can rebase my PR to 3.2.x branch. Also what do you think about creating new repository for hadoop 3 shaded version?

@aweisberg
Copy link

aweisberg commented Mar 31, 2022

My preference is to do what Trino did and rename the repository to hadoop-apache and stop including the version.

Realistically only the version in use in Presto matters. If we had finished the update to hadoop 3 then master would have become the 3.2.x branch since we don't need to release hadoop 2 versions once Presto upgrades.

We should make a new maven artifact that is just hadoop-apache as well.

@rajatrj20
Copy link
Author

Yes, renaming repository to presto-hadoop-apache seems to be the right way to go forward. There are some changes between my PR and 3.2.x branch and also, in my initial analysis, presto is throwing compilation errors with 3.2.x branch. I will rebase my PR to 3.2.x and update this.

If you know what needs to be done to rename this repository and if I can help with it please let me know.

@rajatrj20 rajatrj20 force-pushed the rajatrj/hadoop-version-update-to-3 branch from c01941b to 21db71a Compare April 1, 2022 14:43
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 1, 2022

CLA Missing ID CLA Not Signed

@rajatrj20 rajatrj20 changed the base branch from master to 3.2.x April 1, 2022 14:44
@rajatrj20 rajatrj20 force-pushed the rajatrj/hadoop-version-update-to-3 branch from 21db71a to c01941b Compare April 1, 2022 14:47
@rajatrj20 rajatrj20 changed the base branch from 3.2.x to master April 1, 2022 14:47
@aweisberg
Copy link

aweisberg commented Apr 1, 2022

I can rename it, but the bigger issue is getting it merged into Presto and having a working Presto PR to pair with it.

I had a PR for this prestodb/presto#13311 but I can't recall why I dropped it. Looks like a hive dependency upgrade is also mixed into that PR.

@rajatrj20
Copy link
Author

Closing this PR as it is now failing EasyCLA check. Opened PR against branch 3.2.x Update hadoop to 3.2.0 #56

@rajatrj20 rajatrj20 closed this Apr 1, 2022
@rajatrj20 rajatrj20 mentioned this pull request Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants