New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow HiveQueryRunner to persist data on disk #12137

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@yingsu00
Copy link
Contributor

yingsu00 commented Dec 27, 2018

The user can pass a URI string as the single program argument of
HiveQueryRunner to be the place where he or she wants the data to be
persisted on local disk. It needs to be a URI the user has write
permission. For example, if the user pass in /Users/yingsu/, then the
folder to hold the hive data would be /Users/yingsu/hive_data. If no
argument is provided, HiveQueryRunner would remain the existing
behaviour and use a random directory as the baseDir to hold data, e.g.
/var/folders/hj/hpzdcj_9155b8364wbtf23vh262vxn/T/PrestoTest272443592426
3942855/hive_data. This would make local debugging experience more
convenient.

@yingsu00 yingsu00 requested a review from wenleix Dec 27, 2018

@wenleix wenleix assigned electrum and unassigned wenleix Dec 27, 2018

@electrum
Copy link
Contributor

electrum left a comment

Rather than changing all the table creation, let's simplify it to only copy the tables if the database did not exist.

Alternatively, we could do it like RaptorQueryRunner and not create anything when running the main method.

@electrum

This comment has been minimized.

Copy link
Contributor

electrum commented Dec 27, 2018

Another approach is to allow configuring baseDir of the coordinator. That would allow preserving data for any query runner.

@yingsu00

This comment has been minimized.

Copy link
Contributor

yingsu00 commented Dec 28, 2018

Another approach is to allow configuring baseDir of the coordinator. That would allow preserving data for any query runner.

@electrum Are we talking about using a session property to set up the baseDir? In that case the user would have to set the session property when the server is started, then restart the server to honor this property. If we pass in the baseDir from program arguments we don't need to restart the server and it's easier to change it. However if the user forgets the baseDir (e.g. lost the run configuration) then the session property way is better. I'm ok with either way. Which way do you prefer?

@yingsu00 yingsu00 force-pushed the yingsu00:hivequeryrunner branch from ae7e0d7 to d076c96 Dec 28, 2018

@yingsu00

This comment has been minimized.

Copy link
Contributor

yingsu00 commented Dec 28, 2018

@electrum Hi David I updated the commit with your comments addressed. I'm still passing in a program argument from main(). Please let me know if you have preference using a session property instead. Thanks!

@yingsu00 yingsu00 changed the title WIP - Allow HiveQueryRunner to persist data on disk Allow HiveQueryRunner to persist data on disk Dec 29, 2018

@electrum

This comment has been minimized.

Copy link
Contributor

electrum commented Dec 29, 2018

The base directory would be configured as a constructor argument to TestingPrestoServer. The difference is in configuring the directory for the entire server vs just the Hive connector.

@yingsu00 yingsu00 force-pushed the yingsu00:hivequeryrunner branch from d076c96 to 4e55b73 Dec 31, 2018

@yingsu00

This comment has been minimized.

Copy link
Contributor

yingsu00 commented Jan 2, 2019

@electrum Hi David I updated the PR using the way you suggested. Two questions 1) I added one new constructors in DistributedQueryRunner. However I saw the constructors were marked as deprecated. I didn't know why but also added the @deprecated annotation for the new constructor 2) I didn't add a new createTestingPrestoServer function but just modified it to take the baseDirPath. So the workers are updated with this baseDirPath too. I think this is fine because the baseDataDir is not used for workers. Please let me know you think otherwise. Thanks!

Show resolved Hide resolved ...rc/main/java/com/facebook/presto/server/testing/TestingPrestoServer.java
throws Exception
{
this.coordinator = coordinator;
baseDataDir = Files.createTempDirectory("PrestoTest");

if (!baseDataDirPath.isPresent()) {

This comment has been minimized.

@dain

dain Jan 3, 2019

Contributor

Use the Optional map and orElse apis instead of an if/else block

This comment has been minimized.

@yingsu00

yingsu00 Jan 8, 2019

Contributor

@dain createTempDirectory("PrestoTest") throws IOException, and Optional.orElseGet() takes a Supplier<? extends T> parameter which doesn't support throwing checked exceptions. To get around it we need to provide a lambda wrapper of Supplier but then it loses the succinctness of using lambda here. Since we also need to remember isBaseDataDirUserSupplied, I reverted back to the original way as I think it's easier to read and understand. Please let me know if there are other better ways.

@yingsu00 yingsu00 force-pushed the yingsu00:hivequeryrunner branch from 4e55b73 to aa62b60 Jan 3, 2019

@yingsu00

This comment has been minimized.

Copy link
Contributor

yingsu00 commented Jan 4, 2019

Hi @dain updated the PR with your comments fixed. Thanks for reviewing! Happy new year.

{
this.coordinator = coordinator;
baseDataDir = Files.createTempDirectory("PrestoTest");

baseDataDir = baseDataDirPath.orElse(Files.createTempDirectory("PrestoTest"));

This comment has been minimized.

@dain

dain Jan 4, 2019

Contributor

you should use the lambda version of orElse to avoid the creation of the temp directory. Also I'd static import that method.

I'd also perform some basic validation on the path since it is now user supplied. If it already exists, it should be a directory. If not, either throw an exception or possibly create the directory (not sure which is more desirable).

This comment has been minimized.

@electrum

electrum Jan 5, 2019

Contributor

The cleanup code also needs to be changed not to delete the directory if it is user provided.

This comment has been minimized.

@yingsu00

yingsu00 Jan 8, 2019

Contributor

@dain The validation was done in RawLocalFileSystem.mkdirsWithOptionalPermission() and ChecksumFileSystem.create() when creating database. In that case exceptions would be thrown on invalid user supplied path when creating database. I didn't include the validation in HiveQueryRunner.main() originally because I thought the logic would be repeated twice. But doing it in HiveQueryRunner would allow invalid path to fail sooner than later in createDatabase(), so I copied the validation logic to HiveQueryRunner.main() in the new commit.

I'd also perform some basic validation on the path since it is now user supplied. If it already exists, it should be a directory. If not, either throw an exception or possibly create the directory (not sure which is more desirable).

@yingsu00 yingsu00 force-pushed the yingsu00:hivequeryrunner branch 2 times, most recently from 14c678d to a8516d6 Jan 8, 2019

Allow HiveQueryRunner to persist data on disk
The user can pass a URI string as the single program argument of
HiveQueryRunner to be the place where he or she wants the data to be
persisted on local disk. It needs to be a URI the user has write
permission. For example, if the user pass in /Users/yingsu/, then the
folder to hold the hive data would be /Users/yingsu/hive_data. If no
argument is provided, HiveQueryRunner would remain the existing
behaviour and use a random directory as the baseDir to hold data, e.g.
/var/folders/hj/hpzdcj_9155b8364wbtf23vh262vxn/T/PrestoTest272443592426
3942855/hive_data. This would make local debugging experience more
convenient.

@yingsu00 yingsu00 force-pushed the yingsu00:hivequeryrunner branch from a8516d6 to c33d038 Jan 8, 2019

@yingsu00

This comment has been minimized.

Copy link
Contributor

yingsu00 commented Jan 10, 2019

@electrum @dain Any thought on the orElseGet()? I have addressed your other comments. Could you please take a look again? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment