Skip to content

[hailtop] Add hailtop.fs for user-level RouterFS functions #12731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 21, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented Feb 24, 2023

CHANGELOG: Introduce hailtop.fs that makes public a filesystem module that works for local fs, gs, s3 and abs. This is now used as the Backend.fs for hail query but can be used standalone for Hail Batch users by import hailtop.fs as hfs.

Still have to do the docs but a couple questions remain:

I create a hidden singleton RouterFS object so that is used by functions in hailtop.fs. Should this singleton also be used by the Hail Query backends when they are initialized? How do we propagate configuration information such as requester_pays_bucket to the FS?

danking
danking previously requested changes Apr 6, 2023
@@ -202,7 +203,9 @@ async def create(*,
"MY_BILLING_PROJECT'"
)

async_fs = RouterAsyncFS('file')
if not isinstance(gcs_requester_pays_configuration, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be None, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially going to throw this into another PR and then stack this on top. Do you want me to just keep it in here though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fixed as of now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Actually it looks like this is still failing service backend tests that explicitly request a handful of requester pays buckets, e.g.

test_requester_pays_with_project_more_than_one_partition

Copy link
Contributor

@danking danking Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I follow what your point is. You'll need to change RouterAsyncFS to intelligently use or not use the project based on the buckets and you hoped to do that in a follow up PR?

I think, unfortunately, we have to pull that into this one because folks, at least in principle, could be relying on that behavior.

In particular, IIRC, you get an error if you include a project on a non-requester-pays bucket, right?

@daniel-goldstein daniel-goldstein marked this pull request as ready for review April 6, 2023 21:27
@daniel-goldstein
Copy link
Contributor Author

So I think I'd appreciate a review on this. Would especially appreciate feedback about the question I wrote in the PR body as well as what to do about documentation and testing:

  • We have pretty expansive FS testing, but not for these new shim functions. Should we convert some of our tests to use these functions instead of the FS objects themselves?
  • We don't have hailtop docs, and afaik this is the first module outside of hailtop.batch that would be public. Where should its docs go?

**kwargs):
if not storage_client:
if project is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-goldstein you gotta fix copy.py line 118 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danking What do you mean here? I changed copy.py line 118 to use gcs_requester_pays_configuration, which I thought was correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must've been looking at an old test

@daniel-goldstein
Copy link
Contributor Author

Huh, quite confused as to why the local backend test hung in GCP and not azure. As far as I can tell the test that hung is test/hail/utils/test_utils.py::Tests::test_hadoop_ls_glob_2 but it passed in azure and locally. Any ideas?

@danking
Copy link
Contributor

danking commented Apr 13, 2023

Seems to have starved itself of memory? https://batch.hail.is/batches/7260182/jobs/69

@danking
Copy link
Contributor

danking commented Apr 13, 2023

Next thing I would try is running the full split. Maybe we need to run that in a memory limited way?

HAIL_RUN_IMAGE_SPLITS=14 \
HAIL_RUN_IMAGE_SPLIT_INDEX=2 \
HAIL_CLOUD=gcp \
HAIL_TEST_STORAGE_URI=/tmp/ \
HAIL_TEST_RESOURCES_DIR=./../src/test/resources/ \
HAIL_DOCTEST_DATA_DIR=./hail/docs/data \
HAIL_QUERY_BACKEND=local python3 -m pytest -Werror:::hail -Werror:::hailtop --log-cli-level=INFO -s -r A -vv --instafail --durations=50 --ignore=test/hailtop/ test

@danking
Copy link
Contributor

danking commented Apr 13, 2023

Also not the most sophisticated thing but littering the test and the things it calls with print statements, making sure the tests print in real-time, and kicking off another test is something else I'd try.

@danking
Copy link
Contributor

danking commented Apr 13, 2023

Though it does seem like something upstream of hadoop_ls_glob_2 has just sucked all the memory out of the container.

@danking
Copy link
Contributor

danking commented Apr 13, 2023

Also ask Tim about limiting RAM available to Hail & Java, we should probably keep at least a gig dedicated to Python. Maybe that will cause memory errors to appear closer to where they belong.

@danking
Copy link
Contributor

danking commented Apr 13, 2023

One more idea, can you add this https://pypi.org/project/pytest-timestamper/ package?

@daniel-goldstein
Copy link
Contributor Author

Hm, I added timestamper and merged main and the tests passed… getting retested now though

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This does remind me that we don't currently generate docs for hail top anywhere. We probably need to sit down and rethink the structure of the docs a bit.

@danking
Copy link
Contributor

danking commented Apr 21, 2023

All the tests are passing but something is hanging.

@danking
Copy link
Contributor

danking commented Apr 21, 2023

Like, pytest is hanging.

@danking
Copy link
Contributor

danking commented Apr 21, 2023

CPU use is flat after that test completes successfully.

@danking
Copy link
Contributor

danking commented Apr 21, 2023

Next thing to try is running the split on your laptop to reproduce

@danking
Copy link
Contributor

danking commented Apr 21, 2023

So, there was a test run that got cancelled (probably main branch changed) but which passed the service backend tests.

It confirms that this most recent run ran all the tests, but it has some fishy looking error outputs:

[gw2] PASSED [2023-04-21 18:31:07] test/hail/table/test_table.py::Tests::test_from_pandas_missing_and_nans Exception ignored in: <_io.FileIO name=0 mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>
Exception ignored in: <_io.FileIO name=0 mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>
Exception ignored in: <_io.FileIO name=0 mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>
Exception ignored in: <_io.FileIO name=0 mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name=0 mode='r' encoding='UTF-8'>

Are we not cleaning up files somewhere and that's somehow hanging the system?

@danking
Copy link
Contributor

danking commented Apr 21, 2023

I pushed a commit that will error on resource warning. Hopefully we can figure out where we're leaking, fix the leaks, and stop the hangs.

@danking danking merged commit 53f7bac into hail-is:main Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants