-
Notifications
You must be signed in to change notification settings - Fork 250
[hailctl] Initialize Hail Environment Interactively #13279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In GCP, this code will check remote_tmpdir existence, compatibility with regions, and assign service account permissions. Same with the artifact registry. |
@daniel-goldstein When you get a chance, I'd appreciate your high level opinion on this addition before I start thoroughly testing everything. Goal is to have something before the ATGU Welcome Workshop September 5th. I know the import locations are an issue in I mainly want feedback on whether the code passes the high level smell test and whether I've over designed this with too many checks and prompts or am forgetting something that would be nice to check for. |
This is super cool! I am a big fan of the idea and the overall approach, particularly when it comes to setting up the tmp bucket and getting the permissions on it correct. Here's my high level thoughts. Sorry for the wall of text but I found these a little hard to articulate. Regarding number of promptsI think this is my primary concern. There's a lot of great automation here, but it's a lot right off the bat. I think what this is aiming to do is make it quick and simple to start running batches and every time someone has to stop and ask someone a question as to how they should respond to some prompt that process gets longer and more complicated. I think it's worth considering what the first batch people should run might be and design for a minimal first experience. IMO, a temp bucket is an absolutely crucial piece of configuration before you can do anything interesting and configuring a temp bucket is something that Another thing that gives me a little pause is the wording around google projects. I get that you need one to create a bucket, but I think we should just make sure to steer clear of the implication that you are "selecting a GCP project to use for Hail Batch", because that implies some link or ownership that isn't there. But I think there's a quick fix here: for a given resource that we are creating for hail use, like the temp bucket, ask for the name first and then ask which project it should be created in, using the projects listed in gcloud as choices with the option to write in your own. Regarding number of checksI think it'd be good to avoid warnings when possible. From looking at this I see a pattern of
I think I would prefer instead to ask a leading question and in the prompt explain why the alternative option might be undesirable. Then when they make a decision just move on. On a broader note, I think we should focus on having good documentation and linking to it over having perfectly thorough |
I'm thinking that I want to have this: Basic setup for the domain, remote_tmpdir, billing_project, and only select one region
Which will prompt for:
And set:
And then print out a message with the default settings with instructions on how to change any of the settings and warnings about the configuration by using Then we'll have optional flags for components to configure
This would prompt for the same as above as well as advanced regions settings, query settings, and create the container registry. And lastly
which prints warnings about misspecified regions and tmpdirs. |
Actually, we can just pick a random bucket name and create it automatically and the user doesn't even specify the tmp directory. |
This seems good. I have one objection but I might have just misunderstood the following sentence.
I'm cool with having a default random name as an option but I think there should be a prompt for the bucket name so that the user is aware that we are creating resources on their behalf that will cost them money. I feel like they should always confirm any resource creation that |
You're absolutely right. We should pick a default name for the prompt. Actually, I already have the code to do that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff! A few comments. Seems broadly on the right track.
@@ -32,8 +32,8 @@ def _load_user_config(): | |||
user_config.read(config_file) | |||
|
|||
|
|||
def get_user_config() -> configparser.ConfigParser: | |||
if user_config is None: | |||
def get_user_config(reload=False) -> configparser.ConfigParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. Seems to easy to create races and the only use-case is printing it after init
.
|
||
async def grant_service_account_bucket_write_access(storage_client, bucket_info: BucketInfo, service_account: str): | ||
service_account_member = f'serviceAccount:{service_account}' | ||
await storage_client.grant_bucket_write_access(bucket_info.name, service_account_member) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the grant_bucket_write_access
function is easy to use incorrectly (e.g. forget to prepend serviceAccount:). Maybe it should have required mutually exclusive keyword args from user=
or serviceAccount=
?
'roles/storage.objectCreator', | ||
) | ||
|
||
iam_policy = await storage_client.get_bucket_iam_policy(bucket_info.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can assume you have permission to read IAM policy. I don't know where you plan to use check_service_account_has_bucket_write_access
but if this is a bucket that Hail might write to it anyway seems very reasonable to just write to it.
bindings = iam_policy['bindings'] | ||
for binding in bindings: | ||
role = binding['role'] | ||
if role in write_roles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right. There can be custom roles that grant write access.
await storage_client.insert_bucket(bucket_info.project, body=body) | ||
|
||
|
||
async def check_service_account_has_bucket_read_access(storage_client, bucket_info: BucketInfo, service_account: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same argument to below, just check that you can list the bucket. That's not exactly the same as read, but in practice Hail needs list and read.
'name': bucket_info.name, | ||
'location': bucket_info.location, | ||
'locationType': bucket_info.location_type, | ||
'lifecycle': {'rule': [{'action': {'type': 'Delete'}, 'condition': {'age': bucket_info.retention_policy_days}}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a retention policy its a lifecycle policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retention is the opposite: it prevents deletion.
return self.location_type.lower() != 'region' | ||
|
||
|
||
async def get_gcp_bucket_information(storage_client, bucket: str) -> Optional[BucketInfo]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is in hailctl.config.utils
so I'm gonna apply scrutiny I would to any method made available to all Hail developers.
The fact that this doesn't read and set the lifecycle policy information is confusing. This makes me think that None
can mean either (1) there is no lifecycle policy or (2) we just don't know the lifecycle policy.
async def get_gcp_default_project() -> Optional[str]: | ||
from hailtop.utils import check_shell_output # pylint: disable=import-outside-toplevel | ||
try: | ||
project, _ = await check_shell_output("gcloud config list --format 'value(core.project)'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcloud config get-value project
feels more direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project
is defined as the default project by the gcloud config docs: https://cloud.google.com/sdk/gcloud/reference/config/set
f'Does the bucket {bucket} exist? ' | ||
f'Have you installed gcloud? For directions see https://cloud.google.com/sdk/docs/install ' | ||
f'Do you have admin privileges to grant permissions to resources? ' | ||
f'Are you logged into your google account by running `gcloud auth login`?') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should catch the exception that says "you do not have permission" and specifically suggest the user request from the relevant admin the roles "XXX" and "YYY".
remote_tmpdir = f'{bucket_name}/batch/tmp' | ||
errors = False | ||
|
||
retention_days = IntPrompt.ask(f'How many days should files be retained in bucket {bucket_name}?', default=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use the word retained and retention because those words are used by Google to talk about prevention of deletion not automatic deletion. Let's just say directly: "After how many days should files be automatically deleted in your temporary bucket, {bucket_name}"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably you need to give them an option to not do that.
Closing to ease CI burden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this level of granularity in terms of the initial configuration. Added some more detailed comments, mostly around managing user expectations.
import typer | ||
|
||
from typer import Abort, Exit, Option as Opt | ||
# from rich import print |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
||
|
||
def initialize( | ||
incremental: Ann[bool, Opt('--incremental', '-i', help='Do not destroy the existing configuration before setting new variables.')] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's safer to default to not erasing anything from the user's config. How about instead having a --overwrite
flag that explicitly overwrites any existing configuration?
if location != expected_region or bucket_info['locationType'] != 'region': | ||
typer.secho(f'WARNING: remote temporary directory {remote_tmpdir} is not located in {expected_region} or is multi-regional. ' | ||
f'Found {location} with location type {bucket_info["locationType"]}.', | ||
fg=typer.colors.MAGENTA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a yellow/gold is a more expected warning color.
lifecycle_days = IntPrompt.ask( | ||
f'After how many days should files be automatically deleted from bucket {bucket_name}?', default=30) | ||
if lifecycle_days <= 0: | ||
typer.secho(f'invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typer.secho(f'invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED) | |
typer.secho(f'Invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED) |
verbose=verbose, | ||
) | ||
typer.secho(f'Created bucket {bucket_name} in project {project} with lifecycle rule set to {lifecycle_days} days.', fg=typer.colors.GREEN) | ||
except InsufficientPermissions as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the minimum amount of code we expect might fail in the try
block, which AFAICT starts on the await create_gcp_bucket
line.
def get_default_region(supported_regions: List[str], cloud: str) -> Optional[str]: | ||
if cloud == 'gcp': | ||
if 'us-central1' in supported_regions: | ||
return 'us-central1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to represent the default region in the deploy config? This isn't necessarily true for e.g. the australians. How about another endpoint like the supported regions?
If this doesn't represent the default region of the hail batch service, then I think we should not have a default and force the user to think about which region they really want.
labels_str = None | ||
|
||
try: | ||
await check_shell(f'gcloud --project {project} storage buckets create gs://{bucket} --location={location}', echo=verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use check_shell
we need to shell quote, but this isn't really using any shell functionality just the gcloud
executable, so this should use check_exec_output
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find writing commands for check_exec_output
to be significantly harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, but I think using the shell when we don't have to is a footgun because we could forget to properly quote inputs and shell invocations end up in bash history files (so we have to be careful not to check_shell
with anything sensitive). What if we just made a wrapper for check_exec_output
that takes a single string, splits on spaces, and passes it to check_exec_output
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this option. It's too hard to get the splitting to be correct. I'll just make it check exec output.
f'to assign you the StorageIAMAdmin role in Google Cloud Storage or ask them to update the bucket {bucket} on your behalf by giving ' \ | ||
f'service account {service_account} the role "roles/storage.objectCreator".' | ||
raise InsufficientPermissions(msg) from e | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace these two functions with one function that also accepts a role
.
except CalledProcessError as e: | ||
if 'does not have storage.buckets.get access to the Google Cloud Storage bucket' in e.stderr.decode('utf-8'): | ||
msg = f'ERROR: You do not have the necessary permissions to update bucket {bucket} in project {project}. Ask a project administrator ' \ | ||
f'to assign you the StorageAdmin role in Google Cloud Storage or ask them to update the bucket {bucket} on your behalf.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users only need that role on that particular bucket, not project-wide.
except CalledProcessError as e: | ||
if 'does not have storage.buckets.create access to the Google Cloud project' in e.stderr.decode('utf-8'): | ||
msg = f'ERROR: You do not have the necessary permissions to create buckets in project {project}. Ask a project administrator ' \ | ||
f'to assign you the StorageAdmin role in Google Cloud Storage or ask them to create the bucket {bucket} on your behalf.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace this with a narrower role that allows users to create buckets?
I think I addressed most of the comments. I haven't tested the new code -- I don't want to do that until we're happy with it. I don't know that I like how this is turning out. I think we're conflating what |
Agreed that On the issue of login, I think we can appease both points of view. If you're not logged in, have
I think we should punt on addressing domain. Broad users don't need to interact with it at all. What are the other concerns about how this is developing? |
(and to be clear, I'm onboard with the idea of a single command which gets you from zero to simple batch pipelines; jury is out on the Artifact Registry. I think making sure that's configured correctly might be better upstreamed to Sam. Normal users might not have permission to enable/disable things like that.). |
Hm, I guess I just don't personally see the value in one command instead of two. If the workshop is meant to teach users how to use batch, and one aspect of using batch is knowing how to log in, shouldn't they go through the motions? |
That's a fair counter-point, but wouldn't it be annoying if you did:
And it was like:
Why not just do it for me? |
I suppose this is where I started getting entangled with the domain issue. If the Australians run a workshop, what should happen if their users run While it doesn't work this way today, I imagine that we should ultimately configure |
Heck not even the Australians, what about Azure? |
I agree that we never carefully considered switching domains at the user-level. I'm mostly concerned with Broadies using the GCP instance for now. I think how to deal with domains really gets into the idea of having config profiles, which just makes all of this more complicated. Ergo, my gut is: if you're not a Broadie using Google, you have to do some extra work, but that set of users is very small. |
But let's solve for Aus and Azure as a separate thread of work that doesn't block with PR. I imagine whatever we decide will either make this command need to take a domain or enhance this command to create a profile per-domain. |
Could you please do one more broad look over this before I start testing again with the new changes? |
This looks great, I love how straightforward |
This reverts commit 11ee703.
I wanted to get feedback on the code I've written thus far before I start testing everything. I'm worried it might be too complicated / brittle to maintain.
I also chose to blow away the entire existing environment rather than resetting the variables with new values. Not sure if that's what we want.