Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix temp directory permission issue on worker side #5684

Merged
merged 5 commits into from
Apr 13, 2022

Conversation

WeichenXu123
Copy link
Collaborator

Signed-off-by: Weichen Xu weichen.xu@databricks.com

What changes are proposed in this pull request?

(Please fill in changes proposed in this fix)

How is this patch tested?

(Details)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@@ -503,6 +503,7 @@ def get_or_create_tmp_dir():
os.makedirs(tmp_dir, exist_ok=True)
else:
tmp_dir = tempfile.mkdtemp()
os.chmod(tmp_dir, 0o777)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the workers to have write authority over this location? Could we do 0o755 instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind. I just checked the docs on os.makedirs() and the default behavior there is 0o777.
tempfile.mkdtemp sets 0o700 by default.
This looks good to me!

@BenWilson2 BenWilson2 added the rn/none List under Small Changes in Changelogs. label Apr 13, 2022
Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@WeichenXu123
Copy link
Collaborator Author

Let me add some comments before merging

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
mlflow/utils/file_utils.py Outdated Show resolved Hide resolved
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@@ -526,6 +529,9 @@ def get_or_create_nfs_tmp_dir():
os.makedirs(tmp_nfs_dir, exist_ok=True)
else:
tmp_nfs_dir = tempfile.mkdtemp(dir=nfs_root_dir)
# mkdtemp creates a directory with permission 0o700
# change it to be 0o777 to ensure it can be seen in spark UDF
Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does tmp_nfs_dir have to be writable in spark UDF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because "shutils" makedirs also use "0o777", I keep it the same with it

Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it has to be writable in spark UDF?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it strictly needs to be writeable, but this was the preexisting behavior for the Spark model cache. To avoid regressions, I think it's best to use the same permissions level.

Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I asked this because can be seen sounds like readable permission should be enough.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @WeichenXu123 !

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 Before merging, can we resolve os.makedirs() permissions exceptions thrown when operating on Table ACL clusters?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Comment on lines +33 to +34
# Test whether the NFS directory is writable.
test_path = os.path.join(nfs_root_dir, uuid.uuid4().hex)
Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we remove this line before the merge?

Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understood what we're trying to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll keep this line because we're now testing whether or not we can write to NFS by attempting to create a directory at test_path

Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use os.access(nfs_root_dir, os.W_OK)?

ref: https://stackoverflow.com/a/2113511/6943581

or this approach doesn't necessarily guarantee that nfs_root_dir is writable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to write, as is done in the current PR. The best way to test for write access is to try to write.

From the comments of https://stackoverflow.com/a/2113511/6943581:

Testing a directory for just the write bit isn't enough if you want to write files to the directory. You will need to test for the execute bit as well if you want to write into the directory. os.access('/path/to/folder', os.W_OK | os.X_OK) With os.W_OK by itself you can only delete the directory (and only if that directory is empty

If we're not careful, we might make a mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a bunch, @WeichenXu123 !

# directory, in this case, return None representing NFS is not available.
return None
finally:
shutil.rmtree(test_path, ignore_errors=True)
Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line fail when test_path doesn't exist or ignore_errors=True swallows the exception?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fortunately, it swallows:

In [11]: shutil.rmtree("/tmp/asfsadfsdf", ignore_errors=True)

In [12]: shutil.rmtree("/tmp/asfsadfsdf")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-12-8bc109b15d91> in <module>
----> 1 shutil.rmtree("/tmp/asfsadfsdf")

~/opt/anaconda3/lib/python3.8/shutil.py in rmtree(path, ignore_errors, onerror)
    704             orig_st = os.lstat(path)
    705         except Exception:
--> 706             onerror(os.lstat, path, sys.exc_info())
    707             return
    708         try:

~/opt/anaconda3/lib/python3.8/shutil.py in rmtree(path, ignore_errors, onerror)
    702         # lstat()/open()/fstat() trick.
    703         try:
--> 704             orig_st = os.lstat(path)
    705         except Exception:
    706             onerror(os.lstat, path, sys.exc_info())

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/asfsadfsdf'

Copy link
Member

@harupy harupy Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this line doesn't fail.

@WeichenXu123 WeichenXu123 merged commit bd8854f into mlflow:master Apr 13, 2022
WeichenXu123 added a commit to WeichenXu123/mlflow that referenced this pull request Apr 13, 2022
* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
(cherry picked from commit bd8854f)
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
WeichenXu123 added a commit that referenced this pull request Apr 13, 2022
* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

* update

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
(cherry picked from commit bd8854f)
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants