Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IsADirectoryError while selecting S3 artifact in the UI #3154

Closed
1 task done
amiryi365 opened this issue Jul 22, 2020 · 51 comments
Closed
1 task done

[BUG] IsADirectoryError while selecting S3 artifact in the UI #3154

amiryi365 opened this issue Jul 22, 2020 · 51 comments
Labels
area/artifacts Artifact stores and artifact logging bug Something isn't working priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@amiryi365
Copy link

amiryi365 commented Jul 22, 2020

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos 7.4
  • MLflow installed from (source or binary): binary
  • MLflow version (run mlflow --version): 1.9.0
  • Python version: 3.7.3
  • npm version, if running the dev UI: NA
  • Exact command to reproduce:
  • S3 packages: botocore 1.14.14, boto3 1.11.14

Describe the problem

My mlflow server runs on centos with Postgresql backend storage and S3 (minio) artifact storage:
mlflow server --backend-store-uri postgresql://<pg-location-and-credentials> --default-artifact-root s3://mlflow -h 0.0.0.0 -p 8000
I set all S3 relevant env vars:
MLFLOW_S3_ENDPOINT_URL, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
I've successfully ran several runs from other machine against this server:

  • Runs all finished OK, with params, metrics and artifacts.

  • Postrgesql mlflow tabels were updated accordingly.

  • All arifacts were stored in minio bucket as expected and I can display and download them by minio browser.

However, when I select any artifact in the UI, I get Internal Server Error in the browser.

Other info / logs

In the mlflow server I see the following error:

ERROR mlflow.server: Exception on /get-artifact [GET]
# I skip most of the traceback
File "<python-path>/site-packages/mlflow/serverhandlers.py": line 180, in get_artifact_handler
    return send_file(filename, mimetype='text/plain', as_attachment=True)
File "<python-path>/site-packages/flask/helpers.py", line 629, in send_file
    file = open(filename, "rb")
IsADirectoryError: [Error 21] Is a directory: '/tmp/<generated-name>/<my-file>'

Actually, there is a '/tmp/<generated-name>/' which is really a directory and not a file!
This folder contains another directory with a generated name, and inside there's nothing!

I didn't find any similar error regarding mlflow and s3.
What's wrong?

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
@amiryi365 amiryi365 added the bug Something isn't working label Jul 22, 2020
@harupy
Copy link
Member

harupy commented Jul 22, 2020

@amiryi365 Thanks for filing this issue. I was able to reproduce the same error using a folder named test.txt. Does your folder name contain text file extensions listed below?

_TEXT_EXTENSIONS = ['txt', 'log', 'yaml', 'yml', 'json', 'js', 'py',
'csv', 'tsv', 'md', 'rst', MLMODEL_FILE_NAME, MLPROJECT_FILE_NAME]

@harupy harupy added the area/artifacts Artifact stores and artifact logging label Jul 22, 2020
@amiryi365
Copy link
Author

Thanks for replying @harupy
Yes, all my artifacts are text files from the list (log, json, py). I also found this list in the source...
Is there a workaround for this bug?

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 The error indicates your folder name contains a text file extension. Can you share the code you used to log artifacts?

@amiryi365
Copy link
Author

amiryi365 commented Jul 23, 2020

@harupy the bug is probably not in my code, i,e, I wrote the run code, but it worked well (no exceptions, I can see all run details in PG tables and in the UI, I can see all the files in minio browser and download them, and they look fine).
The bug is taking place in the mlflow server while the web client requests to get an artifact content.
In my code I do the following (only mlflow related lines):
At start:

mlflow.set_experiment(experiment_name)
active_run = mlflow.start_run(run_name, nested)
experiment = mlflow.get_experiment_by_name(experiment_name)
mlflow.set_tags(experiment.tags)
mlflow.set_params(params)
mlflow.log_artifact(config)   # config is py file

I don't use with mlflow.start_run(run_name, nested) as active_run because I have my own with on my class, The above code is in its __enter__().
During the run:
I do things like:

mlflow.log_metric(key, value, step)
mlflow.log_artifact(local_path)

Finally I do in my __exit__():

mlflow.log_artifact(log_file_path)
# status is RunStatus.to_string(RunStatus.FAILED) or RunStatus.to_string(RunStatus.FINISHED) according to __exit__ exc_type arg (i.e. for exception it's FAILED)
mlflow.end_run(status) 

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 I see. Are you experiencing an error like below?

Screen Shot 2020-07-23 at 14 41 20

@amiryi365
Copy link
Author

Exactly!

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 Can you take a screenshot and share it if possible?

@amiryi365
Copy link
Author

@harupy I can't. But you say you can reproduce the bug...
It seems to be a server bug of reading from S3.
In both server and client sides I use env vars for S3, maybe it worth to run the server with configuration file for AWS instead?

More Info: My client is Windows10 and I run it directly as python (from pycharm), not with mlflow run

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 Got it. Do your artifacts contain a folder like mine in the image above?

@amiryi365
Copy link
Author

@harupy I logged all artifacts under 'log' but It didn't help...

@amiryi365
Copy link
Author

@harupy In your image I can see you're using file uri and not s3 uri

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 Yep I just wanted to show that a folder named like xxx.txt or log causes the error.

@amiryi365
Copy link
Author

I've just used 2 folders: 'logs' and 'config' - the bug is still there for all files!

@harupy
Copy link
Member

harupy commented Jul 23, 2020

So your folder structure looks like:

- config (folder)
  - foo.xxx (file)
  ...

- logs (folder)
  - bar.yyy (file)
  ...

and when you try to open foo.xxx or bar.yyy, the error occurs, correct?

@amiryi365
Copy link
Author

@harupy Exactly!

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365

Actually, there is a '/tmp/<generated-name>/' which is really a directory and not a file!
This folder contains another directory with a generated name, and inside there's nothing!

Does this mean that you have a folder named like /tmp/path/to/foo.xxx and there is nothing in it?

@amiryi365
Copy link
Author

amiryi365 commented Jul 23, 2020

@harupy There's a folder named like /tmp/path/to/foo.xxx
Inside this folder there's another folder with a auto generated name (a long name of letters and digits).
Inside that folder there's nothing

@harupy
Copy link
Member

harupy commented Jul 23, 2020

@amiryi365 I have setup up minio server following this doc and tested artifact logging, but wasn't able to reproduce the issue.

minio

code:

import mlflow

EXPERIMENT_NAME = 'minio'
BUCKET_NAME = 'test'

if not mlflow.get_experiment_by_name(EXPERIMENT_NAME):
    mlflow.create_experiment(EXPERIMENT_NAME, f's3://{BUCKET_NAME}')

mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run():
    mlflow.log_param('p', 1)
    mlflow.log_metric('m', 1)
    mlflow.log_artifact('minio.py')
    mlflow.log_artifact('minio.py', artifact_path='data')

@harupy harupy added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jul 23, 2020
@amiryi365
Copy link
Author

@harupy I see...
I'm trying to find meaningful differences between our mlflow servers:

  1. What is your botocore and boto3 packages versions? (I have botocore 1.14.14, boto3 1.11.14)

  2. Could it be related to server OS, env, other services, etc.? (my server runs on linux Centos 7.4)

  3. How do you define minio credentials in the server? (I tried env vars and also ./aws/credentials config file)

@amiryi365
Copy link
Author

@harupy Hi there!
I seemingly found a bug in the mlflow code:
at: mlflow/store/artifact/s3_artifact_repo.py function _download_file:
Original code is:

def _download_file(self, remote_file_path, local_path):
        (bucket, s3_root_path) = data.parse_s3_uri(self.artifact_uri)
        s3_full_path = posixpath.join(s3_root_path, remote_file_path)
        s3_client = self._get_s3_client()
        s3_client.download_file(bucket, s3_full_path, local_path)

My patch fix is:

def _download_file(self, remote_file_path, local_path):
        (bucket, s3_root_path) = data.parse_s3_uri(self.artifact_uri)
        s3_full_path = s3_root_path   # CHANGE IS HERE
        s3_client = self._get_s3_client()
        s3_client.download_file(bucket, s3_full_path, local_path)

It seems s3_root_path includes already the filename, e.g.:
s3_root_path = '3/xxxxxxxxxxxxxxxxxxx/artifacts/myfile.log'
and remote_file_path is the file path under the artifacts, e.g:
remote_file_path = 'myfile.log'
So, s3_full_path = posixpath.join(s3_root_path, remote_file_path) doubles the filename, e.g.:
3/xxxxxxxxxxxxxxxxxxx/artifacts/myfile.log/myfile.log
And that causes the bug!
I don't know why it behaves like that in my machine, and differently in other machines...
(if I'm right, you should have the same bug in your test)

Any idea?
Thanks :-)

@harupy
Copy link
Member

harupy commented Jul 26, 2020

@amiryi365 Thanks!

It seems s3_root_path includes already the filename, e.g.: s3_root_path = '3/xxx/artifacts/myfile.log'

This indicates that self.artifact_uri is set to something like s3://your_bucket_name/3/xxx/artifacts/myfile.log Is artifact_uri set correctly?

@amiryi365
Copy link
Author

amiryi365 commented Jul 27, 2020

@harupy I'm not sure...
Actually I don't know how to debug the server as it has nested runs...
I also failed to find a way to get a good debug level log (using --gunicorn-opts "--log-level debug" didn't help a lot..)
If you could help me with those it would be great!
I just took code samples from the mlflow package, changed them a bit to make them independent, then ran them with my own params, maybe I was wrong...

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 You can use mlflow.get_artifact_uri which shows the current artifact_uri.

with mlfow.start_run() as parent_run:
    print(mlflow.get_artifact_uri())

    with mlflow_start_run(nested=True) as child_run:
        print(mlflow.get_artifact_uri())
        ...

@amiryi365
Copy link
Author

amiryi365 commented Jul 27, 2020

@harupy I get: s3://mlflow/3/xxx/artifacts, so I was wrong putting the params in my test...
Anyway, I'm talking about the server, not the client!
My client seems to work well and I can see the result artifacts in the minio browser as I told you before.
I meant the mlflow server is kind of a nested python runs (not mlflow runs), so running it with "-m pdb" doesn't help to debug...
I want to debug the server in order to understand what's wrong and why. Could you help me with that?
I need the right technique to use the debugger inside mlflow/store/artifact/s3_artifact_repo.py for instance.

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 What do you mean by a nested python runs?

@amiryi365
Copy link
Author

amiryi365 commented Jul 27, 2020

@harupy I ran somthing like: mlflow server -m pdb ... and put breakpoints in s3_artifact_repo.py, but it never stopped there.
Looking at the code, I understand that the server run internal python command, and that's is probably the reason..
In general' what I look for is the right technique to debug the server!
Also, because it's an installed package, I couldn't change the code (e.g. add prints) because it compiled the original code even if I delete the pyc file... (I don't have a lot of experience doing python play like this)

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 DId you try inserting print to check values passed to _download_file?

@amiryi365
Copy link
Author

@harupy see above (I edited it), I already told you what was my problem adding prints

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 Sorry I missed the edit. Actually, you can change the insalled package code. The code below shows where s3_artifact_reop.py is located. You can just open it and tweak the code to debug (assuming you have access to centos that mlflow server runs on).

python -c "import mlflow; print(mlflow.store.artifact.s3_artifact_repo.__file__)"

@amiryi365
Copy link
Author

amiryi365 commented Jul 27, 2020

@harupy that what I did! but although I changed the py file, and also deleted its pyc file, it didn't run my new file!
It compiled the original py again and ran it.
I just read that the right method to do that is to clone the source code, do the change there and install it as a package. It's complicated for me because it runs in a close network...
Do you know an alternative way to do that?

@amiryi365
Copy link
Author

@harupy I didn't.
I did yesterday exactly what you suggested me just now, and I failed because it used the original version somehow.

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 Did you run pip installl -e . after the change yesterady?

@amiryi365
Copy link
Author

@harupy I just tried - it doesn't work on the installed package.
It should work on the cloned source code from github (with the setup.py etc.)

@harupy
Copy link
Member

harupy commented Jul 27, 2020

Just to confirm, what you did yesterday is directly fixing the source code of the installed mlfow?

@amiryi365
Copy link
Author

@harupy I added prints to the source code of the installed mlfow, but it didn't "catch" it because it ran the original code.
I also made a fix but now I think it's wrong...
I can't find a way to check what's going on inside the server...

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365

I added prints to the source code of the installed mlfow, but it didn't "catch" it because it ran the original code.

How did you confirm that?

@amiryi365
Copy link
Author

@harupy I didn't see my prints. Also, when I deleted the pyc file, it recreated the original one.
See here how to do it right.
But in may case I can't do that...

Other tries:
Also, I found that pdb cannot attach an already running python process.
I also tried to run httpry to see rest messages between the mlflow server and minio, but it displayed nothing (maybe because they all run in the same machine).

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 What did you do after changing the code?

@amiryi365
Copy link
Author

@harupy rerun it in the same way: mlflow server ...

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 You added some prints to _download_file in s3_artifact_repo.py, correct?

@amiryi365
Copy link
Author

@harupy sure

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 I think just running mlflow server ... doesn't call _download_file. Can you launch the UI and open artifacts? or have you already tried this?

@amiryi365
Copy link
Author

@harupy of course it should call _download_file when I click on an artifact in the UI

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 So you clicked an artifact in the UI and nothing printed out. Did you still get the same error?

@amiryi365
Copy link
Author

@harupy of course I get the error. If I wouldn't, I could say I solved it..

@harupy
Copy link
Member

harupy commented Jul 27, 2020

@amiryi365 Can you open a file in the error stack trace and edit it to debug?

@amiryi365
Copy link
Author

@harupy I found the bug!!!
It's inconsistent behavior of the minio (probably of the specific version), to my opinion.
I share here the details how to debug it and exactly what's the error.

How to debug mlflow server:

  1. Run mlflow server with flags to reduce num of workers to 1 and increase timeout so it wouldn't exit the worker during the debug: mlflow server --gunicorn-opts "--timeout 1000" --workers 1 --backend-store-uri <s3-uri> ...

  2. Open pycharm on the same machine, create a project with the same python interpreter, then attach to the worker process: Run -> Attach to Process

  3. Now you can set breakpoints and debug!

Bug: minio getting contained items list

  • In artifact_repo.py:127, download_artifact() it checks if the artifact path is a file or a dir, by calling _is_directory()

  • _is_directory() calls list_artifacts() and return True if number of contained artifacts > 0

  • s3 impl. of list_artifacts() in s3_artifact_repo.py:94 uses "list_objects_v2" operation on minio by a paginator. This function usually returns a buggy result list, e.g. for "index.log" file it returns a list of one item (some auto-generated long name), which make it believe it's a dir.

  • This makes the caller download_artifacts() go to internal function download_artifact_dir() instead of download_file(). That function calls list_artifacts() again which results with the fictional auto-generated item, which is also a dir that contains no items inside. This is exactly the path it builds under the /tmp while downloading this "file" from minio - but because it's a dir (not a file) it has no contents and the send_file() in handlers.py:170 raises IsADirectoryError ...

Inconsistency:
When I found that, I copied the list_artifacts() function to my own test and ran it separately with the same params.
To my surprise it worked well!! (i.e. it returned an empty list for "index.log").
Then I checked again the mlflow UI, and noticed that in some runs (where I had the bug before) the artifacts suddenly works and it displays the artifacts contents! For other runs it doesn't...

Possible Solution:
I'm using the last minio version 2020-07-18T18:48:16Z
I suspect this version is still unstable, so I'm going to check it with older minio version.

@amiryi365
Copy link
Author

I found that this bug disappear with an older release of minio - 2020-04-28T23:56:56Z
See minio/minio#10157
Actually: although the browser displays well all artifacts files, the server still raises the IsADirectoryError when I click on a dir in the artifacts tree (in mlflow run web page). But it doesn't have any external effect so it doesn't bother me.

@harupy
Copy link
Member

harupy commented Aug 3, 2020

@amiryi365 Thanks for the investigation :)

@Subhraj07
Copy link

I am getting this issue while I deployed mlflow in kubernetes and minio as artifacts store. Getting following error.
Screenshot from 2021-05-29 23-00-44

@sfc-gh-adlee
Copy link

@Subhraj07 did u solve this? Im having the same issue too :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging bug Something isn't working priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

4 participants