New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331
Comments
I have experienced a similar issue but I am not 100% sure whether it is the same cause. I also wanted to setup a scenario similar to Scenario 5 with proxied artifact storage. For the database I opted for postgres and for S3 storage we tried minio (for a local dev setup). In the end our docker-compose setup looks like the following, maybe it helps.
And a the correspoding docker compose:
This helped a lot in setting up minio. Now when trying to upload a large file >1GB I get the following error from mlflow client:
And the mlflow server logs are the following:
|
@patrickodpt Try increasing the
|
@nilutz Yep, that was my first step -- from my logs above This is sufficient gunicorn worker time for my current model size tests. |
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue. |
Issues Policy acknowledgement
Where did you encounter this bug?
Other
Willingness to contribute
Yes. I can contribute a fix for this bug independently.
MLflow version
System information
Describe the problem
When using Scenario 5 with proxied artifact storage with an S3 artifact store and an RDS remote host, large downloads via UI do not cleanup their temp directories if streaming a file fails (say a connection or gunicorn worker dies).
I'm not sure, but I wonder if
def _download_artifacts
in mlflow/server/handlers.py may be the culprit. If so, using context managers withtempfile.TemporaryDirectory()
and fileopen()
, as is done many other places in the MLflow repo, may resolve the issue.Tracking information
Code to reproduce issue
Stack trace
Other info / logs
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/gateway
: AI Gateway service, Gateway client APIs, third-party Gateway integrationsarea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: