[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

patrickodpt · 2023-11-08T21:52:07Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Other

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

MLflow version

Client: 2.7.0
Tracking server: 2.7.0

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 11.4-slim
Python version: 3.10

Describe the problem

When using Scenario 5 with proxied artifact storage with an S3 artifact store and an RDS remote host, large downloads via UI do not cleanup their temp directories if streaming a file fails (say a connection or gunicorn worker dies).

I'm not sure, but I wonder if def _download_artifacts in mlflow/server/handlers.py may be the culprit. If so, using context managers with tempfile.TemporaryDirectory() and file open(), as is done many other places in the MLflow repo, may resolve the issue.

Tracking information

REPLACE_ME

Code to reproduce issue

#  Not Applicable
#  Error is encountered when downloading through MLflow UI

Stack trace

# No stack trace logs spit out during OS Error

# Server returns:
error_code	"INTERNAL_ERROR"
message	`The following failures occurred while downloading one or more artifacts from s3://my-mlflow-artifacts-us-east-1: {'<RUN_ID>/<UUID>/artifacts/artifacts/model.pkl': "OSError(28, 'No space left on device')"}`

# Of course, if you try to hit the tracking server after we crash it we get 
failed with exception HTTPSConnectionPool(host='<remote_tracking_url', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/<run_id>/<uuid>/artifacts/model.pkl (Caused by ResponseError('too many 500 error responses'))

Other info / logs

11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] Current configuration:
11-07-2023 01:40:46 PM   config: ./gunicorn.conf.py
11-07-2023 01:40:46 PM   wsgi_app: None
11-07-2023 01:40:46 PM   bind: ['0.0.0.0:8080']
11-07-2023 01:40:46 PM   backlog: 2048
11-07-2023 01:40:46 PM   workers: 1
11-07-2023 01:40:46 PM   worker_class: sync
11-07-2023 01:40:46 PM   threads: 1
11-07-2023 01:40:46 PM   worker_connections: 1000
11-07-2023 01:40:46 PM   max_requests: 0
11-07-2023 01:40:46 PM   max_requests_jitter: 0
11-07-2023 01:40:46 PM   timeout: 120
11-07-2023 01:40:46 PM   graceful_timeout: 30
11-07-2023 01:40:46 PM   keepalive: 2
11-07-2023 01:40:46 PM   limit_request_line: 4094
11-07-2023 01:40:46 PM   limit_request_fields: 100
11-07-2023 01:40:46 PM   limit_request_field_size: 8190
11-07-2023 01:40:46 PM   reload: False
11-07-2023 01:40:46 PM   reload_engine: auto
11-07-2023 01:40:46 PM   reload_extra_files: []
11-07-2023 01:40:46 PM   spew: False
11-07-2023 01:40:46 PM   check_config: False
11-07-2023 01:40:46 PM   print_config: False
11-07-2023 01:40:46 PM   preload_app: False
11-07-2023 01:40:46 PM   sendfile: None
11-07-2023 01:40:46 PM   reuse_port: False
11-07-2023 01:40:46 PM   chdir: /
11-07-2023 01:40:46 PM   daemon: False
11-07-2023 01:40:46 PM   raw_env: []
11-07-2023 01:40:46 PM   pidfile: None
11-07-2023 01:40:46 PM   worker_tmp_dir: None
11-07-2023 01:40:46 PM   user: 0
11-07-2023 01:40:46 PM   group: 0
11-07-2023 01:40:46 PM   umask: 0
11-07-2023 01:40:46 PM   initgroups: False
11-07-2023 01:40:46 PM   tmp_upload_dir: None
11-07-2023 01:40:46 PM   secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
11-07-2023 01:40:46 PM   forwarded_allow_ips: ['127.0.0.1']
11-07-2023 01:40:46 PM   accesslog: None
11-07-2023 01:40:46 PM   disable_redirect_access_to_syslog: False
11-07-2023 01:40:46 PM   access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
11-07-2023 01:40:46 PM   errorlog: -
11-07-2023 01:40:46 PM   loglevel: debug
11-07-2023 01:40:46 PM   capture_output: False
11-07-2023 01:40:46 PM   logger_class: gunicorn.glogging.Logger
11-07-2023 01:40:46 PM   logconfig: None
11-07-2023 01:40:46 PM   logconfig_dict: {}
11-07-2023 01:40:46 PM   logconfig_json: None
11-07-2023 01:40:46 PM   syslog_addr: udp://localhost:514
11-07-2023 01:40:46 PM   syslog: False
11-07-2023 01:40:46 PM   syslog_prefix: None
11-07-2023 01:40:46 PM   syslog_facility: user
11-07-2023 01:40:46 PM   enable_stdio_inheritance: False
11-07-2023 01:40:46 PM   statsd_host: None
11-07-2023 01:40:46 PM   dogstatsd_tags: 
11-07-2023 01:40:46 PM   statsd_prefix: 
11-07-2023 01:40:46 PM   proc_name: None
11-07-2023 01:40:46 PM   default_proc_name: camo_mlflow_auth.interceptor:app
11-07-2023 01:40:46 PM   pythonpath: None
11-07-2023 01:40:46 PM   paste: None
11-07-2023 01:40:46 PM   on_starting: <function OnStarting.on_starting at 0x7ff525b7c820>
11-07-2023 01:40:46 PM   on_reload: <function OnReload.on_reload at 0x7ff525b7ca60>
11-07-2023 01:40:46 PM   when_ready: <function WhenReady.when_ready at 0x7ff525b7cb80>
11-07-2023 01:40:46 PM   pre_fork: <function Prefork.pre_fork at 0x7ff525b7cca0>
11-07-2023 01:40:46 PM   post_fork: <function Postfork.post_fork at 0x7ff525b7cdc0>
11-07-2023 01:40:46 PM   post_worker_init: <function PostWorkerInit.post_worker_init at 0x7ff525b7cee0>
11-07-2023 01:40:46 PM   worker_int: <function WorkerInt.worker_int at 0x7ff525b7d000>
11-07-2023 01:40:46 PM   worker_abort: <function WorkerAbort.worker_abort at 0x7ff525b7d120>
11-07-2023 01:40:46 PM   pre_exec: <function PreExec.pre_exec at 0x7ff525b7d240>
11-07-2023 01:40:46 PM   pre_request: <function PreRequest.pre_request at 0x7ff525b7d360>
11-07-2023 01:40:46 PM   post_request: <function PostRequest.post_request at 0x7ff525b7d3f0>
11-07-2023 01:40:46 PM   child_exit: <function ChildExit.child_exit at 0x7ff525b7d510>
11-07-2023 01:40:46 PM   worker_exit: <function WorkerExit.worker_exit at 0x7ff525b7d630>
11-07-2023 01:40:46 PM   nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7ff525b7d750>
11-07-2023 01:40:46 PM   on_exit: <function OnExit.on_exit at 0x7ff525b7d870>
11-07-2023 01:40:46 PM   ssl_context: <function NewSSLContext.ssl_context at 0x7ff525b7d990>
11-07-2023 01:40:46 PM   proxy_protocol: False
11-07-2023 01:40:46 PM   proxy_allow_ips: ['127.0.0.1']
11-07-2023 01:40:46 PM   keyfile: None
11-07-2023 01:40:46 PM   certfile: None
11-07-2023 01:40:46 PM   ssl_version: 2
11-07-2023 01:40:46 PM   cert_reqs: 0
11-07-2023 01:40:46 PM   ca_certs: None
11-07-2023 01:40:46 PM   suppress_ragged_eofs: True
11-07-2023 01:40:46 PM   do_handshake_on_connect: False
11-07-2023 01:40:46 PM   ciphers: None
11-07-2023 01:40:46 PM   raw_paste_global_conf: []
11-07-2023 01:40:46 PM   strip_header_spaces: False
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Starting gunicorn 21.2.0
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] Arbiter booted
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Listening at: http://0.0.0.0:8080 (10)
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Using worker: sync
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [11] [INFO] Booting worker with pid: 11
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] 1 workers

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

nilutz · 2023-11-10T13:47:01Z

I have experienced a similar issue but I am not 100% sure whether it is the same cause. I also wanted to setup a scenario similar to Scenario 5 with proxied artifact storage. For the database I opted for postgres and for S3 storage we tried minio (for a local dev setup). In the end our docker-compose setup looks like the following, maybe it helps.

mlflow.Dockerfile:

#!/bin/bash
FROM python:3.11.0

RUN pip install --no-cache mlflow[extra] psycopg2 boto3 minio

EXPOSE 5000

CMD mlflow server \
    --host=0.0.0.0 \
    --port=5000 \
    --backend-store-uri=${BACKEND_STORE_URI} \
    --gunicorn-opts="--log-level=debug" \
    --serve-artifacts \
    --artifacts-destination=${BUCKET}

And a the correspoding docker compose:

version: "3.4"

services:
  database:
    image: 'postgres:latest'
    container_name: mlflow_db
    restart: unless-stopped
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=mlflowdb
      - POSTGRES_USER=mlflow_user
      - POSTGRES_PASSWORD=123
    networks:
      - mlflow_network
    volumes:
      - pgdata_mlflow:/var/lib/postgresql/data
    command: ["postgres", "-c", "log_statement=all", "-c", "log_destination=stderr"] # this is for logging everything so stdout
  s3:
    restart: always
    image: minio/minio
    container_name: mlflow_minio
    volumes:
      - ./minio_data:/data
    ports:
      - "9000:9000"
      - "9001:9001"
    networks:
      - mlflow_network
    environment:
      - MINIO_ROOT_USER=minio_user
      - MINIO_ROOT_PASSWORD=minio_pwd
      - MINIO_ADDRESS=:9000
      - MINIO_PORT=9000
      - MINIO_STORAGE_USE_HTTPS=False
      - MINIO_CONSOLE_ADDRESS=:9001
    command: server /data
  app:
    build:
      context: .
      dockerfile: mlflow.Dockerfile
    ports:
      - "5000:5000"
    environment:
      - BACKEND_STORE_URI=postgresql://mlflow_user:123@database:5432/mlflowdb
      - BUCKET=s3://mlflow
      - MLFLOW_S3_ENDPOINT_URL=http://s3:9000
      - MLFLOW_S3_IGNORE_TLS=true
      - AWS_ACCESS_KEY_ID=awPdPbLlPaqciiCEDP3H #this generated from the minio ui
      - AWS_SECRET_ACCESS_KEY=iUf7wQYbtfUq0ssQ2FCiKIbkMKIHPItkHw6vhw45 #this generated from the minio ui

    networks:
      - mlflow_network
    depends_on:
      - s3
      - database

volumes:
  pgdata_mlflow:

networks:
  mlflow_network:

This helped a lot in setting up minio.

Now when trying to upload a large file >1GB I get the following error from mlflow client:

mlflow.exceptions.MlflowException: API request to http://0.0.0.0:5000/api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/b<file>.bin failed with exception HTTPConnectionPool(host='0.0.0.0', port=5000): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/<file>.bin (Caused by ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')))

And the mlflow server logs are the following:

app_1       | [2023-11-10 12:51:54 +0000] [26] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/<file>.bin
app_1       | [2023-11-10 12:52:03 +0000] [77] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
app_1       | [2023-11-10 12:52:18 +0000] [61] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
app_1       | [2023-11-10 12:52:25 +0000] [16] [CRITICAL] WORKER TIMEOUT (pid:26)
app_1       | [2023-11-10 12:52:25 +0000] [26] [INFO] Worker exiting (pid: 26)
app_1       | [2023-11-10 12:52:25 +0000] [16] [ERROR] Worker (pid:26) exited with code 1
app_1       | [2023-11-10 12:52:25 +0000] [16] [ERROR] Worker (pid:26) exited with code 1.

nilutz · 2023-11-10T15:58:13Z

@patrickodpt Try increasing the --timeout for gunicorn. By default, gunicorn will hang if a process takes more than 30 seconds, so increasing the timeout helps !


#!/bin/bash
FROM python:3.11.0

RUN pip install --no-cache mlflow[extra] psycopg2 boto3 minio

EXPOSE 5000

CMD mlflow server \
    --host=0.0.0.0 \
    --port=5000 \
    --app-name basic-auth \
    --backend-store-uri=${BACKEND_STORE_URI} \
    --gunicorn-opts="--log-level=debug --timeout=9000" \
    --serve-artifacts \
    --artifacts-destination=${BUCKET}

patrickodpt · 2023-11-10T16:58:40Z

@nilutz Yep, that was my first step -- from my logs above 11-07-2023 01:40:46 PM timeout: 120.

This is sufficient gunicorn worker time for my current model size tests.

github-actions · 2023-11-16T00:12:42Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

patrickodpt added the bug Something isn't working label Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

patrickodpt commented Nov 8, 2023

nilutz commented Nov 10, 2023 •

edited

nilutz commented Nov 10, 2023

patrickodpt commented Nov 10, 2023

github-actions bot commented Nov 16, 2023

[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

Comments

patrickodpt commented Nov 8, 2023

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

nilutz commented Nov 10, 2023 • edited

nilutz commented Nov 10, 2023

patrickodpt commented Nov 10, 2023

github-actions bot commented Nov 16, 2023

nilutz commented Nov 10, 2023 •

edited