Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UI Large Model Download fills Ephemeral Storage and Crashes Server #10331

Open
1 of 23 tasks
patrickodpt opened this issue Nov 8, 2023 · 4 comments
Open
1 of 23 tasks
Labels
bug Something isn't working

Comments

@patrickodpt
Copy link

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Other

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

MLflow version

  • Client: 2.7.0
  • Tracking server: 2.7.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 11.4-slim
  • Python version: 3.10

Describe the problem

When using Scenario 5 with proxied artifact storage with an S3 artifact store and an RDS remote host, large downloads via UI do not cleanup their temp directories if streaming a file fails (say a connection or gunicorn worker dies).

I'm not sure, but I wonder if def _download_artifacts in mlflow/server/handlers.py may be the culprit. If so, using context managers with tempfile.TemporaryDirectory() and file open(), as is done many other places in the MLflow repo, may resolve the issue.

Tracking information

REPLACE_ME

Code to reproduce issue

#  Not Applicable
#  Error is encountered when downloading through MLflow UI

Stack trace

# No stack trace logs spit out during OS Error

# Server returns:
error_code	"INTERNAL_ERROR"
message	`The following failures occurred while downloading one or more artifacts from s3://my-mlflow-artifacts-us-east-1: {'<RUN_ID>/<UUID>/artifacts/artifacts/model.pkl': "OSError(28, 'No space left on device')"}`

# Of course, if you try to hit the tracking server after we crash it we get 
failed with exception HTTPSConnectionPool(host='<remote_tracking_url', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/<run_id>/<uuid>/artifacts/model.pkl (Caused by ResponseError('too many 500 error responses'))

Other info / logs

11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] Current configuration:
11-07-2023 01:40:46 PM   config: ./gunicorn.conf.py
11-07-2023 01:40:46 PM   wsgi_app: None
11-07-2023 01:40:46 PM   bind: ['0.0.0.0:8080']
11-07-2023 01:40:46 PM   backlog: 2048
11-07-2023 01:40:46 PM   workers: 1
11-07-2023 01:40:46 PM   worker_class: sync
11-07-2023 01:40:46 PM   threads: 1
11-07-2023 01:40:46 PM   worker_connections: 1000
11-07-2023 01:40:46 PM   max_requests: 0
11-07-2023 01:40:46 PM   max_requests_jitter: 0
11-07-2023 01:40:46 PM   timeout: 120
11-07-2023 01:40:46 PM   graceful_timeout: 30
11-07-2023 01:40:46 PM   keepalive: 2
11-07-2023 01:40:46 PM   limit_request_line: 4094
11-07-2023 01:40:46 PM   limit_request_fields: 100
11-07-2023 01:40:46 PM   limit_request_field_size: 8190
11-07-2023 01:40:46 PM   reload: False
11-07-2023 01:40:46 PM   reload_engine: auto
11-07-2023 01:40:46 PM   reload_extra_files: []
11-07-2023 01:40:46 PM   spew: False
11-07-2023 01:40:46 PM   check_config: False
11-07-2023 01:40:46 PM   print_config: False
11-07-2023 01:40:46 PM   preload_app: False
11-07-2023 01:40:46 PM   sendfile: None
11-07-2023 01:40:46 PM   reuse_port: False
11-07-2023 01:40:46 PM   chdir: /
11-07-2023 01:40:46 PM   daemon: False
11-07-2023 01:40:46 PM   raw_env: []
11-07-2023 01:40:46 PM   pidfile: None
11-07-2023 01:40:46 PM   worker_tmp_dir: None
11-07-2023 01:40:46 PM   user: 0
11-07-2023 01:40:46 PM   group: 0
11-07-2023 01:40:46 PM   umask: 0
11-07-2023 01:40:46 PM   initgroups: False
11-07-2023 01:40:46 PM   tmp_upload_dir: None
11-07-2023 01:40:46 PM   secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
11-07-2023 01:40:46 PM   forwarded_allow_ips: ['127.0.0.1']
11-07-2023 01:40:46 PM   accesslog: None
11-07-2023 01:40:46 PM   disable_redirect_access_to_syslog: False
11-07-2023 01:40:46 PM   access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
11-07-2023 01:40:46 PM   errorlog: -
11-07-2023 01:40:46 PM   loglevel: debug
11-07-2023 01:40:46 PM   capture_output: False
11-07-2023 01:40:46 PM   logger_class: gunicorn.glogging.Logger
11-07-2023 01:40:46 PM   logconfig: None
11-07-2023 01:40:46 PM   logconfig_dict: {}
11-07-2023 01:40:46 PM   logconfig_json: None
11-07-2023 01:40:46 PM   syslog_addr: udp://localhost:514
11-07-2023 01:40:46 PM   syslog: False
11-07-2023 01:40:46 PM   syslog_prefix: None
11-07-2023 01:40:46 PM   syslog_facility: user
11-07-2023 01:40:46 PM   enable_stdio_inheritance: False
11-07-2023 01:40:46 PM   statsd_host: None
11-07-2023 01:40:46 PM   dogstatsd_tags: 
11-07-2023 01:40:46 PM   statsd_prefix: 
11-07-2023 01:40:46 PM   proc_name: None
11-07-2023 01:40:46 PM   default_proc_name: camo_mlflow_auth.interceptor:app
11-07-2023 01:40:46 PM   pythonpath: None
11-07-2023 01:40:46 PM   paste: None
11-07-2023 01:40:46 PM   on_starting: <function OnStarting.on_starting at 0x7ff525b7c820>
11-07-2023 01:40:46 PM   on_reload: <function OnReload.on_reload at 0x7ff525b7ca60>
11-07-2023 01:40:46 PM   when_ready: <function WhenReady.when_ready at 0x7ff525b7cb80>
11-07-2023 01:40:46 PM   pre_fork: <function Prefork.pre_fork at 0x7ff525b7cca0>
11-07-2023 01:40:46 PM   post_fork: <function Postfork.post_fork at 0x7ff525b7cdc0>
11-07-2023 01:40:46 PM   post_worker_init: <function PostWorkerInit.post_worker_init at 0x7ff525b7cee0>
11-07-2023 01:40:46 PM   worker_int: <function WorkerInt.worker_int at 0x7ff525b7d000>
11-07-2023 01:40:46 PM   worker_abort: <function WorkerAbort.worker_abort at 0x7ff525b7d120>
11-07-2023 01:40:46 PM   pre_exec: <function PreExec.pre_exec at 0x7ff525b7d240>
11-07-2023 01:40:46 PM   pre_request: <function PreRequest.pre_request at 0x7ff525b7d360>
11-07-2023 01:40:46 PM   post_request: <function PostRequest.post_request at 0x7ff525b7d3f0>
11-07-2023 01:40:46 PM   child_exit: <function ChildExit.child_exit at 0x7ff525b7d510>
11-07-2023 01:40:46 PM   worker_exit: <function WorkerExit.worker_exit at 0x7ff525b7d630>
11-07-2023 01:40:46 PM   nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7ff525b7d750>
11-07-2023 01:40:46 PM   on_exit: <function OnExit.on_exit at 0x7ff525b7d870>
11-07-2023 01:40:46 PM   ssl_context: <function NewSSLContext.ssl_context at 0x7ff525b7d990>
11-07-2023 01:40:46 PM   proxy_protocol: False
11-07-2023 01:40:46 PM   proxy_allow_ips: ['127.0.0.1']
11-07-2023 01:40:46 PM   keyfile: None
11-07-2023 01:40:46 PM   certfile: None
11-07-2023 01:40:46 PM   ssl_version: 2
11-07-2023 01:40:46 PM   cert_reqs: 0
11-07-2023 01:40:46 PM   ca_certs: None
11-07-2023 01:40:46 PM   suppress_ragged_eofs: True
11-07-2023 01:40:46 PM   do_handshake_on_connect: False
11-07-2023 01:40:46 PM   ciphers: None
11-07-2023 01:40:46 PM   raw_paste_global_conf: []
11-07-2023 01:40:46 PM   strip_header_spaces: False
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Starting gunicorn 21.2.0
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] Arbiter booted
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Listening at: http://0.0.0.0:8080 (10)
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [INFO] Using worker: sync
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [11] [INFO] Booting worker with pid: 11
11-07-2023 01:40:46 PM [2023-11-07 18:40:46 +0000] [10] [DEBUG] 1 workers

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/gateway: AI Gateway service, Gateway client APIs, third-party Gateway integrations
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@patrickodpt patrickodpt added the bug Something isn't working label Nov 8, 2023
@nilutz
Copy link

nilutz commented Nov 10, 2023

I have experienced a similar issue but I am not 100% sure whether it is the same cause. I also wanted to setup a scenario similar to Scenario 5 with proxied artifact storage. For the database I opted for postgres and for S3 storage we tried minio (for a local dev setup). In the end our docker-compose setup looks like the following, maybe it helps.

  • mlflow.Dockerfile:
#!/bin/bash
FROM python:3.11.0

RUN pip install --no-cache mlflow[extra] psycopg2 boto3 minio

EXPOSE 5000

CMD mlflow server \
    --host=0.0.0.0 \
    --port=5000 \
    --backend-store-uri=${BACKEND_STORE_URI} \
    --gunicorn-opts="--log-level=debug" \
    --serve-artifacts \
    --artifacts-destination=${BUCKET}

And a the correspoding docker compose:

version: "3.4"

services:
  database:
    image: 'postgres:latest'
    container_name: mlflow_db
    restart: unless-stopped
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=mlflowdb
      - POSTGRES_USER=mlflow_user
      - POSTGRES_PASSWORD=123
    networks:
      - mlflow_network
    volumes:
      - pgdata_mlflow:/var/lib/postgresql/data
    command: ["postgres", "-c", "log_statement=all", "-c", "log_destination=stderr"] # this is for logging everything so stdout
  s3:
    restart: always
    image: minio/minio
    container_name: mlflow_minio
    volumes:
      - ./minio_data:/data
    ports:
      - "9000:9000"
      - "9001:9001"
    networks:
      - mlflow_network
    environment:
      - MINIO_ROOT_USER=minio_user
      - MINIO_ROOT_PASSWORD=minio_pwd
      - MINIO_ADDRESS=:9000
      - MINIO_PORT=9000
      - MINIO_STORAGE_USE_HTTPS=False
      - MINIO_CONSOLE_ADDRESS=:9001
    command: server /data
  app:
    build:
      context: .
      dockerfile: mlflow.Dockerfile
    ports:
      - "5000:5000"
    environment:
      - BACKEND_STORE_URI=postgresql://mlflow_user:123@database:5432/mlflowdb
      - BUCKET=s3://mlflow
      - MLFLOW_S3_ENDPOINT_URL=http://s3:9000
      - MLFLOW_S3_IGNORE_TLS=true
      - AWS_ACCESS_KEY_ID=awPdPbLlPaqciiCEDP3H #this generated from the minio ui
      - AWS_SECRET_ACCESS_KEY=iUf7wQYbtfUq0ssQ2FCiKIbkMKIHPItkHw6vhw45 #this generated from the minio ui

    networks:
      - mlflow_network
    depends_on:
      - s3
      - database

volumes:
  pgdata_mlflow:

networks:
  mlflow_network:

This helped a lot in setting up minio.

Now when trying to upload a large file >1GB I get the following error from mlflow client:

mlflow.exceptions.MlflowException: API request to http://0.0.0.0:5000/api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/b<file>.bin failed with exception HTTPConnectionPool(host='0.0.0.0', port=5000): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/<file>.bin (Caused by ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')))

And the mlflow server logs are the following:

app_1       | [2023-11-10 12:51:54 +0000] [26] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/1/9f5fcd081865478d9f799cdb6cab2089/artifacts/<file>/artifacts/<file>.bin
app_1       | [2023-11-10 12:52:03 +0000] [77] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
app_1       | [2023-11-10 12:52:18 +0000] [61] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
app_1       | [2023-11-10 12:52:25 +0000] [16] [CRITICAL] WORKER TIMEOUT (pid:26)
app_1       | [2023-11-10 12:52:25 +0000] [26] [INFO] Worker exiting (pid: 26)
app_1       | [2023-11-10 12:52:25 +0000] [16] [ERROR] Worker (pid:26) exited with code 1
app_1       | [2023-11-10 12:52:25 +0000] [16] [ERROR] Worker (pid:26) exited with code 1.

@nilutz
Copy link

nilutz commented Nov 10, 2023

@patrickodpt Try increasing the --timeout for gunicorn. By default, gunicorn will hang if a process takes more than 30 seconds, so increasing the timeout helps !


#!/bin/bash
FROM python:3.11.0

RUN pip install --no-cache mlflow[extra] psycopg2 boto3 minio

EXPOSE 5000

CMD mlflow server \
    --host=0.0.0.0 \
    --port=5000 \
    --app-name basic-auth \
    --backend-store-uri=${BACKEND_STORE_URI} \
    --gunicorn-opts="--log-level=debug --timeout=9000" \
    --serve-artifacts \
    --artifacts-destination=${BUCKET}

@patrickodpt
Copy link
Author

@nilutz Yep, that was my first step -- from my logs above 11-07-2023 01:40:46 PM timeout: 120.

This is sufficient gunicorn worker time for my current model size tests.

Copy link

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants