Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Increase parameter length #3931

Open
3 of 23 tasks
ahirner opened this issue Jan 1, 2021 · 10 comments
Open
3 of 23 tasks

[FR] Increase parameter length #3931

ahirner opened this issue Jan 1, 2021 · 10 comments
Labels
area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging enhancement New feature or request

Comments

@ahirner
Copy link

ahirner commented Jan 1, 2021

Willingness to contribute

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the MLflow community. (review migrations)
  • No. I cannot contribute this feature at this time.

Proposal Summary

Increase maximum parameter value length from 250 to 2000 or only limit by request size.

Motivation

We use of parameter values longer than 250 characters. Those occur in parameters of naturally variable length:

  • URL references
  • Definitions of filtering steps: list of qualifiers
  • Definition of data augmentations: mapping of transformations to kwargs

In #1870 it was proposed to alter VARCHAR limits. It would be nice to not care of those alterations in migrations.

Our length distribution has some outliers.

> select distinct length(value) from params order by length(value) desc limit 20;
11991
11978
11976
11488
11481
1115
1113
1003
902
431
430
396
394
345
300
296
289
287
280
275

One solution for us would be to break apart outliers of >10k and settle on the somewhat historic URL limit of 2k.

Alternatively, the VARCHAR limit is dropped while the 1MB request limit stays for sanity. AFAICS a modern database wouldn't have any problem, nor does e.g. wandb.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interfaces

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Languages

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@ahirner ahirner added the enhancement New feature or request label Jan 1, 2021
@github-actions github-actions bot added area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging labels Jan 1, 2021
@nickresnick
Copy link

hey there, what's the status on this?

@srowen
Copy link
Contributor

srowen commented Nov 16, 2021

If the data you're logging is substantial, and even has some 'structure', then maybe a parameter isn't the ideal choice. You can always log information as a file, as an artifact. I tend to agree that a higher limit would be nice. 2000 isn't 'huge' for example.

@xavierfontaine
Copy link

xavierfontaine commented May 9, 2022

A comment to support the current issue.

The discussed limitation in parameter length makes MLFlow less suited for use cases such as Natural Language Generation. For prompt-based NLG, prompts (strings, often long ones) are hyper-parameters that are central to model performances, alongside other numerical/boolean parameters.

For now, my team and I are logging these prompts as artifacts. However, doing so lessens the relevance of the MLFlow Tracking UI since we cannot use those of the UI's components designed to spot the relevant hyperparameter values. More generally, storing some parameters as parameter, and some as artifact, necessarily complexifies codebases.

@jinzhang21
Copy link
Collaborator

@xavierfontaine What's the longest prompts you've used?

@sabaimran
Copy link

Adding comment to express support for the feature as well. I have a logged parameter value that's <1000, but exceeds the 500 length limit.

@akshara08
Copy link

+1 this would be really helpful. Any update on this?

@xavierfontaine
Copy link

xavierfontaine commented Jun 6, 2023

@jinzhang21 I'm sorry for taking so long to respond.

What's the longest prompts you've used?

Regarding logged parameters, what interests us most are prompt templates (e.g., Please summarize the following text: {text_to_summarize}.) The average template length has been increasing together with the limit put on the input length of Large Language Models. Although 1000-2000 character-long templates are common, seeing much longer templates is not unusual anymore. The longest I have seen In a professional context was ~18,000 character-long.

It might be interesting to note that OpenAI has a version of gpt-4 that supports prompts up to 32,000 token-long (~124,000 characters.) Furthermore, the next generation of open-source/proprietary models might have little-to-no limitation on prompt length ([1], [2].) Of course, prompt templates will typically remain much shorter than the observed prompts, but we should expect their average size to keep increasing over time nonetheless.

I don't remember what the storage backend used by MLflow is, but I guess the simplest would be to store strings in a format that enforces no limitation on length?

@jinzhang21
Copy link
Collaborator

c.c. @dbczumar and @sunishsheth2009 who are leading the LLM efforts in MLflow

@getchebarne
Copy link

+1 for this feature, can't log parameters such as selected features from a feature selection step due to the small maximum param size

@GeorgePearse
Copy link

GeorgePearse commented Aug 3, 2023

Can't even log my class_list, I use MMDetection, which automatically logs it as part of the config in their tracking hook.

Only workaround is to replace the class_names with IDs, but then I need to find something else to log the class_names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants