-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable '..' in query string for artifact URI #10653
Conversation
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Documentation preview for 3ca7f44 will be available here when this CircleCI job completes successfully. More info
|
mlflow/utils/uri.py
Outdated
def _decode(url): | ||
# Keep decoding until the url stops changing | ||
while True: | ||
decoded = urllib.parse.unquote(url) | ||
if decoded == url: | ||
return url | ||
url = decoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ai any chance that this function goes into an infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a chance that this function could go into an infinite loop if urllib.parse.unquote(url)
continually returns a value that is not equal to the original url
. However, this is extremely unlikely due to how urllib.parse.unquote()
operates. It decodes percent-encoded characters in a URL into their corresponding characters. In practice, it means that with each loop, the URL should become less percent-encoded, and eventually, it should be fully decoded so that decoded == url
and the function would then exit the loop. But if there's a strange behavior with unquote()
or the input url
constantly changes outside this function in a multithreaded environment it might cause an infinite loop.
Click here to see the usage data
{
"prompt_tokens": 91,
"completion_tokens": 149,
"total_tokens": 240,
"estimated_cost_in_usd": 0.01167
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ai Can we use recursion instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible to re-implement this function using recursion. Here would be a possible solution:
def _decode(url):
decoded = urllib.parse.unquote(url)
if decoded == url:
return url
else:
return _decode(decoded)
This version of the function does essentially the same thing as the original - It decodes the URL, and continue to decode the result until decoding does not produce any changes.
Remember though, Python has a maximum recursion depth (usually 1000). While in normal URL decoding cases you would likely never reach near it, malformed or malicious strings may cause a maximum recursion depth exceeded error. Proper error handling or a depth check should be added if this is a concern.
Click here to see the usage data
{
"prompt_tokens": 256,
"completion_tokens": 151,
"total_tokens": 407,
"estimated_cost_in_usd": 0.01674
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use this:
for _ in range(100):
decoded = urllib.parse.unquote(url)
if decoded == url:
return url
url = decoded
raise ValueError("Failed to decode")
# No chance for an infinite loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like 10 is enough 🤔 Who would encode a path to such complex stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. 10 should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will limit it to 10:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reach 10, I think that's some malicious URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
🛠 DevTools 🛠
Install mlflow from this PR
Checkout with GitHub CLI
What changes are proposed in this pull request?
Solving path traversal vulnerability.
Problem
When users create experiment with base artifact URI like
http://foo/bar
, MLflow appends/{{run_id}}/artifact
sub path to it when saving/reading each run's artifact, such as `http://foo/bar/{{run_id}}/artifact". Every GET request for artifacts are validated so that the requested path are under this path, effectively prevents attackers to read any files outside the that run directory.However, there is one hack to bypass this, which is query string. When MLflow appends
/{{run_id}}/artifact
, it will be inserted before query string of the specified artifact root. For example, if the artifact root ishttp://foo/bar?a=a
, the run's artifact URI will be ``http://foo/bar{{run_id}}/artifact?a=a`.This allows path traversal by adding malformed query string like "../../../../etc", which results in run's artifact location to be
http://foo/bar{{run_id}}/artifact?../../../../etc
, which is then resolved to/etc
as a local path.Solution
This PR resolves this by explicitly validating query string passed as a part of artifact URI. It simply check if the query string contains ".." or not (with decoding).
There were some alternatives considered:
{{run_id}}/artifacts
=> I'm afraid that some cases we need query string to access artifact location e.g. S3 ARN can have a region as query string.Currently users can specify any query strings to artifact URI, for example, "http:///?/../path".
How is this PR tested?
Validated the fix prevents experiment creation with malformed query string:
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes