Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct pyarrow version check #11905

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

antbbn
Copy link

@antbbn antbbn commented May 4, 2024

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11905/merge

Checkout with GitHub CLI

gh pr checkout 11905

According to issue #8213 addressed with PR #9878 we should use pyarrow.fs.HadoopFileSystem with pyarrow GREATER THAN 2.0.0 but condition was inverted.

Related Issues/PRs

Issue #8213
PR #9878

What changes are proposed in this pull request?

According to issue #8213 addressed with PR #9878 we should use pyarrow.fs.HadoopFileSystem with pyarrow GREATER THAN 2.0.0 but condition was inverted.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Users will stop seeing the deprecation warning.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

According to issue mlflow#8213 addressed with PR mlflow#9878 we should use `pyarrow.fs.HadoopFileSystem` with pyarrow GREATER THAN 2.0.0 but condition was inverted.

Signed-off-by: Antonio Bibiano <antbbn@users.noreply.github.com>
@github-actions github-actions bot added patch-2.12.2 area/artifacts Artifact stores and artifact logging rn/none List under Small Changes in Changelogs. labels May 4, 2024
Copy link

github-actions bot commented May 4, 2024

Documentation preview for 1ad3729 will be available when this CircleCI job
completes successfully.

More info

Signed-off-by: Harutaka Kawamura <hkawamura0130@gmail.com>
@WeichenXu123
Copy link
Collaborator

@antbbn Could you check and fix test failures?

@antbbn
Copy link
Author

antbbn commented May 8, 2024 via email

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
@antbbn
Copy link
Author

antbbn commented May 19, 2024 via email

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - apologies for my failure to read the PR description. Thanks @antbbn!

@harupy
Copy link
Member

harupy commented May 19, 2024

@antbbn did you have a chance to manually test this?

@antbbn
Copy link
Author

antbbn commented May 22, 2024

@antbbn did you have a chance to manually test this?

Yes, and unfortunately it doesn't work, the legacy pyarrow.HadoopFileSystem and pyarrow.fs.HadoopFileSystem have completely different interfaces.

So a bit more refactoring of this module is necessary, i'm working on it and will report back soon.

@antbbn
Copy link
Author

antbbn commented May 25, 2024

@antbbn did you have a chance to manually test this?

Yes, and unfortunately it doesn't work, the legacy pyarrow.HadoopFileSystem and pyarrow.fs.HadoopFileSystem have completely different interfaces.

So a bit more refactoring of this module is necessary, i'm working on it and will report back soon.

The last two commits are my attempt at a refactor, while at the beginning I started to use the pyarrow API directly I realized that we were re-implementing functionality that is already present in the fsspec package. I think it's a worthy addition and it greatly simplifies the code here, I think eventually many of the repos in there could be refactored to be just thin wrappers around fsspec implementations.

Signed-off-by: Antonio Bibiano <antbbn@gmail.com>
Signed-off-by: Antonio Bibiano <antbbn@gmail.com>
@antbbn antbbn force-pushed the fix-pyarrow-check branch 2 times, most recently from 7b5fa0b to 7c7301f Compare May 29, 2024 16:58
Signed-off-by: Antonio Bibiano <antbbn@gmail.com>
Signed-off-by: Antonio Bibiano <antbbn@gmail.com>
@B-Step62
Copy link
Collaborator

@antbbn

The last two commits are my attempt at a refactor, while at the beginning I started to use the pyarrow API directly I realized that we were re-implementing functionality that is already present in the fsspec package.

Thank you for the simplification suggestion and we appreciate your effort to improve the code quality! However, we are generally not willing to add a new dependency for improving code quality. MLflow is used in a wide range of environment including critical production services, so dependency libraries need to be carefully assessed to avoid any compatibility issues. Also, introducing new dependency can potentially increase maintenance cost.

Would you mind reverting the refactoring change and use pyarrow API directly instead? Please let us know if you have any other concerns of that approach.

Thank you so much for your contribution!

@B-Step62 B-Step62 mentioned this pull request Jun 11, 2024
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants