Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST-#6830: Use local s3 server instead of public s3 buckets #6863

Merged
merged 7 commits into from Jan 18, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Jan 17, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Remove access to public s3 buckets #6830
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…kets

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@@ -44,6 +44,8 @@ def test_line_endings():
if any(i in subdir for i in [".git", ".idea", "__pycache__"]):
continue
for file in files:
if file.endswith(".parquet"):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In binary format there is no need to check these characters (\r\n).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence of characters may have a different meaning for binary formats. If, for example, you try to replace \r\n with \n, the arrow will give a message that parquet file is corrupted.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev marked this pull request as ready for review January 18, 2024 16:35
path,
columns,
use_threads=False,
storage_options=kwargs["storage_options"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of storage_options parameter was missed

@@ -47,7 +47,9 @@ def _read(cls, path_or_buf, **kwargs):
path_or_buf = stringify_path(path_or_buf)
path_or_buf = cls.get_path_or_buffer(path_or_buf)
if isinstance(path_or_buf, str):
if not cls.file_exists(path_or_buf):
if not cls.file_exists(
path_or_buf, storage_options=kwargs.get("storage_options")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of storage_options parameter was missed

with OpenFile(
path_or_buf,
"rb",
**(kwargs.get("storage_options", None) or {}),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of storage_options parameter was missed

path_or_buf,
"rb",
kwargs.get("compression", "infer"),
**(kwargs.get("storage_options", None) or {}),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of storage_options parameter was missed

@@ -68,7 +68,9 @@ def _read(cls, filepath_or_buffer, **kwargs):
reason=cls._file_not_found_msg(filepath_or_buffer),
**kwargs,
)
filepath_or_buffer = cls.get_path(filepath_or_buffer)
filepath_or_buffer = cls.get_path(
filepath_or_buffer, kwargs.get("storage_options")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of storage_options parameter was missed

Comment on lines +370 to +376
if storage_options is not None:
new_storage_options = dict(storage_options)
new_storage_options.pop("anon", None)
else:
new_storage_options = {}

fs, _ = fsspec.core.url_to_fs(file_path, **new_storage_options)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made by analogy with other functions

@@ -212,17 +216,19 @@ def _pandas_read_csv_glob(path, storage_options):
reason=f"{Engine.get()} does not have experimental API",
)
@pytest.mark.parametrize(
"storage_options",
[{"anon": False}, {"anon": True}, {"key": "123", "secret": "123"}, None],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use storage_options=None (because of our S3 server), but the test coverage hasn't decreased, so I guess there's no problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to add a test case for storage_options=None in future? What are we supposed to do?

Copy link
Collaborator Author

@anmyachev anmyachev Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, do not use our own s3 server, since in this case we always need to explicitly specify endpoint.

@anmyachev
Copy link
Collaborator Author

@YarShev @dchigarev ready for review

@YarShev YarShev merged commit 4f91d24 into modin-project:master Jan 18, 2024
36 checks passed
@anmyachev anmyachev deleted the issue6830 branch January 18, 2024 19:46
anmyachev added a commit that referenced this pull request Jan 19, 2024
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove access to public s3 buckets
2 participants