Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory: use pymongoarrow to get dataset results as dataframe #2868

Closed
severo opened this issue May 29, 2024 · 1 comment
Closed

memory: use pymongoarrow to get dataset results as dataframe #2868

severo opened this issue May 29, 2024 · 1 comment
Assignees
Labels
improvement / optimization P1 Not as needed as P0, but still important/wanted

Comments

@severo
Copy link
Collaborator

severo commented May 29, 2024

As stated in the code, the current conversion from a list of mongo entries to a dataframe is not necessarily optimal:

def _get_df(entries: list[CacheEntryFullMetadata]) -> pd.DataFrame:
return pd.DataFrame(
{
"kind": pd.Series([entry["kind"] for entry in entries], dtype="category"),
"dataset": pd.Series([entry["dataset"] for entry in entries], dtype="str"),
"config": pd.Series([entry["config"] for entry in entries], dtype="str"),
"split": pd.Series([entry["split"] for entry in entries], dtype="str"),
"http_status": pd.Series(
[entry["http_status"] for entry in entries], dtype="category"
), # check if it's working as expected
"error_code": pd.Series([entry["error_code"] for entry in entries], dtype="category"),
"dataset_git_revision": pd.Series([entry["dataset_git_revision"] for entry in entries], dtype="str"),
"job_runner_version": pd.Series([entry["job_runner_version"] for entry in entries], dtype=pd.Int16Dtype()),
"progress": pd.Series([entry["progress"] for entry in entries], dtype="float"),
"updated_at": pd.Series(
[entry["updated_at"] for entry in entries], dtype="datetime64[ns]"
), # check if it's working as expected
"failed_runs": pd.Series([entry["failed_runs"] for entry in entries], dtype=pd.Int16Dtype()),
}
)
# ^ does not seem optimal at all, but I get the types right

see also

def _get_df(self, jobs: list[FlatJobInfo]) -> pd.DataFrame:
return pd.DataFrame(
{
"job_id": pd.Series([job["job_id"] for job in jobs], dtype="str"),
"type": pd.Series([job["type"] for job in jobs], dtype="category"),
"dataset": pd.Series([job["dataset"] for job in jobs], dtype="str"),
"revision": pd.Series([job["revision"] for job in jobs], dtype="str"),
"config": pd.Series([job["config"] for job in jobs], dtype="str"),
"split": pd.Series([job["split"] for job in jobs], dtype="str"),
"priority": pd.Categorical(
[job["priority"] for job in jobs],
ordered=True,
categories=[Priority.LOW.value, Priority.NORMAL.value, Priority.HIGH.value],
),
"status": pd.Categorical(
[job["status"] for job in jobs],
ordered=True,
categories=[
Status.WAITING.value,
Status.STARTED.value,
],
),
"created_at": pd.Series([job["created_at"] for job in jobs], dtype="datetime64[ns]"),
}
)
# ^ does not seem optimal at all, but I get the types right

We might benefit from using https://github.com/mongodb-labs/mongo-arrow/tree/main/bindings/python for that (recommended way from the mongo team)

PyMongoArrow is the recommended way to materialize MongoDB query result sets as contiguous-in-memory, typed arrays suited for in-memory analytical processing applications.

Some comments:

  • we have to implement unit tests on these methods before switching to be sure we don't break anything
  • we could add memory tests (@pytest.mark.limit_memory(), see https://bloomberg.github.io/memray/tutorials/additional_features.html#pytest-plugin) to ensure we reduce the memory footprint
  • pymongoarrow requires a specific version of pyarrow (currently: ^15, while we use ^14, and the last pyarrow version is ^16 :) )
  • will it work nicely with the types?
@severo severo added improvement / optimization P1 Not as needed as P0, but still important/wanted labels May 29, 2024
@AndreaFrancis AndreaFrancis self-assigned this May 29, 2024
@AndreaFrancis
Copy link
Contributor

Implemented by #2879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement / optimization P1 Not as needed as P0, but still important/wanted
Projects
None yet
Development

No branches or pull requests

2 participants