memory: use pymongoarrow to get dataset results as dataframe #2868

severo · 2024-05-29T09:02:22Z

As stated in the code, the current conversion from a list of mongo entries to a dataframe is not necessarily optimal:

dataset-viewer/libs/libcommon/src/libcommon/simple_cache.py

Lines 813 to 833 in 27edd1f

    
           def _get_df(entries: list[CacheEntryFullMetadata]) -> pd.DataFrame: 
        
               return pd.DataFrame( 
        
                   { 
        
                       "kind": pd.Series([entry["kind"] for entry in entries], dtype="category"), 
        
                       "dataset": pd.Series([entry["dataset"] for entry in entries], dtype="str"), 
        
                       "config": pd.Series([entry["config"] for entry in entries], dtype="str"), 
        
                       "split": pd.Series([entry["split"] for entry in entries], dtype="str"), 
        
                       "http_status": pd.Series( 
        
                           [entry["http_status"] for entry in entries], dtype="category" 
        
                       ),  # check if it's working as expected 
        
                       "error_code": pd.Series([entry["error_code"] for entry in entries], dtype="category"), 
        
                       "dataset_git_revision": pd.Series([entry["dataset_git_revision"] for entry in entries], dtype="str"), 
        
                       "job_runner_version": pd.Series([entry["job_runner_version"] for entry in entries], dtype=pd.Int16Dtype()), 
        
                       "progress": pd.Series([entry["progress"] for entry in entries], dtype="float"), 
        
                       "updated_at": pd.Series( 
        
                           [entry["updated_at"] for entry in entries], dtype="datetime64[ns]" 
        
                       ),  # check if it's working as expected 
        
                       "failed_runs": pd.Series([entry["failed_runs"] for entry in entries], dtype=pd.Int16Dtype()), 
        
                   } 
        
               ) 
        
               # ^ does not seem optimal at all, but I get the types right

see also

dataset-viewer/libs/libcommon/src/libcommon/queue.py

Lines 994 to 1019 in 27edd1f

    
           def _get_df(self, jobs: list[FlatJobInfo]) -> pd.DataFrame: 
        
               return pd.DataFrame( 
        
                   { 
        
                       "job_id": pd.Series([job["job_id"] for job in jobs], dtype="str"), 
        
                       "type": pd.Series([job["type"] for job in jobs], dtype="category"), 
        
                       "dataset": pd.Series([job["dataset"] for job in jobs], dtype="str"), 
        
                       "revision": pd.Series([job["revision"] for job in jobs], dtype="str"), 
        
                       "config": pd.Series([job["config"] for job in jobs], dtype="str"), 
        
                       "split": pd.Series([job["split"] for job in jobs], dtype="str"), 
        
                       "priority": pd.Categorical( 
        
                           [job["priority"] for job in jobs], 
        
                           ordered=True, 
        
                           categories=[Priority.LOW.value, Priority.NORMAL.value, Priority.HIGH.value], 
        
                       ), 
        
                       "status": pd.Categorical( 
        
                           [job["status"] for job in jobs], 
        
                           ordered=True, 
        
                           categories=[ 
        
                               Status.WAITING.value, 
        
                               Status.STARTED.value, 
        
                           ], 
        
                       ), 
        
                       "created_at": pd.Series([job["created_at"] for job in jobs], dtype="datetime64[ns]"), 
        
                   } 
        
               ) 
        
               # ^ does not seem optimal at all, but I get the types right

We might benefit from using https://github.com/mongodb-labs/mongo-arrow/tree/main/bindings/python for that (recommended way from the mongo team)

PyMongoArrow is the recommended way to materialize MongoDB query result sets as contiguous-in-memory, typed arrays suited for in-memory analytical processing applications.

Some comments:

we have to implement unit tests on these methods before switching to be sure we don't break anything
we could add memory tests (@pytest.mark.limit_memory(), see https://bloomberg.github.io/memray/tutorials/additional_features.html#pytest-plugin) to ensure we reduce the memory footprint
pymongoarrow requires a specific version of pyarrow (currently: ^15, while we use ^14, and the last pyarrow version is ^16 :) )
will it work nicely with the types?

The text was updated successfully, but these errors were encountered:

AndreaFrancis · 2024-06-10T20:32:14Z

Implemented by #2879

severo added improvement / optimization P1 Not as needed as P0, but still important/wanted labels May 29, 2024

AndreaFrancis self-assigned this May 29, 2024

AndreaFrancis mentioned this issue Jun 3, 2024

Use pymongoarrow to get dataset results as dataframe #2879

Merged

AndreaFrancis closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory: use pymongoarrow to get dataset results as dataframe #2868

memory: use pymongoarrow to get dataset results as dataframe #2868

severo commented May 29, 2024 •

edited

Loading

AndreaFrancis commented Jun 10, 2024

memory: use pymongoarrow to get dataset results as dataframe #2868

memory: use pymongoarrow to get dataset results as dataframe #2868

Comments

severo commented May 29, 2024 • edited Loading

AndreaFrancis commented Jun 10, 2024

severo commented May 29, 2024 •

edited

Loading