Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Using Polars for loading and dumping data #304

Open
takeyama0 opened this issue Feb 4, 2023 · 3 comments
Open

[Feature Request] Using Polars for loading and dumping data #304

takeyama0 opened this issue Feb 4, 2023 · 3 comments

Comments

@takeyama0
Copy link

takeyama0 commented Feb 4, 2023

Hello, thank you for developing really cool tool!

Summary

I have one feature request to use Polars for loading and dumping data:
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.

Implementation idea

I have tried a very simple implementation for parquet files here.
The changes are as follows.

  1. Add config module as gokart/config and init.py in this module.
# gokart/config/__init__.py
from gokart.config import config
from gokart.config.config import (
    get_option,
    set_option,
)
  1. Create config.py in gokart/config. This file contains "_global_config" variable, "register_option", "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by config_init.py.)
# gokart/config/config.py
from typing import Any, Dict

_global_config: Dict[str, Any] = {}


def register_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    _global_config.update({key: val})


def get_option(
    key: str,
) -> object:
    assert key in _global_config, f"No such keys: {key}"
    return _global_config[key]


def set_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    assert key in _global_config, f"No such keys: {key}"
    _global_config.update({key: val})
  1. Create config_init.py in gokart/config. This file is used for "_global_config" initialization.
# gokart/config/config_init.py
import gokart.config.config as cf

use_polars = """
: boolean
    Whether to use polars instead of pandas
"""

cf.register_option(
    "use_polars",
    False,
    use_polars,
)
  1. Modify gokart/init.py to include gokart.config.
# gokart/__init__.py
from gokart.config import config_init, get_option, set_option
from gokart.build import build
...
  1. Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.
class ParquetFileProcessor(FileProcessor):
    ...

    def load(self, file):
        # MEMO: read_parquet only supports a filepath as string (not a file handle)
        if get_option("use_polars"):
            return pl.read_parquet(file.name)
        else:
            return pd.read_parquet(file.name)

    def dump(self, obj, file):
        assert isinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
            f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'
        # MEMO: to_parquet only supports a filepath as string (not a file handle)
        if isinstance(obj, pd.DataFrame):
            obj.to_parquet(file.name, index=False, compression=self._compression)
        else:
            obj.write_parquet(file.name, compression=self._compression if self._compression is not None else 'zstd')

I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.

@hirosassa
Copy link
Collaborator

hirosassa commented Feb 4, 2023

@takeyama0 Thanks for your suggestion and implementation idea!
I'm positive with supporting polars for its good performance as you suggest.

IMO, I would like to move pandas and polars on python extras and raise import error when the users use pandas/polars features without import it.
It is because I think there's no application using both pandas and polars.

@Hi-king @ujiuji1259 @mski-iksm How do you think about this?

@ujiuji1259
Copy link
Contributor

@takeyama0 Thanks for your suggestion! I think it’s great to support Polars too.

And I basically agree with @hirosassa ’s idea to minimize dependencies, but I’m a little bit concerned about moving pandas on extras because some common methods (like TaskOnKart.load_data_frame) already use pandas.

@takeyama0
Copy link
Author

@hirosassa , @ujiuji1259 Thank you for your replaying! I am glad to hear your positive feedback about supporting polars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants