Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take Sequence into consideration in get_data() and add_features_from() methods of Dataset object #4403

Closed
StrikerRUS opened this issue Jun 23, 2021 · 3 comments

Comments

@StrikerRUS
Copy link
Collaborator

This issue is actual only after merging #4089.

Refer to #4089 (comment).

@cyfdecyf
Copy link
Contributor

I've started working on this. As I don't uses these two APIs, I need some guidence.

PR #4472 adds support Sequence input for Dataset.get_data. I'm wondering what's the use case of creating a subset with an existing Dataset.

For Dataset.add_features_from, I'm reluctant to support this naively for two reasons:

  • It defeats the saving memory benefit of Sequence interface
    • The current implementation seems combing raw data together
  • We can combine features by Sequence interface easily
    • Thus don't need to use add_features_from

Here's a simple combine Sequence by columns example:

class CombineSequenceByColumn(lgb.Sequence):
    def __init__(self, seqlst: List[lgb.Sequence], batch_size):
        self.seqlst = seqlst
        self.batch_size = batch_size

    def __getitem__(self, idx):
        if isinstance(idx, numbers.Integral):
            return np.hstack([arr[idx] for arr in self.seqlst])
        elif isinstance(idx, slice):
            return np.hstack([arr[idx.start:idx.stop] for arr in self.seqlst])
        else:
            raise TypeError(f"Sequence Index must be an integer/list/slice, got {type(idx).__name__}")

    def __len__(self):
        return len(self.seqlst[0])

If we really want to support Sequence input for add_features_from, I suggest first create a class CombineSequenceByRow to combine List[Sequence] as a single Sequence object. With those two Sequence combining classes, we can support Sequence input for add_features_from like this:

  • If both datasets' input are Sequence, then create an object of CombineSequenceByColumn
  • If only one dataset is Sequence, then adapt the other as Sequence and then combine as above case
  • For input of List[Sequence], create an object of CombineSequenceByRow and then handle as above cases

This implementation can still retain the benefit of saving memory, but it's a lot of work as we have to adapt various existing supprted types to Sequence. It would be much easier to ask the user to combine features with Sequence interface.

@StrikerRUS
Copy link
Collaborator Author

@cyfdecyf Thanks a lot for taking this issue!

It would be much easier to ask the user to combine features with Sequence interface.

Yeah, given your great explanation, I totally agree with you.

@StrikerRUS
Copy link
Collaborator Author

Closed via #4472.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants