Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Create Dataset from multiple data files #4089

Merged
merged 70 commits into from
Jul 2, 2021

Conversation

cyfdecyf
Copy link
Contributor

This PR implements creating Dataset from multiple data files.

We add a Sequence interface which requires data providers implement:

  • Return total number of data
  • Random access
    • so we can sample data without going through data in two round
  • Range access
    • used to read data in batch thus avoid loading all data into memory

With this interface, we can create Dataset from List[Sequence]. Also refer this to #2789 as I think this interface allows we use other types of efficient binary data format. I closed #3900 because it's an ugly hack and data loading features are much easier to implement in Python.

An example of creating Dataset from multiple HDF5 files is also given. As I mentioned in #2788, I'm more familiar with HDF5 file so it's choosen as the example.

#2031 also mentions the feature request, guess this is a common feature request.

@Willian-Zhang
Copy link
Contributor

Other potential improvements:
Sequence objects can be easily extended with support of batch reading with row numbers:

seq[[1,2,5,7,9,....]]

such, data provider could provide more efficient reading than current one-row-at-time fashion under the hood.

Copy link
Contributor

@Willian-Zhang Willian-Zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert is_class implementation or use pytypes

@@ -1383,8 +1439,15 @@ def _lazy_init(self, data, label=None, reference=None,
self.__init_from_csc(data, params_str, ref_dataset)
elif isinstance(data, np.ndarray):
self.__init_from_np2d(data, params_str, ref_dataset)
elif isinstance(data, list) and len(data) > 0 and all(isinstance(x, np.ndarray) for x in data):
self.__init_from_list_np2d(data, params_str, ref_dataset)
elif isinstance(data, Sequence):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: revert is_class implementation

This would force any 3rd party lib (presumably Data Access Layer libs) to import LightGBM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we need an duck typing implementation.
typechecked and is_of_type would do the job yet requires new dependency pytypes.

Copy link
Contributor

@Willian-Zhang Willian-Zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: fix unwanted memory reference

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Willian-Zhang Willian-Zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Willian-Zhang
Copy link
Contributor

@shiyu1994 this PR is now ready. Are you still available for reviewing?

return indices

def init_from_sample(self, sample_data, sample_indices, sample_cnt, total_nrow):
"""Get the used parameters in the Dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is mismatched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Get the used parameters in the Dataset. wrongfully describes what this method does?

Yes, I confirm we lost track of where this summary was from. Nevertheless, will fix.


return filtered, filtered_idx

def __init_from_seqs(self, seqs, params_str, ref_dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameters params_str and ref_dataset. It seems that ref_dataset is ignored when we constructing from seqs. Does that mean this method only supports constructing training dataset? Because for a validation dataset, a reference training dataset is required in Python API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much appreciated for making clear on what ref_dataset does.

Does that mean this method only supports constructing training dataset?

Currently, yes. However after your explanation, It seems validation dataset can also be supported via trivial changes to make.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think now that we've supported creating from seqs for training data, we should also support that for validation data. Otherwise, users can get unexpected behavior if they try to create validation dataset from seqs. Because the only difference between validation and training datasets in Python API is whether reference parameter in lightgbm.Dataset is provided.

I'm willing to provide any help.

Supports random access and access by batch if properly defined by user
"""
total_nrow = sum(len(seq) for seq in seqs)
ncol = len(seqs[0][0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable ncol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left out for debugging propose and really should be further extended for data sanity check before loading them.

Copy link
Contributor

@Willian-Zhang Willian-Zhang Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However data check could impact performance since Sequence interface keeps data access unknown to LightGBM.

I suggest adding a new parameter for toggling checking behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we need to check that each seq has exactly the same number of columns? Doing that here is perfectly ok and won't incur efficiency degradation.

Copy link
Contributor

@Willian-Zhang Willian-Zhang Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we need to check that each seq has exactly the same number of columns?

exactly

Doing that here is perfectly ok and won't incur efficiency degradation.

Actually we wouldn't know that for sure.
To exemplify, say our user decided to implement some seqs with some very heavy data provider backend and somehow implemented data prefetching and parallelism as their optimization. (In extreme cases imagine data are generated on the fly from some server instances spawned upon our touching to the data, (or literal cluster of blue-ray discs for data storage shelf moved to some disk readers for preparation @facebook @aws )),

They might would trigger the generation/read for their whole seq when we touch the seq since they would expect data are accessed in ordered fashion.

To generalize, we really cannot make assumptions on how much a single access to data would cost. We hope they would provide best performance for both random data access and by batch access. However in reality, these two are naturally conflicting each other in someway. Assuring batch access indeed in-order, at least giving an option to do so, is the best we can do to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a new parameter for toggling checking behavior.

Actually some helps to this would be very much helpful.

Is it possible to create a param for python only, in particular, not to pass down to or cause cpp to warn unrecognized parameter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is possible. We need to pick out the parameter in Python before passing it to CPP, which is not an elegant solution.
Also, we'd better keep the set of parameters in all API consistent.

Copy link
Contributor

@Willian-Zhang Willian-Zhang May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed via bbd18b9


/*
int num_col = static_cast<int>(sample_values.size());
std::vector<double*> sample_values_ptr(num_col);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless code should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also for debugging. If you like the idea of keep debugging code, I'd move those code into a separarte function that would only be called when debugging is enabled.

@@ -615,10 +615,38 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b
return dataset.release();
}

// To help verify whether sample data is changed when using different language bindings.
static void DumpSampleData(const std::string& sample_filename, double** sample_values,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this method is necessary to be included in the master branch. Is it for debugging? It seems that all calls to ConstructFromSampleData in the project will pass an empty dump_filename, so this function is not called in the source code actually. And no API is provided for users to access this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for debugging. I suggest we find some acceptable way to keep these debugging code. It helped a lot to find problems when we develop this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, c api exposes the function LGBM_DatasetDumpText for debugging. How about we add an option (say dump_sample) when creating Dataset in Python API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better remove these debug code, which will increase maintenance effort. Unlike dumping a whole Dataset, dumping the sampling part used by construction is seldom needed. And I believe we can manually do so whenever necessary in development.
Thanks again for your contribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll remove those debugging code this week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 Debugging code removed now. Please let me know if there's anything that should be improved.

@pytest.mark.parametrize('include_nan', [False, True])
@pytest.mark.parametrize('num_seq', [1, 3])
def test_sequence(tmpdir, sample_count, batch_size, include_0, include_nan, num_seq):
rm_files(["seq.truth.bin", "seq.seq.bin"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that rm_files removes seq.truth.bin and seq.seq.bin, but subsequently tmpdir/seq.truth.bin and tmpdir/seq.seq.bin will be created. That's a mismatch.
And why do we need to manually remove the files created under tmpdir here, which is not done in other testing cases?

Copy link
Contributor

@Willian-Zhang Willian-Zhang Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to manually remove the files created under tmpdir here, which is not done in other testing cases?

we don't. tmpdir is unique per parameter.

This is indeed a left-out from previous code where tmpdir was not used.

Will fix.

@shiyu1994
Copy link
Collaborator

@cyfdecyf Thanks for your great contribution! I'm comment about my concerns. Overall the implementation is great, but there's seems to be some unnecessary changes.

Willian-Zhang added a commit to cyfdecyf/LightGBM that referenced this pull request Apr 16, 2021
Willian-Zhang added a commit to cyfdecyf/LightGBM that referenced this pull request Apr 16, 2021
@shiyu1994
Copy link
Collaborator

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

@Willian-Zhang
Copy link
Contributor

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

It's on plan, will do in 10 days.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @cyfdecyf ! I left some comments related to consistency with LightGBM's existing codebase and removing unrelated changes from this PR.

I haven't reviewed the actual functional changes in the PR yet, since it seems like there is still some ongoing discussion about it (e.g. https://github.com/microsoft/LightGBM/pull/4089/files#r598454555, https://github.com/microsoft/LightGBM/pull/4089/files#r612925422, https://github.com/microsoft/LightGBM/pull/4089/files#r612950521)

.ci/test.sh Outdated
Comment on lines 60 to 66
echo "..pycodestyle"
pycodestyle --ignore=E501,W503 --exclude=./.nuget,./external_libs . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
echo "..pydocstyle"
pydocstyle --convention=numpy --add-ignore=D105 --match-dir='^(?!^external_libs|test|example).*' --match='(?!^test_|setup).*\.py' . || exit -1
echo "..isort"
isort . --check-only || exit -1
echo "..mypy"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove these changes adding new echo statements and moving the order of CI checks around? They don't seem related to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via 941dfce

reduce memory usage.
"""

__metaclass__ = abc.ABCMeta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what benefit this implementation gets from using abc that it couldn't achieve with just using NotImplementedError for classes that are intended to be abstract? I'd like to understand that before lightgbm takes on this new dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Dataset._lazy_init there's many isinstance calls to decide which init function to use. Using abc allows us to keep use isinstance and keep consistency there.

The initial PR does not contain is_class method and thus this is necessary. For now, using abc is not necessary and we can just remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add some more context:

Generally we need an duck typing implementation.
typechecked and is_of_type would do the job yet requires new dependency pytypes.

Originally posted by @Willian-Zhang in #4089 (comment)

pytypes was originally being used for duck typing implementation.
It is removed in consider to minimize new dependency added to LightGBM.
is_class as along some reordering on is_instances was added as a trade-off.
@cyfdecyf has some good reason to adopt ABC route as stated in this thread, and here we are.

Currently, LightGBM tries to call is_class from this commit:
a8bc7d9#diff-5e0fb2a7b1ec4988ccde5e74a6c4b609841c1143f281da2afc85e8c464d37b3bR625-R630
to determine if a class instance ducks Sequence dues to ABC implementation would result in LightGBM misinterpreting non-subclassing seq object to other types. (convertible to ndarray or csr_matrix, cannot recall exact one.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via 591fc5e

src/c_api.cpp Outdated Show resolved Hide resolved
@@ -927,7 +927,7 @@ bool Dataset::GetIntField(const char* field_name, data_size_t* out_len,

void Dataset::SaveBinaryFile(const char* bin_filename) {
if (bin_filename != nullptr && std::string(bin_filename) == data_filename_) {
Log::Warning("Bianry file %s already exists", bin_filename);
Log::Warning("Binary file %s already exists", bin_filename);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! But please remove this unrelated change and make a separate pull request for it.

In general, we have a preference for many small pull requests over a few large ones. A pull request that just changes this typo would be easy to review and merged very quickly, and then the size of the diff for this PR would be reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll removing this first.

raise NotImplementedError("remove this line if subclassing")

@abc.abstractmethod
def __len__(self): # type: () -> int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __len__(self): # type: () -> int
def __len__(self) -> int:

Is there a reason that you prefer to use type hint comments instead of type hints in code? As of #3581 this project no longer supports Python 2.x, so you can use type hints in code directly.

In fact, based on #3756 we have a preference for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a specific reason to use comments, can you please remove them in this PR and replace them with type hints in code? For consistency with the rest of the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for Python 2.x support. It's a pleasure to drop Python 2.x support.

Copy link
Contributor

@Willian-Zhang Willian-Zhang Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via a2b0ec9

seqs.append(seq)
ds = lgb.Dataset(seqs, label=Y, params=params)
ds.save_binary(str(tmpdir / "seq.seq.bin"))
assert filecmp.cmp(tmpdir / "seq.truth.bin", tmpdir / "seq.seq.bin")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use os.path.join() instead of pathlib-style paths in these tests? For consistency with the rest of LightGBM's tests, for example

X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),
'../../examples/lambdarank/rank.train'))
X_test, y_test = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),
'../../examples/lambdarank/rank.test'))
q_train = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),
'../../examples/lambdarank/rank.train.query'))
q_test = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),
'../../examples/lambdarank/rank.test.query'))

If you have a preference for pathlib-style paths over constructing them with os.path you are welcome to open a new issue proposing that and explaining why, but if it's not absolutely necessary for this pull request I'd prefer to keep new code consistent with the existing codebase. That consistency helps a lot for the small team of volunteers maintaining this project.

Copy link
Contributor

@Willian-Zhang Willian-Zhang Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via 60fe403

@Willian-Zhang
Copy link
Contributor

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

validation dataset is completed with via 92d0b39

tho it seems saved binary file for validation datasets from same data are not guaranteed to stay identical.
I'm not so sure if this is an expected behavior.

Some observations:
image

[LightGBM] [Info] num_features_ 0
[LightGBM] [Info] num_total_features_ 10
[LightGBM] [Info] label_idx_ 0
[LightGBM] [Info] max_bin_ 0
[LightGBM] [Info] bin_construct_sample_cnt_ -1359898024
[LightGBM] [Info] min_data_in_bin_ 32723
[LightGBM] [Info] zero_as_missing_ 248
[LightGBM] [Info] has_raw_ 0

examples/python-guide/dataset_from_multi_hdf5.py Outdated Show resolved Hide resolved
src/c_api.cpp Outdated

auto sample_indices = CreateSampleIndices(config, total_nrow);

static_assert (sizeof(int) == 4, "int size is not 4");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this assert statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useless now and removed.

CreateSampleIndicies returns std::vector<int> at first. It's now returning std::vector<int32_t> by implicit type convertion from Random.Sample's return value.

files = [files]
for file in files:
if os.path.exists(file):
os.remove(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, this should be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed via 2d32b87

ds.save_binary(os.path.join(tmpdir, "seq.seq.bin"))

if create_valid:
# TODO: verify validation dataset somehow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for validation dataset.

@shiyu1994
Copy link
Collaborator

[LightGBM] [Info] num_features_ 0
[LightGBM] [Info] num_total_features_ 10
[LightGBM] [Info] label_idx_ 0
[LightGBM] [Info] max_bin_ 0
[LightGBM] [Info] bin_construct_sample_cnt_ -1359898024
[LightGBM] [Info] min_data_in_bin_ 32723
[LightGBM] [Info] zero_as_missing_ 248
[LightGBM] [Info] has_raw_ 0

Are these attributes from a validation dataset loaded from bin file?
For validation dataset, num_features_ 0 is not expected, since it should be set in Dataset::CreateValid, which is called by LGBM_DatasetCreateByReference.

LightGBM/src/io/dataset.cpp

Lines 726 to 728 in f831808

void Dataset::CreateValid(const Dataset* dataset) {
feature_groups_.clear();
num_features_ = dataset->num_features_;

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Jun 28, 2021

While adding C API LGBM_SampleCount, I noticed rand.Sample may return number of values less than requested sample count.
#4089 (comment)

@shiyu1994 Could you please take a look?

I think Sample only returns K elements in expectation, but does not guarantee to return exactly K elements. We need to figure out an efficient way to Sample exactly K elements.

BTW, thank you all for your hard work and detailed review for this PR. Really appreciate your contribution.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor cleaning-up comments below.

docs/Python-Intro.rst Outdated Show resolved Hide resolved
docs/Python-Intro.rst Outdated Show resolved Hide resolved
docs/Python-Intro.rst Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 28, 2021

@shiyu1994

We need to figure out an efficient way to Sample exactly K elements.

Thanks a lot for taking a look! Do you think we need feature request for this?

Also, please update your review for this PR because there have been a lot of code changes at cpp side since your latest approval (now there are 2 new C API functions, for example).

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@cyfdecyf
Copy link
Contributor Author

@StrikerRUS Thanks for the cleanup. I should have done renaming with find and replace..

@shiyu1994
Copy link
Collaborator

@StrikerRUS Write to #2302. Sure, I'll update my review now.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked again, only a few typos to be fixed and DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT to be removed.

include/LightGBM/c_api.h Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@cyfdecyf
Copy link
Contributor Author

@shiyu1994 Thanks for the review. Just addressed all the comments and pushed my changes.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Jun 30, 2021

@cyfdecyf @StrikerRUS Sorry, after a careful investigation, I found that the Sample function always returns exactly K elements. I should delete the feature request. But I don't know why the sample_cnt should be recalculated as mentioned here #4089 (comment).

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so so much for this awesome contribution!
LGTM! I don't have any other comments. Thanks for your patience!

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@cyfdecyf
Copy link
Contributor Author

cyfdecyf commented Jul 1, 2021

Commit 8a0f07a introduced sample_cnt assignement from sampled indices vector.

In line 87 in the following code, if NextShort always return a value near max_step, then we may sample less then K values.

int min_step = 1;
int avg_step = N / K;
int max_step = 2 * avg_step - min_step;
int start = -1;
for (int i = 0; i < K; ++i) {
int step = NextShort(min_step, max_step + 1);
start += step;
if (start >= N) { break; }
ret.push_back(start);
}

@shiyu1994
Copy link
Collaborator

I see. But it seems that the latest version of Sample won't have such problem.

} else {
std::set<int> sample_set;
for (int r = N - K; r < N; ++r) {
int v = NextInt(0, r);
if (!sample_set.insert(v).second) {
sample_set.insert(r);
}
}

In the loop above, each iteration a new element will be added into the set. But as #4371 pointed out, there's a bug in line 91, NextInt(0, r) should be NextInt(0, r + 1). I think we can fix this, and remove the recalculation of sample_cnt in the code in another PR.

@shiyu1994
Copy link
Collaborator

@StrikerRUS @jameslamb Thank you both for your careful review. Can we merge this PR now?

@StrikerRUS
Copy link
Collaborator

Yeah, I believe we can continue (#4403 and #4089 (comment)) in a follow-up PRs.

@cyfdecyf Great contribution, many thanks!

@cyfdecyf
Copy link
Contributor Author

cyfdecyf commented Jul 6, 2021

Thanks for all the reviews. I'll start working on #4403 next week.

@StrikerRUS
Copy link
Collaborator

I'll start working on #4403 next week.

@cyfdecyf
Many thanks!

Also, #4450 has been merged, so please check #4089 (comment)

... and remove the recalculation of sample_cnt in the code in another PR.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants