Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) #5818

Merged
merged 17 commits into from
Jun 10, 2023

Conversation

moziada
Copy link
Contributor

@moziada moziada commented Apr 4, 2023

fixes #5687

Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I've left some minor comments

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
@jameslamb jameslamb changed the title adding max_category_values parameter to create_tree_digraph method [python-package] adding max_category_values parameter to create_tree_digraph method Apr 6, 2023
@moziada
Copy link
Contributor Author

moziada commented Apr 8, 2023

I can's see why R-package checks are failing

@jameslamb
Copy link
Collaborator

please merge laster master into this branch to get the changes from #5821

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for your contribution!

@jmoralez
Copy link
Collaborator

@jameslamb do you want to review this as well?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this!

Can you please just add unit tests covering this functionality? That would give us more confidence that this is working and prevent it from being broken in the future.

Please add two tests, both with datasets that have at least one categorical feature that is informative and used in splits.

  • one where the len(category_values) > max_category_values condition you've added is True
  • one where that condition is false

You could, for example, copy this test:

def test_create_tree_digraph(breast_cancer_split):

@moziada
Copy link
Contributor Author

moziada commented Apr 18, 2023

After looking at sklearn datasets I did not find any classification dataset that contains categorical features, I am thinking of creating multiple base datasets with sklearn.datasets.make_classification each one has different distribution then merging them into one dataset and use the label of each base dataset as a categorical feature. what do you think?

@jmoralez
Copy link
Collaborator

I think that could work, you just need to make sure that the target depends on those features somehow (so that they're chosen for some splits), maybe something like what we do here

if output == 'dataframe-with-categorical':
num_cat_cols = 2
for i in range(num_cat_cols):
col_name = f"cat_col{i}"
cat_values = rnd.choice(['a', 'b'], X.shape[0])
cat_series = pd.Series(
cat_values,
dtype='category'
)
X_df[col_name] = cat_series
X = np.hstack((X, cat_series.cat.codes.values.reshape(-1, 1)))
# make one categorical feature relevant to the target
cat_col_is_a = X_df['cat_col0'] == 'a'
if objective == 'regression':
y = np.where(cat_col_is_a, y, 2 * y)
elif objective == 'binary-classification':
y = np.where(cat_col_is_a, y, 1 - y)
elif objective == 'multiclass-classification':
n_classes = 3
y = np.where(cat_col_is_a, y, (1 + y) % n_classes)

@moziada
Copy link
Contributor Author

moziada commented May 1, 2023

I have created a quantized version of breast cancer dataset and used the features as a categorical features. I made only one test case where the condition is false (categorical values should not be compressed), but the problem is the way that lightgbm splits on categorical features is not deterministic, for example a feature could have 30 different categorical values but a random indexed tree may be only splits on one categorical value. what do you suggest and I need your review

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea binning the breast_cancer dataset to create an informative dataset of categorical features! I think that's a great approach for this PR's tests.

Please see my suggestions on how to proceed, and please add at least one test where the number of categories in a feature is greater than the value of max_category_values as well.

tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved
tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved
tests/python_package_test/test_plotting.py Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

@moziada sorry for the failing R CI jobs...those are not related to your PR. We've recently fixed that issue in #5859 .

Can you please merge latest master into this branch? Once you do that, I'll review the testing changes you've pushed.

@moziada
Copy link
Contributor Author

moziada commented May 16, 2023

Are there any updates? @jameslamb

@jameslamb jameslamb changed the title [python-package] adding max_category_values parameter to create_tree_digraph method [python-package] adding max_category_values parameter to create_tree_digraph method (fixes #5687) Jun 4, 2023
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the contribution!

@jameslamb jameslamb merged commit 15e3aec into microsoft:master Jun 10, 2023
40 checks passed
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plot tree with high cardinality feature
3 participants