Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throw error when meet non ascii #2229

Merged
merged 3 commits into from Jul 18, 2019
Merged

throw error when meet non ascii #2229

merged 3 commits into from Jul 18, 2019

Conversation

guolinke
Copy link
Collaborator

to fix #2226

@guolinke guolinke requested a review from StrikerRUS June 17, 2019 01:53
@guolinke
Copy link
Collaborator Author

ping @StrikerRUS for review

@StrikerRUS
Copy link
Collaborator

Is feature_name the only model's field in which users' arbitrary strings are stored?

@guolinke
Copy link
Collaborator Author

@StrikerRUS there are some parameters with string type, e.g. data_filename, could be set by user and store in model file.

@StrikerRUS
Copy link
Collaborator

Maybe then add the same check for them too?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check works fine with Python 3.

def test_non_ascii_names(self):
    X = np.random.random((100, 3))
    with np.testing.assert_raises_regex(lgb.basic.LightGBMError, "Do not support non-ascii*"): 
        lgb.Dataset(X, feature_name=["测试名称", "тестовая_колонка", "clé à bougie"]).construct()

But fails with Python 2.

string = '\xe6\xb5\x8b\xe8\xaf\x95\xe5\x90\x8d\xe7\xa7\xb0'

    def c_str(string):
        """Convert a Python string to C string."""
>       return ctypes.c_char_p(string.encode('utf-8'))
E       UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

So, I think it's not worth to play around with new tests for such simple check.

@jameslamb
Copy link
Collaborator

@StrikerRUS can this PR be merged?

@StrikerRUS
Copy link
Collaborator

@guolinke What do you think about enabling this check for other user's strings, like different paths as you've mentioned? #2229 (comment)

cc @jameslamb

@guolinke
Copy link
Collaborator Author

@StrikerRUS done

@guolinke guolinke closed this Jul 18, 2019
@guolinke guolinke reopened this Jul 18, 2019
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@guolinke guolinke merged commit 0d59859 into master Jul 18, 2019
@StrikerRUS StrikerRUS deleted the guolinke-patch-2 branch July 19, 2019 10:48
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong size of feature_names
3 participants