Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve type detection mechanism for IDs #88

Closed
dorisjlee opened this issue Sep 18, 2020 · 0 comments · Fixed by #234
Closed

Improve type detection mechanism for IDs #88

dorisjlee opened this issue Sep 18, 2020 · 0 comments · Fixed by #234
Labels
bug Something isn't working easy Easy to fix; Good issues for newcomers

Comments

@dorisjlee
Copy link
Member

dorisjlee commented Sep 18, 2020

In Lux, we detect attributes that look like an ID and avoid visualizing them.
image

There are several issues related to the current type detection mechanisms:

  1. The function check_if_id_like needs to be improved so that we are not relying on attribute_contain_id check too much, i.e. even if the attribute name does not contain ID but looks like an ID, we should still label it as an ID. The cardinality check almost_all_vals_unique is a good example since most ID fields are largely unique. Another check we could implement is checking that the ID is spaced by a regular interval (e.g., 200,201,202,...), this is somewhat of a weak signal, since it not a necessary property of ID.

BUG: We only trigger ID detection currently if the data type of the attribute is detected as an integer (source). We should fix this bug so that string attributes that are ID like (e.g., a CustomerID in the Churn dataset like "7590-VHVEG") are also detected as IDs.

Some test data can be found here, feel free to find your own on Kaggle or elsewhere. For a pull request, please include tests to try out the bugfix on several datasets to verify that ID fields are being detected and that non-ID fields are not detected.

@dorisjlee dorisjlee added bug Something isn't working easy Easy to fix; Good issues for newcomers labels Sep 18, 2020
dorisjlee added a commit that referenced this issue Sep 30, 2020
* string id detection bug (#88)
* remove bolded Filter description
* id type visualized as nominal
* expanded nominal integer type criteria
* added additional type tests
* making univariate sorted but not top-k
@dorisjlee dorisjlee linked a pull request Jan 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working easy Easy to fix; Good issues for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant