Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Id function improvised #234

Merged
merged 15 commits into from
Jan 27, 2021
Merged

Id function improvised #234

merged 15 commits into from
Jan 27, 2021

Conversation

Priyansdesai
Copy link
Contributor

@Priyansdesai Priyansdesai commented Jan 17, 2021

Added more regex checking within the ID checking to cover more cases. Also, added to check if the difference between consecutive ID values is the same - even intervals.

Had a query regarding falsely identifying columns that are NOT IDs but are still recognized as IDs.

Added more datasets for testing.

Update 12:23AM PST 16th Jan: Complete

@dorisjlee
Copy link
Member

Hi @Priyansdesai, Thanks for the PR! Could you please pull in the latest code from master so that code changes reflect your most recent changes? For the dataset, we generally put the datasets to lux-datasets and use a link for the tests. Many of the datasets that you've used for testing is already in that folder (absenteeism, census, spotify), so it would be great if we could just reuse the same dataset.

@dorisjlee dorisjlee linked an issue Jan 18, 2021 that may be closed by this pull request
@Priyansdesai
Copy link
Contributor Author

Hi @dorisjlee! I have added all the relevant datasets to lux-datasets and raised a PR in that repo. I have also changed the links in the test_type.py file to github links as well as pulled new changes.

@dorisjlee
Copy link
Member

Hi @Priyansdesai , I'm still seeing some files that were changed that shouldn't be part of the PR. Could you use the Files Changed view to check that the PR only contains the intended changes? Thanks!

@Priyansdesai
Copy link
Contributor Author

Hi @dorisjlee! The files car.csv might have changed because I took the dataset from lux-datasets, so that extra record might have been added. Other than that, I haven't changed files other than test_type.py and utils.py. Those changes might have been the result due to merging the new changes in the main repo.

@dorisjlee
Copy link
Member

Hi @Priyansdesai, It would be great if you can make sure that the new changes (even the merged ones) are up-to-date with the latest (i.e., does not show up as color in "Files Changed"). Like you mentioned the PR should only contain changes for test_type.py and utils.py.

@Priyansdesai
Copy link
Contributor Author

@dorisjlee
If you see now, I have reverted all those files that I was not supposed to change to the versions in sink with the master. You can check files changed and see that the changes are same and it is just probably some spacing problem.

For the datasets, I re-downloaded the master branch and copied the same versions of the dataset - car.csv and college.csv

return high_cardinality and (attribute_contain_id or almost_all_vals_unique)
if len(df) >= 2:
diff = df[attribute].diff()
evenly_spaced = all(diff.loc[1:] == diff.loc[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line fails on examples with indexes.

python -m pytest tests/test_action.py

image

@dorisjlee
Copy link
Member

I've made some edits to fix the formatting issues. However, the line that checks the data diffs leads to errors in the test suite.
image

@Priyansdesai
Copy link
Contributor Author

Hi @dorisjlee! I have fixed the Index issue now. But, one test test_check_datetime_numeric_value is failing in test_type.py. The code miscategorizes the "Year"column to be Nominal instead of Temporal. I do not know where this is coming from, but I think it is not related to the check_id_like function.

I have not yet pushed the code. I just wanted to post an update.

@Priyansdesai
Copy link
Contributor Author

@dorisjlee, I have added the support for Index columns for ID function. The tests that are failing, these are also failing in the Master branch.

Copy link
Contributor

@jerrysong1324 jerrysong1324 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the one line; please do so before merging as it simplifies control flow logic.

evenly_spaced = True
if attribute_contain_id:
almost_all_vals_unique = df.cardinality[attribute] >= 0.75 * len(df)
return high_cardinality and (almost_all_vals_unique or evenly_spaced)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to delete all lines from 100 to 103?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerry - Done the required changes and pushed as well.

@Priyansdesai
Copy link
Contributor Author

@jerrysong1324 - Changes added. The tests that fail, fail in master 2. A test fails in test_type.py, but that does not have anything to do with my code for id detection.

@dorisjlee dorisjlee merged commit 1c0e2eb into lux-org:master Jan 27, 2021
@Priyansdesai
Copy link
Contributor Author

Thank you for merging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve type detection mechanism for IDs
3 participants