Id function improvised #234

Priyansdesai · 2021-01-17T07:48:19Z

Added more regex checking within the ID checking to cover more cases. Also, added to check if the difference between consecutive ID values is the same - even intervals.

Had a query regarding falsely identifying columns that are NOT IDs but are still recognized as IDs.

Added more datasets for testing.

Update 12:23AM PST 16th Jan: Complete

…ification field

dorisjlee · 2021-01-18T02:08:51Z

Hi @Priyansdesai, Thanks for the PR! Could you please pull in the latest code from master so that code changes reflect your most recent changes? For the dataset, we generally put the datasets to lux-datasets and use a link for the tests. Many of the datasets that you've used for testing is already in that folder (absenteeism, census, spotify), so it would be great if we could just reuse the same dataset.

… changed to lux-datasets links

Priyansdesai · 2021-01-18T05:15:09Z

Hi @dorisjlee! I have added all the relevant datasets to lux-datasets and raised a PR in that repo. I have also changed the links in the test_type.py file to github links as well as pulled new changes.

dorisjlee · 2021-01-19T05:24:39Z

Hi @Priyansdesai , I'm still seeing some files that were changed that shouldn't be part of the PR. Could you use the Files Changed view to check that the PR only contains the intended changes? Thanks!

Priyansdesai · 2021-01-19T05:28:43Z

Hi @dorisjlee! The files car.csv might have changed because I took the dataset from lux-datasets, so that extra record might have been added. Other than that, I haven't changed files other than test_type.py and utils.py. Those changes might have been the result due to merging the new changes in the main repo.

dorisjlee · 2021-01-19T05:33:44Z

Hi @Priyansdesai, It would be great if you can make sure that the new changes (even the merged ones) are up-to-date with the latest (i.e., does not show up as color in "Files Changed"). Like you mentioned the PR should only contain changes for test_type.py and utils.py.

Priyansdesai · 2021-01-19T05:48:40Z

@dorisjlee
If you see now, I have reverted all those files that I was not supposed to change to the versions in sink with the master. You can check files changed and see that the changes are same and it is just probably some spacing problem.

For the datasets, I re-downloaded the master branch and copied the same versions of the dataset - car.csv and college.csv

dorisjlee · 2021-01-19T09:19:16Z

lux/utils/utils.py

-        return high_cardinality and (attribute_contain_id or almost_all_vals_unique)
+        if len(df) >= 2:
+            diff = df[attribute].diff()
+            evenly_spaced = all(diff.loc[1:] == diff.loc[1])


This line fails on examples with indexes.

python -m pytest tests/test_action.py

dorisjlee · 2021-01-19T09:23:54Z

I've made some edits to fix the formatting issues. However, the line that checks the data diffs leads to errors in the test suite.

Priyansdesai · 2021-01-19T10:30:04Z

Hi @dorisjlee! I have fixed the Index issue now. But, one test test_check_datetime_numeric_value is failing in test_type.py. The code miscategorizes the "Year"column to be Nominal instead of Temporal. I do not know where this is coming from, but I think it is not related to the check_id_like function.

I have not yet pushed the code. I just wanted to post an update.

Priyansdesai · 2021-01-19T10:44:41Z

@dorisjlee, I have added the support for Index columns for ID function. The tests that are failing, these are also failing in the Master branch.

jerrysong1324

We can remove the one line; please do so before merging as it simplifies control flow logic.

jerrysong1324 · 2021-01-23T20:06:18Z

lux/utils/utils.py

+            evenly_spaced = True
+        if attribute_contain_id:
+            almost_all_vals_unique = df.cardinality[attribute] >= 0.75 * len(df)
+            return high_cardinality and (almost_all_vals_unique or evenly_spaced)


this line can be deleted

Do you want to delete all lines from 100 to 103?

@jerry - Done the required changes and pushed as well.

Priyansdesai · 2021-01-25T11:57:15Z

@jerrysong1324 - Changes added. The tests that fail, fail in master 2. A test fails in test_type.py, but that does not have anything to do with my code for id detection.

Priyansdesai · 2021-01-27T06:54:30Z

Thank you for merging this!

Priyansdesai added 4 commits January 16, 2021 23:44

Id function added

283e8d1

removed unnecessary datasets

e3d9a24

More accurate differentiation between an actual ID field and an ident…

96d4b14

…ification field

Done

34e9152

dorisjlee linked an issue Jan 18, 2021 that may be closed by this pull request

Improve type detection mechanism for IDs #88

Closed

Solving merge conflicts; Removing unnecessary datasets; Dataset links…

830cf9a

… changed to lux-datasets links

dorisjlee requested a review from jerrysong1324 January 19, 2021 05:24

Priyansdesai added 3 commits January 18, 2021 21:37

Only changing intended files

83de78d

Reverted to orig versions of dataset

3b422d1

Reverted to orig versions of dataset

0220526

reformat indentation issue, run black, revert college dataset

27be821

dorisjlee reviewed Jan 19, 2021

View reviewed changes

broken data link

6e95143

Index columns support added for ID function

f3157c6

black

69962f7

dorisjlee approved these changes Jan 19, 2021

View reviewed changes

jerrysong1324 reviewed Jan 23, 2021

View reviewed changes

Priyansdesai added 2 commits January 25, 2021 03:46

new changes

c56a489

Merge branch 'master' of https://github.com/Priyansdesai/lux

41dd51a

Priyansdesai requested a review from jerrysong1324 January 25, 2021 11:58

revert plot config changes

d1a2f17

dorisjlee merged commit 1c0e2eb into lux-org:master Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Id function improvised #234

Id function improvised #234

Priyansdesai commented Jan 17, 2021 •

edited

Loading

dorisjlee commented Jan 18, 2021

Priyansdesai commented Jan 18, 2021

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

dorisjlee Jan 19, 2021

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

jerrysong1324 left a comment •

edited

Loading

jerrysong1324 Jan 23, 2021

Priyansdesai Jan 24, 2021

Priyansdesai Jan 25, 2021

Priyansdesai commented Jan 25, 2021

Priyansdesai commented Jan 27, 2021

Id function improvised #234

Id function improvised #234

Conversation

Priyansdesai commented Jan 17, 2021 • edited Loading

dorisjlee commented Jan 18, 2021

Priyansdesai commented Jan 18, 2021

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

dorisjlee Jan 19, 2021

Choose a reason for hiding this comment

dorisjlee commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

Priyansdesai commented Jan 19, 2021

jerrysong1324 left a comment • edited Loading

Choose a reason for hiding this comment

jerrysong1324 Jan 23, 2021

Choose a reason for hiding this comment

Priyansdesai Jan 24, 2021

Choose a reason for hiding this comment

Priyansdesai Jan 25, 2021

Choose a reason for hiding this comment

Priyansdesai commented Jan 25, 2021

Priyansdesai commented Jan 27, 2021

Priyansdesai commented Jan 17, 2021 •

edited

Loading

jerrysong1324 left a comment •

edited

Loading