Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundancy check for MGC #216

Closed
wants to merge 15 commits into from
Closed

Redundancy check for MGC #216

wants to merge 15 commits into from

Conversation

Verathagnus
Copy link
Contributor

@Verathagnus Verathagnus commented Oct 9, 2021

Reference issue

Add warning when X or Y have redundant rows; close #125

Type of change

Documentation

What does this implement/fix?

Added a function _check_redundancy in mgc.py to check if redundancies are present in x or y and report a warning to the user.

Additional information

I used the following code to find if redundant rows exist

    def _check_redundancy(self, x, y, mgc_dict):
        """Check if there are redundant rows in input arrays x and y"""
        if x.shape[0] == 1:
            return

        if [len(x), len(y)] > mgc_dict["opt_scale"]:
            warnings.warn("Redundant rows exist")

Added a test in test_mgc.py for checking if UserWarning is printed successfully.

@netlify
Copy link

netlify bot commented Oct 9, 2021

✔️ Deploy Preview for hyppo ready!

🔨 Explore the source changes: da4d4e6

🔍 Inspect the deploy log: https://app.netlify.com/sites/hyppo/deploys/6168440698284b000715a0f8

😎 Browse the preview: https://deploy-preview-216--hyppo.netlify.app

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @sampan501 pointed out earlier, #125 asks for redundant rows, not elements.

Also, you don't need to re-open new PR from the same branch. Just commit directly, and the PR will be updated.

@Verathagnus
Copy link
Contributor Author

Thank you for pointing out my mistake. I'll correct it.

@Verathagnus
Copy link
Contributor Author

Done. There are too many commits but not much changes from the main repo. Should I create a new fork and then add just the changed files?

@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #216 (da4d4e6) into main (e96c986) will decrease coverage by 0.05%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #216      +/-   ##
==========================================
- Coverage   96.54%   96.49%   -0.06%     
==========================================
  Files          31       31              
  Lines        1477     1484       +7     
==========================================
+ Hits         1426     1432       +6     
- Misses         51       52       +1     
Impacted Files Coverage Δ
mgc.py 97.82% <97.82%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e96c986...da4d4e6. Read the comment docs.

@sampan501
Copy link
Member

@Verathagnus Few things:

  • Adding the commit message "Add files via upload" tells me nothing about the changes that were done and makes it harder for me to review the PR
  • if x.shape[0] == 1 is not a possible case, since the test will not run under 5 samples
  • if [len(x), len(y)] > mgc_dict["opt_scale"] is not the correct logic needed to throw this warning. The optimal scale is only equal to the global scale (in your case [len(x), len(y)] when the relationship is linear and there are no duplicate rows. The current logic will throw a warning when the dependency relationship is nonlinear

Looking through the commits, it is a little difficult to follow what all was done (code disappeared, reappeared, files were duplicated, etc.). I would go through, and cherry-pick the commits that are off or restart with a clean branch.

@Verathagnus
Copy link
Contributor Author

Verathagnus commented Oct 14, 2021

First, I tried using a dictionary(or set) to check for duplicate rows but it caused issue with tests

def _check_redundancy(self, x, y):
        """Check if there are redundant rows in input arrays x and y"""
        redundancy_check_dict = dict()
        redundancy_flag = False
        for i in zip(x, y):
            if str(i) in redundancy_check_dict:
                redundancy_flag = True
                break
            else:
                redundancy_check_dict[str(i)] = 1

        if redundancy_flag:
            warnings.warn("Redundant rows exist")

Next I tried this

        combined = [(x[i], y[i]) for i in range(x.shape[0])]
        unique_rows = set(combined)
        redundancy_flag = len(combined) != len(unique_rows)
        if redundancy_flag:
            warnings.warn("Redundant rows exist")

This one also caused issues in the tests.
I'm not sure what was wrong. Could you please check?

@sampan501
Copy link
Member

I believe this fix should be about a 1 line check of inputs X and Y, something using np.unique

@Verathagnus
Copy link
Contributor Author

Verathagnus commented Oct 14, 2021

I tried with that as well previously but there was a test in which it came out wrong.

combined = np.vstack([X, Y]).T
unique_rows = np.unique(combined, axis=0)

I placed the check_redundancy call before the check_input call previously during that test. That may have caused the issue. I'll try that again this time.

@sampan501
Copy link
Member

yes it should be after

@Verathagnus
Copy link
Contributor Author

#220
Please check this. I have started from fresh and it generated a new PR on contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add warning when X or Y have redundant rows
3 participants