Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero and inf fix #4

Merged
merged 9 commits into from Mar 15, 2017
Merged

Zero and inf fix #4

merged 9 commits into from Mar 15, 2017

Conversation

sergeyf
Copy link
Collaborator

@sergeyf sergeyf commented Mar 15, 2017

Re: #3

@iskandr Please check it out to make sure these changes don't fundamentally break something else.

@amueller
Copy link

Maybe add a minimal regression test using

X = np.array([[1, 1, np.NaN], [1, 1, 1]])
knnimpute.knn_impute_few_observed(X, np.isnan(X), k=1)

?

@sergeyf
Copy link
Collaborator Author

sergeyf commented Mar 15, 2017

Yup, good idea. It's in there.

# points considering themselves as neighbors
for i in range(X.shape[0]):
D[i, i] = np.inf
max_dist = 1e6 * np.maximum(1, D.max())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: np.array([1,2,np.inf]).max()
Out[1]: inf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I should do D[np.isfinite(D)].max()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err unless that makes D empty. That's impossible, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess you could play it very safe and have:

finite_distances = D[np.isfinite(D)]
if len(finite_distances) > 0:
   max_dist = 1e6 * max(1, finite_distances.max())
else:
   max_dist = 1e6

# points considering themselves as neighbors
np.fill_diagonal(D, np.inf)
max_dist = 1e6 * np.maximum(1, D.max())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If infinity is in the distance matrix, won't this make max_dist infinity?

In [1]: np.array([1,2,np.inf]).max()
Out[1]: inf

missing_mask,
verbose=False,
min_dist=1e-6,
max_dist_multiplier=1e6):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does max_dist_multiplier do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, it's supposed to be in the code instead of 1e6.

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017

FYI, this is going to break the few_observed_entries implementation, which depends on being able to tell which distances are infinite due to no overlap between samples.

    effective_infinity = 10 ** 6 * D[finite_distance_distance_mask].max()
    D[~finite_distance_distance_mask] = effective_infinity
    D_sorted = np.argsort(D, axis=1)
    inv_D = 1.0 / D
    D_valid_mask = D < effective_infinity

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017

One way to fix this for few_observed_entries would be to have knn_initialize return max_dist. I can do that as a second PR if you'd like.

@sergeyf
Copy link
Collaborator Author

sergeyf commented Mar 15, 2017

@iskandr I think I fixed that a few commits back?

    effective_infinity = D[0, 0] # since diagonal was replaced by max_dist
    D_valid_mask = D < effective_infinity
    valid_distances_per_row = D_valid_mask.sum(axis=1)

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017 via email

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017

Test failure: NameError: global name 'knnimpute' is not defined

@sergeyf
Copy link
Collaborator Author

sergeyf commented Mar 15, 2017

That's weird. I fixed that:

+def test_knn_minimal():
 +    X = np.array([[1, 1, np.NaN], [1, 1, 1]])
 +    res = knn_impute_few_observed(X, np.isnan(X), k=1)
 +    assert np.isnan(res).any() == False, \
 +        "Basic example did not get imputed: %s" % res 

@sergeyf
Copy link
Collaborator Author

sergeyf commented Mar 15, 2017

Did you get the latest version?

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017

I took a peek at Travis but I guess it's behind on builds. Wonder why it's so slow today.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage decreased (-0.4%) to 95.575% when pulling 324309e on zero_and_inf_fix into d045b5c on master.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage decreased (-0.4%) to 95.575% when pulling 324309e on zero_and_inf_fix into d045b5c on master.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage decreased (-0.4%) to 95.575% when pulling 324309e on zero_and_inf_fix into d045b5c on master.

@iskandr
Copy link
Owner

iskandr commented Mar 15, 2017

Not sure which library this bubbles up from, but we get a warning on Python 2.7:

Problem importing module variables.py: No module named backports.functools_lru_cache

Possibly fixed by depending on https://pypi.python.org/pypi/backports.functools_lru_cache/1.3? I can look into it later.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 95.575% when pulling 324309e on zero_and_inf_fix into d045b5c on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 95.575% when pulling 324309e on zero_and_inf_fix into d045b5c on master.

@iskandr iskandr merged commit 8ec2611 into master Mar 15, 2017
@iskandr iskandr deleted the zero_and_inf_fix branch March 15, 2017 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants