New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.merge_asof() matches out of tolerance when timestamps are duplicated #13709

Closed
chrisaycock opened this Issue Jul 19, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@chrisaycock
Contributor

chrisaycock commented Jul 19, 2016

This is a continuation of #13695.

Starting with the original DataFrames from that issue:

df1 = pd.DataFrame({'time':pd.to_datetime(['2016-07-15 13:30:00.030']), 'username':['bob']})
df2 = pd.DataFrame({'time':pd.to_datetime(['2016-07-15 13:30:00.000', '2016-07-15 13:30:00.030']), 'version':[1, 2]})

I now get the null:

In [82]: pd.merge_asof(df1, df2, on='time', allow_exact_matches=False, tolerance=pd.Timedelta('10ms'))
Out[82]:
                     time username  version
0 2016-07-15 13:30:00.030      bob      NaN

However, if I change the first DataFrame to have duplicate timestamps:

df1 = pd.DataFrame({'time':pd.to_datetime(['2016-07-15 13:30:00.030', '2016-07-15 13:30:00.030']), 'username':['bob', 'charlie']})

then the bug reappears:

In [85]: pd.merge_asof(df1, df2, on='time', allow_exact_matches=False, tolerance=pd.Timedelta('10ms'))
Out[85]:
                     time username  version
0 2016-07-15 13:30:00.030      bob        1
1 2016-07-15 13:30:00.030  charlie        1

This is in pandas version 0.18.0+418.gc46dcfa.

@jreback jreback added this to the 0.19.0 milestone Jul 19, 2016

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Jul 21, 2016

Contributor

I can take a stab at this. I should dive into the asof internals anyway.

Contributor

chrisaycock commented Jul 21, 2016

I can take a stab at this. I should dive into the asof internals anyway.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 21, 2016

Contributor

great @chrisaycock yeah the fix is trivial, BUT it breaks the full on test, not entirely sure why, so its not exactly right

Contributor

jreback commented Jul 21, 2016

great @chrisaycock yeah the fix is trivial, BUT it breaks the full on test, not entirely sure why, so its not exactly right

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Jul 29, 2016

Contributor

Alright, I've issued a pull request:

#13836

I just rewrote the Cython logic to compare the factorized keys directly since that was the easiest way forward. Though we don't actually have to factorize the keys at all; we could just compare the timestamps directly, which would be even faster.

Contributor

chrisaycock commented Jul 29, 2016

Alright, I've issued a pull request:

#13836

I just rewrote the Cython logic to compare the factorized keys directly since that was the easiest way forward. Though we don't actually have to factorize the keys at all; we could just compare the timestamps directly, which would be even faster.

@jreback jreback closed this in #13836 Aug 1, 2016

jreback added a commit that referenced this issue Aug 1, 2016

BUG: Fix edge cases in merge_asof() by comparing factorized keys (#13709
) (#13836)

Also removes unnecessary check_duplicates.

Added asv benchmarks for merge_asof()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment