Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with 'arrange' when df has an index #47

Closed
omri374 opened this issue Jan 9, 2018 · 1 comment
Closed

Issue with 'arrange' when df has an index #47

omri374 opened this issue Jan 9, 2018 · 1 comment

Comments

@omri374
Copy link

omri374 commented Jan 9, 2018

Hi,
Please take a look at the following example:

from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})
print(utime >> arrange(X.eventTime))

utime = utime.set_index("u")
print(utime >> d.arrange(X.eventTime))

In the first option, the result is as expected. When introducing an index, the result is incorrect and contains 4 times as many values as before.

I'm not sure if it's bug or an expected behavior, as I'm a newbie to pandas and to indices of data frames.

output for the code:
eventTime u
0 01-01-1971 01:04:00 1
2 01-01-1971 01:09:00 1
3 01-01-1971 01:10:00 1
1 01-01-1971 02:07:00 1
eventTime
u
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00

@kieferk
Copy link
Owner

kieferk commented Jan 10, 2018

Good catch! This is in fact a bug. It was happening because I was using the original dataframe's index to sort, then re-indexing with the sorted indices. When there were duplicate indices it would duplicate the rows.

Should be fixed now. I just changed to indexing using .iloc instead.

I tried the same on my machine with the new master branch:

from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})

print(utime >> arrange(X.eventTime))
             eventTime  u
0  01-01-1971 01:04:00  1
2  01-01-1971 01:09:00  1
3  01-01-1971 01:10:00  1
1  01-01-1971 02:07:00  1

utime = utime.set_index("u")

print(utime >> arrange(X.eventTime))
             eventTime
u                     
1  01-01-1971 01:04:00
1  01-01-1971 01:09:00
1  01-01-1971 01:10:00
1  01-01-1971 02:07:00

Which is the behavior you expected. If you pull the master branch and reinstall it should work.

@kieferk kieferk closed this as completed Jan 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants