Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter speed performance regression compared to 0.4 #601

Closed
jpdna opened this issue Apr 30, 2019 · 7 comments

Comments

@jpdna
Copy link

commented Apr 30, 2019

System information

CentOS 7
Modin commit 5b77d24
Python 3.6.8

load (this is fine)

import modin.pandas as mpd
geno_1e9_modin = mpd.read_table("geno_10e9_rows.header.out", sep=" ")

filter command in question in this issue

%%time
geno_1e9_modin_hzalt = geno_1e9_modin[geno_1e9_modin.g == 2]

note, behavior is same when %%time is not used

At commit 5b77d24
this command takes

CPU times: user 6min 48s, sys: 26.7 s, total: 7min 15s
Wall time: 7min 1s

But back in Modin 0.4 this took half the time

CPU times: user 3min 34s, sys: 12.3 s, total: 3min 46s
Wall time: 3min 49s

The input file can be found here on google drive.
https://drive.google.com/file/d/1RKl6HN7mm4ehoOelKe4Uybd30WfDI3XK/view?usp=sharing

In addition to this performance regression, we should explore what is the bottlneck in Modin here even in 0.4, because the Pandas filter time is only 4.04 seconds!

Watching top when Modin runs the filter command, the first 10 seconds show parallelism across many CPUs, but then a single CPU process runs for the remaining ~6 minutes, using an increasing amount of RAM up to 40 GB for that process, as though a collect is happening within that process, rather than keeping the dataset partitioned across the worker processes. This occurs 0.4 and newer Modin version.

@jpdna

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Devin, you suggested on the comment on my slidedeck I try:

geno_1e9_modin_hzalt = geno_1e9_modin.filter("g == 2")

Doing that I get an error:

TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]
@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@jpdna That should be geno_1e9_modin_hzalt = geno_1e9_modin.query("g == 2"), I believe.

@jpdna

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Rockin'!
on Modin 0.4 I just got

%%time 
   ...: geno_1e9_modin_hzalt = geno_1e9_modin.query("g == 2") 
   ...:  
   ...:                                                                                                                                                          
UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
CPU times: user 2.51 s, sys: 2.3 s, total: 4.81 s
Wall time: 10.9 s

So speed problem totally solved using .query()
But users need to be strongly warned not to try the earlier way with Modin!

@jpdna

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Is there a similar way to speed up groupby() ?

It similarly takes 6 minutes in Modin, but 16 seconds in Pandas

geno_counts = geno_1e9_modin.groupby('g').size()
@devin-petersohn

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

We should just fix the regression instead of warning users not to use it.

The slowdown is coming from the separation of the data and the metadata. When a mask is performed, the index needs to communicate the updates to the data, which right now is done through a reindex. That reindex is pretty slow, and is essentially what needs to be sped up.

@jpdna

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Just to make sure we are on same page - there are possibly to different "regression" issues here, the difference between modin 0.4 and latest - but more importantly the difference between using
the mask approach:

geno_1e9_modin[geno_1e9_modin.g == 2]

and the currenttly 100 times faster query method:

geno_1e9_modin_hzalt = geno_1e9_modin.query("g == 2")
although what changed between 0.4 and latest to slow something down might be interesting too.

Agree - just making Modn fast in all cases is the best plan, what makes Modin so appealing is that it can be used as a drop in scalable Pandas.

@devin-petersohn devin-petersohn added this to the 0.5.1 milestone May 7, 2019

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 22, 2019

Fixed by #613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.