df.take is much slower against pandas #6876

YarShev · 2024-01-23T14:52:15Z

On a machine with 192 CPUs.

# import pandas as pd
import modin.pandas as pd
import numpy as np
import time

df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(100000000,4)),
                columns=['C1','C2','C3','C4'])

to_take = np.random.randint(0, 100000000, size=80000000)
t0 = time.time()
df.take(to_take, axis=0)
t1 = time.time()
print('time to take: ', t1 - t0)
# time for take:  44.71011233329773 in Modin
# time for take:  3.082368850708008 in pandas

YarShev · 2024-01-23T14:52:27Z

cc @dchigarev

YarShev · 2024-01-24T09:02:17Z

# import pandas as pd
import modin.pandas as pd
import numpy as np
import time

df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(100000000,1)),
                columns=['C1']).squeeze(axis=1)

to_take = np.random.randint(0, 100000000, size=80000000)
t0 = time.time()
df.take(to_take, axis=0)
t1 = time.time()
print('time to take: ', t1 - t0)
# time for take:  37.6995530128479 in Modin
# time for take:  1.492713212966919 in pandas

…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

YarShev added the Performance 🚀 Performance related issues and pull requests. label Jan 23, 2024

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 24, 2024

PERF-modin-project#6876: Skip the masking stage on 'iloc' where benef…

d85920a

…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 24, 2024

PERF-modin-project#6876: Skip the masking stage on 'iloc' where benef…

8061516

…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Jan 24, 2024

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

Merged

7 tasks

anmyachev closed this as completed in #6878 Jan 24, 2024

anmyachev added a commit that referenced this issue Jan 24, 2024

PERF-#6876: Skip the masking stage on 'iloc' where beneficial (#6878)

72de8c0

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

YarShev mentioned this issue Feb 7, 2024

BUG: HDK runs out of stack and Java heap #6924

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.take is much slower against pandas #6876

df.take is much slower against pandas #6876

YarShev commented Jan 23, 2024 •

edited

YarShev commented Jan 23, 2024

YarShev commented Jan 24, 2024

df.take is much slower against pandas #6876

df.take is much slower against pandas #6876

Comments

YarShev commented Jan 23, 2024 • edited

YarShev commented Jan 23, 2024

YarShev commented Jan 24, 2024

YarShev commented Jan 23, 2024 •

edited