BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

stromal · 2022-11-24T11:12:26Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

python

column: object (string)
column: int
column: float

df_email -> 2 columns: 250k rows

in the main df replacing emails with IDs

df.dtypes # takes 2s , no extensive RAM usage
df = df.replace(df_email['email'].values, df_email['userid'].values) # 1 min 5 s

than it goes crazy

df.dtypes #takes forever easts up all RAM 256 GB than crashes

Issue Description

but this still runs ok

df.shape # fast

(206529864, 4)

INPUT

df.to_csv('base_data/data_for_ml_20221121_userids.csv', index=False)

OUTPUT (it uses up all the ram just to expert a 6 GB file (imported file was 4 GB than I replaced emails with 200k 0-200'000 IDs -> so it should be even smaller than before, I have 256 GB RAM so it should be able to export it, but it alwasy crashes))

2022-11-23 17:49:11,232	WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 135135135135135f12f41f2412412f412f24 Worker ID: 23623f523623523f23f4236236f23f6Node ID: 42f34f235235234f235f Worker IP address: 122.222.124.22 Worker port: 172357 Worker PID: 312312

Expected Behavior

Do this in a few seconds like before

df.dtypes

also should run in a few seconds as well.

df.head()

Error Logs

Replace this line with the error backtrace (if applicable).

Installed Versions

0.12.1

installed it via #4719 (comment)

The text was updated successfully, but these errors were encountered:

RehanSD · 2022-11-27T03:23:22Z

Hi @stromal! Thank you so much for opening this issue! I'm attempting to reproduce it locally - although it seems I may not be able to get the exact same error as you since I'm working off of a 2019 MBP (only 16GB RAM), although I did find that the call (with a smaller df since a larger df would crash my laptop) hung for a very long time without completing, while utilizing a large percent of my CPU.

I believe the reason you are facing this error is because ray finds it difficult to serialize the list of values to replace + the values to replace from when running df.replace. Modin's .replace runs in parallel across each partition, so Ray tries to serialize both lists so that each is visible to each partition, which I believe overwhelms the system (hence the high RAM usage).

For context, it seems that the script fails on the df.dtypes call rather than the df.replace call since Modin is asynchronous - the .replace call returns before computation is finished, while the .dtypes call's computation is blocked on the computation of the previous calls (including the .replace's).

I recommend trying the .replace call with a smaller subset of the values (perhaps iteratively to accomplish the same task) and checking if the error persists, which would confirm whether or not it is a memory/serialization issue.

Alternatively, the query could potentially be rewritten - perhaps using a join if the email values are in a specific column in df, or an apply if you could create a function that returns userid when given email as an input? I think that this would cause the query to be faster!

stromal · 2022-11-28T16:47:29Z

@RehanSD I have come across a lot with this previously not jut via .replace but by modifying a data frame and than suddenly it just loads forever to even just to do a df.head() that previously took less than 1s.

RehanSD · 2022-11-29T07:08:56Z

Hi @stromal! I see - would you be able to share some other examples where the code crashes, so we could try and determine what the similarity is across? That way we can see if this is one bug affecting multiple workloads, or multiple bugs, and then determine a workaround accordingly!

mvashishtha · 2022-11-29T15:20:06Z

@stromal some responses:

First, do you know how much object store memory is available in your ray cluster? How are you initializing the cluster? What do you get if you run the shell command ray status or the python line print(ray.cluster_resources()) while ray is running?

It's hard to tell what's going on without a reproducible example or looking at the ray worker logs, but it's likely that your ray workers are running out of memory. As @RehanSD points out, broadcasting all of the to_replace and value values to each worker will consume a lot of memory when df_email is very large. Ideally you would do the replacement some other way, but I can't think of another way right now.

Does the replace work in pandas? e.g. what if you run

import modin.pandas as mpd

dfp = df._to_pandas()
df = mpd.DataFrame(dfp.replace(df_email['email'].values, df_email['userid'].values))

I have come across a lot with this previously not jut via .replace but by modifying a data frame

You could try posting here the exact operations you're trying.

stromal added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Nov 24, 2022

stromal changed the title ~~BUG:~~ BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes Nov 24, 2022

RehanSD removed the Triage 🩹 Issues that need triage label Nov 27, 2022

pyrito added the P2 Minor bugs or low-priority feature requests label Nov 28, 2022

mvashishtha added the Performance 🚀 Performance related issues and pull requests. label Nov 29, 2022

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

stromal commented Nov 24, 2022 •

edited

RehanSD commented Nov 27, 2022

stromal commented Nov 28, 2022 •

edited

RehanSD commented Nov 29, 2022

mvashishtha commented Nov 29, 2022

BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

Comments

stromal commented Nov 24, 2022 • edited

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

RehanSD commented Nov 27, 2022

stromal commented Nov 28, 2022 • edited

RehanSD commented Nov 29, 2022

mvashishtha commented Nov 29, 2022

stromal commented Nov 24, 2022 •

edited

stromal commented Nov 28, 2022 •

edited