Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes #5260

Open
3 tasks done
stromal opened this issue Nov 24, 2022 · 4 comments
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests Performance 🚀 Performance related issues and pull requests.

Comments

@stromal
Copy link

stromal commented Nov 24, 2022

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

python

  1. column: object (string)
  2. column: int
  3. column: float

df_email -> 2 columns: 250k rows

in the main df replacing emails with IDs

df.dtypes # takes 2s , no extensive RAM usage
df = df.replace(df_email['email'].values, df_email['userid'].values) # 1 min 5 s

than it goes crazy

df.dtypes #takes forever easts up all RAM 256 GB than crashes

Issue Description

but this still runs ok

df.shape # fast

(206529864, 4)

INPUT

df.to_csv('base_data/data_for_ml_20221121_userids.csv', index=False)  

OUTPUT (it uses up all the ram just to expert a 6 GB file (imported file was 4 GB than I replaced emails with 200k 0-200'000 IDs -> so it should be even smaller than before, I have 256 GB RAM so it should be able to export it, but it alwasy crashes))

2022-11-23 17:49:11,232	WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 135135135135135f12f41f2412412f412f24 Worker ID: 23623f523623523f23f4236236f23f6Node ID: 42f34f235235234f235f Worker IP address: 122.222.124.22 Worker port: 172357 Worker PID: 312312

Expected Behavior

Do this in a few seconds like before

df.dtypes 

also should run in a few seconds as well.

df.head()

Error Logs

Replace this line with the error backtrace (if applicable).

Installed Versions

0.12.1

installed it via #4719 (comment)

@stromal stromal added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Nov 24, 2022
@stromal stromal changed the title BUG: BUG: Sometimes code works fine sometimes same code with same data processes forever and eats up unlimited RAM than crashes Nov 24, 2022
@RehanSD
Copy link
Collaborator

RehanSD commented Nov 27, 2022

Hi @stromal! Thank you so much for opening this issue! I'm attempting to reproduce it locally - although it seems I may not be able to get the exact same error as you since I'm working off of a 2019 MBP (only 16GB RAM), although I did find that the call (with a smaller df since a larger df would crash my laptop) hung for a very long time without completing, while utilizing a large percent of my CPU.

I believe the reason you are facing this error is because ray finds it difficult to serialize the list of values to replace + the values to replace from when running df.replace. Modin's .replace runs in parallel across each partition, so Ray tries to serialize both lists so that each is visible to each partition, which I believe overwhelms the system (hence the high RAM usage).

For context, it seems that the script fails on the df.dtypes call rather than the df.replace call since Modin is asynchronous - the .replace call returns before computation is finished, while the .dtypes call's computation is blocked on the computation of the previous calls (including the .replace's).

I recommend trying the .replace call with a smaller subset of the values (perhaps iteratively to accomplish the same task) and checking if the error persists, which would confirm whether or not it is a memory/serialization issue.

Alternatively, the query could potentially be rewritten - perhaps using a join if the email values are in a specific column in df, or an apply if you could create a function that returns userid when given email as an input? I think that this would cause the query to be faster!

@RehanSD RehanSD removed the Triage 🩹 Issues that need triage label Nov 27, 2022
@stromal
Copy link
Author

stromal commented Nov 28, 2022

@RehanSD I have come across a lot with this previously not jut via .replace but by modifying a data frame and than suddenly it just loads forever to even just to do a df.head() that previously took less than 1s.

@pyrito pyrito added the P2 Minor bugs or low-priority feature requests label Nov 28, 2022
@RehanSD
Copy link
Collaborator

RehanSD commented Nov 29, 2022

Hi @stromal! I see - would you be able to share some other examples where the code crashes, so we could try and determine what the similarity is across? That way we can see if this is one bug affecting multiple workloads, or multiple bugs, and then determine a workaround accordingly!

@mvashishtha mvashishtha added the Performance 🚀 Performance related issues and pull requests. label Nov 29, 2022
@mvashishtha
Copy link
Collaborator

@stromal some responses:

First, do you know how much object store memory is available in your ray cluster? How are you initializing the cluster? What do you get if you run the shell command ray status or the python line print(ray.cluster_resources()) while ray is running?

It's hard to tell what's going on without a reproducible example or looking at the ray worker logs, but it's likely that your ray workers are running out of memory. As @RehanSD points out, broadcasting all of the to_replace and value values to each worker will consume a lot of memory when df_email is very large. Ideally you would do the replacement some other way, but I can't think of another way right now.

Does the replace work in pandas? e.g. what if you run

import modin.pandas as mpd

dfp = df._to_pandas()
df = mpd.DataFrame(dfp.replace(df_email['email'].values, df_email['userid'].values))

I have come across a lot with this previously not jut via .replace but by modifying a data frame

You could try posting here the exact operations you're trying.

@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

No branches or pull requests

5 participants