Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS: How large do my files have to be to get a speed boost from modin? #5062

Open
florianjehn opened this issue Sep 29, 2022 · 6 comments
Open
Labels
documentation 📜 Updates and issues with the documentation External Pull requests and issues from people who do not regularly contribute to modin P1 Important tasks that we should complete soon

Comments

@florianjehn
Copy link

I came across modin the other day and really liked the idea. I tried it on my current project (https://github.com/allfed/Seaweed-Growth-Model) and it made my code way slower (~5x). I am only using a test data set of around 16 MB, but the real dataset will be around 45 GB. Would this be enough to see a speed boost? How big should the files be before I see a speed boost?

@florianjehn florianjehn added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Sep 29, 2022
@RehanSD
Copy link
Collaborator

RehanSD commented Sep 29, 2022

Hi @florianjehn! Thank you so much for opening this issue!

Due to the distributed nature of Modin, we are currently slower than pandas on very small datasets - although we are working on optimizations for small and empty Dataframes that should allow us to meet or beat pandas speed on these datasets as well!

Modin should definitely speed up your workload on your 45 GB dataset - datasets that are a GB and up definitely benefit from Modin! If you're interested in a specific breakdown, I can also run some experiments to determine where specifically the boundary lies!

@florianjehn
Copy link
Author

Thank you for the quick answer :)
If its not too much work, I'd be really curious to know where the boundary is. Might also be interesting for other people, so maybe its something that could be added to the readme?

@florianjehn florianjehn changed the title How large does my files have to be to get a speed boost from modin? How large do my files have to be to get a speed boost from modin? Sep 30, 2022
@vnlitvinov vnlitvinov added the External Pull requests and issues from people who do not regularly contribute to modin label Sep 30, 2022
@mvashishtha mvashishtha changed the title How large do my files have to be to get a speed boost from modin? DOCS: How large do my files have to be to get a speed boost from modin? Sep 30, 2022
@mvashishtha mvashishtha added documentation 📜 Updates and issues with the documentation P1 Important tasks that we should complete soon and removed Triage 🩹 Issues that need triage labels Sep 30, 2022
@mvashishtha
Copy link
Collaborator

@florianjehn

If its not too much work, I'd be really curious to know where the boundary is.

There isn't a simple boundary that holds across all pandas functions, execution environments, and types of data. If you aren't satisfied with Modin's performance on your larger dataset, please do follow up in another GitHub issue.

Might also be interesting for other people, so maybe its something that could be added to the readme?

We should definitely document this fact somewhere prominent, and I think the README is a good spot. We can also add that generally you can't count on Modin to speed up everything you do in pandas.

@mvashishtha mvashishtha removed the question ❓ Questions about Modin label Sep 30, 2022
@Garra1980
Copy link
Collaborator

We should definitely document this fact somewhere prominent, and I think the README is a good spot. We can also add that generally you can't count on Modin to speed up everything you do in pandas.

Totally agree with the above. We probably also need to rephrase this one - https://modin.readthedocs.io/en/stable/#modin-is-a-dataframe-for-datasets-from-1mb-to-1tb

@florianjehn
Copy link
Author

Rephrasing this is a good idea. At least for me the vibe that I got from the readthedocs was that Modin would make basically all my pandas code faster and so I was kinda surprised when it didn't 😅

@mvashishtha
Copy link
Collaborator

@florianjehn I've updated this issue so that it's now tracking the documentation changes you mention. It's on the issue backlog and has a high priority of P1. We'll hope to get to it soon. Meanwhile, please do continue to engage with the modin community by filing issues here, posting on the discuss, or posting on the slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📜 Updates and issues with the documentation External Pull requests and issues from people who do not regularly contribute to modin P1 Important tasks that we should complete soon
Projects
None yet
Development

No branches or pull requests

5 participants