New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS: How large do my files have to be to get a speed boost from modin? #5062
Comments
Hi @florianjehn! Thank you so much for opening this issue! Due to the distributed nature of Modin, we are currently slower than pandas on very small datasets - although we are working on optimizations for small and empty Dataframes that should allow us to meet or beat pandas speed on these datasets as well! Modin should definitely speed up your workload on your 45 GB dataset - datasets that are a GB and up definitely benefit from Modin! If you're interested in a specific breakdown, I can also run some experiments to determine where specifically the boundary lies! |
Thank you for the quick answer :) |
There isn't a simple boundary that holds across all pandas functions, execution environments, and types of data. If you aren't satisfied with Modin's performance on your larger dataset, please do follow up in another GitHub issue.
We should definitely document this fact somewhere prominent, and I think the README is a good spot. We can also add that generally you can't count on Modin to speed up everything you do in pandas. |
Totally agree with the above. We probably also need to rephrase this one - https://modin.readthedocs.io/en/stable/#modin-is-a-dataframe-for-datasets-from-1mb-to-1tb |
Rephrasing this is a good idea. At least for me the vibe that I got from the readthedocs was that Modin would make basically all my pandas code faster and so I was kinda surprised when it didn't 😅 |
@florianjehn I've updated this issue so that it's now tracking the documentation changes you mention. It's on the issue backlog and has a high priority of P1. We'll hope to get to it soon. Meanwhile, please do continue to engage with the modin community by filing issues here, posting on the discuss, or posting on the slack. |
I came across modin the other day and really liked the idea. I tried it on my current project (https://github.com/allfed/Seaweed-Growth-Model) and it made my code way slower (~5x). I am only using a test data set of around 16 MB, but the real dataset will be around 45 GB. Would this be enough to see a speed boost? How big should the files be before I see a speed boost?
The text was updated successfully, but these errors were encountered: