New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use python read for text dataset #715
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok :)
One thing though, could we try to read the files in parallel? |
We could but I'm not sure this would help a lot since the bottleneck is the drive IO if the files are big enough. |
Looks like windows is not a big fan of this approach |
I remember issue #546 where this was kinda requested (but maybe IO would bottleneck). What do you think? |
I think it's worth testing multiprocessing. It could also be something we add to our speed benchmarks |
It still would be interesting I think, especially in scenarios where IO is less of an issue (SSDs particularly) and where there are many smaller files. Wrapping this function in a |
Merging this one for now for the patch release |
As mentioned in #622 the pandas reader used for text dataset doesn't work properly when there are \r characters in the text file.
Instead I switched to pure python using
open
andread
.From my benchmark on a 100MB text file, it's the same speed as the previous pandas reader.