Use python read for text dataset #715

lhoestq · 2020-10-05T09:47:55Z

As mentioned in #622 the pandas reader used for text dataset doesn't work properly when there are \r characters in the text file.

Instead I switched to pure python using open and read.
From my benchmark on a 100MB text file, it's the same speed as the previous pandas reader.

thomwolf

Ok :)

datasets/text/text.py

thomwolf · 2020-10-05T09:59:28Z

One thing though, could we try to read the files in parallel?

lhoestq · 2020-10-05T10:07:22Z

We could but I'm not sure this would help a lot since the bottleneck is the drive IO if the files are big enough.
It could make sense for very small files.

lhoestq · 2020-10-05T10:08:27Z

Looks like windows is not a big fan of this approach
I'm working on a fix

thomwolf · 2020-10-05T10:20:08Z

I remember issue #546 where this was kinda requested (but maybe IO would bottleneck). What do you think?

lhoestq · 2020-10-05T10:28:44Z

I think it's worth testing multiprocessing. It could also be something we add to our speed benchmarks

BramVanroy · 2020-10-05T11:23:57Z

I remember issue #546 where this was kinda requested (but maybe IO would bottleneck). What do you think?

It still would be interesting I think, especially in scenarios where IO is less of an issue (SSDs particularly) and where there are many smaller files. Wrapping this function in a pool.map is perhaps an easy thing to try.

lhoestq · 2020-10-05T13:12:48Z

Merging this one for now for the patch release

lhoestq added 5 commits October 5, 2020 10:50

add /r to tests

29480b4

add lineterminator to pandas reader

0ab8829

use python read for text dataset

7fd8f90

style

b806ac2

style

193e3cb

thomwolf approved these changes Oct 5, 2020

View reviewed changes

BramVanroy reviewed Oct 5, 2020

View reviewed changes

datasets/text/text.py Outdated Show resolved Hide resolved

var naming

df2417d

set default encoding to utf-8

2fa9db1

fix line return in text test

14bd043

lhoestq merged commit 0ec694c into master Oct 5, 2020

lhoestq deleted the use-python-read-for-text-dataset branch October 5, 2020 13:13

mozharovsky mentioned this pull request Oct 5, 2020

Fix reading text files with carriage return symbols #713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use python read for text dataset #715

Use python read for text dataset #715

lhoestq commented Oct 5, 2020

thomwolf left a comment

thomwolf commented Oct 5, 2020

lhoestq commented Oct 5, 2020

lhoestq commented Oct 5, 2020

thomwolf commented Oct 5, 2020

lhoestq commented Oct 5, 2020

BramVanroy commented Oct 5, 2020

lhoestq commented Oct 5, 2020

Use python read for text dataset #715

Use python read for text dataset #715

Conversation

lhoestq commented Oct 5, 2020

thomwolf left a comment

Choose a reason for hiding this comment

thomwolf commented Oct 5, 2020

lhoestq commented Oct 5, 2020

lhoestq commented Oct 5, 2020

thomwolf commented Oct 5, 2020

lhoestq commented Oct 5, 2020

BramVanroy commented Oct 5, 2020

lhoestq commented Oct 5, 2020