Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix text delimiter #631

Merged
merged 1 commit into from
Sep 15, 2020
Merged

Fix text delimiter #631

merged 1 commit into from
Sep 15, 2020

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 15, 2020

I changed the delimiter in the text dataset script.
It should fix the pyarrow.lib.ArrowInvalid: CSV parse error from #622

I changed the delimiter to an unused ascii character that is not present in text files : \b

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Youpi 🎶

@lhoestq lhoestq merged commit f38a871 into master Sep 15, 2020
@lhoestq lhoestq deleted the fix-text-delimiter branch September 15, 2020 08:26
JetRunner pushed a commit that referenced this pull request Sep 17, 2020
Copy link

@abhi1nandy2 abhi1nandy2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got this error on using the delimiter \b. Resolved this error by reverting back to \r. Don't know why though!

pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4

@lhoestq
Copy link
Member Author

lhoestq commented Sep 22, 2020

Which OS are you using ?@abhi1nandy2

@abhi1nandy2
Copy link

Which OS are you using ?

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@lhoestq
Copy link
Member Author

lhoestq commented Sep 22, 2020

Do you mind sharing the data you used (or part of it), so I can try to reproduce ?
Or at least some info about the text file you're using ? (size, n of lines, encoding)

@abhi1nandy2
Copy link

Lot of data, difficult to share. There are 46 shards, each having about 256000 lines. using file command gives this - ASCII text, with very long lines.

@lhoestq
Copy link
Member Author

lhoestq commented Sep 22, 2020

Ok I see, no problem :)
I'll see what I can do

Could you just test with one single dummy text file with a few lines to see if you're having the issue ?
Also which version of datasets do you have ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants