Custom feature types in `load_dataset` from CSV #623

lvwerra · 2020-09-12T13:21:34Z

I am trying to load a local file with the load_dataset function and I want to predefine the feature types with the features argument. However, the types are always the same independent of the value of features.

I am working with the local files from the emotion dataset. To get the data you can use the following code:

from pathlib import Path
import wget

EMOTION_PATH = Path("./data/emotion")
DOWNLOAD_URLS = [
    "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1",
    "https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1",
    "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1",
]

if not Path.is_dir(EMOTION_PATH):
     Path.mkdir(EMOTION_PATH)
for url in DOWNLOAD_URLS:
     wget.download(url, str(EMOTION_PATH))

The first five lines of the train set are:

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger

Here the code to reproduce the issue:

from datasets import Features, Value, ClassLabel, load_dataset

class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
file_dict = {'train': EMOTION_PATH/'train.txt'}

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)

Observed behaviour:

dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

Expected behaviour:

dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

Things I've tried:

deleting the cache
trying other types such as int64

Am I missing anything? Thanks for any pointer in the right direction.

The text was updated successfully, but these errors were encountered:

lhoestq · 2020-09-17T15:32:42Z

Currently csv doesn't support the features attribute (unlike json).
What you can do for now is cast the features using the in-place transform cast_

from datasets import load_dataset

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'])
dataset.cast_(emotion_features)

lvwerra · 2020-09-18T17:44:18Z

Thanks for the clarification!

lewtun · 2020-09-19T09:06:46Z

Hi @lhoestq we've tried out your suggestion but are now running into the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-81ffd5ac18c9> in <module>
----> 1 dataset.cast_(emotion_features)

/usr/local/lib/python3.6/dist-packages/datasets/dataset_dict.py in cast_(self, features)
    125         self._check_values_type()
    126         for dataset in self.values():
--> 127             dataset.cast_(features=features)
    128 
    129     def remove_columns_(self, column_names: Union[str, List[str]]):

/usr/local/lib/python3.6/dist-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    161             # Call actual function
    162 
--> 163             out = func(self, *args, **kwargs)
    164 
    165             # Update fingerprint of in-place transforms + update in-place history of transforms

/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py in cast_(self, features)
    602         self._info.features = features
    603         schema = pa.schema(features.type)
--> 604         self._data = self._data.cast(schema)
    605 
    606     @fingerprint(inplace=True)

/usr/local/lib/python3.6/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['text', 'label'], ['label', 'text']

Looking at the types in emotion_features we see that label and text appear to be swapped in the Arrow table:

emotion_features.type
StructType(struct<label: int64, text: string>)

Did we define the emotion_features incorrectly? We just followed the instructions from the docs, but perhaps we misunderstood something 😬

thomwolf · 2020-09-29T12:11:34Z

In general, I don't think there is any hard reason we don't allow to use features in the csv script, right @lhoestq?

Should I add it?

lhoestq · 2020-09-29T12:17:06Z

In general, I don't think there is any hard reason we don't allow to use features in the csv script, right @lhoestq?

Should I add it?

Sure let's add it. Setting the convert options should do the job

Hi @lhoestq we've tried out your suggestion but are now running into the following error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-81ffd5ac18c9> in <module>
----> 1 dataset.cast_(emotion_features)

 /usr/local/lib/python3.6/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['text', 'label'], ['label', 'text']
Did we define the emotion_features incorrectly? We just followed the instructions from the docs, but perhaps we misunderstood something 😬

Thanks for reporting, that's a bug :) I'm fixing it right now

lhoestq · 2020-09-29T12:58:56Z

PR is open for the ValueError: Target schema's field names are not matching the table's field names error.

I'm adding the features parameter to csv

lewtun · 2020-09-30T19:51:43Z

Thanks a lot for the PR and quick fix @lhoestq!

lhoestq added dataset bug A bug in a dataset script provided in the library enhancement New feature or request and removed dataset bug A bug in a dataset script provided in the library labels Sep 14, 2020

lhoestq mentioned this issue Sep 29, 2020

Fix column order issue in cast #684

Merged

lhoestq mentioned this issue Sep 29, 2020

Add features parameter to CSV #685

Merged

lhoestq closed this as completed in #685 Sep 30, 2020

luyug mentioned this issue Jun 24, 2021

Field order issue in loading json #2548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom feature types in `load_dataset` from CSV #623

Custom feature types in `load_dataset` from CSV #623

lvwerra commented Sep 12, 2020

lhoestq commented Sep 17, 2020

lvwerra commented Sep 18, 2020

lewtun commented Sep 19, 2020

thomwolf commented Sep 29, 2020

lhoestq commented Sep 29, 2020 •

edited

Loading

lhoestq commented Sep 29, 2020 •

edited

Loading

lewtun commented Sep 30, 2020

Custom feature types in load_dataset from CSV #623

Custom feature types in load_dataset from CSV #623

Comments

lvwerra commented Sep 12, 2020

lhoestq commented Sep 17, 2020

lvwerra commented Sep 18, 2020

lewtun commented Sep 19, 2020

thomwolf commented Sep 29, 2020

lhoestq commented Sep 29, 2020 • edited Loading

lhoestq commented Sep 29, 2020 • edited Loading

lewtun commented Sep 30, 2020

Custom feature types in `load_dataset` from CSV #623

Custom feature types in `load_dataset` from CSV #623

lhoestq commented Sep 29, 2020 •

edited

Loading

lhoestq commented Sep 29, 2020 •

edited

Loading