Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom feature types in load_dataset from CSV #623

Closed
lvwerra opened this issue Sep 12, 2020 · 7 comments · Fixed by #685
Closed

Custom feature types in load_dataset from CSV #623

lvwerra opened this issue Sep 12, 2020 · 7 comments · Fixed by #685
Labels
enhancement New feature or request

Comments

@lvwerra
Copy link
Member

lvwerra commented Sep 12, 2020

I am trying to load a local file with the load_dataset function and I want to predefine the feature types with the features argument. However, the types are always the same independent of the value of features.

I am working with the local files from the emotion dataset. To get the data you can use the following code:

from pathlib import Path
import wget

EMOTION_PATH = Path("./data/emotion")
DOWNLOAD_URLS = [
    "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1",
    "https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1",
    "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1",
]

if not Path.is_dir(EMOTION_PATH):
     Path.mkdir(EMOTION_PATH)
for url in DOWNLOAD_URLS:
     wget.download(url, str(EMOTION_PATH))

The first five lines of the train set are:

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger

Here the code to reproduce the issue:

from datasets import Features, Value, ClassLabel, load_dataset

class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
file_dict = {'train': EMOTION_PATH/'train.txt'}

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)

Observed behaviour:

dataset['train'].features
{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

Expected behaviour:

dataset['train'].features
{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

Things I've tried:

  • deleting the cache
  • trying other types such as int64

Am I missing anything? Thanks for any pointer in the right direction.

@lhoestq lhoestq added dataset bug A bug in a dataset script provided in the library enhancement New feature or request and removed dataset bug A bug in a dataset script provided in the library labels Sep 14, 2020
@lhoestq
Copy link
Member

lhoestq commented Sep 17, 2020

Currently csv doesn't support the features attribute (unlike json).
What you can do for now is cast the features using the in-place transform cast_

from datasets import load_dataset

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'])
dataset.cast_(emotion_features)

@lvwerra
Copy link
Member Author

lvwerra commented Sep 18, 2020

Thanks for the clarification!

@lewtun
Copy link
Member

lewtun commented Sep 19, 2020

Hi @lhoestq we've tried out your suggestion but are now running into the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-81ffd5ac18c9> in <module>
----> 1 dataset.cast_(emotion_features)

/usr/local/lib/python3.6/dist-packages/datasets/dataset_dict.py in cast_(self, features)
    125         self._check_values_type()
    126         for dataset in self.values():
--> 127             dataset.cast_(features=features)
    128 
    129     def remove_columns_(self, column_names: Union[str, List[str]]):

/usr/local/lib/python3.6/dist-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    161             # Call actual function
    162 
--> 163             out = func(self, *args, **kwargs)
    164 
    165             # Update fingerprint of in-place transforms + update in-place history of transforms

/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py in cast_(self, features)
    602         self._info.features = features
    603         schema = pa.schema(features.type)
--> 604         self._data = self._data.cast(schema)
    605 
    606     @fingerprint(inplace=True)

/usr/local/lib/python3.6/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['text', 'label'], ['label', 'text']

Looking at the types in emotion_features we see that label and text appear to be swapped in the Arrow table:

emotion_features.type
StructType(struct<label: int64, text: string>)

Did we define the emotion_features incorrectly? We just followed the instructions from the docs, but perhaps we misunderstood something 😬

@thomwolf
Copy link
Member

In general, I don't think there is any hard reason we don't allow to use features in the csv script, right @lhoestq?

Should I add it?

@lhoestq
Copy link
Member

lhoestq commented Sep 29, 2020

In general, I don't think there is any hard reason we don't allow to use features in the csv script, right @lhoestq?

Should I add it?

Sure let's add it. Setting the convert options should do the job

Hi @lhoestq we've tried out your suggestion but are now running into the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-163-81ffd5ac18c9> in <module>
----> 1 dataset.cast_(emotion_features)

 /usr/local/lib/python3.6/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['text', 'label'], ['label', 'text']

Did we define the emotion_features incorrectly? We just followed the instructions from the docs, but perhaps we misunderstood something 😬

Thanks for reporting, that's a bug :) I'm fixing it right now

@lhoestq
Copy link
Member

lhoestq commented Sep 29, 2020

PR is open for the ValueError: Target schema's field names are not matching the table's field names error.

I'm adding the features parameter to csv

@lewtun
Copy link
Member

lewtun commented Sep 30, 2020

Thanks a lot for the PR and quick fix @lhoestq!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants