Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

Closed
tommasodelorenzo opened this issue Apr 26, 2021 · 5 comments

Comments

@tommasodelorenzo
Copy link

tommasodelorenzo commented Apr 26, 2021

What am I doing wrong?

I encode data with

model_name = "dbmdz/bert-base-italian-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case = True)

def encode_data(texts):
    return tokenizer.batch_encode_plus(
                texts, 
                add_special_tokens=True, 
                return_attention_mask=True, 
                padding = True,
                truncation=True,
                max_length=200,
                return_tensors='pt'
            )

Then I create my datasets with

import torch

class my_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        print(item)
        return item

    def __len__(self):
        return len(self.labels)

So I have

encoded_data_train = encode_data(df_train['text'].tolist())
encoded_data_val = encode_data(df_val['text'].tolist())
encoded_data_test = encode_data(df_test['text'].tolist())
dataset_train = my_Dataset(encoded_data_train, df_train['labels'].tolist())
dataset_val = my_Dataset(encoded_data_val, df_val['labels'].tolist())
dataset_test = my_Dataset(encoded_data_test, df_test['labels'].tolist())

Then I initiate my Trainer with

from transformers import AutoConfig, TrainingArguments, DataCollatorWithPadding, Trainer

training_args = TrainingArguments(
    output_dir='/trial',
    learning_rate=1e-6,
    do_train=True,
    do_eval=True,
    evaluation_strategy='epoch',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.2,
    logging_dir="./logs",
)

num_labels = len(label_dict)
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels = num_labels)

trainer = Trainer(
  model=model,
  args=training_args,
  data_collator=DataCollatorWithPadding(tokenizer),
  tokenizer= tokenizer,
  train_dataset=dataset_train,
  eval_dataset=dataset_val,
)

and finally I train

trainer.train()

Here is the error I get

AttributeErrorTraceback (most recent call last)
<ipython-input-22-5d018b4b061d> in <module>
----> 1 trainer.train()

/opt/conda/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
   1033 
-> 1034             for step, inputs in enumerate(epoch_iterator):
   1035 
   1036                 # Skip past any already trained steps if resuming training

/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/opt/conda/lib/python3.8/site-packages/transformers/data/data_collator.py in __call__(self, features)
    116 
    117     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
--> 118         batch = self.tokenizer.pad(
    119             features,
    120             padding=self.padding,

/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2558         if self.model_input_names[0] not in encoded_inputs:
   2559             raise ValueError(
-> 2560                 "You should supply an encoding or a list of encodings to this method"
   2561                 f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
   2562             )

AttributeError: 'list' object has no attribute 'keys'

What I am doing wrong?
I also tried using

import torch
from torch.utils.data import TensorDataset

dataset_train = TensorDataset(encoded_data_train['input_ids'], encoded_data_train['attention_mask'], torch.tensor(df_train['labels'].tolist()))
dataset_test = TensorDataset(encoded_data_test['input_ids'], encoded_data_test['attention_mask'], torch.tensor(df_test['labels'].tolist()))
dataset_val = TensorDataset(encoded_data_val['input_ids'], encoded_data_val['attention_mask'], torch.tensor(df_val['labels'].tolist()))

getting the same error.
Using:
torch == 1.7.1
transformers == 4.4.2

Thank you!

@sgugger

@sgugger
Copy link
Collaborator

sgugger commented Apr 30, 2021

This is really weird. Could you print a few items of your Dataset? The error means that they are not dictionaries containing "input_ids" but they certainly seem to be.

Also note that since you already have applied padding in your preprocessing, you can use the default_data_collator, but the code should work nonetheless.

@tommasodelorenzo
Copy link
Author

tommasodelorenzo commented May 3, 2021

Also note that since you already have applied padding in your preprocessing, you can use the default_data_collator, but the code should work nonetheless.

Yeah, I did try commenting the line about the data_collator as well, but I got the same error.

This is really weird. Could you print a few items of your Dataset? The error means that they are not dictionaries containing "input_ids" but they certainly seem to be.

For instance, dataset_train.__getitem__(1) gives me

{'input_ids': tensor([  102,  2719, 10118, 19614,   784,   366,   119,   142, 17586,   113,
         10885,  4019,  5129,   143, 10885,   119,  4019, 14633,  1354,   137,
           917,  1621,  9048,   360,   151,   143,   784,   366,   113,   213,
          7809,   985,  1941,  1702,  9580,   749, 12993,   135,  9272,   119,
          1202,  1328,  2909,  7427,  2909,   483, 15079,  6766,  2201,  5754,
          4213,  1266,   642,   119,  1968,   115,  7584,  7124,  2899,  9654,
           151,   143,  3684,   137, 17586,   113,  3151,   113,   193,  4283,
           165,  1035,  1354,  4913,  1621,  9048,   360,   137, 17586,   113,
           119,  7809,   985,  1941,  1702,  1621,  9048,   360,  4913, 16829,
           913,   272,  3694,  2909,  7427,   145,  1723, 20957, 15016,   213,
         11171,   119,  7809,   642,  3761,   188,   164,  4706,   119,  3684,
          8941,   119,  6330,  8076,  2199,   642, 23829, 22462, 30934,  4213,
          1354,  2759,   311,  7809,  5434,   137,  1031,   510,  2603,  5569,
          5434,   137,  1031,   510,  3732,  5569,  5434,   137,  1031,   510,
          3627, 14715, 30951,  4543,  8823,  5066,  3625,  3627,  1701,  7900,
           153,  5066,  3625,  3732,  7559,   127,  3732, 13703,   133,   176,
         11576,  2909, 13703,   133,  1621,  9048,   360,  1723,  5230,  9580,
           749, 12993,   114,  1031,   510,   387, 11993,   189, 22264,  8823,
           143,  6766,  3462,  5622, 27082,   113,  7809,  3132,  1011,   189,
          7825,  8823,   143,  6766,   111,   341,  7124,  2899, 18482,   103]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor(5)}

Input texts are emails in italian.

(the issue appears also with transformers 4.5.1)

@sgugger
Copy link
Collaborator

sgugger commented May 3, 2021

I am unable to reproduce your bug. Are you sure your data frames don't contain a list of text in one of the line instead of just texts?

@tommasodelorenzo
Copy link
Author

I found the mistake! I was doing something slightly different from what I wrote, namely

from transformers import AutoConfig, TrainingArguments, DataCollatorWithPadding, Trainer

train_dataset=dataset_train,
eval_dataset = dataset_val

training_args = TrainingArguments(
    output_dir='/trial',
    learning_rate=1e-6,
    do_train=True,
    do_eval=True,
    evaluation_strategy='epoch',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.2,
    logging_dir="./logs",
)

num_labels = len(label_dict)
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels = num_labels)

trainer = Trainer(
  model=model,
  args=training_args,
  data_collator=DataCollatorWithPadding(tokenizer),
  tokenizer= tokenizer,
  train_dataset=train_dataset,
  eval_dataset=eval_dataset,
)

The difference is in line 3 and 4, and consequently last two lines. The mistake is the comma at the end of line 3. My bad I did not run the example code I published in the question exactly as it was. I am so sorry, and so upset to have spent a week for a stupid comma.
Thanks for the help

@sgugger
Copy link
Collaborator

sgugger commented May 3, 2021

Oh that's a nasty little bug indeed! Glad you found the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants