Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

tommasodelorenzo · 2021-04-26T16:22:27Z

What am I doing wrong?

I encode data with

model_name = "dbmdz/bert-base-italian-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case = True)

def encode_data(texts):
    return tokenizer.batch_encode_plus(
                texts, 
                add_special_tokens=True, 
                return_attention_mask=True, 
                padding = True,
                truncation=True,
                max_length=200,
                return_tensors='pt'
            )

Then I create my datasets with

import torch

class my_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        print(item)
        return item

    def __len__(self):
        return len(self.labels)

So I have

encoded_data_train = encode_data(df_train['text'].tolist())
encoded_data_val = encode_data(df_val['text'].tolist())
encoded_data_test = encode_data(df_test['text'].tolist())
dataset_train = my_Dataset(encoded_data_train, df_train['labels'].tolist())
dataset_val = my_Dataset(encoded_data_val, df_val['labels'].tolist())
dataset_test = my_Dataset(encoded_data_test, df_test['labels'].tolist())

Then I initiate my Trainer with

from transformers import AutoConfig, TrainingArguments, DataCollatorWithPadding, Trainer

training_args = TrainingArguments(
    output_dir='/trial',
    learning_rate=1e-6,
    do_train=True,
    do_eval=True,
    evaluation_strategy='epoch',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.2,
    logging_dir="./logs",
)

num_labels = len(label_dict)
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels = num_labels)

trainer = Trainer(
  model=model,
  args=training_args,
  data_collator=DataCollatorWithPadding(tokenizer),
  tokenizer= tokenizer,
  train_dataset=dataset_train,
  eval_dataset=dataset_val,
)

and finally I train

trainer.train()

Here is the error I get

AttributeErrorTraceback (most recent call last)
<ipython-input-22-5d018b4b061d> in <module>
----> 1 trainer.train()

/opt/conda/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
   1033 
-> 1034             for step, inputs in enumerate(epoch_iterator):
   1035 
   1036                 # Skip past any already trained steps if resuming training

/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/opt/conda/lib/python3.8/site-packages/transformers/data/data_collator.py in __call__(self, features)
    116 
    117     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
--> 118         batch = self.tokenizer.pad(
    119             features,
    120             padding=self.padding,

/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2558         if self.model_input_names[0] not in encoded_inputs:
   2559             raise ValueError(
-> 2560                 "You should supply an encoding or a list of encodings to this method"
   2561                 f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
   2562             )

AttributeError: 'list' object has no attribute 'keys'

What I am doing wrong?
I also tried using

import torch
from torch.utils.data import TensorDataset

dataset_train = TensorDataset(encoded_data_train['input_ids'], encoded_data_train['attention_mask'], torch.tensor(df_train['labels'].tolist()))
dataset_test = TensorDataset(encoded_data_test['input_ids'], encoded_data_test['attention_mask'], torch.tensor(df_test['labels'].tolist()))
dataset_val = TensorDataset(encoded_data_val['input_ids'], encoded_data_val['attention_mask'], torch.tensor(df_val['labels'].tolist()))

getting the same error.
Using:
torch == 1.7.1
transformers == 4.4.2

Thank you!

@sgugger

The text was updated successfully, but these errors were encountered:

sgugger · 2021-04-30T12:43:56Z

This is really weird. Could you print a few items of your Dataset? The error means that they are not dictionaries containing "input_ids" but they certainly seem to be.

Also note that since you already have applied padding in your preprocessing, you can use the default_data_collator, but the code should work nonetheless.

tommasodelorenzo · 2021-05-03T07:59:22Z

Also note that since you already have applied padding in your preprocessing, you can use the default_data_collator, but the code should work nonetheless.

Yeah, I did try commenting the line about the data_collator as well, but I got the same error.

This is really weird. Could you print a few items of your Dataset? The error means that they are not dictionaries containing "input_ids" but they certainly seem to be.

For instance, dataset_train.__getitem__(1) gives me

{'input_ids': tensor([  102,  2719, 10118, 19614,   784,   366,   119,   142, 17586,   113,
         10885,  4019,  5129,   143, 10885,   119,  4019, 14633,  1354,   137,
           917,  1621,  9048,   360,   151,   143,   784,   366,   113,   213,
          7809,   985,  1941,  1702,  9580,   749, 12993,   135,  9272,   119,
          1202,  1328,  2909,  7427,  2909,   483, 15079,  6766,  2201,  5754,
          4213,  1266,   642,   119,  1968,   115,  7584,  7124,  2899,  9654,
           151,   143,  3684,   137, 17586,   113,  3151,   113,   193,  4283,
           165,  1035,  1354,  4913,  1621,  9048,   360,   137, 17586,   113,
           119,  7809,   985,  1941,  1702,  1621,  9048,   360,  4913, 16829,
           913,   272,  3694,  2909,  7427,   145,  1723, 20957, 15016,   213,
         11171,   119,  7809,   642,  3761,   188,   164,  4706,   119,  3684,
          8941,   119,  6330,  8076,  2199,   642, 23829, 22462, 30934,  4213,
          1354,  2759,   311,  7809,  5434,   137,  1031,   510,  2603,  5569,
          5434,   137,  1031,   510,  3732,  5569,  5434,   137,  1031,   510,
          3627, 14715, 30951,  4543,  8823,  5066,  3625,  3627,  1701,  7900,
           153,  5066,  3625,  3732,  7559,   127,  3732, 13703,   133,   176,
         11576,  2909, 13703,   133,  1621,  9048,   360,  1723,  5230,  9580,
           749, 12993,   114,  1031,   510,   387, 11993,   189, 22264,  8823,
           143,  6766,  3462,  5622, 27082,   113,  7809,  3132,  1011,   189,
          7825,  8823,   143,  6766,   111,   341,  7124,  2899, 18482,   103]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor(5)}

Input texts are emails in italian.

(the issue appears also with transformers 4.5.1)

sgugger · 2021-05-03T15:41:12Z

I am unable to reproduce your bug. Are you sure your data frames don't contain a list of text in one of the line instead of just texts?

tommasodelorenzo · 2021-05-03T16:25:14Z

I found the mistake! I was doing something slightly different from what I wrote, namely

from transformers import AutoConfig, TrainingArguments, DataCollatorWithPadding, Trainer

train_dataset=dataset_train,
eval_dataset = dataset_val

training_args = TrainingArguments(
    output_dir='/trial',
    learning_rate=1e-6,
    do_train=True,
    do_eval=True,
    evaluation_strategy='epoch',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.2,
    logging_dir="./logs",
)

num_labels = len(label_dict)
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels = num_labels)

trainer = Trainer(
  model=model,
  args=training_args,
  data_collator=DataCollatorWithPadding(tokenizer),
  tokenizer= tokenizer,
  train_dataset=train_dataset,
  eval_dataset=eval_dataset,
)

The difference is in line 3 and 4, and consequently last two lines. The mistake is the comma at the end of line 3. My bad I did not run the example code I published in the question exactly as it was. I am so sorry, and so upset to have spent a week for a stupid comma.
Thanks for the help

sgugger · 2021-05-03T16:27:34Z

Oh that's a nasty little bug indeed! Glad you found the problem!

tommasodelorenzo closed this as completed May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

tommasodelorenzo commented Apr 26, 2021 •

edited by LysandreJik

sgugger commented Apr 30, 2021

tommasodelorenzo commented May 3, 2021 •

edited

sgugger commented May 3, 2021

tommasodelorenzo commented May 3, 2021

sgugger commented May 3, 2021

Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

Unable to use custom dataset: AttributeError: 'list' object has no attribute 'keys' #11455

Comments

tommasodelorenzo commented Apr 26, 2021 • edited by LysandreJik

sgugger commented Apr 30, 2021

tommasodelorenzo commented May 3, 2021 • edited

sgugger commented May 3, 2021

tommasodelorenzo commented May 3, 2021

sgugger commented May 3, 2021

tommasodelorenzo commented Apr 26, 2021 •

edited by LysandreJik

tommasodelorenzo commented May 3, 2021 •

edited