Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'text' #85

Closed
loretoparisi opened this issue Feb 14, 2019 · 10 comments
Closed

KeyError: 'text' #85

loretoparisi opened this issue Feb 14, 2019 · 10 comments
Labels
waiting for answer Further information is requested

Comments

@loretoparisi
Copy link

Hello,
my train csv file looks like

mbploreto:script loretoparisi$ head -n2 /root/spam_dataset.csv 
label	text
HAM	waiting waiting waiting waiting solitude stands by the window as someone said i tried hard to find you i found fake promises instead the thought behind to join the thought before i thought i was blind sometimes i feel i feel the way to live i thought i had strength to overcome these walls i thought i was wonderful memories keep together things now would you like to know how it feels to be always stuck in the past without any rest the thought behind to join the thought before i thought i was blind
SPAM	please every body click cross

so I have my configuration as string

"{input_features: [{name: text, type: text}], output_features: [{name: label, type: category}]}"

and I start training then:

ludwig train --data_csv /root/spam_dataset.csv --model_definition "{input_features: [{name: text, type: text}], output_features: [{name: label, type: category}]}"

Suddenly I get that error about the text field:

 _         _        _      
| |_  _ __| |_ __ _(_)__ _ 
| | || / _` \ V  V / / _` |
|_|\_,_\__,_|\_/\_/|_\__, |
                     |___/ 
ludwig v0.1.0 - Train

Experiment name: experiment
Model name: run
Output path: results/experiment_run_1


ludwig_version: '0.1.0'
command: ('ludwig train '
 '--data_csv /root/spam_dataset.csv --model_definition {input_features: '
 '[{name: text, type: text}], output_features: [{name: label, type: '
 'category}]}')
commit_hash: '98b82b3f56c0'
dataset_type: '/root/spam_dataset.csv'
model_definition: {   'combiner': {'type': 'concat'},
    'input_features': [   {   'encoder': 'parallel_cnn',
                              'level': 'word',
                              'name': 'text',
                              'tied_weights': None,
                              'type': 'text'}],
    'output_features': [   {   'dependencies': [],
                               'loss': {   'class_distance_temperature': 0,
                                           'class_weights': 1,
                                           'confidence_penalty': 0,
                                           'distortion': 1,
                                           'labels_smoothing': 0,
                                           'negative_samples': 0,
                                           'robust_lambda': 0,
                                           'sampler': None,
                                           'type': 'softmax_cross_entropy',
                                           'unique': False,
                                           'weight': 1},
                               'name': 'label',
                               'reduce_dependencies': 'sum',
                               'reduce_input': 'sum',
                               'top_k': 3,
                               'type': 'category'}],
    'preprocessing': {   'bag': {   'fill_value': '',
                                    'format': 'space',
                                    'lowercase': 10000,
                                    'missing_value_strategy': 'fill_with_const',
                                    'most_common': False},
                         'binary': {   'fill_value': 0,
                                       'missing_value_strategy': 'fill_with_const'},
                         'category': {   'fill_value': '<UNK>',
                                         'lowercase': False,
                                         'missing_value_strategy': 'fill_with_const',
                                         'most_common': 10000},
                         'force_split': False,
                         'image': {'missing_value_strategy': 'backfill'},
                         'numerical': {   'fill_value': 0,
                                          'missing_value_strategy': 'fill_with_const'},
                         'sequence': {   'fill_value': '',
                                         'format': 'space',
                                         'lowercase': False,
                                         'missing_value_strategy': 'fill_with_const',
                                         'most_common': 20000,
                                         'padding': 'right',
                                         'padding_symbol': '<PAD>',
                                         'sequence_length_limit': 256,
                                         'unknown_symbol': '<UNK>'},
                         'set': {   'fill_value': '',
                                    'format': 'space',
                                    'lowercase': False,
                                    'missing_value_strategy': 'fill_with_const',
                                    'most_common': 10000},
                         'split_probabilities': (0.7, 0.1, 0.2),
                         'stratify': None,
                         'text': {   'char_format': 'characters',
                                     'char_most_common': 70,
                                     'char_sequence_length_limit': 1024,
                                     'fill_value': '',
                                     'lowercase': True,
                                     'missing_value_strategy': 'fill_with_const',
                                     'padding': 'right',
                                     'padding_symbol': '<PAD>',
                                     'unknown_symbol': '<UNK>',
                                     'word_format': 'space_punct',
                                     'word_most_common': 20000,
                                     'word_sequence_length_limit': 256},
                         'timeseries': {   'fill_value': '',
                                           'format': 'space',
                                           'missing_value_strategy': 'fill_with_const',
                                           'padding': 'right',
                                           'padding_value': 0,
                                           'timeseries_length_limit': 256}},
    'training': {   'batch_size': 128,
                    'bucketing_field': None,
                    'decay': False,
                    'decay_rate': 0.96,
                    'decay_steps': 10000,
                    'dropout_rate': 0.0,
                    'early_stop': 3,
                    'epochs': 200,
                    'gradient_clipping': None,
                    'increase_batch_size_on_plateau': 0,
                    'increase_batch_size_on_plateau_max': 512,
                    'increase_batch_size_on_plateau_patience': 5,
                    'increase_batch_size_on_plateau_rate': 2,
                    'learning_rate': 0.001,
                    'learning_rate_warmup_epochs': 5,
                    'optimizer': {   'beta1': 0.9,
                                     'beta2': 0.999,
                                     'epsilon': 1e-08,
                                     'type': 'adam'},
                    'reduce_learning_rate_on_plateau': 0,
                    'reduce_learning_rate_on_plateau_patience': 5,
                    'reduce_learning_rate_on_plateau_rate': 0.5,
                    'regularization_lambda': 0,
                    'regularizer': 'l2',
                    'staircase': False,
                    'validation_field': 'combined',
                    'validation_measure': 'loss'}}


Using full raw csv, no hdf5 and json file with the same name have been found
Building dataset (it may take a while)
Traceback (most recent call last):
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'text'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ludwig", line 11, in <module>
    load_entry_point('ludwig==0.1.0', 'console_scripts', 'ludwig')()
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/cli.py", line 86, in main
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/cli.py", line 64, in __init__
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/cli.py", line 70, in train
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/train.py", line 663, in cli
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/train.py", line 224, in full_train
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/data/preprocessing.py", line 457, in preprocess_for_training
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/data/preprocessing.py", line 62, in build_dataset
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/data/preprocessing.py", line 83, in build_dataset_df
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/ludwig-0.1.0-py3.6.egg/ludwig/data/preprocessing.py", line 123, in build_metadata
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/loretoparisi/Documents/Projects/AI/ludwig/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'text'
@w4nderlust
Copy link
Collaborator

w4nderlust commented Feb 14, 2019

Your CSV seems to be tab separated and not comma separated. At the moment we don't support TSV, we are working on it right now. For the time being please use comma to separate your columns and escape them if they appear in your text as described here.
Please confirm that this solves the problem.

@w4nderlust w4nderlust added the waiting for answer Further information is requested label Feb 14, 2019
@loretoparisi
Copy link
Author

loretoparisi commented Feb 14, 2019

@w4nderlust ah yeah you were right, that was the question here #66
Ok I will try to change \t to , 🤕

If anyone else is having the same issue:

When on macOS:

sed -i "" $'s/,/ /g' /root/spam_dataset.csv 
sed -i "" $'s/\t/,/g' /root/spam_dataset.csv

(keep an eye to the "" before the -i for inline replacement and to the ANSI-C style quoting since OSX sed does not recognize \t)

while on linux

sed -i "" $'s/,/ /g' /root/spam_dataset.csv 
sed -i "s/\t/,/g" /root/spam_dataset.csv

We should to replace every , in the dataset to a (as example) before converting tab to commas, otherwise we could have comma in some columns, breaking the resulting CSV file.

and if for some reason you have forget the header:

sed -i '' -e '1i\'$'\n''label,text' /root/spam_dataset.csv

@w4nderlust
Copy link
Collaborator

That's a great suggestion, a good workaround until we implement a better solution for reading TSVs and other file formats.

@aminaBm
Copy link

aminaBm commented Mar 5, 2019

I have the same problem and my data is separated using a comma and it still showing the same error :(

@loretoparisi
Copy link
Author

loretoparisi commented Mar 5, 2019

@aminaBm are you sure that you do not have any additional , in the text column? This happened to me also before normalizing the text column and converting tabs to commas.

@cuggla91
Copy link

I have the same issue as @aminaBm. I used df = pd.read_csv('dump_20190401.csv', escapechar='\') t try to deal with it but somehow it still is an issue for me. I get this error for this code:

Code:
print('creating model')
model = LudwigModel(model_definition)
print('training model')
train_stats = model.train(data_df=df)
model.close()

Error:
creating model
training model

KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3077 try:
-> 3078 return self._engine.get_loc(key)
3079 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
2 model = LudwigModel(model_definition)
3 print('training model')
----> 4 train_stats = model.train(data_df=df)
5 model.close()

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/api.py in train(self, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, data_dict, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, gpus, gpu_fraction, use_horovod, random_seed, logging_level, debug, **kwargs)
448 use_horovod=use_horovod,
449 random_seed=random_seed,
--> 450 debug=debug,
451 )
452

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/train.py in full_train(model_definition, model_definition_file, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, should_close_session, gpus, gpu_fraction, use_horovod, random_seed, debug, **kwargs)
254 skip_save_processed_input=skip_save_processed_input,
255 preprocessing_params=model_definition['preprocessing'],
--> 256 random_seed=random_seed
257 )
258

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in preprocess_for_training(model_definition, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, skip_save_processed_input, preprocessing_params, random_seed)
387 data_test_df,
388 preprocessing_params,
--> 389 random_seed
390 )
391 elif data_csv is not None or data_train_csv is not None:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in _preprocess_df_for_training(features, data_df, data_train_df, data_validation_df, data_test_df, preprocessing_params, random_seed)
638 features,
639 preprocessing_params,
--> 640 random_seed=random_seed
641 )
642 training_set, test_set, validation_set = split_dataset_tvt(

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_dataset_df(dataset_df, features, global_preprocessing_parameters, train_set_metadata, random_seed, **kwargs)
84 dataset_df,
85 features,
---> 86 global_preprocessing_parameters
87 )
88

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_metadata(dataset_df, features, global_preprocessing_parameters)
124 ]
125 train_set_metadata[feature['name']] = get_feature_meta(
--> 126 dataset_df[feature['name']].astype(str),
127 preprocessing_parameters
128 )

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2487 res = cache.get(item)
2488 if res is None:
-> 2489 values = self._data.get(item)
2490 res = self._box_item_values(item, values)
2491 cache[item] = res

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3078 return self._engine.get_loc(key)
3079 except KeyError:
-> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key))
3081
3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

@w4nderlust
Copy link
Collaborator

I'm sorry @aminaBm an @cuggla91 . Those errors are pandas errors that reflect a probably malformed csv. Unfortunately if you can't share your data there isn't much I can do about it. Try cleaning up your csv and / or changing the separator up to the point where you have a readable csv, and then let me know what parameters of the pd.read_csv() function worked, as then I can try to improve ludwig csv loading accordingly.

@Kranthiteja7
Copy link

I also have same problem. But I load the text data through manually using Dataframe. Then how can I separate the csv file from comma to tab.?

@w4nderlust
Copy link
Collaborator

pandad.read_csv(path, delimiter='\t')

@rishijain07
Copy link

im also getting same error

INFO:ludwig.models.llm:Done.
WARNING:ludwig.utils.tokenizers:No padding token id found. Using eos_token as pad_token.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-21-3b63728b4f1d>](https://localhost:8080/#) in <cell line: 58>()
     56 
     57 model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
---> 58 results = model.train(dataset=df[:10])

9 frames
[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
    631                 update_config_with_metadata(self.config_obj, training_set_metadata)
    632                 logger.info("Warnings and other logs:")
--> 633                 self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
    634                 # update config with properties determined during model instantiation
    635                 update_config_with_model(self.config_obj, self.model)

[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in create_model(config_obj, random_seed)
   2060             config_obj = ModelConfig.from_dict(config_obj)
   2061         model_type = get_from_registry(config_obj.model_type, model_type_registry)
-> 2062         return model_type(config_obj, random_seed=random_seed)
   2063 
   2064     @staticmethod

[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in __init__(self, config_obj, random_seed, _device, **_kwargs)
    138 
    139         self.output_features.update(
--> 140             self.build_outputs(
    141                 output_feature_configs=self.config_obj.output_features,
    142                 # Set the input size to the model vocab size instead of the tokenizer vocab size

[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in build_outputs(cls, output_feature_configs, input_size)
    235 
    236         output_features = {}
--> 237         output_feature = cls.build_single_output(output_feature_config, output_features)
    238         output_features[output_feature_config.name] = output_feature
    239 

[/usr/local/lib/python3.10/dist-packages/ludwig/models/base.py](https://localhost:8080/#) in build_single_output(feature_config, output_features)
    123         logger.debug(f"Output {feature_config.type} feature {feature_config.name}")
    124         output_feature_class = get_from_registry(feature_config.type, get_output_type_registry())
--> 125         output_feature_obj = output_feature_class(feature_config, output_features=output_features)
    126         return output_feature_obj
    127 

[/usr/local/lib/python3.10/dist-packages/ludwig/features/text_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
    308         **kwargs,
    309     ):
--> 310         super().__init__(output_feature_config, output_features, **kwargs)
    311 
    312     @classmethod

[/usr/local/lib/python3.10/dist-packages/ludwig/features/sequence_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
    344     ):
    345         super().__init__(output_feature_config, output_features, **kwargs)
--> 346         self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
    347         self._setup_loss()
    348         self._setup_metrics()

[/usr/local/lib/python3.10/dist-packages/ludwig/features/base_feature.py](https://localhost:8080/#) in initialize_decoder(self, decoder_config)
    281         # Input to the decoder is the output feature's FC hidden layer.
    282         decoder_config.input_size = self.fc_stack.output_shape[-1]
--> 283         decoder_cls = get_decoder_cls(self.type(), decoder_config.type)
    284         decoder_schema = decoder_cls.get_schema_cls().Schema()
    285         decoder_params_dict = decoder_schema.dump(decoder_config)

[/usr/local/lib/python3.10/dist-packages/ludwig/decoders/registry.py](https://localhost:8080/#) in get_decoder_cls(feature, name)
     30 @DeveloperAPI
     31 def get_decoder_cls(feature: str, name: str) -> Type[Decoder]:
---> 32     return get_decoder_registry()[feature][name]
     33 
     34 

[/usr/local/lib/python3.10/dist-packages/ludwig/utils/registry.py](https://localhost:8080/#) in __getitem__(self, key)
     44         if self.parent and key not in self.data:
     45             return self.parent.__getitem__(key)
---> 46         return self.data.__getitem__(key)
     47 
     48     def __contains__(self, key: str):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for answer Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants