Flattenning complex nested json #4749
-
I have a nested-nested json data, and I want to flatten it. but it doesn't seem to work properly with from datasets import load_dataset
dataset = load_dataset("json", data_files=files, field="named_entity", split="train") >>> dataset.features
{'board': Value(dtype='string', id=None),
'content': [{'labels': [{'id': Value(dtype='int64', id=None),
'tag': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}],
'sentence': Value(dtype='string', id=None)}],
'source_site': Value(dtype='string', id=None),
'subtitle': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None),
'title': [{'labels': [{'id': Value(dtype='int64', id=None),
'tag': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}],
'sentence': Value(dtype='string', id=None)}],
'url': Value(dtype='string', id=None),
'write_date': Value(dtype='string', id=None),
'writer': Value(dtype='string', id=None)} >>> flat_ds = dataset.flatten()
>>> flat_ds.features
{'board': Value(dtype='string', id=None),
'content': [{'labels': [{'id': Value(dtype='int64', id=None),
'tag': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}],
'sentence': Value(dtype='string', id=None)}],
'source_site': Value(dtype='string', id=None),
'subtitle': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None),
'title': [{'labels': [{'id': Value(dtype='int64', id=None),
'tag': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}],
'sentence': Value(dtype='string', id=None)}],
'url': Value(dtype='string', id=None),
'write_date': Value(dtype='string', id=None),
'writer': Value(dtype='string', id=None)} please see this colab notebook, I have prepared the data here. environmentmy laptop: windows 11, python=3.9, datasets=2.4.0 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi! |
Beta Was this translation helpful? Give feedback.
Hi!
.flatten
only works with the dictionary feature type ({...}
) orSequence({...})
, which is treated as a dictionary of sequences (e.g.Sequence({"a": Value("bool"), "b": Value("string")})
->{"a": Sequence(Value("bool")), "b": Sequence(Value("string"))}
) for consistency with TFDS, so you'll have to usemap
for flattening in this case. Perhaps the docs should explain more clearly what can be flattened and what can't.