Flattenning complex nested json #4749

Bing-su · 2022-07-27T01:54:32Z

Bing-su
Jul 27, 2022

I have a nested-nested json data, and I want to flatten it.

but it doesn't seem to work properly with .flatten.

from datasets import load_dataset

dataset = load_dataset("json", data_files=files, field="named_entity", split="train")

>>> dataset.features
{'board': Value(dtype='string', id=None),
 'content': [{'labels': [{'id': Value(dtype='int64', id=None),
     'tag': Value(dtype='string', id=None),
     'text': Value(dtype='string', id=None)}],
   'sentence': Value(dtype='string', id=None)}],
 'source_site': Value(dtype='string', id=None),
 'subtitle': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None),
 'title': [{'labels': [{'id': Value(dtype='int64', id=None),
     'tag': Value(dtype='string', id=None),
     'text': Value(dtype='string', id=None)}],
   'sentence': Value(dtype='string', id=None)}],
 'url': Value(dtype='string', id=None),
 'write_date': Value(dtype='string', id=None),
 'writer': Value(dtype='string', id=None)}

>>> flat_ds = dataset.flatten()
>>> flat_ds.features
{'board': Value(dtype='string', id=None),
 'content': [{'labels': [{'id': Value(dtype='int64', id=None),
     'tag': Value(dtype='string', id=None),
     'text': Value(dtype='string', id=None)}],
   'sentence': Value(dtype='string', id=None)}],
 'source_site': Value(dtype='string', id=None),
 'subtitle': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None),
 'title': [{'labels': [{'id': Value(dtype='int64', id=None),
     'tag': Value(dtype='string', id=None),
     'text': Value(dtype='string', id=None)}],
   'sentence': Value(dtype='string', id=None)}],
 'url': Value(dtype='string', id=None),
 'write_date': Value(dtype='string', id=None),
 'writer': Value(dtype='string', id=None)}

please see this colab notebook, I have prepared the data here.

environment

my laptop: windows 11, python=3.9, datasets=2.4.0
colab: python=3.7, datasets=2.4.0

Answered by mariosasko

Aug 17, 2022

Hi! .flatten only works with the dictionary feature type ({...}) or Sequence({...}), which is treated as a dictionary of sequences (e.g. Sequence({"a": Value("bool"), "b": Value("string")}) -> {"a": Sequence(Value("bool")), "b": Sequence(Value("string"))}) for consistency with TFDS, so you'll have to use map for flattening in this case. Perhaps the docs should explain more clearly what can be flattened and what can't.

View full answer

mariosasko · 2022-08-17T11:12:04Z

mariosasko
Aug 17, 2022
Collaborator

Hi! .flatten only works with the dictionary feature type ({...}) or Sequence({...}), which is treated as a dictionary of sequences (e.g. Sequence({"a": Value("bool"), "b": Value("string")}) -> {"a": Sequence(Value("bool")), "b": Sequence(Value("string"))}) for consistency with TFDS, so you'll have to use map for flattening in this case. Perhaps the docs should explain more clearly what can be flattened and what can't.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flattenning complex nested json #4749

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Flattenning complex nested json #4749

Bing-su Jul 27, 2022

environment

Replies: 1 comment

mariosasko Aug 17, 2022 Collaborator

Bing-su
Jul 27, 2022

mariosasko
Aug 17, 2022
Collaborator