Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Audio feature #2324

Merged
merged 180 commits into from Oct 13, 2021
Merged
Show file tree
Hide file tree
Changes from 93 commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
c27da03
Refactor features as a package
albertvillanova May 5, 2021
4d99ff9
Move translation features into own module
albertvillanova May 5, 2021
e8a7457
Create Audio feature
albertvillanova May 5, 2021
5110c68
Fix style
albertvillanova May 5, 2021
f69a7a9
Make late import for audio library
albertvillanova May 5, 2021
fcdb110
Fix imports
albertvillanova May 5, 2021
e6e8ff1
Ignore flake8 errors
albertvillanova May 5, 2021
2d2adcd
Fix imports
albertvillanova May 5, 2021
e7d8904
Fix imports
albertvillanova May 5, 2021
585bba7
Fix imports of private classes/functions
albertvillanova May 5, 2021
827b263
Fix imports of private classes/functions
albertvillanova May 5, 2021
dc5b582
Fix test patch
albertvillanova May 7, 2021
d8e7441
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova May 7, 2021
76cb67d
Mimic features package for tests
albertvillanova Jun 21, 2021
7b72de8
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Jun 21, 2021
e0b77c1
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Jun 25, 2021
3c28a61
Add required Feature attributes to Audio
albertvillanova Jun 25, 2021
496a1fb
Implement __call__
albertvillanova Jun 25, 2021
158c545
Add coding_format attribute
albertvillanova Jun 25, 2021
f13b7e8
Validate audio coding format
albertvillanova Jun 28, 2021
7b658f9
Add Audio docs
albertvillanova Jun 28, 2021
22f4131
Test Audio instantiation
albertvillanova Jun 28, 2021
e95d3f9
Add audio test data
albertvillanova Jun 28, 2021
100ceb1
Add Audio dependency requirements
albertvillanova Jun 28, 2021
1d29232
Fix Audio call
albertvillanova Jun 28, 2021
e51a8a9
Test Audio encode example
albertvillanova Jun 28, 2021
b16bc65
Fix style
albertvillanova Jun 28, 2021
977fa33
Add audio dependency requirements to tests
albertvillanova Jun 28, 2021
e0672e4
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Jun 28, 2021
e5d87a7
Skip test for linux
albertvillanova Jun 28, 2021
e02e387
Fix import of private function
albertvillanova Jun 28, 2021
340989a
Return 1D array
albertvillanova Jun 29, 2021
bd98f78
Encode example for Audio feature
albertvillanova Jun 29, 2021
4230843
Test dataset with Audio feature
albertvillanova Jun 29, 2021
bfe7517
Replace Audio encode_example with decode_example
albertvillanova Jun 29, 2021
1a230bd
Implement Features decode_example
albertvillanova Jun 29, 2021
c153e6d
Fix Audio __call__
albertvillanova Jun 29, 2021
13a1341
Test decoding of dataset with Audio feature
albertvillanova Jun 29, 2021
9501298
Replace soundfile with librosa
albertvillanova Aug 18, 2021
e9e6879
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Aug 18, 2021
d25b1d7
Remove array reshape
albertvillanova Aug 18, 2021
34d5b7b
Fix tests
albertvillanova Aug 18, 2021
85fa108
Flatten decode_nested_example
albertvillanova Aug 18, 2021
1342eba
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 2, 2021
401327f
Implement Features.decode_batch
albertvillanova Sep 8, 2021
bd312d1
Decode features in _getitem
albertvillanova Sep 8, 2021
a848928
Refactor test_dataset_with_audio_feature
albertvillanova Sep 8, 2021
f91cf6b
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 8, 2021
5dc0ba8
Fix style
albertvillanova Sep 8, 2021
42ea474
Implement PythonFeaturesDecoder
albertvillanova Sep 13, 2021
8afdb25
Compose Formatter with PythonFeaturesDecoder
albertvillanova Sep 13, 2021
e4eef07
Refactor PythonFormatter.format_row to use PythonFeaturesDecoder
albertvillanova Sep 13, 2021
80a3d06
Pass features to instantiate formatter
albertvillanova Sep 13, 2021
c314ec5
Fix style
albertvillanova Sep 13, 2021
a122552
Refactor decode_nested_example with default for the rest of features
albertvillanova Sep 13, 2021
c0de3aa
Fix missing pass features to instantiate formatter
albertvillanova Sep 13, 2021
b3214e1
Revert flatten of decode_nested_example to return nested examples
albertvillanova Sep 13, 2021
42426e8
Fix test_dataset_with_audio_feature for nested output
albertvillanova Sep 13, 2021
2067083
Return also audio path in decode_example
albertvillanova Sep 13, 2021
af5fc26
Add path to audio tests
albertvillanova Sep 13, 2021
1499f3e
Fix Formatter and NumpyFormatter init
albertvillanova Sep 13, 2021
23973f2
Fix format_table with python_formatter without features
albertvillanova Sep 13, 2021
56e0b7d
Fix PythonFeaturesDecoder.decode_row only if features
albertvillanova Sep 13, 2021
a8d836e
Fix all Formatter subclasses init
albertvillanova Sep 13, 2021
f5b1d13
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 13, 2021
caf2153
Fix typo
albertvillanova Sep 13, 2021
6c1ea4b
Test batch in dataset with Audio feature
albertvillanova Sep 14, 2021
5ec9031
Implement PythonFeaturesDecoder.decode_batch
albertvillanova Sep 14, 2021
ead8001
Use PythonFeaturesDecoder.decode_batch in PythonFormatter.format_batch
albertvillanova Sep 14, 2021
1f1f730
Add docstrings
albertvillanova Sep 14, 2021
8930573
Add mono attribute to Audio feature
albertvillanova Sep 15, 2021
c4e7905
Test formatted dataset with Audio feature
albertvillanova Sep 21, 2021
4623cdd
Implement ArrowFeaturesDecoder
albertvillanova Sep 21, 2021
a7071cc
Compose Formatter with ArrowFeaturesDecoder
albertvillanova Sep 21, 2021
90c5873
Make NumpyFormatter.format_row use SimpleArrowExtractor and ArrowFeat…
albertvillanova Sep 21, 2021
217782c
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 21, 2021
8587ea1
Fix decode_nested_example to decode only keys present in example
albertvillanova Sep 21, 2021
6a916e4
Refactor NumpyFormatter.format_row
albertvillanova Sep 21, 2021
b648074
Test pandas formatted dataset with Audio feature
albertvillanova Sep 21, 2021
5f274e1
Implement PandasFeaturesDecoder
albertvillanova Sep 21, 2021
7dac12a
Compose Formatter with PandasFeaturesDecoder
albertvillanova Sep 21, 2021
7af69e6
Make PandasFormatter.format_row use PandasFeaturesDecoder
albertvillanova Sep 21, 2021
4e34d21
Fix PandasFeaturesDecoder.decode_row for None features and keys not i…
albertvillanova Sep 21, 2021
77f67a8
Fix PandasFeaturesDecoder.decode_row to call transform only for featu…
albertvillanova Sep 21, 2021
2d87e96
Fix unused imports
albertvillanova Sep 21, 2021
8ede53b
Remove ArrowFeaturesDecoder and _nest
albertvillanova Sep 21, 2021
53e18bd
Fix typo
albertvillanova Sep 21, 2021
fdb050e
Remove test skip if linux
albertvillanova Sep 21, 2021
329abc5
Revert "Remove test skip if linux"
albertvillanova Sep 21, 2021
da7edc5
Fix PandasFeaturesDecoder.decode_row to transform and assign transfor…
albertvillanova Sep 22, 2021
149dc51
Make Audio instances hashable
albertvillanova Sep 22, 2021
109c186
Make Audio.decode_example return original value if dependencies not i…
albertvillanova Sep 22, 2021
0c66689
Fix style
albertvillanova Sep 22, 2021
b3939b1
Test audio resampling
albertvillanova Sep 22, 2021
f2e29ac
Test Audio feature decode mp3
albertvillanova Sep 22, 2021
9c09f4c
Refactor Audio.decode_example to support mp3 with torchaudio
albertvillanova Sep 22, 2021
0fd87d7
Fix style
albertvillanova Sep 22, 2021
4c08a52
Fix logic in Audio.decode_example
albertvillanova Sep 23, 2021
7c16d8a
Require torchaudio dependency for tests
albertvillanova Sep 23, 2021
fa3e068
Require torch to test audio mp3
albertvillanova Sep 23, 2021
423dbba
Refactor decoding with torchaudio with mono and librosa resampling
albertvillanova Sep 23, 2021
8174fc2
Set sox_io backend when decoding with torchaudio
albertvillanova Sep 23, 2021
5881d9a
Fix test_audio with more specific pytest markers
albertvillanova Sep 23, 2021
7c0bd9d
Fix unused torchaudio.functional
albertvillanova Sep 23, 2021
443d4e0
Fix requirement of sndfile
albertvillanova Sep 24, 2021
b0e345f
Fix require_sox
albertvillanova Sep 24, 2021
6a23bdb
Refactor import of find_spec
albertvillanova Sep 24, 2021
09ac25f
Revert torchaudio resampling using librosa
albertvillanova Sep 24, 2021
0b36596
Simplify torchaudio resampling
albertvillanova Sep 24, 2021
15cea1e
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 24, 2021
799b138
Fix require_sndfile
albertvillanova Sep 24, 2021
e64b30a
Implement conditionally decoding
albertvillanova Sep 24, 2021
22984b0
Implement decoded param in _getitem
albertvillanova Sep 24, 2021
4069c82
Pass decoded=False when iterating in map
albertvillanova Sep 24, 2021
86d33a1
Rename sampling_rate
albertvillanova Sep 24, 2021
2e5dae9
Fix NumpyFormatter
albertvillanova Sep 24, 2021
f58f802
Test dataset with not decoded Audio feature
albertvillanova Sep 24, 2021
3597d44
Test map dataset with Audio feature is decoded
albertvillanova Sep 27, 2021
d51f1a6
Use lazy dict to decorate arg of mapped function
albertvillanova Sep 27, 2021
e9188ba
Fix decorator in map
albertvillanova Sep 27, 2021
e108697
Fix sampling_rate by torchaudio
albertvillanova Sep 27, 2021
13ff8aa
Fix test decoding mp3
albertvillanova Sep 27, 2021
bacf5d1
Add GitHub Action for audio CI tests
albertvillanova Sep 27, 2021
3169d0f
Remove audio dependencies from test dependencies
albertvillanova Sep 27, 2021
1548bca
Comment unused audio pytest marker
albertvillanova Sep 27, 2021
3ce50cd
Refactor LazyDict
albertvillanova Sep 27, 2021
9d7c3e8
Fix tests
albertvillanova Sep 27, 2021
ac7ef25
Run audio tests in parallel
albertvillanova Sep 28, 2021
46cdebd
Rename audio test job
albertvillanova Sep 28, 2021
4a76773
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 28, 2021
902d173
Call parallel test on Linux as on Windows
albertvillanova Sep 28, 2021
02b1572
Implement Audio decode_batch
albertvillanova Sep 28, 2021
b7cf206
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Sep 28, 2021
9a1b63d
Test batched map dataset with Audio feature is decoded
albertvillanova Sep 29, 2021
835a4d4
Fix _map_single to avoid decoding of batches
albertvillanova Sep 29, 2021
87cb4fd
Implement Example and Batch from LazyDict
albertvillanova Sep 29, 2021
0c46754
Decorate mapped function with Example/Batch lazy dict
albertvillanova Sep 29, 2021
137b29e
Fix PythonFormatter for batch conditional decoding
albertvillanova Sep 29, 2021
58e6a01
Fix typo
albertvillanova Sep 29, 2021
c68505a
Refactor iter
albertvillanova Sep 29, 2021
955bd01
Fix _map_single for batched
albertvillanova Sep 29, 2021
07b6bc5
Refactor _get_item
albertvillanova Sep 29, 2021
0f80e6e
Fix style
albertvillanova Sep 29, 2021
cd5311b
Refactor resampling using torchaudio
albertvillanova Sep 29, 2021
98d962d
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Oct 6, 2021
c26a60e
Add docstring and comments to decorate function
albertvillanova Oct 6, 2021
c22cb6b
Remove comment
albertvillanova Oct 6, 2021
c0d6eec
Test batch numpy formatted dataset with Audio feature
albertvillanova Oct 6, 2021
efba528
Make NumpyFormatter.format_batch use decoder
albertvillanova Oct 6, 2021
71891f0
Test batch pandas formatted dataset with Audio feature
albertvillanova Oct 6, 2021
57720c7
Implement PandasFeaturesDecoder.decode_batch
albertvillanova Oct 6, 2021
da5d138
Make PandasFormatter.format_batch use decoder
albertvillanova Oct 6, 2021
a8ffee8
Fix style
albertvillanova Oct 6, 2021
ec32d2f
Make CustomFormatter use decoder
albertvillanova Oct 6, 2021
3aee1f1
Change Features.decode_example/batch
albertvillanova Oct 7, 2021
f5bd62a
Fix CustomFormatter
albertvillanova Oct 7, 2021
0ce0832
Test resampling when loading a dataset
albertvillanova Oct 8, 2021
e85c61b
Test resampling after loading a dataset
albertvillanova Oct 8, 2021
53d6d73
Implement cast a column to a feature for decoding w/o caching
albertvillanova Oct 8, 2021
633ef09
Make cast_column call cast if not decoding
albertvillanova Oct 8, 2021
9c56ee9
Test decode column in dataset with Audio feature
albertvillanova Oct 12, 2021
2afa6b6
Implement Features.decode_column
albertvillanova Oct 12, 2021
7088e17
Implement PythonFeaturesDecoder.decode_column
albertvillanova Oct 12, 2021
4dd3160
Make PythonFormatter.format_column use PythonFeaturesDecoder
albertvillanova Oct 12, 2021
377cfcb
Add type hints and rename column to column_name
albertvillanova Oct 12, 2021
c0515f2
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Oct 12, 2021
a4a6f25
Fix typo
albertvillanova Oct 12, 2021
7854ea9
Update fingerprint within cast_column
albertvillanova Oct 12, 2021
bf94ac9
Improve docstring
albertvillanova Oct 12, 2021
eb7b8c5
Test decode column in formatted dataset with Audio feature
albertvillanova Oct 12, 2021
0678af2
Make NumpyFormatter.format_column use PythonFeaturesDecoder
albertvillanova Oct 12, 2021
b1ed47e
Implement PandasFeaturesDecoder.decode_column
albertvillanova Oct 12, 2021
a05c9a5
Make PandasFormatter.format_column use PandasFeaturesDecoder
albertvillanova Oct 12, 2021
97bf4ed
Rename variable in PandasFeaturesDecoder.decode_row
albertvillanova Oct 12, 2021
583b24d
Merge remote-tracking branch 'upstream/master' into features-audio
albertvillanova Oct 13, 2021
98bcbc8
Add type hint to cast_column
albertvillanova Oct 13, 2021
d29cfee
Implement cast_column for DatasetDict
albertvillanova Oct 13, 2021
556a7a1
Add cast_column to API docs
albertvillanova Oct 13, 2021
50cbe21
Add example of cast_column to docs how-to guide
albertvillanova Oct 13, 2021
87943c2
Fix Sphinx role name
albertvillanova Oct 13, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/source/package_reference/main_classes.rst
Expand Up @@ -117,6 +117,9 @@ Dictionary with split names as keys ('train', 'test' for example), and :obj:`dat
.. autoclass:: datasets.Array5D
:members:

.. autoclass:: datasets.Audio
:members:

``MetricInfo``
~~~~~~~~~~~~~~~~~~~~~

Expand Down
7 changes: 7 additions & 0 deletions setup.py
Expand Up @@ -107,6 +107,10 @@
"packaging",
]

AUDIO_REQUIRE = [
"librosa",
]

BENCHMARKS_REQUIRE = [
"numpy==1.18.5",
"tensorflow==2.3.0",
Expand All @@ -118,6 +122,7 @@
# test dependencies
"absl-py",
"pytest",
"pytest-datadir",
"pytest-xdist",
# optional dependencies
"apache-beam>=2.26.0",
Expand Down Expand Up @@ -179,11 +184,13 @@
]
)

TESTS_REQUIRE += AUDIO_REQUIRE

QUALITY_REQUIRE = ["black==21.4b0", "flake8==3.7.9", "isort", "pyyaml>=5.3.1"]


EXTRAS_REQUIRE = {
"audio": AUDIO_REQUIRE,
"apache-beam": ["apache-beam>=2.26.0"],
"tensorflow": ["tensorflow>=2.2.0"],
"tensorflow_gpu": ["tensorflow-gpu>=2.2.0"],
Expand Down
1 change: 1 addition & 0 deletions src/datasets/__init__.py
Expand Up @@ -42,6 +42,7 @@
Array3D,
Array4D,
Array5D,
Audio,
ClassLabel,
Features,
Sequence,
Expand Down
4 changes: 2 additions & 2 deletions src/datasets/arrow_dataset.py
Expand Up @@ -1598,7 +1598,7 @@ def set_format(

# Check that the format_type and format_kwargs are valid and make it possible to have a Formatter
type = get_format_type_from_alias(type)
_ = get_formatter(type, **format_kwargs)
_ = get_formatter(type, features=self.features, **format_kwargs)

# Check filter column
if isinstance(columns, str):
Expand Down Expand Up @@ -1765,7 +1765,7 @@ def _getitem(
Can be used to index columns (by string names) or rows (by integer index, slices, or iter of indices or bools)
"""
format_kwargs = format_kwargs if format_kwargs is not None else {}
formatter = get_formatter(format_type, **format_kwargs)
formatter = get_formatter(format_type, features=self.features, **format_kwargs)
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
formatted_output = format_table(
pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
Expand Down
11 changes: 11 additions & 0 deletions src/datasets/features/__init__.py
@@ -0,0 +1,11 @@
# flake8: noqa
from .audio import Audio
from .features import *
from .features import (
_ArrayXD,
_ArrayXDExtensionType,
_arrow_to_datasets_dtype,
_cast_to_python_objects,
_is_zero_copy_only,
)
from .translation import Translation, TranslationVariableLanguages
43 changes: 43 additions & 0 deletions src/datasets/features/audio.py
@@ -0,0 +1,43 @@
from dataclasses import dataclass, field
from typing import Any, ClassVar, Optional

import pyarrow as pa


@dataclass(frozen=True)
class Audio:
"""Audio Feature to extract audio data from an audio file.

Args:
sampling_rate (:obj:`int`, optional): Target sampling rate. If `None`, the native sampling rate is used.
mono (:obj:`bool`, default ```True``): Whether to convert the audio signal to mono by averaging samples across channels.
"""

sampling_rate: int = None
mono: bool = True
id: Optional[str] = None
# Automatically constructed
dtype: ClassVar[str] = "dict"
pa_type: ClassVar[Any] = None
_type: str = field(default="Audio", init=False, repr=False)

def __call__(self):
return pa.string()

def decode_example(self, value):
"""Decode example audio file into audio data.

Args:
value: Audio file path.

Returns:
dict
"""
try:
import librosa
except ImportError:
return value

with open(value, "rb") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use xopen (currently from the streaming_download_manager.py) but feel free to move it if necessary) to support reading from remote files in streaming mode :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bit problematic as it will always fail if the value is of type .mp3 (which is Common Voice's format). To fix it, I think we should do the following:

try:
    array, sample_rate = librosa.load(value, sr=self.sampling_rate, mono=self.mono)
except audioread.NoBackendError as e:
    raise ImportError(f"`librosa` is not able to load audio of format .{value.split(".")[-1]}). If you are trying to load MP3 audio files, please make sure to install `ffmpeg` as described here: https://github.com/librosa/librosa#audioread-and-mp3-support")

-> we should also add a couple of tests for .mp3 IMO (one that fails gracefully and one that correctly loads MP3 with ffmpeg installed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be problematic for streaming @lhoestq ? I sadly don't think there is a way to open the file with and then pass the binary buffer to librosa for MP3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For audio file support, I would like to again link to that excellent comment by adefossez: huggingface/transformers#11337 (comment)

Shouldn't we have support for torchaudio too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MP3 torchaudio is the best choice IMO as well, but it's a pretty heavy dependency with torch, but it's clearly the fastest. In my experiments, torchaudio gives a 50x speed up compared to librosa for MP3. This is pretty significant - for distributed training the fallback of librosa for MP3 actually makes the preprocessing take more time than the actual training

Could we maybe add something like "if format == mp3" -> check if torchaudio is installed - if not then throw error that torchaudio needs to be installed for mp3?

Having played around with the feature a bit, librosa + ffmpeg is very very slow compared to torchaudio and I think the better choice is here to actually throw an error instead of falling back to librosa + ffmpeg.

@anton-l - do you know any fast decoding libraries apart torchaudio? What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten if I understand, mp3 is not supported by torchaudio in Windows machines. Then, no support for mp3 for Windows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah windows users will have to implement their own decoding then

array, sample_rate = librosa.load(f, sr=self.sampling_rate, mono=self.mono)
return {"path": value, "array": array, "sampling_rate": sample_rate}
159 changes: 44 additions & 115 deletions src/datasets/features.py → src/datasets/features/features.py
Expand Up @@ -35,8 +35,10 @@
from pyarrow.lib import TimestampType
from pyarrow.types import is_boolean, is_primitive

from . import config, utils
from .utils.logging import get_logger
from datasets import config, utils
from datasets.features.audio import Audio
from datasets.features.translation import Translation, TranslationVariableLanguages
from datasets.utils.logging import get_logger


logger = get_logger(__name__)
Expand Down Expand Up @@ -700,119 +702,6 @@ def _load_names_from_file(names_filepath):
return [name.strip() for name in f.read().split("\n") if name.strip()] # Filter empty names


@dataclass
class Translation:
"""`FeatureConnector` for translations with fixed languages per example.
Here for compatiblity with tfds.

Input: The Translate feature accepts a dictionary for each example mapping
string language codes to string translations.

Output: A dictionary mapping string language codes to translations as `Text`
features.

Example::

# At construction time:

datasets.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
'en': 'the cat',
'fr': 'le chat',
'de': 'die katze'
}
"""

languages: List[str]
id: Optional[str] = None
# Automatically constructed
dtype: ClassVar[str] = "dict"
pa_type: ClassVar[Any] = None
_type: str = field(default="Translation", init=False, repr=False)

def __call__(self):
return pa.struct({lang: pa.string() for lang in sorted(self.languages)})


@dataclass
class TranslationVariableLanguages:
"""`FeatureConnector` for translations with variable languages per example.
Here for compatiblity with tfds.

Input: The TranslationVariableLanguages feature accepts a dictionary for each
example mapping string language codes to one or more string translations.
The languages present may vary from example to example.

Output:
language: variable-length 1D tf.Tensor of tf.string language codes, sorted
in ascending order.
translation: variable-length 1D tf.Tensor of tf.string plain text
translations, sorted to align with language codes.

Example::

# At construction time:

datasets.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
'en': 'the cat',
'fr': ['le chat', 'la chatte,']
'de': 'die katze'
}

# Tensor returned :

{
'language': ['en', 'de', 'fr', 'fr'],
'translation': ['the cat', 'die katze', 'la chatte', 'le chat'],
}
"""

languages: Optional[List] = None
num_languages: Optional[int] = None
id: Optional[str] = None
# Automatically constructed
dtype: ClassVar[str] = "dict"
pa_type: ClassVar[Any] = None
_type: str = field(default="TranslationVariableLanguages", init=False, repr=False)

def __post_init__(self):
self.languages = list(sorted(list(set(self.languages)))) if self.languages else None
self.num_languages = len(self.languages) if self.languages else None

def __call__(self):
return pa.struct({"language": pa.list_(pa.string()), "translation": pa.list_(pa.string())})

def encode_example(self, translation_dict):
lang_set = set(self.languages)
if self.languages and set(translation_dict) - lang_set:
raise ValueError(
"Some languages in example ({0}) are not in valid set ({1}).".format(
", ".join(sorted(set(translation_dict) - lang_set)), ", ".join(lang_set)
)
)

# Convert dictionary into tuples, splitting out cases where there are
# multiple translations for a single language.
translation_tuples = []
for lang, text in translation_dict.items():
if isinstance(text, str):
translation_tuples.append((lang, text))
else:
translation_tuples.extend([(lang, el) for el in text])

# Ensure translations are in ascending order by language code.
languages, translations = zip(*sorted(translation_tuples))

return {"language": languages, "translation": translations}


@dataclass
class Sequence:
"""Construct a list of feature from a single type or a dict of types.
Expand Down Expand Up @@ -841,6 +730,7 @@ class Sequence:
Array3D,
Array4D,
Array5D,
Audio,
]


Expand Down Expand Up @@ -915,6 +805,20 @@ def encode_nested_example(schema, obj):
return obj


def decode_nested_example(feature, example):
if isinstance(feature, dict):
return {
col: decode_nested_example(col_feature, col_example)
for col, (col_feature, col_example) in utils.zip_dict(
{key: value for key, value in feature.items() if key in example}, example
)
}
elif isinstance(feature, Audio):
return feature.decode_example(example)
else:
return example


def generate_from_dict(obj: Any):
"""Regenerate the nested feature object from a deserialized dict.
We use the '_type' fields to get the dataclass name to load.
Expand Down Expand Up @@ -1080,6 +984,31 @@ def encode_batch(self, batch):
encoded_batch[key] = [encode_nested_example(self[key], obj) for obj in column]
return encoded_batch

def decode_example(self, example):
lhoestq marked this conversation as resolved.
Show resolved Hide resolved
"""Decode example with custom feature decoding.

Args:
example (:obj:`dict[str, Any]`): Dataset row data.

Returns:
:obj:`dict[str, Any]`
"""
return decode_nested_example(self, example)

def decode_batch(self, batch):
"""Decode batch with custom feature decoding.

Args:
batch (:obj:`dict[str, list[Any]]`): Dataset batch data.

Returns:
:obj:`dict[str, list[Any]]`
"""
decoded_batch = {}
for key, column in batch.items():
decoded_batch[key] = [decode_nested_example(self[key], obj) for obj in column]
return decoded_batch

def copy(self) -> "Features":
"""
Make a deep copy of Features.
Expand Down