Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Audio feature #2324

Merged
merged 180 commits into from Oct 13, 2021
Merged

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented May 5, 2021

Create Audio feature to handle raw audio files.

Some decisions to be further discussed:

  • I have chosen soundfile as the audio library; another interesting library is librosa, but this requires soundfile (see here). If we require some more advanced functionalities, we could eventually switch the library.
  • I have implemented the audio feature as an extra: pip install datasets[audio]. For the moment, the typical datasets user uses only text datasets, and there is no need for them for additional package requirements for audio/image if they do not need them.
  • For tests, I require audio dependencies (so that all audio functionalities are checked with our CI test suite); I exclude Linux platforms, which require an additional library to be installed with the distribution package manager
    • I also require pytest-datadir, which allow to have (audio) data files for tests
  • The audio data contain: array and sample_rate.
  • The array is reshaped as 1D array (expected input for Wav2Vec2).

Note that to install soundfile on Linux, you need to install libsndfile using your distribution’s package manager, for example sudo apt-get install libsndfile1.

Requirements Specification

  • Access example with audio loading and resampling:
    ds[0]["audio"]
  • Map with audio loading & resampling:
    def preprocess(batch):
         batch["input_values"] = processor(batch["audio"]).input_values
         return batch
    
    ds = ds.map(preprocess)
  • Map without audio loading and resampling:
    def preprocess(batch):
         batch["labels"] = processor(batch["target_text"]).input_values
         return batch
    
    ds = ds.map(preprocess)
  • Additional requirement specification (see Create Audio feature #2324 (review)): Cast audio column to change sampling sate:
    ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

@albertvillanova albertvillanova added this to the 1.7 milestone May 5, 2021
@albertvillanova albertvillanova modified the milestones: 1.7, 1.8 May 31, 2021
@albertvillanova albertvillanova modified the milestones: 1.8, 1.9 Jun 8, 2021
@albertvillanova
Copy link
Member Author

albertvillanova commented Oct 12, 2021

I think the last thing we need to do is make sure that cast_column changes the fingerprint of the dataset. Feel free to use the fingerprint_transform decorator, as for remove_columns.

(note that cast currently doesn't use the decorator since it's based on map that already updates the fingerprint).

@lhoestq note that cast_column may call cast in some cases, and the decorator would not be necessary for these cases...

  • I did it by setting inplace=False, and updating fingerprint explicitly only when cast is not called.

@albertvillanova
Copy link
Member Author

albertvillanova commented Oct 12, 2021

I think current state of this PR could be included in our next release, as experimental feature, for stress testing it and try to find all potential issues. What do you think?

CC: @lhoestq @patrickvonplaten @anton-l

@anton-l
Copy link
Member

anton-l commented Oct 13, 2021

Looks great! Ready to try it out on the transformers examples after the release :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, good job @albertvillanova !

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
@patrickvonplaten
Copy link
Contributor

Think we are good to merge here no? :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants