-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SequenceFeatureExtractor] Rewrite padding logic from pure python to numpy #13650
Conversation
x = np.subtract(x, mean) | ||
if normalize_vars: | ||
var = square_sums / x[:input_length].shape[0] - mean ** 2 | ||
std = np.sqrt(np.maximum(var, 1e-10)) | ||
std = x[:input_length].std(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched this logic to pure numpy to squeeze out a bit more precision when working with np.float32
instead of casted float->np.float64
if is_batched and not isinstance(raw_speech[0], np.ndarray): | ||
raw_speech = [np.asarray(speech) for speech in raw_speech] | ||
if is_batched: | ||
raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing float32 here for consistency with other feature extractors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -724,7 +724,7 @@ def map_to_array(batch): | |||
return batch | |||
|
|||
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") | |||
ds = ds.select(range(num_samples)).map(map_to_array) | |||
ds = ds.sort("id").select(range(num_samples)).map(map_to_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR: the newdatasets
version re-shuffled this dataset, so sorting is needed for reproducibility.
output = speech_recognizer(waveform) | ||
self.assertEqual(output, {"text": ""}) | ||
|
||
from datasets import load_dataset | ||
|
||
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") | ||
filename = ds[0]["file"] | ||
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation").sort("id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR: the newdatasets
version re-shuffled this dataset, so sorting is needed for reproducibility.
@@ -42,9 +42,9 @@ def test_torch_small(self): | |||
tokenizer="facebook/s2t-small-mustc-en-fr-st", | |||
framework="pt", | |||
) | |||
waveform = np.zeros((34000,)) | |||
waveform = np.linspace(0, 1, 34000, dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the slight variability due to type-casting is eliminated, this needs to be something specific, because the model becomes unstable (non-reproducible in different environments) with all-zero inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Left a couple of nits:
- Think we don't have to use
np.int64
-> we are usingtf.int32
everywhere - Would be nice to add
jax
to theto_py_obj
andto_numpy_obj
function - Do we still need a higher variance for the
no-padding
test forSpeech2Text
?
Also did you notice a speed-up for larger inputs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool, LGTM @anton-l!
@patrickvonplaten the benchmarking results are pretty promising:
|
Great job @anton-l - feel free to merge! |
…numpy (huggingface#13650) * Test np padding * Pass feature extraction tests * Update type hints * Fix flaky integration tests * Try a more stable waveform * Add to_numpy jax support * int32 attention masks * Refactor normalization tests
…numpy (huggingface#13650) * Test np padding * Pass feature extraction tests * Update type hints * Fix flaky integration tests * Try a more stable waveform * Add to_numpy jax support * int32 attention masks * Refactor normalization tests
What does this PR do?
Resolves #13539
Since speech models universally use Numpy
float32
arrays as input features (standard way of representing waveforms), it was decided to rewriteSequenceFeatureExtractor
from pure python lists (akin to traditional tokenizers) to numpy arrays. It will also help with solving some inconsistent normalization issues (#13538, #13585) due tofloat->np.float32
conversions.The feature extractor itself is still dtype-agnostic (can pad
np.float64
in the future if needed), while the model-specific feature extractors were updated to only work withnp.float32