Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make chuking smartly (long files) work on asr ctc_with_lm. #15219

Merged
merged 7 commits into from
Jan 19, 2022

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Jan 19, 2022

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil Narsil changed the title [WIP] Make chuking smartly (long files) work on asr ctc_with_lm. Make chuking smartly (long files) work on asr ctc_with_lm. Jan 19, 2022
@@ -66,14 +66,27 @@ def ffmpeg_read(bpayload: bytes, sampling_rate: int) -> np.array:
return audio


def apply_stride(tokens, stride):
max_token_n = tokens.shape[-1]
def audio_to_logits(tokens_or_logits, stride):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does the function do? Could we add some docstring? Also I don't understand the name audio_to_logits really

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just change the stride number to go from audio space (10s at 16_000 means (160_000, 8_000, 8_000)) for instance. to logits_space (2333, 160, 160) for instance.

Do you think of a better name ? Doctstring could help a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Maybe get_output_stride_from_input_stride(input_shape, stride) and directly pass the shape? Yeah think a little docstring can help here

# we need to reconstruct this information
# This won't work with left padding (which doesn't exist right now)
right_n = total_n - right
logits = logits[:, left:right_n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yes I think that's the only approach that'll work right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding is not possible here sadly

Copy link
Contributor Author

@Narsil Narsil Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding works, it's handled by the batching mecanism. All I mentionned here is why we need so much information (logits might get padded while stride no).

We can make it work, relatively trivially for left padding.

if self.feature_extractor.padding_side == "left":
    left = logits.shape[1] - total_n + left_n
    right = logits.shaoe[1] - total_n + right_n
else:
   left = left_n
   right_n = total_n - right_n

Just thought this was overly complex since left padding doesn't seem likely here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - ok yeah

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool! Works well for the swedish example of:

./eval.py --model_id hf-test/xls-r-300m-sv --dataset speech-recognition-community-internal/tedx_manual_dev_test --config sv --split validation --chunk_length_s 5.0 --stride_length_s 1.0

with the eval script here: https://github.com/huggingface/transformers/blob/master/examples/research_projects/robust-speech-event/eval.py

Made some small changes to make it work for batch size 1 - feel free to refactor those to make it cleaner @Narsil

@Narsil Narsil merged commit 3fefee9 into huggingface:master Jan 19, 2022
@HuggingFaceDocBuilder
Copy link

Great job merging this PR! the documentation will now be removed from the staging environment.

@Narsil Narsil deleted the pipeline_chunking_asr_with_lm branch January 19, 2022 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants