Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Narsil · 2022-11-07T14:58:42Z

What does this PR do?

This adds chunk_length_s to seq2seq algorithms.

Approach

Since we have no way of finding a matching between output and input with seq2seq
this is an alternative route.

This runs the pipeline on the various chunks and finds all generated output.
Then it tries to find the longest sequence of non special ids that could correspond
to the subsequences within the batch.

Pros

It should work on any seq2seq models
It should work decently when the stride is long enough to have good overlapping of tokens so that the stitching can work correctly
It should be slightly robust to few token errors
It should perform best on mostly continuous talk (so that there is model output that can overlap)

Cons

This method is unsound and will fail under some circumstances
It will fail when there is silence in the overlap. If there is silence then there is no overlapping tokens, and the stitching might get lost during the stitching process. By default it will concatenate, but it might be put off by boundaries in the stride.
It will fail spectacularly when something repeats a single word over and over. Then, we will have overlap that might be TOO large. This is impossible to distinguish without getting access to the timestamps (which only whisper can currently do, and it does come with caveats). The currently algorithm will favor long chain of matching tokens.
It will have issues with capitalization and out of domain areas. For instance "Yes, sir." , "Sir Thomas" might be 2 chunks, which have different capitalization. Since the current algorithm works at the token level, the 2 tokens "sir" and ¨Sir" are different and will fail to match leading to some `¨Yes, sir. Sir Thomas" stitching instead of the intended "Yes, Sir Thomas.".

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-11-07T15:15:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks for working on this. Not sure if the PR is ready for (at least core maintainer) review yet?

sgugger · 2022-11-07T15:19:49Z

src/transformers/pipelines/automatic_speech_recognition.py

+            # if self.type not in {"ctc", "ctc_with_lm"}:
+            #     raise ValueError(
+            #         "`chunk_length_s` is only valid for CTC models, use other chunking options for other models"
+            #     )


To clean up?

sgugger · 2022-11-07T15:20:07Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        # self.assertEqual(
+        #     str(v.exception),
+        #     "`chunk_length_s` is only valid for CTC models, use other chunking options for other models",
+        # )


To clean up as well?

sgugger · 2022-11-07T15:20:23Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        # waveform = np.tile(np.arange(1000, dtype=np.float32), 34)
+        # output = speech_recognizer(waveform)
+        # self.assertEqual(output, {"text": ""})
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation").sort("id")
+        filename = ds[40]["file"]
+        # output = speech_recognizer(filename)
+        # self.assertEqual(output, {"text": " A man said to the universe, Sir, I exist."})
+        print(filename)


Comments and print statements to clean up.

Narsil · 2022-11-07T15:36:12Z

Thanks for working on this. Not sure if the PR is ready for (at least core maintainer) review yet?

Yup sorry it was slightly early for you.
The core idea is still there.

We chunk with stride. and we make a hopeful stitch to find the longest sequence from all the subsequences.

PROs:

It's extremely generic.
It should work in a lot of scenarios including repeating tokens

CONs:

It's technically unsound. Meaning if the model infers widely varying tokens, there's no way to reconstruct what the model would actually predict on the whole file.
I expect it can fail spectacularly in well crafted examples where someone repeats the same word over and over, where the longest match will be MUCH longer than the original voices thing.

ArthurZucker · 2022-11-08T13:15:46Z

As we discussed offline with @Narsil , will be implementing the find_conmmon_sequence in O(N) 😉 Will open a new PR!

Narsil · 2022-11-08T16:59:55Z

As we discussed offline with @Narsil , will be implementing the find_conmmon_sequence in O(N) wink Will open a new PR!

Seems it's going to be complex because of fault tolerance which does seem to be important.

You can try doing something like

#!wget https://www.archive.org/download/around_world_80_days_mfs_librivox/around_world_in_80_days_01_verne.mp3
from transformers import pipeline

speech_recognizer = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-small",
    framework="pt",
    batch_size=2,
    device=0,
    chunk_length_s=30,
    generate_kwargs={"max_new_tokens": 1024},
)

out = speech_recognizer(["around_world_in_80_days_01_verne.mp3"])
print(out)

This will required some suboptimal stitches to work.

Narsil · 2022-11-08T17:05:26Z

@sgugger it's now ready for review.

The TODO is left intentionnally. It might really become relevant on hour+ long files where the current naive algorithm might become too slow. However the code is likely to be orders of magnitude more complex (if a O(n) solution exists, I'm pretty sure we could find an expected O(n) algorithm, but not sure about worst case).
The current code works correctly, has the fault tolerance we need to be useful.

I added a warning because the current code Will fail in some know circumstances. I updated the PR description to reflect those. If those tradeoffs are not good enough, I'm happy to not merge this PR in this state.

The only other option I see is whisper specific with timestamps and it would only alleviate some of the issues.

ArthurZucker · 2022-11-09T15:42:59Z

Before merging, would love to try a little bit, otherwise LGTM (looking for a solution to the faults)

Narsil · 2022-11-14T08:35:00Z

@ArthurZucker What are your conclusions ?

HuggingFaceDocBuilderDev · 2022-11-14T08:46:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2022-11-14T11:58:14Z

I think that including timestamp tokens in the process could help with the error tolerance as they are consistently predicted at the end of pauses in the speech. If the stride is big enough not at least include pauses in speech, it boils down to matching these.
Moreover, given that we know approximately the time between each tokens, we can use this information as some kind of guiding information. I am working on something, but we can merge for now and have an improved PR later on 😉

Narsil · 2022-11-14T12:26:38Z

@sgugger would like your opinion on this if possible.

The results are pretty decent imo on regular speech. I'm still mentionning the caveats because they are real.

ArthurZucker

LGTM thanks a lot for working on this

ArthurZucker · 2022-11-14T12:26:36Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+        output = speech_recognizer([filename], chunk_length_s=5, batch_size=4)
+        self.assertEqual(output, [{"text": " A man said to the universe, Sir, I exist."}])


sgugger

Just one comment on the warning, otherwise LGTM! Thanks!

sgugger · 2022-11-14T17:13:54Z

src/transformers/pipelines/automatic_speech_recognition.py

+                logger.warning(
+                    "Using `chunk_length_s` is very experimental. The results will not necessarily be entirely"
+                    " accurate and will have caveats. More information:"
+                    " https://github.com/huggingface/transformers/pull/20104"
                )


Can we add some logic to only throw this warning once? Users are complaining Transformers is too verbose.

Is there already a created way to do that ?

Otherwise I can create some tool for it.
Any other location we could add this "single" warning ? (Will add in a different PR)

We use a dict in the state like this one. No need to overengineer another solution IMO.

HuggingFaceDocBuilderDev · 2022-11-14T20:38:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-14T21:13:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

nickmuchi87 · 2022-11-21T02:10:41Z

Has this been added to the current transformers version? I am getting the "ValueError: chunk_length_s is only valid for CTC models, use other chunking options for other models".

Narsil · 2022-11-21T08:18:58Z

There hasn't been a release yet, you must use main if you want to use it now.

amgmbs · 2022-11-26T10:47:59Z

Using chunk_length_s=10 without stride_length_s=(4, 2)
looses a rather large part of the transcription. It works pretty nice with stride :) but I get a lot of repetitions despite setting condition_on_previous_text=0

Is there an alternative way to transcribe large audio files when I am using a fine-tuned whisper model?

Narsil · 2022-11-26T13:27:38Z

looses a rather large part of the transcription. It works pretty nice with stride :) but I get a lot of repetitions despite setting condition_on_previous_text=0

condition_on_previous_text ? What is that ?

Could you provide an example with the repetitions ? There might be some optimizations to be made on the tying of the chunks.
As I mentioned in this PR, the tying of inferred audio can definitely create repeitions, but with better examples, we might be able to figure out better heuristics.

amgmbs · 2022-11-26T18:45:24Z

condition_on_previous_text: bool
if True, the previous output of the model is provided as a prompt for the next window;
disabling may make the text inconsistent across windows, but the model becomes less prone to
getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

I think it's true by default and it uses something like GPT to "verify" the transcription. When I simply use whisper with medium or large model, I prefer to set it to False

I get a lot of repetitions even with small samples, like the audio from this commercial

https://www.youtube.com/watch?v=LkllgKVgz8o

or this trailer

https://www.youtube.com/watch?v=xncMdIGR2pk

I have fine tuned whisper for Greek and I am trying to use it with the following lines (after of course loading transformers and the model, etc)

from transformers import pipeline

transcript = pipeline(
task="automatic-speech-recognition",
model = model,
feature_extractor = feature_extractor,
tokenizer=tokenizer,
framework="pt",
batch_size=16,
device='cuda:0',
#generate_kwargs={"max_new_tokens": 1024},
#max_new_tokens = 1024,
chunk_length_s=10,
stride_length_s=(4, 2), # must have with chunk_length_s
condition_on_previous_text=0,
compression_ratio_threshold=2.5
)

bofenghuang · 2022-11-26T19:00:57Z

hi, I think condition_on_previous_text (and initial_prompt) is a decoding option used in the original OpenAI's version, not (yet?) implemented in HF's version. cc @ArthurZucker

Narsil · 2022-11-26T20:38:01Z

We use a different decoding strategy here, because openai/whisper is not stateless which is kind of a requirement of pipeline. (It means you can actually do batching, which is not possible with original whisper.)

Narsil · 2022-11-26T21:42:15Z

Did you try using chunk_length_s=30. By default it uses 1/6=5s of chunking on each sides, which should be plenty.

I'm getting for the first example:

{'text': " The dance is like life. You don't need to know the steps. You just need to hear the beat of your heart. You don't need rules to make the right move. Your consciousness is enough. Zagori. We have the good in us."}

Which seems corect to me.

Narsil · 2022-11-26T21:45:41Z

{'text': ' The test is ready. Rachel wrote Ross a letter and demanded he read it before they got back together. How many pages was that letter? 18 pages! 18 pages. Front and back! Front and back is correct! Wait, wait, go one more time! Oh my god. Here we go. Where\'s the tissue box? The cast of Friends. Wow. It\'s cool. her lines written on the table? We\'ve literally just slipped right back. We regret. We have such a bond from this show. Were Ross and Rachel on a break? Yes. Yes. Yes. Yes. Bullshit. table read, that\'s the first time I laid eyes on any of you. Everyone was so perfectly cast. Yeah. This is from the one where everyone finds out. I remember I went to the producer of the show I was on and he he said, "That show\'s not gonna make you a star." [laughing] I remember one time I happened to have the news on, and on the TV was an aerial shot of each of our houses. - Oh, jeez. - And I remember looking at it, going, "What the--?" My roof is a mess. [laughing] It was an incredible time. We became best friends. Yeah, I\'m going to cry now. When I watch the episodes, I\'m laughing out loud, because you all make me laugh so hard. I know you know how big the show is. What you have given so many people is an experience of huge comfort. like we had these friends. I love you guys so much.'}

For the second.

amgmbs · 2022-11-26T22:01:59Z

Yes that was good
What value should I use for stride_length_s with chunk 30?

Can you please tell me if this is the only way to transcribe large audio files with pipeline?

Thank you all :)

Narsil · 2022-11-26T22:09:40Z

What value should I use for stride_length_s with chunk 30?

The stride defaults are chunk_length_s / 6 on each sides so here, 5s, 5s. It's important to have something significant on both sides I think (more overlap will reduce the chances for the algorithm to get it wrong).

amgmbs · 2022-11-27T12:35:59Z

when I ! pip install git+https://github.com/openai/whisper.git and import whisper, all is fine with medium model

I have tried pipeline with

chunk_length_s=30,
stride_length_s=(5, 5),

and still I get repetitions both with openai/whisper-medium openai/whisper-large and emilios/whisper-medium-el
I 've tried other bigger videos (well, ok audio) and it is not working as supposed to :(

amgmbs · 2022-11-27T13:01:24Z

I've just noticed that translated is ok, but the transcription in the original language has repetitions

https://www.youtube.com/watch?v=e_eCryyPRus

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "el", task = "transcribe")

greek transcript with repetitions, removed

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "el", task = "translate")

transcript translated into english, removed

Narsil · 2022-11-27T13:19:53Z

It is possible that the model is in cause then ?

ML generative models are know to be repetitive. And the kin dof repetition I'm seeing here really looks like bad model generation more than erroneous stitching.

amgmbs · 2022-11-27T16:32:13Z

Nope, I think the translation engine fixes (or hides) the repetitive phrases

Narsil · 2022-11-27T18:34:18Z

I confirm it is the model.

Take the audio of the video you linked.

ffmpeg -ss 140 -i out.mp3 -c copy -t 20 out_repete.mp3

Then do the inference:

from transformers import pipeline, AutoProcessor

processor = AutoProcessor.from_pretrained("emilios/whisper-medium-el")
pipe = pipeline(model="emilios/whisper-medium-el")
pipe.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

out = pipe("out_repete.mp3")
print(out)

And you will see that the model goes looping all by itself. This is not the chunking's doing.

amgmbs · 2022-11-27T20:02:04Z

But I get the same problem with openai model too

[{'text': ' [μουσική] Εμείς σήμερα, εγώ του λαμβάνω,πικά, αλλά φαντάζομαι όσοι από μας προσπαθούν να σκεφτούν σοβαρά, μέσα σε όλο αυτό το χάος του ιστορικού υλικού που έχουμε μπροστά μας, επιλέγουμε μια παράδοση. Αυτό δεν σημαίνει ότι την επιλέγουμε για να σημαίνουμε δούλοι.. Επιλέγουμε ακριβώς την παράδοση εκείνη, δηλαδή αυτήν που ονομάζω Έλληνοδυτική, μέσα στην οποία η αμφισβήτηση της παράδοσης είναι ένα βασικό στοιχείο. Η αμφισβήτηση όχι για την ευχαρίστηση της αμφισβήτησης, Η αμφισβήτηση όταν υπάρχει λόγος, η δυνατότητα της αμφισβήτησης, η δυνατότητα του να σκεφτώ αλλιώς, του να μιλήσω αλλιώς από τη σκέφτετη. Η πλειοψηφία, η εκκλησία, το κράτος, το κόμμα κτλ. Δεν είναι έτσι; Ο δάσκαλος, οι γονείς ενδεχομένως. Και από εκεί και πέρα η δυνατότητα να βάλω σαν άτομο ή να βάλει μια κοινωνική ομάδα ή μια πολιτική κίνηση ερωτήματα σχετικά με το αν η σημερινή θέσμη της κοινωνίας είναι δίκαιη ή δεν είναι δίκαιη, εάν η ισότητα εντός εισαγωγικών, την οποία επαγγέλλεται το Σύνταγμα και ο νόμος για τους πολίτες, τα βασικά χαρακτηριστικά αυτής της παράδοσης, πιο όχι άλλο νόμο. Κάθε κοινωνία δημιουργεί τους θεσμούς της, αλλά η ιδέα ότι η θεσμία αυτή είναι η δική της δημιουργία ακριβώς δε είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρκετά. Είναι αρονομία της κοινωνίας δεν είναι μόνο και δεν είναι τόσο η εκμετάλλευση, η καταπίεση, η υπάρξη μιας εξουσίας χωρισμένης από την κοινωνία. Είναι η ιδέα ότι οι θεσμοί ήρθαν απαλού.ει και σε πρωτόγωνες κοινωνίες, στις οποίες δεν βλέπουμε αυτά τα συνόμενα. Η ετερονομία της κοινωνίας είναι το γεγονός ακριβώς ότι η κοινωνία αλοτριώνεται στους θεσμούς της οποίες η ίδια η δημιούργησε, διότι δεν ξέρει ότι η ίδια τους η δημιούργησε, Αν δεν υπήρχε Θεός, όλα θα ήσουν αυτοί που θα έρθουν.ημειωταίων δεν ανήκει στον Ντοστογεύσκη, αλλά μπορεί να το πάει κανείς πίσω, ως τουλάχιστον με έκακε τον Πλάτονα. Και το οποίον εγώ θεωρώ επιχείρημα υπαστηνό μου βήτα, δηλαδή ότι χρειάζεται ένας Θεός, διότι αλλιώς όλα αυτά τα ρεμάλια θα κάνουν, τους κατεύαιναν, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, εγώ, η αρχαία αθηνή, ναι. Τους κάνουμε τους νόμους μας και όσον δεν τους έχουμε αλλάξει, τους ευώμαστε. Αυτό είναι το πράγμα που πρέπει να δίνει. Από αυτή την άψη, αυτό το οποίο ενεργώ εγώ ως αυτώνομη κοινωνία, είναι μια κοινωνία, όχι οποία είναι διαφανής, αλλά είναι μια κοινωνία η οποία ξέρει ότι δεν υπάρχει υπερβατικότητα, ότι δεν υπάρχει υπερβατική πηγή των θεσμών και των νόμων, ότι δεν υπάρχει μεταθάνατον ζωή αυτό που ξέραν οι αρχαίοι Έλληνες, οι οποίοι δεν επίστευαν σε μεταθάνατο ζωή,υτό μας, στους εαυτούς μας, σαν κοινωνικό σύνολο, κανόνες και νόμιες, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να κάνουμε, να δούμε ότι όσο πρέπει να δούμε, να δούμε ότι όσο πρέπει να δούμε, να δούμε ότι όσο πρέπει να δούμε, να δούμε ότι όσο πρέπει να δούμε, να δούμες έχουμε να το κάνουμε και έχουμε να δώσουμε στον εαυτό μας, στους εαυτούς μας σαν κοινωνικό σύνολο, κανόνες και νόμους που να μας επιτρέπουν να υπάρχουμε σαν αυτώνομη κοινωνία και σαν αυτώνομα άτομα μέσα σε αυτή την κοινωνία.'}]

Please check it with the following code in a notepad when you can

!pip install git+https://github.com/huggingface/transformers
!pip install pytube
from pytube import YouTube

mymodel = "openai/whisper-medium"
#mymodel = "openai/whisper-large"
#mymodel = "emilios/whisper-medium-el"
#lang="English"
lang="Greek"

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained( mymodel)
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained( mymodel, language=lang, task="transcribe")
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained( mymodel, language=lang, task="transcribe")
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained( mymodel, language=lang, task="transcribe")

link = 'https://www.youtube.com/watch?v=e_eCryyPRus'

try:
yt = YouTube(link)
except:
print("Connection Error")
yt.streams.filter(file_extension='mp4')
stream = yt.streams.get_by_itag(139)
stream.download('',"YouTube.mp4")

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "el", task = "transcribe")
model.config.suppress_tokens = []
#model.config.max_new_tokens = 1024

from transformers import pipeline

transcript = pipeline(
task="automatic-speech-recognition",
model = model,
feature_extractor = feature_extractor,
tokenizer=tokenizer,
framework="pt",
batch_size=16,
device='cuda:0',
#generate_kwargs={"max_new_tokens": 1024},
#max_new_tokens = 1024,
chunk_length_s=30, # 12
stride_length_s=(5, 5), # must have with chunk_length_s
condition_on_previous_text=0,
compression_ratio_threshold=2.4
)

out = transcript(["YouTube.mp4"])
print(out)

Narsil · 2022-11-27T23:20:23Z

Yes, this is what I'm saying. The model is repeating itself, there's not much we can do about it.

If you could fine tune it even more, or on more data, or more diverse data, that could probably help.

For faster solutions, you could try and reduce amount of repetition, with repetition_penalty (there's actually several options for it) https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate

That should help you get started. But please bear in mind it's only a temporary solution, the real solution is fixing the model itself I'm afraid. (But all models end up doing repetition when out of domain).

amgmbs · 2022-11-28T23:10:22Z

I thought that when I use pipeline and hf whisper medium model from openai

mymodel = "openai/whisper-medium"
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained( mymodel)

is exactly the same model with the following code

import whisper
model = whisper.load_model("medium")
result = model.transcribe("GoogleImagen.mp4",
language= "el", fp16=True)
print(result['text'])

Which does not have any repetitions

What am I missing?

Transcript of the last video

Η τ Playstation Εμείς σήμερα, εγώ τουλάχιστον προσωπικά, αλλά φαντάζομαι όσοι από μας προσπαθούν να σκεφτούν σοβαρά, μέσα σε όλο αυτό το χάος του ιστορικού υλικού που έχουμε μπροστά μας, επιλέγουμε μια παράδοση. Αυτό δεν σημαίνει ότι την επιλέγουμε για να τσιμίνουμε δούλοι. Επιλέγουμε ακριβώς την παράδοση εκείνη, δηλαδή αυτή που ονομάζω Έλληνο-Δυτική, μέσα στην οποία η αμφισβήτηση της παράδοσης είναι ένα βασικό στοιχείο. Η αμφισβήτηση όχι για την ευχαρίστηση της αμφισβήτησης, η αμφισβήτηση όταν υπάρχει λόγος, η δυνατότητα της αμφισβήτησης, η δυνατότητα του να σκεφτώ αλλιώς, του να μιλήσω αλλιώς από τι σκέφτεται η πλειοψηφία, η εκκλησία, το κράτος, το κόμμα κτλ. Δεν είναι έτσι? Ο δάσκαλος, οι γονείς ενδεχομένως. Και από εκεί και πέρα η δυνατότητα να βάλω σαν άτομο ή να βάλει μια κοινωνική ομάδα ή μια πολιτική κίνηση ερωτήματα σχετικά με το αν η σημερινή θέσμηση της κοινωνίας είναι δίκαιη ή δεν είναι δίκαιη, εάν η ισότητα εντός αορικών, την οποία αν επαγγέλεται το σύνταγμα και ο νόμος για τους πολίτες, υπάρχει στην πραγματικότητα ή δεν υπάρχει, αυτή η δυνατότητα είναι συμφύσης επίσης με τα βασικά χαρακτηριστικά αυτής της παράδοσης, πιο όχι άλλο νόμο. Κάθε κοινωνία δημιουργεί τους θεσμούς της, αλλά η ιδέα ότι η θεσμία αυτή είναι η δική της δημιουργία ακριβώς δεν υπάρχει στις περισσότερες κοινωνίες. Γι' αυτό και οι θεσμοί μένουν άθηκτοι. Υπάρχει η ιδέα ότι οι θεσμοί ήρθαν απαλού. Η ετερονομία της κοινωνίας δεν είναι μόνο και δεν είναι τόσο η εκμετάλλευση, η καταπίεση, η υπάρξη μιας εξουσίας χωρισμένης από την κοινωνία και τα λοιπά, γιατί η ετερονομία της κοινωνίας υπάρχει και σε πρωτόγωνες κοινωνίες, εσείς όπου δεν βλέπουμε αυτά τα φαινόμενα. Η ετερονομία της κοινωνίας είναι το γιόνωσό ακριβώς ότι η κοινωνία αλοτριώνεται στους θεσμούς τις οποίες η ίδια η δημιούργησε, διότι δεν ξέρει ότι η ίδια τους η δημιούργησε και κατά κάποιο τρόπο δεν μπορεί να το ξέρει, γιατί είναι τρομερά δύσκολο να το ξέρει. Και αυτό το περίφουμο επιχείρημα του Ντοστογεύσκη, το οποίο τόσο έχει εκθιαστεί, ότι εάν δεν υπήρχε Θεός όλα θα ήσανε επιτρεπτά, το οποίο σημειωταίων δεν ανήκει στον Ντοστογεύσκη, αλλά μπορεί να το πάει κανείς πίσω, ως τουλάχιστον με έκανε και τον Πλάτονα, και το οποίο εγώ θεωρώ επιχείρημα υπαστεινό μου βήτα, δηλαδή ότι χρειάζεται ένας Θεός, όλα αυτά τα ρεμάλια θα κάνουν ότι τους κατέβαιναν, παρά τη χειδαιότητα του επιχείρημα του Ντοστογεύσκη, παρά τη χειδαιότητά του εκφράζει μια βασική αλήθεια της θέσμης των ετερονόμων κοινωνιών. Δηλαδή χρειάζεται να λεχτεί ότι ο θεσμός έχει έρθει από αλλού, για να μπορεί να κατοχυρωθεί ο θεσμός. Εάν οι άνθρωποι ξέραν ότι κάνουν ήδη τους νόμους τους, θα τους ευβόντουσαν. Εσ' αυτό απαντάνε οι αρχαίοι Έλληνες και οι αρχαίοι Αθηνέοι, ναι, τους κάνουμε τους νόμους μας και όσον δεν τους έχουμε αλλάξει, τους ευόμαστε. Και σ' αυτό, κατά κάποιο τρόπο, προσπάθησε να απαντήσει το νεότερο δημοκρατικό και παναστατικό κίνημα στο μέτρο που απάντησε, προσπαθώντας να βάλει μπροστά την ιδέα ότι τους νόμους τους δημιουργεί ο λαός και ότι αυτό δεν είναι λόγος να μην είναι σεβαστοί αυτή η νόμη, παραδείγματι, δεν είναι έτσι. Από αυτή την άψη, αυτό το οποίο ενεργώ εγώ ως αυτώνομη κοινωνία, είναι μια κοινωνία όχι η οποία είναι διαφανής αλλά είναι μια κοινωνία η οποία ξέρει ότι δεν υπάρχει υπερβατικότητα, ότι δεν υπάρχει υπερβατική πηγή των θεσμών και των νόμων, ότι δεν υπάρχει μεταθάνατο ζωή αυτό που ξέραν οι αρχαίοι Έλληνες οι οποίοι δεν επίστευαν σε μεταθάνατο ζωή ή αν επίστευαν σε μεταθάνατο ζωή της δίνα ένα περιεχόμενο όπως φαίνεται στην Οδύσσια που ήταν 100 φορές χειρότερο από την επίγελ ζωή η οποία ξέρει συναιπώς ότι ό,τι γίνεται γίνεται εδώ κάτω και ότι ό,τι έχουμε να κάνουμε εμείς έχουμε να το κάνουμε και έχουμε να δώσουμε στον εαυτό μας, στους εαυτούς μας σαν κοινωνικό σύνολο κανόνες και νόμους που να μας επιτρέπουν να υπάρχουμε σαν αυτόνομη κοινωνία και σαν αυτόνομα άτομα μέσα σε αυτή την κοινωνία

Narsil · 2022-11-29T10:02:20Z

What am I missing?

Both methods are different, and it's just luck based I think into how the split occurs.
If you check the split I suggest, you can see the duplication, but it doesn't if you shift by 10s left or right.
If OpenAI splits at different locations, then it will work better. But I think it could easily be the other way around. (Unless something else goes around within the model about timestamps within OpenAI, but afaik it's all really just luck based)

…ng algorithm. (huggingface#20104) * Very crude matching algorithm. * Fixing tests. * Removing comments * Adding warning + fix short matches. * Cleanup tests. * Quality. * Less noisy. * Fixup.

Narsil requested review from ArthurZucker and sgugger November 7, 2022 14:58

sgugger reviewed Nov 7, 2022

View reviewed changes

Narsil requested a review from sgugger November 8, 2022 17:02

ArthurZucker mentioned this pull request Nov 9, 2022

ASR pipeline does not work with openai/whisper on current master #19490

Closed

4 tasks

Narsil added 6 commits November 14, 2022 09:35

Very crude matching algorithm.

20af9b4

Fixing tests.

01e833c

Removing comments

4127c84

Adding warning + fix short matches.

1019536

Cleanup tests.

5db3432

Quality.

bd13f54

Narsil force-pushed the whisper_chunking branch from 5eb0179 to bd13f54 Compare November 14, 2022 08:35

ArthurZucker approved these changes Nov 14, 2022

View reviewed changes

sgugger reviewed Nov 14, 2022

View reviewed changes

Less noisy.

035c2bc

Fixup.

8b9f1f2

Narsil merged commit 25c451e into huggingface:main Nov 14, 2022

Narsil deleted the whisper_chunking branch November 14, 2022 22:57

ydshieh mentioned this pull request Dec 5, 2022

Fix AutomaticSpeechRecognitionPipelineTests.run_pipeline_test #20597

Merged

pearl-yu mentioned this pull request Mar 16, 2023

whisper return_timestamp error #22214

Closed

4 tasks

ylacombe mentioned this pull request Jan 11, 2024

Seamless M4T-v2 Inference bug when using chunk_length_s parameter #28397

Closed

4 tasks

		output = speech_recognizer([filename], chunk_length_s=5, batch_size=4)
		self.assertEqual(output, [{"text": " A man said to the universe, Sir, I exist."}])

Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Adding chunking for whisper (all seq2seq actually). Very crude matching algorithm. #20104

Conversation

Narsil commented Nov 7, 2022 • edited

What does this PR do?

Approach

Pros

Cons

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Nov 7, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil commented Nov 7, 2022

ArthurZucker commented Nov 8, 2022 • edited

Narsil commented Nov 8, 2022

Narsil commented Nov 8, 2022

ArthurZucker commented Nov 9, 2022

Narsil commented Nov 14, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

ArthurZucker commented Nov 14, 2022

Narsil commented Nov 14, 2022

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 14, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

nickmuchi87 commented Nov 21, 2022

Narsil commented Nov 21, 2022

amgmbs commented Nov 26, 2022 • edited

Narsil commented Nov 26, 2022

amgmbs commented Nov 26, 2022 • edited

bofenghuang commented Nov 26, 2022

Narsil commented Nov 26, 2022

Narsil commented Nov 26, 2022

Narsil commented Nov 26, 2022

amgmbs commented Nov 26, 2022

Narsil commented Nov 26, 2022

amgmbs commented Nov 27, 2022

amgmbs commented Nov 27, 2022 • edited

Narsil commented Nov 27, 2022

amgmbs commented Nov 27, 2022

Narsil commented Nov 27, 2022

amgmbs commented Nov 27, 2022

Narsil commented Nov 27, 2022

amgmbs commented Nov 28, 2022

Narsil commented Nov 29, 2022

Narsil commented Nov 7, 2022 •

edited

ArthurZucker commented Nov 8, 2022 •

edited

amgmbs commented Nov 26, 2022 •

edited

amgmbs commented Nov 26, 2022 •

edited

amgmbs commented Nov 27, 2022 •

edited