Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Extraction of Embeddings #1

Open
wants to merge 77 commits into
base: extract-embeddings
Choose a base branch
from

Conversation

ilanit1997
Copy link

I have found your fork to be exceptionally valuable for extracting encoder and decoder embeddings. As a result, I have decided to integrate the modifications you made for extracting embeddings, ensuring they align with the most recent version of openai/whisper.

MichaelMonashev and others added 30 commits May 16, 2023 17:58
Fix bug: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
Add project summary, license, etc. for display with
"pip show" and similar Python package distribution tools.
- The "large-v2" model is trained for more epochs with regularization and shows improved performance compared to the previous large.
- It has the same architecture as the original large model.
- When `load_model("large")` is called, the "large-v2" model will be loaded.
- We will soon update the paper regarding this new model.
* Update Hebrew language code to he per IANA registry

Per [IANA registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry), `iw` was deprecated as the code for Hebrew in 1989 and the preferred code is `he`

The correct subtag: 
```
%%
Type: language
Subtag: he
Description: Hebrew
Added: 2005-10-16
Suppress-Script: Hebr
%%
``` 
And the deprecation
```
%%
Type: language
Subtag: iw
Description: Hebrew
Added: 2005-10-16
Deprecated: 1989-01-01
Preferred-Value: he
Suppress-Script: Hebr
%%
```

* Update hebrew ISO code to he

Per discussion, it's ok to make this change without backwards compatibility
s/successfully/successively, which I believe was the intent.
For a 30s long audio file which didn't have any silence, ndimage.median_filter took 7s where signa.medfilter took 30s.

Co-authored-by: Umar Farooqi <umar@paystash.com>
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
…it (openai#681)

* Add github action to automatically push to pypi on Release x.y.z commit

* some housekeeping for pypi upload

* add version.py

Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
jongwook and others added 29 commits May 16, 2023 17:58
* kwargs in decode() for convenience

* formatting fix
* use tiktoken==0.3.0

* formatting

* tuple should be safer

* Update whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

* use tiktoken 0.3.1

* reflecting suggestions

* cleanup

* bypassing load_tiktoken_bpe to avoid blobfile dep

---------

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>
* Fix alignment between the segments and the list of words

* Ensure the word index does not overflow
…ai#1076)

Co-authored-by: Akash Mahajan <akash.mahajan@microsoft.com>
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Squash long words at window and sentence boundaries.

* Formatting requirements.

* Fix squashing logic to point to correct words.

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
…ng optional (openai#1184)

* Add highlight_words, max_line_width, max_line_count

* Refactor subtitle generator

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Update decoding.py

Following the suggestions of @Jeronymous in openai#914 and openai#924, it solves the problem of endless loop.

* Removed blank line and whitespaces in empty lines.

* Suggested changes according to the linter

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* python 3.11

* python 3.11

* fix

* fix

* fix

* revert changes

* Update requirements.txt

* Trying pip3 install instead

* Excluding cp39 - torch 1.10.2

* Removing 1.10.2 from test

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Drop ffmpeg-python dependency and call ffmpeg directly.

The last ffmpeg-python module release was in 2019[1], upstream seem to be
unavailable[2] and the project development seem to have stagnated[3].  As
the features it provide is trivial to replace using the Python native
subprocess module, drop the dependency.

 [1] <URL: https://github.com/kkroening/ffmpeg-python/tags >
 [2] <URL: kkroening/ffmpeg-python#760 >
 [3] <URL: https://openhub.net/p/ffmpeg-python >

* Rewrote to use subprocess.run() instead of subprocess.Popen().

* formatting changes

* formatting update

* isort fix

* Error checking

* isort 🤦🏻

* flake8 fix

* minor spelling changes

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Avoid computing higher temperatures on no_speech

In decode_with_fallback, we compute higher temperatures in the case where compression_ratio is too high or avg_logprob is too low.
But as the computation of no_speech_prob doens't depend on sampling, we can avoid computing higher temperatures if we detect in the first one that the no_speech condition is fulfilled

* Update transcribe.py

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
…ices (openai#1236)

* Updated README.md to provide more insight on BLEU and specific appendices in the research paper

* Update README.md

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
prompt_reset_since is set before all_tokens is extended hence does not have the expected effect.
…mbeddings + change pipeline_transcriptions.py to allow save of embedding as npy arrays
@myyhlee
Copy link

myyhlee commented May 22, 2023

I have found your fork to be exceptionally valuable for extracting encoder and decoder embeddings. As a result, I have decided to integrate the modifications you made for extracting embeddings, ensuring they align with the most recent version of openai/whisper.

Thank you for your work. But I tried the "command-line usage" codes and it seems no embedding was saved. Could you explain more on how to get the embedding using the usage codes in README.md?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet