Updated Extraction of Embeddings #1

ilanit1997 · 2023-05-16T15:09:45Z

I have found your fork to be exceptionally valuable for extracting encoder and decoder embeddings. As a result, I have decided to integrate the modifications you made for extracting embeddings, ensuring they align with the most recent version of openai/whisper.

Fix bug: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Add project summary, license, etc. for display with "pip show" and similar Python package distribution tools.

- The "large-v2" model is trained for more epochs with regularization and shows improved performance compared to the previous large. - It has the same architecture as the original large model. - When `load_model("large")` is called, the "large-v2" model will be loaded. - We will soon update the paper regarding this new model.

* Update Hebrew language code to he per IANA registry Per [IANA registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry), `iw` was deprecated as the code for Hebrew in 1989 and the preferred code is `he` The correct subtag: ``` %% Type: language Subtag: he Description: Hebrew Added: 2005-10-16 Suppress-Script: Hebr %% ``` And the deprecation ``` %% Type: language Subtag: iw Description: Hebrew Added: 2005-10-16 Deprecated: 1989-01-01 Preferred-Value: he Suppress-Script: Hebr %% ``` * Update hebrew ISO code to he Per discussion, it's ok to make this change without backwards compatibility

This reverts commit 68e44bd.

s/successfully/successively, which I believe was the intent.

For a 30s long audio file which didn't have any silence, ndimage.median_filter took 7s where signa.medfilter took 30s. Co-authored-by: Umar Farooqi <umar@paystash.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

…it (openai#681) * Add github action to automatically push to pypi on Release x.y.z commit * some housekeeping for pypi upload * add version.py Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

…ing (openai#859)

…#659) Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

Co-authored-by: Jong Wook Kim <jongwook@openai.com>

…in JSON (openai#1060)

* kwargs in decode() for convenience * formatting fix

* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

* Fix alignment between the segments and the list of words * Ensure the word index does not overflow

…ai#1076) Co-authored-by: Akash Mahajan <akash.mahajan@microsoft.com> Co-authored-by: Jong Wook Kim <jongwook@openai.com>

…penai#1089)

…1211)

* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

…ng optional (openai#1184) * Add highlight_words, max_line_width, max_line_count * Refactor subtitle generator --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Co-authored-by: Jong Wook Kim <jongwook@openai.com>

@Jeronymous

* Update decoding.py Following the suggestions of @Jeronymous in openai#914 and openai#924, it solves the problem of endless loop. * Removed blank line and whitespaces in empty lines. * Suggested changes according to the linter --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

* python 3.11 * python 3.11 * fix * fix * fix * revert changes * Update requirements.txt * Trying pip3 install instead * Excluding cp39 - torch 1.10.2 * Removing 1.10.2 from test --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

* Drop ffmpeg-python dependency and call ffmpeg directly. The last ffmpeg-python module release was in 2019[1], upstream seem to be unavailable[2] and the project development seem to have stagnated[3]. As the features it provide is trivial to replace using the Python native subprocess module, drop the dependency. [1] <URL: https://github.com/kkroening/ffmpeg-python/tags > [2] <URL: kkroening/ffmpeg-python#760 > [3] <URL: https://openhub.net/p/ffmpeg-python > * Rewrote to use subprocess.run() instead of subprocess.Popen(). * formatting changes * formatting update * isort fix * Error checking * isort 🤦🏻 * flake8 fix * minor spelling changes --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

* Avoid computing higher temperatures on no_speech In decode_with_fallback, we compute higher temperatures in the case where compression_ratio is too high or avg_logprob is too low. But as the computation of no_speech_prob doens't depend on sampling, we can avoid computing higher temperatures if we detect in the first one that the no_speech condition is fulfilled * Update transcribe.py --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

…ices (openai#1236) * Updated README.md to provide more insight on BLEU and specific appendices in the research paper * Update README.md --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

From numba 0.57 raise a warning if `nopython` is not supplied: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

prompt_reset_since is set before all_tokens is extended hence does not have the expected effect.

…mbeddings + change pipeline_transcriptions.py to allow save of embedding as npy arrays

…o allow embedding extraction

myyhlee · 2023-05-22T05:47:59Z

I have found your fork to be exceptionally valuable for extracting encoder and decoder embeddings. As a result, I have decided to integrate the modifications you made for extracting embeddings, ensuring they align with the most recent version of openai/whisper.

Thank you for your work. But I tried the "command-line usage" codes and it seems no embedding was saved. Could you explain more on how to get the embedding using the usage codes in README.md?

MichaelMonashev and others added 30 commits May 16, 2023 17:58

Fix bug (openai#305)

207a46a

Fix bug: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Add package metadata to setup.py (openai#315)

1efaa73

Add project summary, license, etc. for display with "pip show" and similar Python package distribution tools.

Fix attention caching to make it actually work (openai#370)

4117a3a

suppress generating non-timestamp tokens at the beginning (openai#532)

77c4d6c

fix to return only the text token ids

3c18acf

invoking __call__ instead of forward()

f20ea99

fix compression ratio function (openai#561)

ef25ace

Explicitly closing model file after reading it (openai#630)

792e1f0

large-v2 figure and arxiv url update

69d2a4d

saving the qk matrix in the attention module for convenience

df00c4b

Revert "saving the qk matrix in the attention module for convenience"

6f602de

This reverts commit 68e44bd.

MultiHeadAttention to return qk as well

24185b3

word-level timestamps in Multilingual_ASR notebook

939d1d3

torch.concatenate -> torch.cat for compatibility

9116fdd

Fix tiny transcribe() docstring typo (openai#857)

8089c63

s/successfully/successively, which I believe was the intent.

Support batch-dimension in log_mel_spectogram (openai#839)

1b23bc3

Update README.md (openai#804)

7ac309b

add github action to run pytest

b7535b3

allow test_transcribe to run on CPU when CUDA is not available

4844fc8

rename GitHub workflow

abd105e

Use ndimage.median_filter instead of signal.medfilter (openai#812)

3021ec0

For a 30s long audio file which didn't have any silence, ndimage.median_filter took 7s where signa.medfilter took 30s. Co-authored-by: Umar Farooqi <umar@paystash.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

Add github action to automatically push to pypi on Release x.y.z comm…

a6ec1a9

…it (openai#681) * Add github action to automatically push to pypi on Release x.y.z commit * some housekeeping for pypi upload * add version.py Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

Release 20230117

ca9444b

print '?' if a letter can't be encoded using the system default encod…

141204b

…ing (openai#859)

verbose outputs from pytest

8360160

Fix bug where mm is mistakenly replaced with hmm in e.g. 20mm (openai…

5d304e4

…#659) Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>

use stdout for printing transcription progress (openai#867)

91d56d2

Handle XDG_CACHE_HOME properly for download_root (openai#864)

2990115

Co-authored-by: Jong Wook Kim <jongwook@openai.com>

jongwook and others added 29 commits May 16, 2023 17:58

Release 20230307

0aa8b79

fix typo

86d33b6

fix all_tokens handling that caused more repetitions and discrepancy …

88d2811

…in JSON (openai#1060)

kwargs in decode() for convenience (openai#1061)

5c8405c

* kwargs in decode() for convenience * formatting fix

Release 20230308

f433b12

Fix alignment between the segments and the list of words (openai#1087)

c15c065

* Fix alignment between the segments and the list of words * Ensure the word index does not overflow

fix github language stats getting dominated by jupyter notebook (open…

05a8312

…ai#1076) Co-authored-by: Akash Mahajan <akash.mahajan@microsoft.com> Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Fix truncated words list when the replacement character is decoded (o…

bf1873f

…penai#1089)

abort find_alignment on empty input (openai#1090)

11242f3

Release 20230314

80ff7bf

Update tokenizer.py (openai#1163)

ec42661

python-publish.yml: bump actions version to fix node warning (openai#…

4e7a9ad

…1211)

Squash long words at window and sentence boundaries. (openai#1114)

d390815

* Squash long words at window and sentence boundaries. * Formatting requirements. * Fix squashing logic to point to correct words. --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Implement max line width and max line count, and make word highlighti…

819074f

…ng optional (openai#1184) * Add highlight_words, max_line_width, max_line_count * Refactor subtitle generator --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Update README.md to reference tiktoken (openai#1105)

2974581

Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Update decoding.py (openai#1219)

dffc84f

Python 3.11 (openai#1171)

27784ec

* python 3.11 * python 3.11 * fix * fix * fix * revert changes * Update requirements.txt * Trying pip3 install instead * Excluding cp39 - torch 1.10.2 * Removing 1.10.2 from test --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Dropped unused execute bit from mel_filters.npz. (openai#1254)

07b6909

Updated README.md to provide more insight on BLEU and specific append…

4faca12

…ices (openai#1236) * Updated README.md to provide more insight on BLEU and specific appendices in the research paper * Update README.md --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

Fix numba depreceation notice (openai#1233)

3dd982e

From numba 0.57 raise a warning if `nopython` is not supplied: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

fix condition_on_previous_text (openai#1224)

8099c2e

prompt_reset_since is set before all_tokens is extended hence does not have the expected effect.

add whisper clones with change to allow save of encoder and decoder e…

49a0986

…mbeddings + change pipeline_transcriptions.py to allow save of embedding as npy arrays

update branch with most recent files from whisper and apply changes t…

679584e

…o allow embedding extraction

delete local files

ed80c0a

fix indentation bug

94d0e30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Extraction of Embeddings #1

Updated Extraction of Embeddings #1

ilanit1997 commented May 16, 2023

myyhlee commented May 22, 2023

Updated Extraction of Embeddings #1

Are you sure you want to change the base?

Updated Extraction of Embeddings #1

Conversation

ilanit1997 commented May 16, 2023

myyhlee commented May 22, 2023