📻 (AST) Audio data classification optimization and data pre-process #762

trajepl · 2023-11-29T11:47:43Z

Describe your changes

For this issue #735, I tried to add an example for AST model optimization with Olive data configs on huggingface examples to achieve script-free for model optimization.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

vymao · 2023-11-29T21:24:00Z

Thanks. Though one thing to note is that the actual speech_commands dataset by itself doesn't match what the AST-v2 model was fine-tuned on, and you probably need to do some pre-processing to remove extra labels: huggingface/datasets#6446

vymao · 2023-11-29T22:57:41Z

Also I think you need to query file from the examples, currently getting an error from here:

  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/olive/data/component/pre_process_data.py", line 259, in _tokenizer_and_align_labels
    for files in tokenized_inputs[kwargs.get("file_column_name", "file")]:
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/transformers/feature_extraction_utils.py", line 86, in __getitem__
    return self.data[item]
KeyError: 'file'

olive/data/component/pre_process_data.py

…/audio_data_preprocess

trajepl · 2023-11-30T05:16:44Z

Thanks. Though one thing to note is that the actual speech_commands dataset by itself doesn't match what the AST-v2 model was fine-tuned on, and you probably need to do some pre-processing to remove extra labels: huggingface/datasets#6446

Yes, I manually align the labels by file prefix. The way you show is the better way. Updated.

trajepl · 2023-11-30T05:17:01Z

Also I think you need to query file from the examples, currently getting an error from here:

  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/olive/data/component/pre_process_data.py", line 259, in _tokenizer_and_align_labels
    for files in tokenized_inputs[kwargs.get("file_column_name", "file")]:
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/transformers/feature_extraction_utils.py", line 86, in __getitem__
    return self.data[item]
KeyError: 'file'

fixed.

trajepl · 2023-11-30T09:48:36Z

[Not merge] Let us wait for customer's feedback.

vymao · 2023-11-30T18:59:53Z

From the linked issue, you probably need to remove the _silence_ label before running align_labels_with_mapping, currently getting a _silence_ KeyError. Not very clean but maybe you can make this an option.

jambayk · 2023-11-30T21:10:51Z

/azp run

azure-pipelines · 2023-11-30T21:11:03Z

Azure Pipelines successfully started running 1 pipeline(s).

trajepl · 2023-12-01T01:13:03Z

From the linked issue, you probably need to remove the _silence_ label before running align_labels_with_mapping, currently getting a _silence_ KeyError. Not very clean but maybe you can make this an option.

https://github.com/microsoft/Olive/pull/762/files#:~:text=%23%20align%20labels%20with,%5B0%5D)
I already did that. The config example worked for me. When you trigger the error of _silence_ key error? Have you updated your config json with the latest one in this pr? @vymao

latest changes for json config:
https://github.com/microsoft/Olive/pull/762/files#diff-f9735dfb6fa445feeb666b7f11c9b670283b799fecf876683962741bee8733fe:~:text=%22component_kwargs%22%3A%20%7B,%7D

vymao · 2023-12-02T02:27:35Z

Thanks, works.

trajepl added 2 commits November 29, 2023 15:41

fix

1c1c539

fix

35d6279

trajepl changed the title ~~(AST) Audio data classification optimization and data pre-process~~ 📻 (AST) Audio data classification optimization and data pre-process Nov 29, 2023

trajepl mentioned this pull request Nov 29, 2023

[Bug]: MIT/ast-finetuned-speech-commands-v2 leads to KeyError in evaluation? #735

Closed

guotuofeng reviewed Nov 30, 2023

View reviewed changes

olive/data/component/pre_process_data.py Outdated Show resolved Hide resolved

trajepl added 3 commits November 30, 2023 13:09

fix with dataset align_labels_with_mapping

cb05607

typing fix

de1fa1b

Merge branch 'main' of https://github.com/microsoft/olive into jiapli…

156c82d

…/audio_data_preprocess

xiaoyu-work approved these changes Nov 30, 2023

View reviewed changes

Merge branch 'main' into jiapli/audio_data_preprocess

a9e413a

trajepl merged commit 32322a8 into main Dec 1, 2023
31 checks passed

trajepl deleted the jiapli/audio_data_preprocess branch December 1, 2023 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📻 (AST) Audio data classification optimization and data pre-process #762

📻 (AST) Audio data classification optimization and data pre-process #762

trajepl commented Nov 29, 2023

vymao commented Nov 29, 2023 •

edited

vymao commented Nov 29, 2023 •

edited

trajepl commented Nov 30, 2023

trajepl commented Nov 30, 2023

trajepl commented Nov 30, 2023

vymao commented Nov 30, 2023 •

edited

jambayk commented Nov 30, 2023

azure-pipelines bot commented Nov 30, 2023

trajepl commented Dec 1, 2023 •

edited

vymao commented Dec 2, 2023

📻 (AST) Audio data classification optimization and data pre-process #762

📻 (AST) Audio data classification optimization and data pre-process #762

Conversation

trajepl commented Nov 29, 2023

Describe your changes

Checklist before requesting a review

(Optional) Issue link

vymao commented Nov 29, 2023 • edited

vymao commented Nov 29, 2023 • edited

trajepl commented Nov 30, 2023

trajepl commented Nov 30, 2023

trajepl commented Nov 30, 2023

vymao commented Nov 30, 2023 • edited

jambayk commented Nov 30, 2023

azure-pipelines bot commented Nov 30, 2023

trajepl commented Dec 1, 2023 • edited

vymao commented Dec 2, 2023

vymao commented Nov 29, 2023 •

edited

vymao commented Nov 29, 2023 •

edited

vymao commented Nov 30, 2023 •

edited

trajepl commented Dec 1, 2023 •

edited