-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of pretrained weigths #5
Comments
Hi, thanks for your interest. I'm glad to hear that the pre-trained weight is fairly suitable for your tasks so far. It looks like an environmental issue, such as the torch version. python train_audio.py --csv_main data/files_icbhi2017.csv --resume m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth The python version is: >>> import torch; torch.__version__
'2.1.2' Then, could you renew your PyTorch? As you have setup the conda environment, yours could be the following if your cuda version is <= 11.8: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia In addition, I am willing to help with your attempt at the additional (further) pre-training. I know that it would need to adjust the pre-training settings (e.g., training and warmup epochs) in the next step. python train_audio.py --csv_main data/files_icbhi2017.csv --csv_bg data/files_f_s_d_5_0_k.csv --resume m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth Anyway, let's make the torch.load() issue clear. |
Thank you for your answer. I have inspected my commands and found out that I did not realize the pretrained weights file was zipped. In order to get it to work, all I had to do was unzip the file and give the correct path. That fixes the torch.load issue. Now, if you are willing to help with further pre-training, I have a couple questions:
|
To further explain how your model will be used: first we will do additional pretraining in Brazilian Portuguese speech data and test the new and perhaps improved model on standard Brazilian Portuguese speech tasks we will prepare. If we are successful, the model will later be fine-tuned on hospital patients suffering from respiratory problems. For the part with healthcare patient audios, the dataset sizes will be very small (ie. minutes and at best case reach 1-2 hours), so it is very important that transfer learning is performed effectively. The hospital data collection will start in a few months most likely, but the first part can already be done now and will be done to prepare models to be fine-tuned quickly once we actually get the data. Thanks for your help! |
Hi, I summarized a guideline based on what I have experienced. https://github.com/nttcslab/m2d/blob/master/Guide_app.md Based on it, quick comments for your use case are:
Recommendation for your #1 "we will do additional pre-training in Brazilian Portuguese speech data and test the new and perhaps improved model on standard Brazilian Portuguese speech tasks we will prepare." Recommendation for your #2 "the model will later be fine-tuned on hospital patients suffering from respiratory problems."
Further pre-training (Fur-PT) guideA possible command line for you is: CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train_audio.py --epochs 600 --warmup_epochs 24 --resume m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth --model m2d_x_vit_base --batch_size 32 --accum_iter 4 --csv_main __your__.csv --csv_bg_noise data/files_f_s_d_5_0_k.csv --noise_ratio 0.01 --save_freq 100 --eval_after 600 --seed 3 --teacher m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth --blr 3e-4 --loss_off 1. --min_ds_size 10000 The following options are for using an existing weight to initialize (resume) and for using a teacher model in the M2D-X regularization setting. --resume m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth
--teacher m2d_vit_base-80x608p16x16-221006-mr7/checkpoint-300.pth These parameters set an effective batch size of 128, which we used in our Fur-PT. --batch_size 32 --accum_iter 4 The number of epochs of 600 (and warm-up epochs of 24) may be good (but could be decreased). --epochs 600 --warmup_epochs 24 --save_freq 100 --eval_after 600 This option virtually increases the dataset size by repeating the list of samples. Use this if you have less than 5000 samples. --min_ds_size 10000 You can change the learning rate by |
Thanks for your comments and help. From what I understand, I have four options:
*A small correction: after updating Pytorch to the latest version, I have found it is possible to set batch size to 64 and not just 32. So we can set accum_iter to 2 and not 4, which should help a little. Thank you very much for your help. |
A small addendum for others: to setup distributed mode I had to adapt the command line to be CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train_audio.py ... Without adding torch.distributed.launch, the model does not run on distributed mode |
Yes, you're correct. My example was wrong and corrected above (for somebody else's future reference). I will answer the questions above... |
Regarding your questions about the four options, yes, they are your options. It was nice to figure out the 4th option. And I also recommend the 5th option. First, the combination of the number of epochs and batch size matters because we use the EMA updated target encoder and the annealing learning rate schedule.
Answers to your four options, and one more from me follow.
|
Thanks for the answer. I will be analyzing/testing how to schedule the learning rate for option 4 then. That was a very helpful comment that would have taken me time to figure out. Option 5 is nice and I will do it once I actually get the data. As I mentioned, the hospital patient data will be collected over the next months and might take a year or so to be enough in quantity. Until then I can perform the other 4 options. I am closing the issue now as I believe it has been solved. You were a great help. |
By the way, one last question: do you have an intuition on what values to expect for the loss during pre-training? While the primary measure of performance will be the performance on downstream tasks, I am curious whether I can rule out certain tests based on initial performance during pre-training, as they just do not reach low enough loss for M2D to have learned effective audio representations. |
Regarding the loss values, I have included a log of ICBHI 2017 fur-PT here. Regarding the performance check, we use the linear evaluation using our evaluator EVAR. Please check the "2. Evaluating M2D" in README.md for the details. Regarding the evaluation of speech tasks, the 16x16 model does not perform well in speech tasks such as ASR/phoneme recognition. Lastly, in our experiments on ICBHI 2017, the 16x4 (40ms time frame) model outperformed the 16x16 model. Please let me know if you have trouble with the EVAR setup. |
Thank you for your answer. So the losses I see seem to be more or less in line with what you observed, though it might be possible to do better as I am using a larger dataset and the learning rate schedule could be improved. I will setup EVAR as well as additional Brazilian Portuguese speech tasks (they are classification tasks, such as emotion, speaker or gender recognition). I intend to avoid ASR tasks as my understanding was the same one you presented: finer time resolution is needed for those. I will ask again if I have problems with EVAR. I also found out I can access a server with 8 A100 GPUs which is probably enough to do pretraining from scratch, as well as most configurations I can imagine using your model. An additional question not directly related to your work: In our experiments with multiple pretrained models we have found regression tasks to be typically hard, with the task often becoming easier by adapting it to some sort of multiclass classification. This seems to be in line with what other researchers reported us: namely, it is better to change a regression task into a multiclass classification. I am curious whether you have encountered similar issues and your opinion on the matter. My impression is that the losses we use (the ones I tried were using masked reconstruction losses, but yours is a little different) give the model information on the data distribution but not on how the space of data points evolves from one state to another which is quite possibly necessary information for a regression task as there is a metric on the labels. I am curious whether there is some form of loss term that may be added which helps models also perform well via transfer learning on regression tasks. |
Please find logs here: example_logs.zip I added M2D, M2D-S logs in addition to the M2D-X for ICBHI2017 log. And it's a great question. However, I have no experience with regression tasks. Actually, we need further investigation to understand how the masked reconstruction or M2D models the input signal in the output features. |
Thank you for the logs. They will be helpful. Unfortunately, I have only exchanged words with other researchers who reported this issue. I have not found a paper documenting this problem. It would be an interesting research problem, in particular for us, as regression tasks will likely be part of the set of tasks we will have to consider for respiratory problems in hospital patients. Indeed I agree that M2D or Masked Reconstruction probably induce the models to learn simple combinations of the patterns found in the spectrograms and does not lead them to understand the underlying mechanism for sound generation. That is an interesting research question as well. |
Hi,
Thanks for making your code open-source. I plan on using your models for tasks involving audio (voice and speech) of hospital patients with respiratory issues. We have had success with pre-trained models before and yours seems to be fairly suited for the type of tasks we consider. As we deal with Brazilian Portuguese audios, we are willing to test whether performing an additional pre-training on Brazilian Portuguese unlabeled audio data on top of the already pre-trained weights on AudioSet could lead to improved results on the types of health related downstream tasks we will consider later.
I have been able to start pretraining from scratch on a given set of audios following your instructions, using the script train_audio.py. After inspecting the code, I believe loading the pretrained weights (to perform further pretraining) involves passing a --resume argument on the command line. So I do python train_audio.py --csv_main=my_audios.csv --resume=path_to_pretrained_weights. Is this understanding correct? It is returning the following error when loading the weights:
I am using the m2d_vit_base-80x608p16x16-221006-mr7 weights set you provided. I have also setup the conda environment as suggested in another issue here. Do you have an idea on why the error message is displayed? I have not made any changes to the script train_audio.py, have setup my_audios.csv following your instructions on a set of Brazilian Portuguese audios we have available and load the correct path to the pretrained model.
Thanks!
The text was updated successfully, but these errors were encountered: