GTZAN and CREMA-D #2

yunzqq · 2023-05-29T10:54:40Z

Hi,
Could you please provide the split setting on these two datasets?
Thank you!

daisukelab · 2023-05-29T12:30:34Z

Hi, thanks for your interest.

You can find them in the evaluation package EVAR.
Please follow: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md

Data splits can be found in evar/metadata/*.csv

Hope it helps.

yunzqq · 2023-05-30T02:35:23Z

Hi, thanks for your interest.

You can find them in the evaluation package EVAR. Please follow: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md

Data splits can be found in evar/metadata/*.csv

Hope it helps.

Many thanks!

daisukelab · 2023-05-30T03:03:53Z

Please let us know if you publish your paper in the future!

yunzqq · 2023-05-30T03:14:12Z

Please let us know if you publish your paper in the future!

hhh, OK!

yunzqq · 2023-06-02T04:05:58Z

Please let us know if you publish your paper in the future!

May I ask another question? For the long audio recording, how long clips are made for training? and At inference time, the audio recording is split into clips and averaged the logit is used for final classification results?

Best,
Qiquan

daisukelab · 2023-06-02T04:56:12Z

May I ask another question? For the long audio recording, how long clips are made for training? and At inference time, the audio recording is split into clips and averaged the logit is used for final classification results?

Thank you for your question! A quick answer is you would be correct.

While pre-training, we randomly crop a fixed duration from training samples.
At inference time, we do the following:

And the runtime implementation is as follows:

m2d/m2d/runtime_audio.py

Lines 173 to 225 in 4cdffb0

    
           def encode_lms(self, lms, return_layers=False): 
        
               x = lms 
        
               patch_fbins = self.backbone.grid_size()[0] 
        
               unit_frames = self.cfg.input_size[1] 
        
               embed_d = self.backbone.patch_embed.proj.out_channels 
        
               cur_frames = x.shape[-1] 
        
               pad_frames = unit_frames - (cur_frames % unit_frames) 
        
               if pad_frames > 0: 
        
                   x = torch.nn.functional.pad(x, (0, pad_frames)) 
        
               embeddings = [] 
        
               if self.cfg.flat_features: 
        
                   # flatten all patch embeddings 
        
                   mask_ratio = self.cfg.training_mask if self.training else 0.0 
        
                   for i in range(x.shape[-1] // unit_frames): 
        
                       emb, *_ = self.backbone.forward_encoder(x[..., i*unit_frames:(i+1)*unit_frames], mask_ratio=mask_ratio, return_layers=return_layers) 
        
                       cls_token, emb = emb[..., :1, :], emb[..., 1:, :] 
        
                       if self.cfg.cls_token: 
        
                           # prepend cls token to all frame features. 
        
                           # in: 
        
                           #   cls_token.shape -> [B, 1, D] 
        
                           #   emb.shape -> [B, T*F, D] 
        
                           # out: 
        
                           #   emb.shape -> [B, 1 + T*F, D] 
        
                           emb = torch.cat([cls_token, emb], axis=-1) 
        
                       embeddings.append(emb) 
        
                   x = torch.cat(embeddings, axis=-2) 
        
                   # note: we are not removing the padding frames. 
        
               else: 
        
                   # stack embeddings along time frame 
        
                   for i in range(x.shape[-1] // unit_frames): 
        
                       emb, *_ = self.backbone.forward_encoder(x[..., i*unit_frames:(i+1)*unit_frames], mask_ratio=0., return_layers=return_layers) 
        
                       cls_token, emb = emb[..., :1, :], emb[..., 1:, :] 
        
                       if len(emb.shape) > 3: 
        
                           emb = rearrange(emb, 'L b (f t) d -> L b t (f d)', f=patch_fbins, d=embed_d)  # Layer-wise embeddings 
        
                       else: 
        
                           emb = rearrange(emb, 'b (f t) d -> b t (f d)', f=patch_fbins, d=embed_d) 
        
                       if self.cfg.cls_token: 
        
                           # prepend cls token to all frame features. 
        
                           #  cat([L, B, 1, D].repeat(1, T, 1), [L, B, T, F*D]) -> [L, B, T, (1 + F)*D] or 
        
                           #  cat([B, 1, D].repeat(1, T, 1), [B, T, F*D]) -> [B, T, (1 + F)*D] 
        
                           emb = torch.cat([cls_token.repeat(*([1]*(len(emb.shape) - 2)), emb.shape[-2], 1), emb], axis=-1) 
        
                       embeddings.append(emb) 
        
                   # cut the padding at the end 
        
                   x = torch.cat(embeddings, axis=-2) 
        
                   pad_emb_frames = int(embeddings[0].shape[-2] * pad_frames / unit_frames) 
        
                   # print(2, x.shape, embeddings[0].shape, pad_emb_frames) 
        
                   if pad_emb_frames > 0: 
        
                       x = x[..., :-pad_emb_frames, :] # remove padded tail 
        
                   # print(3, x.shape) 
        
               return x if len(emb.shape) == 3 else [x_ for x_ in x]

Please let me know if you have any more questions.

yunzqq · 2023-06-04T04:48:00Z

Many thanks for your help！Are you in Greece for ICASSP 2023？If yes May I communicate with you ？ hahaha

daisukelab · 2023-06-04T05:33:11Z

Yes, I'm presenting this paper:

AASP-P4: Anomaly Detection and Representation Learning for Audio Classification
Room: Poster Area 2 - Garden Type: Poster 03:35 PM to 5:05 PM
1773 (AASP-P4.3): MASKED MODELING DUO: LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO
MODEL THE INPUT

See you there! :)

yunzqq · 2023-06-05T14:34:03Z

Yes, I'm presenting this paper:

AASP-P4: Anomaly Detection and Representation Learning for Audio Classification
Room: Poster Area 2 - Garden Type: Poster 03:35 PM to 5:05 PM
1773 (AASP-P4.3): MASKED MODELING DUO: LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO
MODEL THE INPUT

See you there! :)

Many thanks!
See you there!
Best,

Qiquan Zhang

daisukelab closed this as completed May 30, 2023

daisukelab reopened this Jun 2, 2023

daisukelab closed this as completed Jun 4, 2023

daisukelab mentioned this issue May 23, 2024

Use of pretrained weigths #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTZAN and CREMA-D #2

GTZAN and CREMA-D #2

yunzqq commented May 29, 2023

daisukelab commented May 29, 2023

yunzqq commented May 30, 2023

daisukelab commented May 30, 2023

yunzqq commented May 30, 2023

yunzqq commented Jun 2, 2023

daisukelab commented Jun 2, 2023

yunzqq commented Jun 4, 2023

daisukelab commented Jun 4, 2023

yunzqq commented Jun 5, 2023

GTZAN and CREMA-D #2

GTZAN and CREMA-D #2

Comments

yunzqq commented May 29, 2023

daisukelab commented May 29, 2023

yunzqq commented May 30, 2023

daisukelab commented May 30, 2023

yunzqq commented May 30, 2023

yunzqq commented Jun 2, 2023

daisukelab commented Jun 2, 2023

yunzqq commented Jun 4, 2023

daisukelab commented Jun 4, 2023

yunzqq commented Jun 5, 2023