Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTZAN and CREMA-D #2

Closed
yunzqq opened this issue May 29, 2023 · 9 comments
Closed

GTZAN and CREMA-D #2

yunzqq opened this issue May 29, 2023 · 9 comments

Comments

@yunzqq
Copy link

yunzqq commented May 29, 2023

Hi,
Could you please provide the split setting on these two datasets?
Thank you!

@daisukelab
Copy link
Collaborator

Hi, thanks for your interest.

You can find them in the evaluation package EVAR.
Please follow: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md

Data splits can be found in evar/metadata/*.csv

Hope it helps.

@yunzqq
Copy link
Author

yunzqq commented May 30, 2023

Hi, thanks for your interest.

You can find them in the evaluation package EVAR. Please follow: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md

Data splits can be found in evar/metadata/*.csv

Hope it helps.

Many thanks!

@daisukelab
Copy link
Collaborator

Please let us know if you publish your paper in the future!

@yunzqq
Copy link
Author

yunzqq commented May 30, 2023

Please let us know if you publish your paper in the future!

hhh, OK!

@yunzqq
Copy link
Author

yunzqq commented Jun 2, 2023

Please let us know if you publish your paper in the future!

May I ask another question? For the long audio recording, how long clips are made for training? and At inference time, the audio recording is split into clips and averaged the logit is used for final classification results?

Best,
Qiquan

@daisukelab
Copy link
Collaborator

May I ask another question? For the long audio recording, how long clips are made for training? and At inference time, the audio recording is split into clips and averaged the logit is used for final classification results?

Thank you for your question! A quick answer is you would be correct.

  • While pre-training, we randomly crop a fixed duration from training samples.
  • At inference time, we do the following:

image

And the runtime implementation is as follows:

m2d/m2d/runtime_audio.py

Lines 173 to 225 in 4cdffb0

def encode_lms(self, lms, return_layers=False):
x = lms
patch_fbins = self.backbone.grid_size()[0]
unit_frames = self.cfg.input_size[1]
embed_d = self.backbone.patch_embed.proj.out_channels
cur_frames = x.shape[-1]
pad_frames = unit_frames - (cur_frames % unit_frames)
if pad_frames > 0:
x = torch.nn.functional.pad(x, (0, pad_frames))
embeddings = []
if self.cfg.flat_features:
# flatten all patch embeddings
mask_ratio = self.cfg.training_mask if self.training else 0.0
for i in range(x.shape[-1] // unit_frames):
emb, *_ = self.backbone.forward_encoder(x[..., i*unit_frames:(i+1)*unit_frames], mask_ratio=mask_ratio, return_layers=return_layers)
cls_token, emb = emb[..., :1, :], emb[..., 1:, :]
if self.cfg.cls_token:
# prepend cls token to all frame features.
# in:
# cls_token.shape -> [B, 1, D]
# emb.shape -> [B, T*F, D]
# out:
# emb.shape -> [B, 1 + T*F, D]
emb = torch.cat([cls_token, emb], axis=-1)
embeddings.append(emb)
x = torch.cat(embeddings, axis=-2)
# note: we are not removing the padding frames.
else:
# stack embeddings along time frame
for i in range(x.shape[-1] // unit_frames):
emb, *_ = self.backbone.forward_encoder(x[..., i*unit_frames:(i+1)*unit_frames], mask_ratio=0., return_layers=return_layers)
cls_token, emb = emb[..., :1, :], emb[..., 1:, :]
if len(emb.shape) > 3:
emb = rearrange(emb, 'L b (f t) d -> L b t (f d)', f=patch_fbins, d=embed_d) # Layer-wise embeddings
else:
emb = rearrange(emb, 'b (f t) d -> b t (f d)', f=patch_fbins, d=embed_d)
if self.cfg.cls_token:
# prepend cls token to all frame features.
# cat([L, B, 1, D].repeat(1, T, 1), [L, B, T, F*D]) -> [L, B, T, (1 + F)*D] or
# cat([B, 1, D].repeat(1, T, 1), [B, T, F*D]) -> [B, T, (1 + F)*D]
emb = torch.cat([cls_token.repeat(*([1]*(len(emb.shape) - 2)), emb.shape[-2], 1), emb], axis=-1)
embeddings.append(emb)
# cut the padding at the end
x = torch.cat(embeddings, axis=-2)
pad_emb_frames = int(embeddings[0].shape[-2] * pad_frames / unit_frames)
# print(2, x.shape, embeddings[0].shape, pad_emb_frames)
if pad_emb_frames > 0:
x = x[..., :-pad_emb_frames, :] # remove padded tail
# print(3, x.shape)
return x if len(emb.shape) == 3 else [x_ for x_ in x]

Please let me know if you have any more questions.

@daisukelab daisukelab reopened this Jun 2, 2023
@yunzqq
Copy link
Author

yunzqq commented Jun 4, 2023

Many thanks for your help!Are you in Greece for ICASSP 2023?If yes May I communicate with you ? hahaha

@daisukelab
Copy link
Collaborator

Yes, I'm presenting this paper:

AASP-P4: Anomaly Detection and Representation Learning for Audio Classification
Room: Poster Area 2 - Garden Type: Poster 03:35 PM to 5:05 PM
1773 (AASP-P4.3): MASKED MODELING DUO: LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO
MODEL THE INPUT

See you there! :)

@yunzqq
Copy link
Author

yunzqq commented Jun 5, 2023

Yes, I'm presenting this paper:

AASP-P4: Anomaly Detection and Representation Learning for Audio Classification
Room: Poster Area 2 - Garden Type: Poster 03:35 PM to 5:05 PM
1773 (AASP-P4.3): MASKED MODELING DUO: LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO
MODEL THE INPUT

See you there! :)

Many thanks!
See you there!
Best,

Qiquan Zhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants