Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some files in the dev_testset_noisy_testclips seem to be identical #70

Closed
zhaoyj1122 opened this issue Dec 3, 2021 · 5 comments
Closed
Assignees
Labels
data Dataset issue

Comments

@zhaoyj1122
Copy link

For example:
1

@motus motus self-assigned this Dec 3, 2021
@motus motus added the data Dataset issue label Dec 3, 2021
@motus
Copy link
Member

motus commented Dec 3, 2021

@zhaoyj1122 thanks for letting us know! I'll check for the duplicates and get back to you with the update soon.

@motus
Copy link
Member

motus commented Dec 4, 2021

@zhaoyj1122 yes, we indeed have ten identical files in the test dataset. The files are at dev_testset/noisy_testclips/:

ms_realrec_emotional_speech_crying_1_headphone_chips_pocket_opening_A2H95JVPEKRUWA_1.wav
ms_realrec_emotional_speech_crying_1_headphone_chips_pocket_opening_A2H95JVPEKRUWA_2.wav
ms_realrec_emotional_speech_crying_1_laptop_crying_A72LC42LU78IP.wav
ms_realrec_emotional_speech_crying_2_headphone_air_conditioner_A2H95JVPEKRUWA_1.wav
ms_realrec_emotional_speech_crying_2_headphone_air_conditioner_A2H95JVPEKRUWA_2.wav
ms_realrec_emotional_speech_crying_2_laptop_crying_A72LC42LU78IP.wav
ms_realrec_emotional_speech_yelling_1_headphone_baby_crying_A2H95JVPEKRUWA_1.wav
ms_realrec_emotional_speech_yelling_1_headphone_baby_crying_A2H95JVPEKRUWA_2.wav
ms_realrec_emotional_speech_yelling_2_headphone_dog_barking_A2H95JVPEKRUWA_1.wav
ms_realrec_emotional_speech_yelling_2_headphone_dog_barking_A2H95JVPEKRUWA_2.wav

I will clean up the dataset and update the archive on Azure and close this issue afterwards. Meanwhile, please feel free to remove any 9 out of 10 clips in the list. I'll also check our other datasets for the duplicates. Stay tuned!

@motus
Copy link
Member

motus commented Dec 4, 2021

looks like ms_realrec_emotional_speech_crying_1_laptop_crying_A72LC42LU78IP.wav is the right one. I'll delete the rest

@zhaoyj1122
Copy link
Author

looks like ms_realrec_emotional_speech_crying_1_laptop_crying_A72LC42LU78IP.wav is the right one. I'll delete the rest

Got it! Thanks a lot for that.

@motus
Copy link
Member

motus commented Dec 4, 2021

I've updated the archive on Azure so I think I can close this issue for now. Please open another one if you see duplicates in our other datasets, and I'll do the same. Thanks for your help and good luck!

@motus motus closed this as completed Dec 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Dataset issue
Projects
None yet
Development

No branches or pull requests

2 participants