Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English and Music in TEDxJP 10k corpus #4

Closed
eiichiroi opened this issue Jan 12, 2021 · 3 comments
Closed

English and Music in TEDxJP 10k corpus #4

eiichiroi opened this issue Jan 12, 2021 · 3 comments

Comments

@eiichiroi
Copy link

eiichiroi commented Jan 12, 2021

The TEDxJP 10k corpus contains some inapporopriate data for evaluation of Japanese speech recognition.

In following videos, Japanese people talk in English.
But the corpus uses subtitles that are automatically translated into Japanese.

Then, following videos contain some music (corpus uses the interval of music).

It may be better to be removed from the corpus.

@hfujihara
Copy link
Contributor

@eiichiroi
Thank you for pointing out this issue.

We examined the videos you mentioned and

  • removed English utterances in Aj-DXM5Zqms, Ba5Jl1_JKZY and gffgHgnEhtA
  • kept Mc044I55SCY as a part used in our data includes only Japanese utteranes
  • Kept ydhfjNRFzaM, BLElQZfR_2M, TWeYkdIQsk0, Ab-KZT06gR0, and kU9LcoHaFLo as, though they are singing voices (or rap), their transcriptions are actually correct.

@hfujihara
Copy link
Contributor

The above modification was made in https://github.com/laboroai/TEDxJP-10K as TEDxJP-10K_v1.1 .

@eiichiroi
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants