English and Music in TEDxJP 10k corpus #4

eiichiroi · 2021-01-12T12:01:42Z

The TEDxJP 10k corpus contains some inapporopriate data for evaluation of Japanese speech recognition.

In following videos, Japanese people talk in English.
But the corpus uses subtitles that are automatically translated into Japanese.

Then, following videos contain some music (corpus uses the interval of music).

It may be better to be removed from the corpus.

hfujihara · 2021-01-14T04:41:47Z

@eiichiroi
Thank you for pointing out this issue.

We examined the videos you mentioned and

removed English utterances in Aj-DXM5Zqms, Ba5Jl1_JKZY and gffgHgnEhtA
kept Mc044I55SCY as a part used in our data includes only Japanese utteranes
Kept ydhfjNRFzaM, BLElQZfR_2M, TWeYkdIQsk0, Ab-KZT06gR0, and kU9LcoHaFLo as, though they are singing voices (or rap), their transcriptions are actually correct.

hfujihara · 2021-01-14T04:46:55Z

The above modification was made in https://github.com/laboroai/TEDxJP-10K as TEDxJP-10K_v1.1 .

eiichiroi · 2021-01-14T08:46:10Z

Thank you!

hfujihara closed this as completed Jan 14, 2021

Provide feedback