Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) #124

Open
nefastosaturo opened this issue Jan 21, 2021 · 9 comments
Labels
bug Something isn't working dataset help wanted Extra attention is needed

Comments

@nefastosaturo
Copy link
Collaborator

nefastosaturo commented Jan 21, 2021

I was checking these two dataset.

The first thing that came in my mind was the duplication of same audio samples and yes, we can v-check this:

from mailabs:

$ it_IT/by_book/male/riccardo_fasol/il_fu_mattia_pascal/metadata.csv
mattiapascal_08_pirandello_f000160|E che avventure! Una più ardita dell'altra...|E che avventure! Una più ardita dell'altra...

from MLS:

$ mls_italian/train/transcripts.txt
1595_4194_001172	e che avventure una più ardita dellaltra ecco qua per dare un altro saggio un brano di dialogo tra lui e una donna maritata

The second thing is: we got a big annoying error and is the apostrophe char missed. Why?

MAILABS metadata.csv contains: audio_id | transcription | NORMALIZED transcription

Given that LOT OF transcription use instead of ' , the normalized version ( which is taken in account by Deepspeech import_mailabs script) will result without apostrophes

BUT

from the example above WE CAN SEE that MLS IS MISSING apostrophe too!

So a recap:
' char sometimes is missed from MLS sometimes from MAILABS

So, whats the best strategy?

  • leave MLS and MAILABS as it is, with some/lot overlapping samples, just use a NEW import_mailabs script parser and replace the character
  • try to fix MLS transcriptions using the raw MAILABS one and, after checking if MAILABS is a subset of MLS, discard MAILABS
  • others..

EDIT:

add books list from MAILABS and MLS
MAILABS_book_list.txt
MLS_book_list.txt

@nefastosaturo nefastosaturo added bug Something isn't working help wanted Extra attention is needed dataset labels Jan 21, 2021
@nefastosaturo nefastosaturo changed the title MLS and MAILABS: considerations and issues ( Have you see my apostrophe?) MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) Jan 25, 2021
@eziolotta
Copy link
Contributor

Even if the texts of the examples are the same, the speakers may be different.
Different speakers I think can be useful even if they say the same thing.

In the example you say (Mattia Pascal of Pirandello) the Speaker are the same (both clips are derived from the same LibriVox clips), but they are different segments: the MLS one is longer, so they are not duplicates

I think it's hard to find real duplicates, we could keep them ...?

@eziolotta
Copy link
Contributor

eziolotta commented Jan 30, 2021

To solve the apostrophe bug in m-ailabs and mls, we would need to parse both strings (original and normalized).
I made this fix, and other changes, I'll do a PR soon.

@eziolotta
Copy link
Contributor

In m-ailabs my fix work fine (we have original text!).
In MLS maybe need to reuse the raw data of mailabs as you say. i Try...

@eziolotta
Copy link
Contributor

MAILABS list apostrophe error
mailabs_fixed_token.txt

@nefastosaturo
Copy link
Collaborator Author

nefastosaturo commented Feb 12, 2021

So starting from the mailabs_fixed_token, I tried to detect the problematic MLS books.

Right now I have checked:

Verga, Novelle, "Vita dei campi", book id: 656
656_Verga_Novelle.zip
Pascoli, Myricae, book id: 1590
1590_pascoli.zip
Machiavelli, Il Principe, book id: 10624 <--- I was thinking to discard this one, there are too many latinism

In each zip files you'll find different set of around 50 wrong words. Some of them already got a correction, most of them don't.

Also there is a file with strange behaviour of some sentences (strange chars, bigger errors like some words without spaces and so on). I will check those tokens in a future step.

If you can please choose one set or subset and put the correct word, would be awesome!

The format is:

,
eg:

dellanima,dell'anima
damore,d'amore
unaltro,un altro

if you think that one token could be ambiguous (eg: loro,l'oro), please flag it with SKIP

loro,l'oro,SKIP

@Sav22999
Copy link
Member

@nefastosaturo I take the first one Verga Novelle, id=656.

@Sav22999
Copy link
Member

Et voilà, credo di aver fatto tutto (spero sia corretto) 656_Verga_Novelle.zip

@eziolotta
Copy link
Contributor

To check all the texts in MLS, csv generated by importer may help.
train_full.zip

@eziolotta
Copy link
Contributor

eziolotta commented Mar 13, 2021

On M-AILABS there are other examples to exclude:

  • transcription does not match with spoken words :-(
  • audio is truncated before the end of transcription

(folder mix\novelle_per_un_anno_06)
novelle06_16_pirandello_f000028
novelle06_16_pirandello_f000029
novelle06_16_pirandello_f000030
novelle06_16_pirandello_f000031
novelle06_16_pirandello_f000032
novelle06_16_pirandello_f000033
novelle06_16_pirandello_f000034
novelle06_16_pirandello_f000035
novelle06_16_pirandello_f000036
novelle06_16_pirandello_f000037
novelle06_16_pirandello_f000038
novelle06_16_pirandello_f000039
novelle06_16_pirandello_f000040
novelle06_16_pirandello_f000041

novelle06_17_pirandello_f000387

I was able to find them because 3 of them were filtered by importer (see check audio too_short),
then I checked (by hand) whole blocks novelle06_16 and novelle06_17

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working dataset help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants