Not clear how to do a simple speech recognition #102

foosoftsrlold · 2020-09-11T06:58:10Z

It would be great if the instructions in the README were dumb-proof.

I just tried to follow them and the results were nonsensical.

It may clearly be due to error on our side or the environment (WSL) but looking at the release, I suspect that some data is missing (I just followed strictly what's on the README).

Mte90 · 2020-09-11T07:15:23Z

Can you add more data about what steps dodn't worked for you?
We need more information to understand what is missing and what you are trying to do.

foosoftsrlold · 2020-09-11T10:47:58Z

What I'm trying to do is.... speech to text of a publicly available content
The system is Linux (WSL), bullsye/sid distribution

how to replicate:

virtualenv -p python3 venv
source venv/bin/activate
pip3 install deepspeech==0.8.0
wget https://radiomaria-cdn.thron.com/delivery/public/audio/radiomaria/75451e50-9ecd-482a-8201-08d1a9bf2189/hjtvzk/WEB/commento-al-messaggio-del-25-
ffmpeg -i commento-al-messaggio-del-25- -t 10 -ar 16000 commento-al-messaggio-del-25-.short.wav
deepspeech --model output_graph.pbmm --scorer scorer --audio commento-al-messaggio-del-25-.short.wav

The output of the "deepspeech" command is....

Loading model from file output_graph.pbmm
TensorFlow: v2.2.0-17-g0854bb5
DeepSpeech: v0.8.0-0-gf56b07d
2020-09-11 12:45:02.097968: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0163s.
Loading scorer from files scorer
Loaded scorer in 0.00397s.
Running inference.
u vogue rire suu sur bert ahh non olh ful up uoh b
Inference took 6.754s for 10.000s audio file.

Mte90 · 2020-09-11T10:50:25Z

Seems that you are using it right but you are not using our model for italian, right?

foosoftsrlold · 2020-09-11T11:02:09Z

Actually... I should be... I forgot these 2 lines which I had done

Scarica e scompatta i file per il modello italiano

curl -LO https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/download/2020.08.07/model_tensorflow_it.tar.xz
tar xvf model_tensorflow_it.tar.xz

Just in case, this is the ls

(venv) luca@DESKTOP-QTQGTR0:/mnt/c/Users/luca/dati/voice/it$ ls -l
total 694012
-rwxrwxrwx 1 luca luca        86 Aug  7 00:36 alphabet.txt
-rwxrwxrwx 1 luca luca   9898126 Jun 26 10:51 commento-al-messaggio-del-25-
-rwxrwxrwx 1 luca luca    640078 Sep 11 12:40 commento-al-messaggio-del-25-.short.wav
-rwxrwxrwx 1 luca luca 315068264 Sep 11 12:54 model_tensorflow_it.tar.xz
-rwxrwxrwx 1 luca luca 188998006 Aug 10 11:51 output_graph.pbmm
-rwxrwxrwx 1 luca luca 196052112 Aug  7 00:41 scorer
drwxrwxrwx 1 luca luca       512 Sep 11 12:37 venv

Mte90 · 2020-09-11T11:07:59Z

So you are using the model without transfer learning from english, that is pure italian but as we don't have a lot of hours to training is not very good.

So the procedure you are using is right, it is just the model that is not working good. I suggest to you to try the one with transfer learning from english that is more accurate as is adding over 7000+ hours compared to 250~ hours of italian (it is in the release page like the other one).

nefastosaturo · 2020-09-11T11:08:11Z

Hello there, I'm checking it right now. That mp3 file is a stereo file. During conversion, please convert it in mono.

Also you can try to use the same model but trained with transfer learning from english. You can find in the release or at this url:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/download/2020.08.07/transfer_model_tensorflow_it.tar.xz

nefastosaturo · 2020-09-11T11:16:21Z

Ok it works. The disclaimer voice transcription:

Loaded model in 0.00849s.
Loading scorer from files /media/nefasto/data/deepspeech/models/ita/scorer
Loaded scorer in 0.000566s.
Running inference.
produzioni radi u eria tutti i diritti sono riservati
Inference took 1.648s for 3.971s audio file.

The first 10 seconds of speaker transcription (I cannot understand what he's saying btw)

Loaded scorer in 0.000662s.
Running inference.
cari mec le cime come tiamo e messaggio della regia della pace delle viticci cu mita denti festo ver creta nove su mali versario re ra palesin
Inference took 4.550s for 9.996s audio file.

foosoftsrlold · 2020-09-11T11:23:58Z

Ok! Way better with mono. A good candidate for README :)
Unfortunately the output... ehm... is not that fantastic. Any idea?

Original message is "Produzioni radio maria. Tutti i diritti sono riservati... blabla...".

Transfer learning model

(venv) luca@DESKTOP-QTQGTR0:/mnt/c/Users/luca/dati/voice/it$ deepspeech --model transfer_model_tensorflow_it/output_graph.pbmm --scorer scorer --audio commento-al-messaggio-del-25-.short.wav
...
...
produzioni radi umeri tutti i diritti sono riservati grave me nci di loci me come antiamo i mesca gio della gia dora pace delle veci mita dent

Normal model

(venv) luca@DESKTOP-QTQGTR0:/mnt/c/Users/luca/dati/voice/it$ deepspeech --model output_graph.pbmm --scorer scorer --audio commento-al-messaggio-del-25-.short.wav
...
...
zini radio eria puti diri ti sono riservati premerci li legione come diamo messaggio gelar giade la pace delle be dictu mido anti

Mte90 · 2020-09-11T13:21:11Z

Readme updated, about the quality as today the model is not very good anyway.
We have some issues (find resources valid on legal purpose) about Italian datasets and also about the text corpora (that we built) used by the model.

If you want to help and contribute to the project I invite you to join us on Telegram with @mozitabot and later in the Developers group where we discuss.

Mte90 added the question Further information is requested label Sep 11, 2020

Mte90 closed this as completed Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not clear how to do a simple speech recognition #102

Not clear how to do a simple speech recognition #102

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

nefastosaturo commented Sep 11, 2020

nefastosaturo commented Sep 11, 2020 •

edited

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

Not clear how to do a simple speech recognition #102

Not clear how to do a simple speech recognition #102

Comments

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

foosoftsrlold commented Sep 11, 2020

Mte90 commented Sep 11, 2020

foosoftsrlold commented Sep 11, 2020

Scarica e scompatta i file per il modello italiano

Mte90 commented Sep 11, 2020

nefastosaturo commented Sep 11, 2020

nefastosaturo commented Sep 11, 2020 • edited

foosoftsrlold commented Sep 11, 2020

Transfer learning model

Normal model

Mte90 commented Sep 11, 2020

nefastosaturo commented Sep 11, 2020 •

edited