Add SoX deps, resolves #3487 #3488

KathyReid · 2021-01-04T03:32:39Z

No description provided.

KathyReid · 2021-01-04T04:23:55Z

Note I have also added vim and nano editors to the Dockerfile for ease of editing files within Docker.

lissyx · 2021-01-04T12:19:52Z

Dockerfile.train.tmpl

@@ -18,7 +18,12 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
        locales \
        python3-venv \
        unzip \
-        wget
+        wget \
+	sox \


nit: please ensure proper indentation

Great pickup, thx @lissyx

lissyx

I have mixed feelings:

it's easy and not a big deal, so why not
it opens the door to potentially everybody wanting to put its required deps here, and we end up with a mess we can't properly maintain

I'm really not sure it's required we add vim/nano there, but maybe I am biaised by my workflow such that I mostly never have to work within the container.

For the sox deps, my workflow is always to derive a new image from the base image we produce, so it would be where I place the dep.

@reuben what's your take on that?

reuben · 2021-01-04T13:20:34Z

I have literally never used the Dockerfiles so my opinion counts very little I'm afraid. I agree with you that adding all required deps of all importers will quickly become untenable, but I guess Common Voice is "first-party" enough that it's warranted to add a dep for that importer. So overall I'm OK with this PR as is.

KathyReid · 2021-01-04T13:37:53Z

All fair and valid points.

The reason that I'm using a Dockerfile for training is to help build out a DeepSpeech Playbook to make training models easier - and in doing so, reduce the support load presented by DeepSpeech. While using Docker for training (to streamline dependency management), it's useful to have editing tools like vim and nano available, for example to edit alphabet.txt (which is how I discovered I didn't have an editor).

lissyx · 2021-01-04T16:48:21Z

I still feel like advising people to product their own image based on ours and thus managing their deps would be a good requirements. (and we could still merge those anyway) what do you think @KathyReid ?

KathyReid · 2021-01-04T23:18:17Z

I think what you propose is a good way forward; with vim and nano people have the ability to tinker around within the current Docker image, and I can add some information to the Playbook on how to add additional deps / requirements to the Dockerfile and have people build their own image from that Dockerfile.

The one downside I can see from this approach is that by encouraging people to train a model from a Docker environment, we are trying to standardise or reduce the variability of environments that are trained with - thus reducing the range of issues that come up for support, and reducing support volume. If we encourage people to modify their own Docker images for training, that will increase variability in support issues. Should we add something to the ISSUE.md template asking for their modified Dockerfile if they're using a modified one?

nmstoker · 2021-01-09T02:31:10Z

I may be showing my inexperience with docker here, but would it be feasible to have two dockerfiles?
Eg 1. A "Lite" version for the purists without the editors, possibly not the dependencies for the importers (or only minimal depencies if Common Voice is in)
2. A slightly more "batteries included" version, derived from the Lite version but which has the editors and Sox.

For those looking to apply it in their own version with specialist needs, they'd potentially start with the lite one but for use in the playbook/with beginners, the batteries included one would be the place to start because it's simple to get going and will reduce support confusion stemming from those who are less experienced

KathyReid · 2021-01-09T02:49:05Z

Thanks @nmstoker for the feedback - appreciated.

I don't have strong opinions either way on your suggestion - my guiding principle here is "what does the community need to reduce the hurdle of getting to train a model?". For me, that's likely to be a "batteries included" Dockerfile that allows someone to train a model from say the CV corpus, with minimal preparation work.

There are already two Dockerfiles that ship with DeepSpeech - the makefile takes a parameter of either train or build, with dependencies tailored to either using DeepSpeech for training or for inference. A concern with increasing the number of Dockerfiles is that it then makes providing support, or replicating issues, harder.

Sorry I'm not more helpful on this one, but it's probably Lissy or Reuben's call here.

lissyx · 2021-01-11T09:41:00Z

I don't have strong opinions either way on your suggestion - my guiding principle here is "what does the community need to reduce the hurdle of getting to train a model?". For me, that's likely to be a "batteries included" Dockerfile that allows someone to train a model from say the CV corpus, with minimal preparation work.

The truth is that I was working on that on a spare project: https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/

This is the project you want. And the goal was that our train Docker here is a base for this. Which I have not had time to finish yet. Hence the two docker in the repo:

build: to allow people to more easily rebuild the lib if required for their own purpose
train: to provide a very basic minimal working setup that should be the base of others.

@KathyReid @reuben Maybe it is time for https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/ to move somewhere else and be more supported?

KathyReid · 2021-02-10T20:16:48Z

I re-tested this with the Docker Hub image (not one I built myself) and tried to do data formatting for CV datasets, which is the example given in the PlayBook, and it fails on sox dependencies as below.

Can we please have this pulled into DeepSpeech so that the Docker Hub image includes sox deps? I know it's not strictly for training, but the actual developer workflow that people are going to use to experiment with DeepSpeech means they will probably use the Docker Hub image to import data from CV.

root@c7f3e6f3c302:/DeepSpeech# bin/import_cv2.py deepspeech-data/cv-corpus-6.1-2020-12-11/vi
/bin/sh: 1: sox: not found
SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    
Loading TSV file:  /DeepSpeech/deepspeech-data/cv-corpus-6.1-2020-12-11/vi/test.tsv
Importing mp3 files...
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "bin/import_cv2.py", line 65, in one_sample
    _maybe_convert_wav(mp3_filename, wav_filename)
  File "bin/import_cv2.py", line 185, in _maybe_convert_wav
    transformer.build(mp3_filename, wav_filename)
  File "/usr/local/lib/python3.6/dist-packages/sox/transform.py", line 594, in build
    input_filepath, input_array, sample_rate_in
  File "/usr/local/lib/python3.6/dist-packages/sox/transform.py", line 496, in _parse_inputs
    input_format['channels'] = file_info.channels(input_filepath)
  File "/usr/local/lib/python3.6/dist-packages/sox/file_info.py", line 82, in channels
    output = soxi(input_filepath, 'c')
  File "/usr/local/lib/python3.6/dist-packages/sox/core.py", line 149, in soxi
    stderr=subprocess.PIPE
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sox': 'sox'
"""This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bin/import_cv2.py", line 221, in <module>
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    main()
  File "bin/import_cv2.py", line 216, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 172, in _preprocess_data
    set_samples = _maybe_convert_set(dataset, tsv_dir, audio_dir, space_after_every_character)
  File "bin/import_cv2.py", line 127, in _maybe_convert_set
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    for i, processed in enumerate(pool.imap_unordered(one_sample, samples), start=1):
This install of SoX cannot process .mp3 files.
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'sox': 'sox'
This install of SoX cannot process .mp3 files.

lissyx · 2021-02-10T20:26:27Z

but the actual developer workflow that people are going to use to experiment with DeepSpeech means they will probably use the Docker Hub image to import data from CV.

I'm really not sure we should enable that, as I stated earlier.

KathyReid · 2021-02-10T20:36:50Z

OK, so the alternatives I can see here are;

I add instructions to the PlayBook for adding dependencies like sox to the Docker image - and it's likely that other dependencies would be required for other Importers, so if they get stuck they know how to addd
I get the user to pull down the image intended for inference which does have the sox deps? To me that's overkill, and makes the onramp to using DeepSpeech harder

So I think option #1 would be better - what do you think?

lissyx · 2021-02-10T20:39:18Z

* I add instructions to the PlayBook for adding dependencies like `sox` to the Docker image - and it's likely that other dependencies would be required for other Importers, so if they get stuck they know how to addd

people should derive using FROM from our docker image yes. Please note we don't have CI on any of the importers, so breakages are more than expected

* I get the user to pull down the image intended for _inference_ which does have the `sox` deps? To me that's overkill, and makes the onramp to using DeepSpeech harder

no that's no good

You forget #3: finally address my suggestion stated in #3488 (comment)

KathyReid · 2021-02-10T22:29:05Z

`DATA_FORMATTING.md` comes before `ENVIRONMENT.md` in the sequence but the Docker environment is set up in `ENVIRONMENT.md`

Previously, I had instructions to git clone DeepSpeech and create a Python venv. I rewrote these instructions for ENVIRONMENT.md and for TRAINING.md, but missed DATA_FORMATTING.md, which assumes that there is a local copy of DeepSpeech on the filesystem. That is, the user is downloading, extracting and importing the Common Voice data on their local filesystem instead of from within a Docker container.

Which is not what we want to do - we want to get the user comfortable with using a Docker container for their activities.

So the task here is:

Move the position of DATA_FORMATTING.md to come after ENVIRONMENT.md, and ensure that the tasks in DATA_FORMATTING use Docker, not the local host.

Ruled out - using inference `DockerFile`

We've ruled out this option, but I'm just stating it to be clear

Add instructions to the `ENVIRONMENT.md` file for deriving an image and/or adding dependencies

This is the section that deals with setting up Docker and spinning up a container. This is the place where I should talk about adding dependencies to the image if needed - ie sox or other dependencies for other importers.

Task:

Add instructions to ENVIRONMENT.md file for deriving an image and/or adding dependencies

French Common Voice and DeepSpeech work

#3488 (comment)

My understanding of this work is that it:

Uses FROM to create a new Docker image
Installs dependencies required for the new language
Uses the Docker image to create a new scorer file for the language
New alphabet file for the language

Is this a correct understanding? Or have I misunderstood?

Spinning this off into a new project is beyond the scope of the PlayBook work - my role here is not to provide ongoing support, but to reduce existing support load by creating the PlayBook.

Summary

To reach resolution on this, here is what I propose as tasks:

Move the position of DATA_FORMATTING.md to come after ENVIRONMENT.md, and ensure that the tasks in DATA_FORMATTING use Docker, not the local host.
Add instructions to ENVIRONMENT.md file for deriving an image and/or adding dependencies

Are you comfortable with this approach?

lissyx · 2021-02-10T22:33:52Z

Spinning this off into a new project is beyond the scope of the PlayBook work - my role here is not to provide ongoing support, but to reduce existing support load by creating the PlayBook.

My point is, reducing ongoing support for the usecase you want to cover in the playbook is best addressed by finding an answer to that question

* Add instructions to `ENVIRONMENT.md` file for deriving an image and/or adding dependencies

If at least we advertise clearly that:

our docker image is bare and provided as a basis to build on, but people should refrain to hack into it, except if they know what they are doing
how to build on top of it through FROM: with CV as a usecase

I think it should be okay.

KathyReid · 2021-02-28T03:27:16Z

I've resolved this via
mozilla/deepspeech-playbook#15
So am closing this issue.

KathyReid added 3 commits January 4, 2021 14:32

Add SoX deps, resolves #3487

ac212fe

Add vim and nano editors to Dockerfile build template

f30d8a8

Add vim and nano editors to the Dockerfile train template

27b228f

lissyx reviewed Jan 4, 2021

View reviewed changes

KathyReid mentioned this pull request Feb 10, 2021

Importing CV data in DATA_FORMATTING.md fails due to sox deps not in Docker Hub image mozilla/deepspeech-playbook#4

Closed

KathyReid closed this Feb 28, 2021

KathyReid deleted the patch-4 branch February 28, 2021 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SoX deps, resolves #3487 #3488

Add SoX deps, resolves #3487 #3488

KathyReid commented Jan 4, 2021

KathyReid commented Jan 4, 2021

lissyx Jan 4, 2021

KathyReid Jan 4, 2021

lissyx left a comment

reuben commented Jan 4, 2021

KathyReid commented Jan 4, 2021

lissyx commented Jan 4, 2021

KathyReid commented Jan 4, 2021

nmstoker commented Jan 9, 2021

KathyReid commented Jan 9, 2021

lissyx commented Jan 11, 2021

KathyReid commented Feb 10, 2021

lissyx commented Feb 10, 2021

KathyReid commented Feb 10, 2021

lissyx commented Feb 10, 2021

KathyReid commented Feb 10, 2021

lissyx commented Feb 10, 2021

KathyReid commented Feb 28, 2021

Add SoX deps, resolves #3487 #3488

Add SoX deps, resolves #3487 #3488

Conversation

KathyReid commented Jan 4, 2021

KathyReid commented Jan 4, 2021

lissyx Jan 4, 2021

Choose a reason for hiding this comment

KathyReid Jan 4, 2021

Choose a reason for hiding this comment

lissyx left a comment

Choose a reason for hiding this comment

reuben commented Jan 4, 2021

KathyReid commented Jan 4, 2021

lissyx commented Jan 4, 2021

KathyReid commented Jan 4, 2021

nmstoker commented Jan 9, 2021

KathyReid commented Jan 9, 2021

lissyx commented Jan 11, 2021

KathyReid commented Feb 10, 2021

lissyx commented Feb 10, 2021

KathyReid commented Feb 10, 2021

lissyx commented Feb 10, 2021

KathyReid commented Feb 10, 2021

DATA_FORMATTING.md comes before ENVIRONMENT.md in the sequence but the Docker environment is set up in ENVIRONMENT.md

Ruled out - using inference DockerFile

Add instructions to the ENVIRONMENT.md file for deriving an image and/or adding dependencies

French Common Voice and DeepSpeech work

Summary

lissyx commented Feb 10, 2021

KathyReid commented Feb 28, 2021

`DATA_FORMATTING.md` comes before `ENVIRONMENT.md` in the sequence but the Docker environment is set up in `ENVIRONMENT.md`

Ruled out - using inference `DockerFile`

Add instructions to the `ENVIRONMENT.md` file for deriving an image and/or adding dependencies