Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SoX deps, resolves #3487 #3488

Closed
wants to merge 3 commits into from
Closed

Add SoX deps, resolves #3487 #3488

wants to merge 3 commits into from

Conversation

KathyReid
Copy link
Contributor

No description provided.

@KathyReid
Copy link
Contributor Author

Note I have also added vim and nano editors to the Dockerfile for ease of editing files within Docker.

@@ -18,7 +18,12 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
locales \
python3-venv \
unzip \
wget
wget \
sox \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please ensure proper indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great pickup, thx @lissyx

Copy link
Collaborator

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings:

  • it's easy and not a big deal, so why not
  • it opens the door to potentially everybody wanting to put its required deps here, and we end up with a mess we can't properly maintain

I'm really not sure it's required we add vim/nano there, but maybe I am biaised by my workflow such that I mostly never have to work within the container.

For the sox deps, my workflow is always to derive a new image from the base image we produce, so it would be where I place the dep.

@reuben what's your take on that?

@reuben
Copy link
Contributor

reuben commented Jan 4, 2021

I have literally never used the Dockerfiles so my opinion counts very little I'm afraid. I agree with you that adding all required deps of all importers will quickly become untenable, but I guess Common Voice is "first-party" enough that it's warranted to add a dep for that importer. So overall I'm OK with this PR as is.

@KathyReid
Copy link
Contributor Author

All fair and valid points.

The reason that I'm using a Dockerfile for training is to help build out a DeepSpeech Playbook to make training models easier - and in doing so, reduce the support load presented by DeepSpeech. While using Docker for training (to streamline dependency management), it's useful to have editing tools like vim and nano available, for example to edit alphabet.txt (which is how I discovered I didn't have an editor).

@lissyx
Copy link
Collaborator

lissyx commented Jan 4, 2021

I still feel like advising people to product their own image based on ours and thus managing their deps would be a good requirements. (and we could still merge those anyway) what do you think @KathyReid ?

@KathyReid
Copy link
Contributor Author

I think what you propose is a good way forward; with vim and nano people have the ability to tinker around within the current Docker image, and I can add some information to the Playbook on how to add additional deps / requirements to the Dockerfile and have people build their own image from that Dockerfile.

The one downside I can see from this approach is that by encouraging people to train a model from a Docker environment, we are trying to standardise or reduce the variability of environments that are trained with - thus reducing the range of issues that come up for support, and reducing support volume. If we encourage people to modify their own Docker images for training, that will increase variability in support issues. Should we add something to the ISSUE.md template asking for their modified Dockerfile if they're using a modified one?

@nmstoker
Copy link
Contributor

nmstoker commented Jan 9, 2021

I may be showing my inexperience with docker here, but would it be feasible to have two dockerfiles?
Eg 1. A "Lite" version for the purists without the editors, possibly not the dependencies for the importers (or only minimal depencies if Common Voice is in)
2. A slightly more "batteries included" version, derived from the Lite version but which has the editors and Sox.

For those looking to apply it in their own version with specialist needs, they'd potentially start with the lite one but for use in the playbook/with beginners, the batteries included one would be the place to start because it's simple to get going and will reduce support confusion stemming from those who are less experienced

@KathyReid
Copy link
Contributor Author

Thanks @nmstoker for the feedback - appreciated.

I don't have strong opinions either way on your suggestion - my guiding principle here is "what does the community need to reduce the hurdle of getting to train a model?". For me, that's likely to be a "batteries included" Dockerfile that allows someone to train a model from say the CV corpus, with minimal preparation work.

There are already two Dockerfiles that ship with DeepSpeech - the makefile takes a parameter of either train or build, with dependencies tailored to either using DeepSpeech for training or for inference. A concern with increasing the number of Dockerfiles is that it then makes providing support, or replicating issues, harder.

Sorry I'm not more helpful on this one, but it's probably Lissy or Reuben's call here.

@lissyx
Copy link
Collaborator

lissyx commented Jan 11, 2021

I don't have strong opinions either way on your suggestion - my guiding principle here is "what does the community need to reduce the hurdle of getting to train a model?". For me, that's likely to be a "batteries included" Dockerfile that allows someone to train a model from say the CV corpus, with minimal preparation work.

The truth is that I was working on that on a spare project: https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/

This is the project you want. And the goal was that our train Docker here is a base for this. Which I have not had time to finish yet. Hence the two docker in the repo:

  • build: to allow people to more easily rebuild the lib if required for their own purpose
  • train: to provide a very basic minimal working setup that should be the base of others.

@KathyReid @reuben Maybe it is time for https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/ to move somewhere else and be more supported?

@KathyReid
Copy link
Contributor Author

I re-tested this with the Docker Hub image (not one I built myself) and tried to do data formatting for CV datasets, which is the example given in the PlayBook, and it fails on sox dependencies as below.

Can we please have this pulled into DeepSpeech so that the Docker Hub image includes sox deps? I know it's not strictly for training, but the actual developer workflow that people are going to use to experiment with DeepSpeech means they will probably use the Docker Hub image to import data from CV.

root@c7f3e6f3c302:/DeepSpeech# bin/import_cv2.py deepspeech-data/cv-corpus-6.1-2020-12-11/vi
/bin/sh: 1: sox: not found
SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    
Loading TSV file:  /DeepSpeech/deepspeech-data/cv-corpus-6.1-2020-12-11/vi/test.tsv
Importing mp3 files...
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
WARNING: No --validate_label_locale specified, your might end with inconsistent dataset.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "bin/import_cv2.py", line 65, in one_sample
    _maybe_convert_wav(mp3_filename, wav_filename)
  File "bin/import_cv2.py", line 185, in _maybe_convert_wav
    transformer.build(mp3_filename, wav_filename)
  File "/usr/local/lib/python3.6/dist-packages/sox/transform.py", line 594, in build
    input_filepath, input_array, sample_rate_in
  File "/usr/local/lib/python3.6/dist-packages/sox/transform.py", line 496, in _parse_inputs
    input_format['channels'] = file_info.channels(input_filepath)
  File "/usr/local/lib/python3.6/dist-packages/sox/file_info.py", line 82, in channels
    output = soxi(input_filepath, 'c')
  File "/usr/local/lib/python3.6/dist-packages/sox/core.py", line 149, in soxi
    stderr=subprocess.PIPE
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sox': 'sox'
"""This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bin/import_cv2.py", line 221, in <module>
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    main()
  File "bin/import_cv2.py", line 216, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 172, in _preprocess_data
    set_samples = _maybe_convert_set(dataset, tsv_dir, audio_dir, space_after_every_character)
  File "bin/import_cv2.py", line 127, in _maybe_convert_set
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    for i, processed in enumerate(pool.imap_unordered(one_sample, samples), start=1):
This install of SoX cannot process .mp3 files.
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
This install of SoX cannot process .mp3 files.
This install of SoX cannot process .mp3 files.
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'sox': 'sox'
This install of SoX cannot process .mp3 files.

@lissyx
Copy link
Collaborator

lissyx commented Feb 10, 2021

but the actual developer workflow that people are going to use to experiment with DeepSpeech means they will probably use the Docker Hub image to import data from CV.

I'm really not sure we should enable that, as I stated earlier.

@KathyReid
Copy link
Contributor Author

OK, so the alternatives I can see here are;

  • I add instructions to the PlayBook for adding dependencies like sox to the Docker image - and it's likely that other dependencies would be required for other Importers, so if they get stuck they know how to addd

  • I get the user to pull down the image intended for inference which does have the sox deps? To me that's overkill, and makes the onramp to using DeepSpeech harder

So I think option #1 would be better - what do you think?

@lissyx
Copy link
Collaborator

lissyx commented Feb 10, 2021

* I add instructions to the PlayBook for adding dependencies like `sox` to the Docker image - and it's likely that other dependencies would be required for other Importers, so if they get stuck they know how to addd

people should derive using FROM from our docker image yes. Please note we don't have CI on any of the importers, so breakages are more than expected

* I get the user to pull down the image intended for _inference_ which does have the `sox` deps? To me that's overkill, and makes the onramp to using DeepSpeech harder

no that's no good

You forget #3: finally address my suggestion stated in #3488 (comment)

@KathyReid
Copy link
Contributor Author

DATA_FORMATTING.md comes before ENVIRONMENT.md in the sequence but the Docker environment is set up in ENVIRONMENT.md

Previously, I had instructions to git clone DeepSpeech and create a Python venv. I rewrote these instructions for ENVIRONMENT.md and for TRAINING.md, but missed DATA_FORMATTING.md, which assumes that there is a local copy of DeepSpeech on the filesystem. That is, the user is downloading, extracting and importing the Common Voice data on their local filesystem instead of from within a Docker container.

Which is not what we want to do - we want to get the user comfortable with using a Docker container for their activities.

So the task here is:

  • Move the position of DATA_FORMATTING.md to come after ENVIRONMENT.md, and ensure that the tasks in DATA_FORMATTING use Docker, not the local host.

Ruled out - using inference DockerFile

We've ruled out this option, but I'm just stating it to be clear

Add instructions to the ENVIRONMENT.md file for deriving an image and/or adding dependencies

This is the section that deals with setting up Docker and spinning up a container. This is the place where I should talk about adding dependencies to the image if needed - ie sox or other dependencies for other importers.

Task:

  • Add instructions to ENVIRONMENT.md file for deriving an image and/or adding dependencies

French Common Voice and DeepSpeech work

#3488 (comment)

My understanding of this work is that it:

  • Uses FROM to create a new Docker image
  • Installs dependencies required for the new language
  • Uses the Docker image to create a new scorer file for the language
  • New alphabet file for the language

Is this a correct understanding? Or have I misunderstood?

Spinning this off into a new project is beyond the scope of the PlayBook work - my role here is not to provide ongoing support, but to reduce existing support load by creating the PlayBook.

Summary

To reach resolution on this, here is what I propose as tasks:

  • Move the position of DATA_FORMATTING.md to come after ENVIRONMENT.md, and ensure that the tasks in DATA_FORMATTING use Docker, not the local host.

  • Add instructions to ENVIRONMENT.md file for deriving an image and/or adding dependencies

Are you comfortable with this approach?

@lissyx
Copy link
Collaborator

lissyx commented Feb 10, 2021

Spinning this off into a new project is beyond the scope of the PlayBook work - my role here is not to provide ongoing support, but to reduce existing support load by creating the PlayBook.

My point is, reducing ongoing support for the usecase you want to cover in the playbook is best addressed by finding an answer to that question

* Add instructions to `ENVIRONMENT.md` file for deriving an image and/or adding dependencies

If at least we advertise clearly that:

  • our docker image is bare and provided as a basis to build on, but people should refrain to hack into it, except if they know what they are doing
  • how to build on top of it through FROM: with CV as a usecase

I think it should be okay.

@KathyReid
Copy link
Contributor Author

I've resolved this via
mozilla/deepspeech-playbook#15
So am closing this issue.

@KathyReid KathyReid closed this Feb 28, 2021
@KathyReid KathyReid deleted the patch-4 branch February 28, 2021 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants