Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: factor algorithm out of server; bundle resource lookups into a single object #99

Merged
merged 10 commits into from Sep 13, 2016

Conversation

ronen
Copy link
Contributor

@ronen ronen commented Aug 25, 2016

Hi, thanks for making Gentle available!

I'm looking into using Gentle, but without running a server. This PR includes:

  • align.py, a shell command that runs the forced alignment algorithm and outputs JSON
  • To do that cleanly, I factored out the algorithm from serve.py, creating two classes ForcedAligner and FullTranscriber
  • And to do that cleanly, I bundled all resource lookups into a single GentleResources instance which I pass around; and once I had that I propagated it into the other bits of code that used those resources.

I took my best shot at organizing and naming and whatnot, but of course I'm open to changes that'd you'd find more suitable.

[Oh I should also point out that I haven't tested FullTranscriber since the install didn't seem to include data/graph/HCLG.fst. It "should" work since the code is just moved over from where it had been earlier, but of course bugs could have easily crept in.]

serve.py now contains no logic for dealing with the resources, building queues, running the multiple passes, etc.  It just runs the aligner/transcriber and displays the output.

NB I haven’t been able to test the FullTranscriber since I don’t have the relevant HCLG file
@strob
Copy link
Contributor

strob commented Aug 26, 2016

This looks great! I've been experimenting with ways to use Gentle programatically from python, but hadn't settled on a strategy yet. Two issues to discuss:

  • There are lots of path issues that come along with the desire to import gentle. In particular, thinking through where to put all of the language model resources when the python package is installed system-wide. Until now, every project I make that uses gentle has a mess of symlinks to ext/, PROTO_LANGDIR, etc. (Incidentally, the install_language_model.sh script will provide the ``data/graph/HCLG.fst` file that you're missing.)
  • The other thing I often do when using gentle programatically is align part of a wavefile. I usually do that by using the lower-level StandardKaldi object directly, but we may want to think about exposing a partial transcription/alignment API.

Let me know what you think about these points, and we can figure out how to complete the merge.
Thanks!

@ronen
Copy link
Contributor Author

ronen commented Aug 26, 2016

Great, glad you like this direction!

... thinking through where to put all of the language model resources when the python package is installed system-wide. Until now, every project I make that uses gentle has a mess of symlinks to ext/, PROTO_LANGDIR, etc.

I don't have a specific thought as to where to put the resources. But I wonder whether the GentleResources object could be made smarter to help deal with this? Here's a proposal:

  • GentleResources would expect a manifest JSON that would list the resources and their locations.
  • GentleResources could allow the manifest to be specified by explicit file path and/or by search path and/or via environment variables. (And for completeness, by programmatically providing the JSON data.)
  • The manifest contents would be keyed by a language name, allowing a multiple languages or versions of language data to be installed simultaneously. By default GentleResources would use the first/only language in the manifest.
  • Also provide tools to help create the manifest JSON while installing the resources and/or create a the manifest JSON after the fact by somehow finding existing resources?
  • Advanced, maybe: The manifest could support specifying URI's rather than only local filesystem paths, and GentleResources could download & cache the files locally somewhere. In fact, the manifest itself could be specified by a URI.

(and really GentleResources is a bad name, it should just be gentle.Resources I'd say)

What do you think about all that?

(Incidentally, the install_language_model.sh script will provide the `data/graph/HCLG.fst file that you're missing.)

Oh good to know thanks. I'll get the file and try it out. (Next week :)

... align part of a wavefile. I usually do that by using the lower-level StandardKaldi object directly, but we may want to think about exposing a partial transcription/alignment API.

Yes, that seems like it would make sense. I guess via optional parameters for starting time and length in the wavefile? Would it also make sense to have start and length in the transcript? And/or an easy way to specify a single wavefile and transcript and support the ability to do a partial alignment then later continue where that alignment left off? (I haven't used alignment in practice yet to have a good sense for important use cases.)

BTW, ultimately for the project I'm looking at I'd need to do everything in C++ without Python, so at some point I'll be reimplementing the core alignment algorithm in C++, directly calling Kaldi. Let me know if that's something you'd want contributed back here--subject to my client OK'ing releasing the C++ code that is.

@ronen
Copy link
Contributor Author

ronen commented Aug 31, 2016

(Incidentally, the install_language_model.sh script will provide the `data/graph/HCLG.fst file that you're missing.)
Oh good to know thanks. I'll get the file and try it out. (Next week :)

Needed a small fix, but it's working now.

@ronen
Copy link
Contributor Author

ronen commented Aug 31, 2016

I've gone a little farther with the refactoring & encapsulation:

  • Now the top-level apps serve.py and align.py just import gentle and use gentle.Resources, gentle.FullTranscriber and gentle.ForcedAligner and don't reach into the gentle package to import anything else.
  • The value returned by the .transcribe() methods is now an instance of gentle.Transcription rather than plain dictionary. The formerly top-level to_csv() and to_json() functions are now methods of that class.
  • Utilties (paths, ffmpg and cyst) that aren't particularly gentle-specific, and that are used only by the top-level apps or are shared by the gentle package and the top-level apps, are now in a separate package util

Hope you approve

… resampling really are gentle-specific.

also provide context manager version that creates a tempfile
@strob strob merged commit 1ed0f82 into lowerquality:master Sep 13, 2016
@strob
Copy link
Contributor

strob commented Sep 13, 2016

Thanks so much for this and sorry it took me so long to finish testing. Big code quality improvements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants