Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5 top-level use-cases for k2 #37

Open
petercwallis opened this issue Oct 1, 2020 · 6 comments
Open

5 top-level use-cases for k2 #37

petercwallis opened this issue Oct 1, 2020 · 6 comments

Comments

@petercwallis
Copy link

Based on last weeks meeting, I'd suggest there are two sets of stake-holders: 1) ASR researchers and 2) kaldi users. From a very peripheral view I think the ASR researchers have two use-cases:

  1. a researcher wants to reproduce results from someone else's paper, and then explore variants. A good reason for doing this is when a published result was produced with massive compute power but success credited to a novel technique with a cool name.
  2. a researcher wants to develop a novel component to an ASR system, and uses kaldi to provide everything else -infrastructure, peripherals for experimentation, etc

For the users, a developer wants to have ASR as part of another project and, for whatever reason, doesn't want to use cloud services:
3) the developer wants a speech interface based on LVCS recognition and the classic pipeline model - sound to transcript to meaning - which works fine for speech "command" systems but not so well for unconstrained input. For unconstrained input the cloud services work better (big computers and far more training data) and the developer should perhaps not be using kaldi.
4) the developer has a new language/vocabulary and wants to train kaldi models for use in a speech interface. In this case kaldi is [a good / the only] option.
5) the developer wants a speech interface for a dialog system where the vocabulary is limited (like command systems) but the input is not (like LVCS). The naive approach is to use "wild cards" in speech grammars or "word spotting". The point is that the developer's envisaged system can provide information that is useful for the ASR, possibly giving better performance (although not based on WER) than commercial LVCS systems for the task.

Of these, 1 is good science but not that interesting, and 3 is misguided. From what I saw it looks like 4 is a popular usage that makes sense. I suspect that 2 and 5 are closely related, but requires far more conversation. That is the conversation I would like to contribute to.

Can I also point out, Daniel, that kaldi is famous because people use it. It could be more famous if more people use it successfully. Having 'apt-get install' on a raspberry pi would guarantee lots of downloads and all though you may not want to do it, it would be good to do. Could you find the money to pay someone perhaps, or perhaps someone might do it for you if you can put their name on a paper or two.

@nshmyrev
Copy link

nshmyrev commented Oct 1, 2020

the developer wants a speech interface for a dialog system where the vocabulary is limited (like command systems) but the input is not (like LVCS).

Hi Peter. Very few people want to recognize commands these days. With few commands you get a toy system anyway because it is easier to press a big red button than to shout "left" and "right" and wait for half a second for system to respond. People get used to assistants and want to recognize large vocabulary, sometimes crazy large vocabulary of million words on Raspberry Pi. Here we have a problem. For example, you can install Vosk with pip on RPi3 but it is far from accurate due to CPU restriction (RPi is much slower than the phone). Some work to do on quantization and multithreading here.

You are welcome to try yourself.

@petercwallis
Copy link
Author

petercwallis commented Oct 2, 2020

Hi and thanks for the feedback - I was throwing this out there to get this kind of discussion going - I am in over my head with speech research - but that is how cross disciplinary research has to work.

I will agree that the "big success" of recent years is the Large Vocabulary Continuous Speech (LVCS) recognition - my wife is currently correcting automatic transcriptions of her lectures and the Word Error Rate (WER) is very low - she is impressed. However, just because that is impressive does not mean people don't want to do other things. The Echo is also impressive and that is not doing LVCS, but is rather doing what I call above, "command" recognition - just a stupidly large number of commands (that get mapped into a much smaller number of "intent"s). More explicit examples of command recognition are the in car speech interfaces, and yes, you are right; people would prefer a button (with several caveats).

What one can not do effectively with LVCS, nor with command recognition, is implement the patterns in the ELIZA chatbot. This mechanism is how most (and indeed all successful?) conversational AI systems work and, I claim, this is something lots of people want to do. The problem is that these patterns contain "wild cards" - that is regex ".*" type of expressions. I have been in a talk by serious AI people (EU 2020 project) in which the language understanding was achieved they said using a "magic regular expression" - which got a guilty laugh from everyone. I want to do something very similar to these regular expressions and am using the hardware used for Wake Word Detection to do "word spotting" in continuous speech, but in a rather convoluted way. I believe there might be an opportunity for the kaldi community here.

Agree about RPi cpu being tiny (although Pi4 now has a grown up cpu with power comparable to a modern mobile phone) but the gpu on a pi has always been reasonable. I have seen openCV doing object detection blindingly fast on the GPU of a RPi Zero. I think there would be a great demo and a very useful "peripheral" with Kaldi recognition running on a linux computer that costs under 10 dollars.

@nshmyrev
Copy link

nshmyrev commented Oct 2, 2020

Thank you Peter, I didn't understand that RPi has GPU before which is accessible through opencl. I'll take a look closer on it.

@petercwallis
Copy link
Author

Glad to help. Keep in mind that the latest raspberry pi is the Pi4 which has a significantly bigger CPU, but the ASR problem looks, to me, like the other things (eg ML vision) people have been doing on the GPU of a rpi for a while now. I think that the thread you opened called "Use VideoCore through OpenCL on RPi3" is there to run with this? Great!

This is exciting, but I want to claim that there is another opportunity - a bigger opportunity :-)

That opportunity is to revisit the way down-stream Natural Language Understanding (NLU) can help the ASR process by providing expectations. This is not (historically) a new idea, but it does seem to be have lost to history and might be an interesting thing to do with kaldi.

@kkm000
Copy link

kkm000 commented Oct 3, 2020

Peter, people indeed use Kaldi on devices, like smartphones and tablets, pretty much comparable in the CPU power with the Pi3 and Pi4. You reminded me I have a pending ticket for fixing the configure script for one of the cases. Most ARM v7 and v8 have NEON extensions (128 bit SIMD) that are taken advantage of by math libraries (ACL is more vision-oriented but certainly an option, ans, IIRC, OpenBLAS has kernels for NEON, too). For the ?gemm-heavy decode part (the AM), halving the precision to float16 more than doubles ?gemm perf (Pi4's A72 does indeed support float16_t), and reduces the model size in half, without sacrificing much accuracy with good engineering and/or good luck (it's data witchcraft, not data science anyway).

I do not think decoding on the GPU is common if used at all, as the current Kaldi is CUDA-only. That is likely to change, as the frameworks we are planning to support have device-optimized versions. Tensorflow would certainly make access to the GPU easier.

As for the proposed taxonomy of uses, I can think of areas that would be hard to fit. Education, is just one example.

The dichotomy between “researchers” and “users” is also far from hard and fast, as device limitation necessarily require architecting a suitable model. Not so much really novel, on the scale of the invention of LF-MMI or CTC, but certainly a lot of model building skills and literature research is required. Engineering constrains are much tighter, and tradeoffs are more significant, so it's possible that there is not even a single "apt install", one-size-fits-anything model even for a single language. If you are reducing precision, you are likely to account for that in training too. Pruning the network—and (Sze et al., 2017) has over 1K citations!—may provide significant benefits as well. As with nearly everything, pulling something out of the box and making it "just work" is only the starting point, especially on a resource-constrained hardware.

I do not really think the world has abandoned the idea ASR and NLU marriage. Recruiting attention for this very task was mentioned at the 3rd session (I forgot who the panelist was, my memory for names is nearly non-existent).

@petercwallis
Copy link
Author

petercwallis commented Oct 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants