Option Analysis: Current Language Translators & Video Conferencing Software

Skype Real-Time Language Translator with Azure Microsoft Cognitive Services

SaaS offering for real-time language translation as part of Skype
Conversion for 10 languages
Text translator for over 60 languages
Utilizes machine learning
Available on Windows 7 & Up, OSX, and in desktop, mobile, wearables
Automatic speech recognition -> speech correction -> Microsoft translate -> text to speech
SMT vs NMT to translate, uses a model that has been trained through uploading millions of sentences, words, speech
https://www.skype.com/en/features/skype-translator/
https://blogs.skype.com/news/2014/12/15/skype-translator-how-it-works/
https://www.microsoft.com/en-us/translator/business/machine-translation/

Google Assistant bilingual capabilities only available for speaking with the Google Assistant and only in the US
Google Pixel Buds are for translating real time in person, not a video conferencing app
AutoML translation & translation API - train a model with phrases in desired languages, evaluate model and repeat
Google Translate API: https://cloud.google.com/translate/
How Pixel Buds Work: https://techxplore.com/news/2017-11-google-pixel-buds-earphones-languages.html

Supports text translation, document translation in beta
Roughly 25 languages supported
Currently no bilingual video conferencing support
2 cents per character after the first 25,0000; 10 cents for custom models
Documentation: https://www.ibm.com/cloud/blog/announcements/document-translation-made-easy-with-watson-language-translator

Translate works with unstructured text, roughly 20 languages supported (https://docs.aws.amazon.com/translate/latest/dg/how-it-works.html)
Also uses neural networks and a Decoder to decode source text, and encoder to translate to target text (both one word at a time)
Amazon Comprehend does the Automated language detection using neural networks. It will recognize key phrases, words, language, sentiment, and syntax. Uses deep learning, async and sync processing, integrates with other AWS services, and supports customization and clustering (https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html)
Polly converts text to "life-like" speech. Multiple voice options, low latency, pay for what you translate, logging available. Only available in 3 regions, throttle limits (https://docs.aws.amazon.com/polly/latest/dg/what-is.html)
Pricing model is simple and "pay for what you need", but multiple services can add up $$

Most popular video conferencing applications in industry (https://www.turbinehq.com/blog/5-video-conferencing-apps, https://zapier.com/blog/best-video-conferencing-apps/)
Currently, non of these platforms support bilingual video conferencing

The basic flow of real time bilingual video conferencing: Condition input, language identification, automatic speech recognition, speech to text, text "cleanup", natural language processing, text to speech
Basic system for a bilingual video conferencing application: Frontend display -> video API/Lambda -> speech to text (ASR) -> translation api (Neural network) -> text to speech -> output to service -> output to user