Generative AI is producing a bunch of fun new models for us devs to poke at. Did you know you can use these over the phone?
Twilio gives you a superpower called Media Streams. Media Streams provides a Websocket connection to both sides of a phone call. You can get audio streamed to you, process it, and send audio back.
This repo serves as a demo exploring three models:
- Deepgram for Speech to Text
- elevenlabs for Text to Speech
- OpenAI for GPT prompt completion
These service combine to create a voice application that is remarkably better at transcribing, understanding, and speaking than traditional IVR systems.
Features:
- Returns responses with low latency, typically 1 second by utilizing streaming.
- Allows the user to interrupt the GPT assistant and ask a different question.
- Maintains chat history with GPT.
Sign up for Deepgram, ElevenLabs, and OpenAI. You'll need an API key for each service.
Use ngrok to tunnel and then expose port 3000
ngrok http 3000
Copy .env.example
to .env
and add all API keys.
Set SERVER
to your tunneled ngrok URL
Install the necessary packages:
npm install
Start the web server:
npm run dev
Wire up your Twilio number using the console or CLI
twilio phone-numbers:update +1[your-twilio-number] --voice-url=https://your-server.ngrok.io/incoming
There is a Stream TwiML verb that will connect a stream to your websocket server.
Fly.io is a hosting service similar to Heroku that simplifies the deployment process. Given Twilio Media Streams are sent and received from us-east-1, it's recommended to choose Fly's Ashburn, VA (IAD) region.
Deploy the app using the Fly.io CLI:
fly deploy
Import your secrets from your .env file to your deployed app:
fly secrets import < .env