A comprehensive web application for audio transcription, text-to-speech conversion, and document processing with GPU-accelerated performance.
- 🎙️ Audio/Video Transcription - Support for MP3, WAV, MP4, MOV files
- 📺 YouTube Processing - Extract transcripts from videos and playlists
- 📄 Document Processing - Extract text from PDF, DOCX, and TXT files
- 🔊 Text-to-Speech - Generate audio with multiple voice options
- 🌍 Multi-Language Support - 10 languages including English, Spanish, French, German, and more
- 📊 Multiple Output Formats - Text, Markdown, Word documents, and PDF
- ⚡ GPU Acceleration - Powered by faster-whisper for efficient transcription
- 📈 Real-time Progress - Track processing status with visual indicators
- 💾 Job History - All processing jobs saved to database
- Python 3.11+
- PostgreSQL database
- FFmpeg
- faster-whisper (for transcription)
-
Clone the repository
git clone https://github.com/YOUR_USERNAME/YOUR_REPO.git cd YOUR_REPO -
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration -
Run the application
gunicorn --bind 0.0.0.0:5000 main:app
-
Access the app Open http://localhost:5000 in your browser
For detailed deployment instructions, see deployment/DEPLOY_TO_UBUAI.md
Quick deployment on Ubuntu:
# On your Ubuntu server
sudo mkdir -p /var/www
sudo git clone https://github.com/YOUR_USERNAME/YOUR_REPO.git /var/www/speech-app
cd /var/www/speech-app
sudo bash deployment/setup_ubuntu.sh- ✅ Nginx reverse proxy
- ✅ Gunicorn WSGI server
- ✅ Systemd service management
- ✅ PostgreSQL database
- ✅ Automatic service restart
- ✅ Log rotation
- ✅ Security hardening
- Select the Files tab
- Upload your audio/video file (MP3, WAV, MP4, MOV)
- Choose the language
- Select output format
- Click "Process"
- Select the YouTube tab
- Paste a YouTube video or playlist URL
- Choose whether to use existing transcript or transcribe audio
- Select language and output format
- Click "Process"
- Select the Documents tab
- Upload a PDF, DOCX, or TXT file
- Select output format
- Click "Process"
- Select the Text tab
- Enter or paste your text
- Choose a voice and language
- Click "Generate Speech"
- Backend: Flask + SQLAlchemy
- Database: PostgreSQL
- Transcription: faster-whisper (GPU-accelerated)
- Web Server: Gunicorn + Nginx
- TTS: gTTS + pyttsx3
- Video Processing: yt-dlp + moviepy
# Database
DATABASE_URL=postgresql://user:password@localhost/dbname
# Whisper Configuration
WHISPER_SERVER=localhost
WHISPER_SCRIPT_PATH=/path/to/faster-whisper/script
# Application
FLASK_ENV=production
SECRET_KEY=your-secret-key- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
- Plain Text (.txt)
- Markdown (.md)
- Word Document (.docx)
- PDF (.pdf)
The application integrates with a GPU-accelerated faster-whisper instance:
- Primary: Remote GPU server via SSH
- Fallback: Local faster-whisper installation
- Auto-detection of best processing method
Service won't start:
sudo journalctl -u speech-app -fDatabase connection errors:
sudo systemctl status postgresqlTranscription fails:
- Check faster-whisper installation
- Verify GPU server connectivity
- Review application logs
- Application:
/var/log/speech-app/ - Nginx:
/var/log/nginx/ - System:
sudo journalctl -u speech-app
.
├── app.py # Flask application setup
├── main.py # Application entry point
├── models.py # Database models
├── utils/
│ ├── audio_converter.py
│ ├── youtube_processor.py
│ ├── document_processor.py
│ ├── text_to_speech.py
│ ├── output_formatter.py
│ └── whisper_client.py
├── templates/ # HTML templates
├── static/ # CSS, JS, assets
└── deployment/ # Deployment scripts
- Create utility modules in
utils/ - Add routes in
app.py - Update models in
models.py - Add templates in
templates/
# Check status
sudo systemctl status speech-app
# Restart service
sudo systemctl restart speech-app
# View logs
sudo journalctl -u speech-app -f
# Update application
cd /var/www/speech-app
git pull
sudo systemctl restart speech-app- GPU-accelerated transcription with faster-whisper
- Asynchronous job processing
- Efficient file handling for large uploads
- Database query optimization
- Nginx caching for static assets
- Environment-based secret management
- SQL injection protection via SQLAlchemy
- XSS protection headers
- CSRF protection
- File upload validation
- Secure password handling
This project is for internal use.
For deployment assistance, see:
Built with:
- Flask
- faster-whisper
- PostgreSQL
- Nginx
- Gunicorn