Documents β Podcast-style conversations β Real-time voice Q&A
Upload any document (PDF, DOCX, TXT) β AI generates a natural two-host podcast conversation β Listen & ask real-time questions with voice.
- Document to Podcast β Upload a PDF/DOCX/TXT, paste text, or snap a photo and get an engaging two-host podcast conversation
- Dual AI Voices β Host + Guest with natural speech synthesis
- Real-time Q&A β Ask questions via voice or text, get audio answers
- No GPU Required β Runs entirely on CPU using cloud AI APIs (free tier)
- Privacy First β Documents stay on your machine; only text is sent to LLM API
| Layer | Technology |
|---|---|
| Frontend | React 18 + Vite + Tailwind CSS |
| Backend | FastAPI (Python 3.10+) |
| LLM | Groq Llama 3.1 8B (free tier β generous limits) |
| STT | Groq Whisper (free tier) |
| TTS | edge-tts v7.2+ (free, no key needed β conversational voices) |
| Image OCR | Google Gemini Vision (free tier) |
| Retrieval | In-memory keyword search (demo) |
| Database | SQLite (via SQLAlchemy async) |
| Tool | Version | Install |
|---|---|---|
| Python | 3.10 or higher | python.org or brew install python |
| Node.js | 18 or higher | nodejs.org or brew install node |
| ffmpeg | any | brew install ffmpeg (macOS) / sudo apt install ffmpeg (Ubuntu) / ffmpeg.org (Windows) |
| Git | any | brew install git or git-scm.com |
Groq (for LLM + STT):
- Go to console.groq.com/keys
- Sign up (free β no credit card needed)
- Create an API key and copy it
Google AI Studio (for Image OCR only):
- Go to aistudio.google.com/app/apikey
- Sign in with your Google account (free β no credit card needed)
- Create an API key and copy it
git clone https://github.com/pushkal1234/PaperPod.git
cd PaperPodcd backend
# Copy the example env file and add your API keys
cp .env.example .env
# Open .env in any editor and replace the placeholders with your actual keys
# Example:
# GROQ_API_KEY=gsk_...
# GOOGLE_API_KEY=AIza...
# Create a Python virtual environment
python3 -m venv venv
# Activate the virtual environment
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate # Windows (Command Prompt)
# venv\Scripts\Activate.ps1 # Windows (PowerShell)
# Upgrade pip (recommended)
pip install --upgrade pip setuptools wheel
# Install dependencies
pip install -r requirements.txt
# Start the backend server
uvicorn app.main:app --reload --port 8000You should see: INFO: Application startup complete.
# Open a new terminal tab/window, navigate to the project
cd PaperPod/frontend
# Install Node.js dependencies
npm install
# Start the development server
npm run devYou should see: Local: http://localhost:5173/
- Open http://localhost:5173 in your browser
- Upload a PDF, DOCX, or TXT document
- Wait ~2-3 minutes for podcast generation
- Listen to your AI-generated podcast
- Ask questions via voice or text in the Q&A panel
| Problem | Solution |
|---|---|
pip install fails with pkg_resources error |
Run pip install --upgrade pip setuptools wheel first |
Backend: No module named 'greenlet' |
Run pip install greenlet |
Backend: Address already in use on port 8000 |
Run lsof -ti:8000 | xargs kill -9 then restart |
| Groq rate limit error | Wait a few seconds and retry β free tier has generous but finite limits |
| edge-tts 403 error | Run pip install --upgrade edge-tts β v7.2+ has the fix |
| Gemini API quota error | Only used for image OCR; if hitting limits, wait and retry |
| Frontend: blank page | Make sure backend is running on port 8000 first |
ffmpeg not found |
Install ffmpeg: brew install ffmpeg (macOS) |
PaperPod/
βββ backend/
β βββ .env.example # Environment config (copy to .env)
β βββ requirements.txt # Python dependencies
β βββ app/
β βββ main.py # FastAPI entry point
β βββ config.py # Settings & configuration
β βββ database.py # SQLAlchemy models (documents β audio_files 1:1)
β βββ routes/
β β βββ documents.py # Upload, list, status endpoints
β β βββ audio.py # Stream podcast MP3
β β βββ qa.py # Q&A: voice/text question β audio answer
β βββ services/
β βββ document_service.py # PDF/DOCX/TXT extraction + chunking
β βββ vector_service.py # In-memory chunk store + keyword retrieval
β βββ llm_service.py # Groq LLM (podcast script + Q&A)
β βββ tts_service.py # edge-tts (Host + Guest conversational voices)
β βββ stt_service.py # Groq Whisper speech-to-text
β βββ image_service.py # Google Gemini Vision OCR (camera upload)
βββ frontend/
β βββ src/
β β βββ App.jsx # Main app (upload β processing β player)
β β βββ api.js # API client (axios)
β β βββ components/
β β β βββ UploadZone.jsx # File upload + text paste + camera capture
β β β βββ PodcastPlayer.jsx # Audio player + transcript view
β β β βββ QAPanel.jsx # Voice/text Q&A chat interface
β β βββ hooks/
β β βββ useAudioRecorder.js # MediaRecorder hook for mic input
β βββ index.html
β βββ package.json
β βββ vite.config.js
β βββ tailwind.config.js
β βββ postcss.config.js
βββ .gitignore
βββ README.md
flowchart LR
subgraph GROQ["βοΈ Groq (Free Tier)"]
LLM["π§ Llama 3.1 8B\nβββββββββββββββββ\nβ’ Podcast script generation\nβ’ Q&A answering\nβ’ Fast & reliable"]
STT["π€ Whisper\nβββββββββββββββββ\nβ’ Speech-to-text\nβ’ Voice question transcription\nβ’ Multi-language support"]
end
subgraph TTS["π edge-tts (Free, No Key)"]
HOST["Host: AriaNeural"]
GUEST["Guest: GuyNeural"]
end
subgraph OCR["π· Google AI Studio (Free)"]
VISION["Gemini Vision\nβββββββββββββββββ\nβ’ Image OCR\nβ’ Camera upload"]
end
subgraph PIPELINE["βοΈ How They Connect"]
DOC["π Document"] --> LLM
CAM["π· Camera"] --> VISION --> LLM
LLM -->|dialogue script| HOST
LLM -->|dialogue script| GUEST
HOST -->|podcast .mp3| PLAY["π§ Player"]
GUEST -->|podcast .mp3| PLAY
PLAY -->|user speaks| STT
STT -->|question text| LLM
LLM -->|answer text| GUEST
GUEST -->|answer .mp3| PLAY
end
style GROQ fill:#E8F8F5,stroke:#1ABC9C,stroke-width:2px
style TTS fill:#FFF3E0,stroke:#FF9800,stroke-width:2px
style OCR fill:#E3F2FD,stroke:#2196F3,stroke-width:2px
style PIPELINE fill:#F4ECF7,stroke:#8E44AD,stroke-width:2px
| Model | Provider | Purpose | Cost |
|---|---|---|---|
| Llama 3.1 8B | Groq | Podcast script generation + Q&A | Free |
| Whisper | Groq | Speech-to-text (voice questions) | Free |
| edge-tts | Microsoft Edge TTS | TTS β Host (Aria) + Guest (Guy) | Free |
| Gemini Vision | Google AI Studio | Image OCR (camera upload) | Free |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/documents/upload |
Upload file (PDF/DOCX/TXT), starts podcast generation |
POST |
/api/documents/text |
Paste text, starts podcast generation |
POST |
/api/documents/image |
Upload image (camera), OCR + podcast generation |
GET |
/api/documents/{doc_id} |
Get document + audio status |
GET |
/api/documents/list |
List all documents |
GET |
/api/audio/{audio_id} |
Stream podcast audio |
POST |
/api/qa/ask |
Ask question (text or voice) |
GET |
/api/qa/audio/{qa_id} |
Get Q&A answer audio |
GET |
/api/qa/history/{doc_id} |
Q&A history for a document |