Skip to content

pushkal1234/PaperPod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎧 PaperPod

Documents β†’ Podcast-style conversations β†’ Real-time voice Q&A

Upload any document (PDF, DOCX, TXT) β†’ AI generates a natural two-host podcast conversation β†’ Listen & ask real-time questions with voice.

Demo License


✨ Features

  • Document to Podcast β€” Upload a PDF/DOCX/TXT, paste text, or snap a photo and get an engaging two-host podcast conversation
  • Dual AI Voices β€” Host + Guest with natural speech synthesis
  • Real-time Q&A β€” Ask questions via voice or text, get audio answers
  • No GPU Required β€” Runs entirely on CPU using cloud AI APIs (free tier)
  • Privacy First β€” Documents stay on your machine; only text is sent to LLM API

Tech Stack

Layer Technology
Frontend React 18 + Vite + Tailwind CSS
Backend FastAPI (Python 3.10+)
LLM Groq Llama 3.1 8B (free tier β€” generous limits)
STT Groq Whisper (free tier)
TTS edge-tts v7.2+ (free, no key needed β€” conversational voices)
Image OCR Google Gemini Vision (free tier)
Retrieval In-memory keyword search (demo)
Database SQLite (via SQLAlchemy async)

πŸš€ Quick Start β€” Local Setup

Prerequisites

Tool Version Install
Python 3.10 or higher python.org or brew install python
Node.js 18 or higher nodejs.org or brew install node
ffmpeg any brew install ffmpeg (macOS) / sudo apt install ffmpeg (Ubuntu) / ffmpeg.org (Windows)
Git any brew install git or git-scm.com

Step 1: Get free API keys

Groq (for LLM + STT):

  1. Go to console.groq.com/keys
  2. Sign up (free β€” no credit card needed)
  3. Create an API key and copy it

Google AI Studio (for Image OCR only):

  1. Go to aistudio.google.com/app/apikey
  2. Sign in with your Google account (free β€” no credit card needed)
  3. Create an API key and copy it

Step 2: Clone the repo

git clone https://github.com/pushkal1234/PaperPod.git
cd PaperPod

Step 3: Set up the Backend

cd backend

# Copy the example env file and add your API keys
cp .env.example .env
# Open .env in any editor and replace the placeholders with your actual keys
# Example:
#   GROQ_API_KEY=gsk_...
#   GOOGLE_API_KEY=AIza...

# Create a Python virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate        # macOS / Linux
# venv\Scripts\activate         # Windows (Command Prompt)
# venv\Scripts\Activate.ps1     # Windows (PowerShell)

# Upgrade pip (recommended)
pip install --upgrade pip setuptools wheel

# Install dependencies
pip install -r requirements.txt

# Start the backend server
uvicorn app.main:app --reload --port 8000

You should see: INFO: Application startup complete.

Step 4: Set up the Frontend (new terminal)

# Open a new terminal tab/window, navigate to the project
cd PaperPod/frontend

# Install Node.js dependencies
npm install

# Start the development server
npm run dev

You should see: Local: http://localhost:5173/

Step 5: Use PaperPod

  1. Open http://localhost:5173 in your browser
  2. Upload a PDF, DOCX, or TXT document
  3. Wait ~2-3 minutes for podcast generation
  4. Listen to your AI-generated podcast
  5. Ask questions via voice or text in the Q&A panel

⚠️ Troubleshooting

Problem Solution
pip install fails with pkg_resources error Run pip install --upgrade pip setuptools wheel first
Backend: No module named 'greenlet' Run pip install greenlet
Backend: Address already in use on port 8000 Run lsof -ti:8000 | xargs kill -9 then restart
Groq rate limit error Wait a few seconds and retry β€” free tier has generous but finite limits
edge-tts 403 error Run pip install --upgrade edge-tts β€” v7.2+ has the fix
Gemini API quota error Only used for image OCR; if hitting limits, wait and retry
Frontend: blank page Make sure backend is running on port 8000 first
ffmpeg not found Install ffmpeg: brew install ffmpeg (macOS)

Project Structure

PaperPod/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ .env.example              # Environment config (copy to .env)
β”‚   β”œβ”€β”€ requirements.txt           # Python dependencies
β”‚   └── app/
β”‚       β”œβ”€β”€ main.py               # FastAPI entry point
β”‚       β”œβ”€β”€ config.py             # Settings & configuration
β”‚       β”œβ”€β”€ database.py           # SQLAlchemy models (documents ↔ audio_files 1:1)
β”‚       β”œβ”€β”€ routes/
β”‚       β”‚   β”œβ”€β”€ documents.py      # Upload, list, status endpoints
β”‚       β”‚   β”œβ”€β”€ audio.py          # Stream podcast MP3
β”‚       β”‚   └── qa.py             # Q&A: voice/text question β†’ audio answer
β”‚       └── services/
β”‚           β”œβ”€β”€ document_service.py   # PDF/DOCX/TXT extraction + chunking
β”‚           β”œβ”€β”€ vector_service.py     # In-memory chunk store + keyword retrieval
β”‚           β”œβ”€β”€ llm_service.py        # Groq LLM (podcast script + Q&A)
β”‚           β”œβ”€β”€ tts_service.py        # edge-tts (Host + Guest conversational voices)
β”‚           β”œβ”€β”€ stt_service.py        # Groq Whisper speech-to-text
β”‚           └── image_service.py      # Google Gemini Vision OCR (camera upload)
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.jsx               # Main app (upload β†’ processing β†’ player)
β”‚   β”‚   β”œβ”€β”€ api.js                # API client (axios)
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ UploadZone.jsx    # File upload + text paste + camera capture
β”‚   β”‚   β”‚   β”œβ”€β”€ PodcastPlayer.jsx # Audio player + transcript view
β”‚   β”‚   β”‚   └── QAPanel.jsx       # Voice/text Q&A chat interface
β”‚   β”‚   └── hooks/
β”‚   β”‚       └── useAudioRecorder.js  # MediaRecorder hook for mic input
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ vite.config.js
β”‚   β”œβ”€β”€ tailwind.config.js
β”‚   └── postcss.config.js
β”œβ”€β”€ .gitignore
└── README.md

AI Models & Architecture

flowchart LR
    subgraph GROQ["☁️ Groq (Free Tier)"]
        LLM["🧠 Llama 3.1 8B\n─────────────────\nβ€’ Podcast script generation\nβ€’ Q&A answering\nβ€’ Fast & reliable"]
        STT["🎀 Whisper\n─────────────────\nβ€’ Speech-to-text\nβ€’ Voice question transcription\nβ€’ Multi-language support"]
    end

    subgraph TTS["πŸ”Š edge-tts (Free, No Key)"]
        HOST["Host: AriaNeural"]
        GUEST["Guest: GuyNeural"]
    end

    subgraph OCR["πŸ“· Google AI Studio (Free)"]
        VISION["Gemini Vision\n─────────────────\nβ€’ Image OCR\nβ€’ Camera upload"]
    end

    subgraph PIPELINE["βš™οΈ How They Connect"]
        DOC["πŸ“„ Document"] --> LLM
        CAM["πŸ“· Camera"] --> VISION --> LLM
        LLM -->|dialogue script| HOST
        LLM -->|dialogue script| GUEST
        HOST -->|podcast .mp3| PLAY["🎧 Player"]
        GUEST -->|podcast .mp3| PLAY
        PLAY -->|user speaks| STT
        STT -->|question text| LLM
        LLM -->|answer text| GUEST
        GUEST -->|answer .mp3| PLAY
    end

    style GROQ fill:#E8F8F5,stroke:#1ABC9C,stroke-width:2px
    style TTS fill:#FFF3E0,stroke:#FF9800,stroke-width:2px
    style OCR fill:#E3F2FD,stroke:#2196F3,stroke-width:2px
    style PIPELINE fill:#F4ECF7,stroke:#8E44AD,stroke-width:2px
Loading
Model Provider Purpose Cost
Llama 3.1 8B Groq Podcast script generation + Q&A Free
Whisper Groq Speech-to-text (voice questions) Free
edge-tts Microsoft Edge TTS TTS β€” Host (Aria) + Guest (Guy) Free
Gemini Vision Google AI Studio Image OCR (camera upload) Free

API Endpoints

Method Endpoint Description
POST /api/documents/upload Upload file (PDF/DOCX/TXT), starts podcast generation
POST /api/documents/text Paste text, starts podcast generation
POST /api/documents/image Upload image (camera), OCR + podcast generation
GET /api/documents/{doc_id} Get document + audio status
GET /api/documents/list List all documents
GET /api/audio/{audio_id} Stream podcast audio
POST /api/qa/ask Ask question (text or voice)
GET /api/qa/audio/{qa_id} Get Q&A answer audio
GET /api/qa/history/{doc_id} Q&A history for a document

About

Doc to podcast style conversation audio book + real time Q&A support

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors