Skip to content

lacymorrow/aura

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aura

Audio processing pipeline for wearable memory augmentation. Records conversations, identifies speakers, transcribes speech, and extracts structured knowledge into a persistent graph.

Architecture

Audio File → VAD → Transcription → Diarization → Speaker ID → Alignment → Knowledge Extraction
                                                                                    ↓
                                                              PostgreSQL + pgvector (voiceprints, knowledge graph)

Pipeline stages:

Stage Model What it does
VAD Silero VAD Strips silence and non-speech
Transcribe Whisper large-v3 (CTranslate2) Word-level timestamped transcription
Diarize pyannote 3.1 Who spoke when
Embed ECAPA-TDNN (192-dim) Voiceprint per speaker
Align Custom Merges transcript + diarization
Extract Claude / GPT-4o People, facts, commitments, events

Quick Start

# 1. Clone and configure
cp .env.example .env
# Edit .env with your HuggingFace token and LLM API key

# 2. Build and run
docker compose build
docker compose run --rm aura process /app/data/uploads/your_audio.wav

# 3. Start the watcher (processes new files automatically)
docker compose up -d watcher

CLI Commands

# Full pipeline
aura process <audio_file> [-o output_dir] [--no-extract] [--owner SPEAKER_00]

# Batch process all unprocessed files
aura batch [-d upload_dir] [-o output_dir]

# Watch for new files (daemon mode)
aura watch [--interval 60]

# Individual stages
aura vad <audio_file>
aura transcribe <audio_file>
aura diarize <audio_file>
aura speakers <audio_file>

# Database
aura db init
aura db status

# System info
aura status

Requirements

Performance

On RTX 3090 (30s test audio):

  • VAD: ~1.4s
  • Transcription: ~10s
  • Diarization: ~6s
  • Embeddings: ~1s
  • Extraction: ~6s (API latency)
  • Total: ~25s for 30s of audio (~1.2x realtime)

Estimated for 8hr recording day: ~2-2.5 hours processing time.

Data Flow

data/
├── uploads/          # Drop audio files here
│   └── 2024-03-04_recording.wav
└── processed/        # Results appear here
    └── 2024-03-04_recording/
        ├── *_transcript.json    # Timestamped, speaker-labeled
        ├── *_transcript.txt     # Human-readable
        ├── *_speakers.json      # Speaker metadata
        ├── *_knowledge.json     # Extracted knowledge graph
        └── *_full.json          # Complete processing metadata

Speaker Identification

Speakers are identified via 192-dimensional ECAPA-TDNN voiceprints. The system maintains a persistent speaker registry:

  • Cosine similarity > 0.75: Confident match (auto-linked)
  • 0.55 - 0.75: Candidate match (flagged for review)
  • < 0.55: New speaker (new profile created)

Embeddings improve over time via running-weighted-average updates.

License

Private.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors