# ðŸ““ The GenAI Revolution Cookbook

**Title:** How to Build a Local Audio Meeting Copilot with Open-Source LLMs

**Description:** Build a privacy-first meeting copilot that records audio, diarizes speakers, transcribes locally, then generates summaries and action items using open-source LLMs. Follow a runnable, end-to-end project that turns meeting audio into speaker-tagged transcripts, concise summaries, and task lists, no cloud, no data leakage.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Meetings generate valuable information, but capturing it without sending audio to third-party services is a real challenge. Teams in regulated industries, privacy-conscious organizations, or environments with sensitive discussions need a solution that keeps everything local. This guide walks you through building a privacy-first meeting copilot that records audio, identifies speakers, transcribes speech, and generates summaries and action items using only open-source models running on your own hardware.

By the end, you'll have a working system that processes live microphone input or recorded files, produces speaker-tagged transcripts, and extracts structured meeting outputs. You'll learn how to stream and segment audio with voice activity detection, run automatic speech recognition with word-level timestamps, perform speaker diarization, align speakers to transcript segments, and prompt a local LLM to generate summaries and action items with a strict JSON schema. You'll also understand the trade-offs in model selection, hardware requirements, and how to validate and extend the pipeline for production use.

## What You'll Build and Why It Matters

You'll build a complete meeting copilot pipeline that runs entirely on your infrastructure. The system captures audio from a microphone or file, segments speech using voice activity detection, transcribes each segment with Whisper, labels speakers with Pyannote diarization, and sends the formatted transcript to a local LLM for summarization and task extraction. All artifacts, including audio, transcripts, and metadata, are saved as structured JSON for review and reproducibility.

Running locally means your raw audio never leaves your machine, and that is the big deal for sensitive meetings. It also means no per-minute fees and no vendor lock-in. The practical outcome is you can run this on developer workstations, on a small server, or on an isolated machine in regulated environments. For a step-by-step walkthrough on deploying your own language models securely, see our [practical guide to running a self-hosted LLM on your server](/article/how-to-run-a-self-hosted-llm-on-your-server-practical-guide-2025-2).

The end state is a Python pipeline that produces:

- WAV files for each speech segment with unique IDs.
- Speaker-tagged transcripts with timestamps.
- Structured JSON output containing summary, decisions, and action items.
- Full meeting artifacts saved for audit and review.

You'll be able to run this in two modes: live capture from a microphone or batch processing of recorded audio files. The pipeline is modular, so you can swap models, adjust parameters, and extend functionality without rewriting core logic.

### Minimum Hardware and Dependencies

You need Python 3.8 or later, a working microphone or audio file, and enough compute to run inference. For CPU-only setups, use Whisper small with int8 quantization and expect slower-than-realtime transcription. For GPU setups, an NVIDIA GPU with 8GB VRAM will handle Whisper medium and diarization comfortably with float16 precision. Diarization benefits significantly from GPU acceleration.

Install dependencies with:

In [None]:
pip install pyaudio webrtcvad faster-whisper pyannote.audio requests

PyAudio requires PortAudio system libraries. On Ubuntu, install with `apt-get install portaudio19-dev`. On macOS, use `brew install portaudio`. On Windows, PyAudio wheels are available via pip.

Pyannote diarization models are gated on Hugging Face and require a token for initial download. Accept the model terms at https://huggingface.co/pyannote/speaker-diarization and set your token with `export HF_TOKEN=your_token`. For air-gapped or offline environments, download model weights once and cache them locally by setting `HF_HOME` to a persistent directory.

For the LLM, install Ollama from https://ollama.ai/ and pull a model:

In [None]:
ollama pull mistral

Ollama runs a local API server at `http://localhost:11434`. You can swap models by changing the model name in the pipeline.

### Core Privacy and Compliance Win

All processing happens on your hardware. Audio is never uploaded, transcripts are never sent to external APIs, and you control retention policies. This architecture supports GDPR, HIPAA, and other regulatory requirements by design. You can encrypt stored artifacts, configure data retention, and run the pipeline on isolated networks without internet access after initial model downloads.

## System Architecture and Component Flow

The pipeline is organized into four stages with explicit inputs and outputs. Each stage is independent, making it easy to test, swap models, and optimize performance.

**Stage 1: Audio Capture and Segmentation**

Input: Live microphone stream or WAV file.
Output: PCM audio segments representing speech intervals.

The system reads audio frames and uses WebRTC VAD to detect speech. Non-speech frames are discarded. Speech frames are buffered with padding and yielded as segments when silence is detected or a maximum length is reached. This reduces downstream compute by processing only spoken audio.

**Stage 2: Automatic Speech Recognition**

Input: WAV file for each segment.
Output: Transcript with word-level timestamps and detected language.

Faster-whisper runs Whisper models efficiently with CTranslate2. Each segment is transcribed independently, producing text and timing information. Word-level timestamps enable precise alignment with diarization.

**Stage 3: Speaker Diarization and Alignment**

Input: Same WAV file used for ASR.
Output: Speaker labels assigned to transcript segments.

Pyannote diarization identifies speaker turns with start and end times. The alignment function matches ASR segments to speaker turns based on maximum time overlap. Each transcript segment is tagged with a speaker ID.

**Stage 4: LLM Post-Processing**

Input: Full speaker-tagged transcript.
Output: Structured JSON with summary, decisions, and action items.

The transcript is formatted and sent to a local LLM via Ollama. A strict JSON schema enforces output structure. The LLM extracts key information and returns it as parseable JSON.

Treat the copilot as four stages with explicit inputs and outputs. Audio capture produces PCM frames. VAD produces speech segments. ASR plus diarization produces a speaker-labeled transcript. LLM post-processing produces summary and tasks. Keeping these boundaries clean is what makes it easy to swap models and tune performance. If you're interested in customizing your models efficiently, our [hands-on guide to parameter-efficient fine-tuning with LoRA](/article/parameter-efficient-fine-tuning-peft-with-lora-2025-hands-on-guide-2) covers how to adapt large language models on a single GPU.

## Step-by-Step Implementation

### Step 1: Stream and Segment Audio with VAD

This code captures audio from the microphone and segments it into speech intervals using WebRTC VAD. The generator yields raw PCM bytes for each detected speech segment.

In [None]:
import collections
import time
import wave
import pyaudio
import webrtcvad

RATE = 16000
CHANNELS = 1
FORMAT = pyaudio.paInt16
FRAME_MS = 30
FRAME_SAMPLES = int(RATE * FRAME_MS / 1000)
FRAME_BYTES = FRAME_SAMPLES * 2

vad = webrtcvad.Vad(2)

def frames_from_mic():
    """
    Generator that yields raw audio frames from the default microphone.

    Yields:
        bytes: Raw PCM audio frame of size FRAME_BYTES.
    """
    pa = pyaudio.PyAudio()
    stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True,
                     frames_per_buffer=FRAME_SAMPLES)
    try:
        while True:
            data = stream.read(FRAME_SAMPLES, exception_on_overflow=False)
            yield data
    finally:
        stream.stop_stream()
        stream.close()
        pa.terminate()

def segment_speech(frame_iter, padding_ms=300, max_segment_s=15):
    """
    Segments incoming audio frames into speech segments using VAD.

    Args:
        frame_iter (iterable): Iterator yielding raw PCM audio frames.
        padding_ms (int): Amount of padding (in ms) to include before/after speech.
        max_segment_s (int): Maximum segment length in seconds to avoid long latency.

    Yields:
        bytes: Concatenated PCM bytes representing a speech segment.
    """
    num_padding = int(padding_ms / FRAME_MS)
    ring = collections.deque(maxlen=num_padding)
    triggered = False
    voiced = []
    segment_start = time.time()

    for frame in frame_iter:
        is_speech = vad.is_speech(frame, RATE)

        if not triggered:
            ring.append((frame, is_speech))
            if sum(1 for _, s in ring if s) > 0.8 * ring.maxlen:
                triggered = True
                voiced.extend([f for f, _ in ring])
                ring.clear()
                segment_start = time.time()
        else:
            voiced.append(frame)
            ring.append((frame, is_speech))
            segment_age = time.time() - segment_start
            if segment_age > max_segment_s or sum(1 for _, s in ring if s) < 0.2 * ring.maxlen:
                yield b"".join(voiced)
                triggered = False
                ring.clear()
                voiced = []

VAD aggressiveness ranges from 0 to 3. Mode 2 balances sensitivity and false positives. Adjust `padding_ms` to include context around speech and `max_segment_s` to control latency. Shorter segments reduce wait time but increase processing overhead.

### Step 2: Save Segments as WAV Files

Each speech segment is saved as a uniquely named WAV file for downstream processing. This allows ASR and diarization to run on stable file inputs.

In [None]:
import os
import uuid
import wave

def write_wav(pcm_bytes, path, rate=16000):
    """
    Write raw PCM bytes to a WAV file.

    Args:
        pcm_bytes (bytes): Raw PCM audio data.
        path (str): Output file path.
        rate (int): Sample rate in Hz.

    Returns:
        None
    """
    with wave.open(path, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(rate)
        wf.writeframes(pcm_bytes)

def save_segment(pcm_bytes, out_dir="segments"):
    """
    Save a PCM audio segment as a uniquely named WAV file.

    Args:
        pcm_bytes (bytes): Raw PCM audio data.
        out_dir (str): Directory to save WAV files.

    Returns:
        tuple: (segment_id, file_path)
    """
    os.makedirs(out_dir, exist_ok=True)
    seg_id = str(uuid.uuid4())
    path = os.path.join(out_dir, f"{seg_id}.wav")
    write_wav(pcm_bytes, path)
    return seg_id, path

Unique IDs ensure segments don't overwrite each other and make it easy to trace outputs back to source audio.

### Step 3: Transcribe with Faster-Whisper

This function transcribes a WAV file using faster-whisper and returns detailed segment and word-level timing information.

In [None]:
from faster_whisper import WhisperModel

asr_model = WhisperModel("small", device="cuda", compute_type="float16")

def transcribe_wav(path):
    """
    Transcribe a WAV file using faster-whisper, returning segment and word-level timestamps.

    Args:
        path (str): Path to the WAV file.

    Returns:
        dict: {
            "language": Detected language code,
            "segments": List of dicts with start, end, text, and word-level timing.
        }
    """
    segments, info = asr_model.transcribe(path, beam_size=3, word_timestamps=True)
    out = []
    for s in segments:
        out.append({
            "start": s.start,
            "end": s.end,
            "text": s.text.strip(),
            "words": [{"start": w.start, "end": w.end, "word": w.word} for w in (s.words or [])]
        })
    return {"language": info.language, "segments": out}

Use `device="cpu"` and `compute_type="int8"` for CPU inference. Beam size controls search quality. Larger beams improve accuracy but increase latency. Word timestamps are essential for aligning speakers to specific words.

### Step 4: Diarize and Assign Speakers

This code runs speaker diarization on a WAV file and assigns speakers to ASR segments based on time overlap.

In [None]:
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

def diarize_wav(path):
    """
    Run speaker diarization on a WAV file.

    Args:
        path (str): Path to the WAV file.

    Returns:
        list: List of dicts with 'start', 'end', and 'speaker' keys.
    """
    diar = pipeline(path)
    turns = []
    for turn, _, speaker in diar.itertracks(yield_label=True):
        turns.append({"start": turn.start, "end": turn.end, "speaker": speaker})
    return turns

def assign_speaker(asr_segments, speaker_turns):
    """
    Assign speakers to ASR segments based on maximum time overlap with diarization turns.

    Args:
        asr_segments (list): List of ASR segment dicts with 'start' and 'end'.
        speaker_turns (list): List of diarization dicts with 'start', 'end', 'speaker'.

    Returns:
        list: ASR segments with added 'speaker' key.
    """
    def overlap(a0, a1, b0, b1):
        return max(0.0, min(a1, b1) - max(a0, b0))

    labeled = []
    for seg in asr_segments:
        best_spk, best_ov = None, 0.0
        for t in speaker_turns:
            ov = overlap(seg["start"], seg["end"], t["start"], t["end"])
            if ov > best_ov:
                best_ov, best_spk = ov, t["speaker"]
        labeled.append({**seg, "speaker": best_spk or "UNKNOWN"})
    return labeled

Diarization produces speaker turns with timestamps. The assignment function matches each ASR segment to the speaker with the most overlapping time. If no overlap is found, the segment is labeled "UNKNOWN". This approach works well for clear turn-taking but may struggle with crosstalk or very short utterances.

### Step 5: Summarize with a Local LLM

This function sends the formatted transcript to a local LLM via Ollama and extracts structured output using a strict JSON schema.

In [None]:
import json
import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

SYSTEM_PROMPT = """You are a meeting copilot. Produce concise, accurate outputs from the transcript.
Return ONLY valid JSON matching the provided schema. Do not include markdown or extra text."""

SCHEMA = {
  "type": "object",
  "properties": {
    "summary": {"type": "string"},
    "decisions": {"type": "array", "items": {"type": "string"}},
    "action_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "task": {"type": "string"},
          "owner": {"type": "string"},
          "due": {"type": "string"},
          "source_quote": {"type": "string"}
        },
        "required": ["task"]
      }
    }
  },
  "required": ["summary", "action_items"]
}

def ollama_generate(model, prompt):
    """
    Call the Ollama local LLM API to generate a response for a given prompt.

    Args:
        model (str): Model name (e.g., "mistral").
        prompt (str): Prompt string.

    Returns:
        str: Model's raw response (expected to be JSON).
    Raises:
        requests.HTTPError: If the API call fails.
    """
    payload = {"model": model, "prompt": prompt, "stream": False}
    r = requests.post(OLLAMA_URL, json=payload, timeout=600)
    r.raise_for_status()
    return r.json()["response"]

def build_prompt(transcript_text):
    """
    Build a prompt for the LLM including the system instruction, JSON schema, and transcript.

    Args:
        transcript_text (str): The formatted transcript.

    Returns:
        str: The full prompt string.
    """
    return f"""{SYSTEM_PROMPT}

JSON schema:
{json.dumps(SCHEMA)}

Transcript:
{transcript_text}
"""

def summarize_transcript(transcript_text, model="mistral"):
    """
    Summarize a transcript and extract action items using the specified LLM model.

    Args:
        transcript_text (str): The formatted transcript.
        model (str): LLM model name.

    Returns:
        dict: Parsed JSON output from the LLM.
    Raises:
        json.JSONDecodeError: If the LLM output is not valid JSON.
    """
    raw = ollama_generate(model, build_prompt(transcript_text))
    return json.loads(raw)

The schema enforces structure. The LLM must return valid JSON with a summary, decisions, and action items. If the model produces invalid JSON, wrap the call in a try-except block and retry with a repair prompt or use regex to extract the first JSON object from the response.

For meeting summaries, you want strong instruction following and enough context. Mistral models are a common default, official site is https://mistral.ai/ and model releases are often distributed via official channels and partners. If you need a smaller model, choose something 7B to 8B and keep transcript chunks small. To get the most out of your LLM prompts and context window, check out our [guide to in-context learning and prompt engineering](/article/the-magic-of-in-context-learning-teach-your-llm-on-the-fly-3).

### Step 6: Save Meeting Artifacts

This function saves all meeting data, including audio paths, transcripts, and summaries, as a timestamped JSON file.

In [None]:
import json
import os
from datetime import datetime

def save_meeting_artifacts(out_dir, meeting):
    """
    Save meeting artifacts to a timestamped JSON file.

    Args:
        out_dir (str): Output directory.
        meeting (dict): Meeting data including config, segments, transcript, summary, etc.

    Returns:
        str: Path to the saved JSON file.
    """
    os.makedirs(out_dir, exist_ok=True)
    ts = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
    path = os.path.join(out_dir, f"meeting_{ts}.json")
    with open(path, "w", encoding="utf-8") as f:
        json.dump(meeting, f, ensure_ascii=False, indent=2)
    return path

Storing artifacts as JSON makes it easy to review, audit, and reprocess meetings. You can load the JSON, inspect transcripts, and regenerate summaries with different prompts or models.

### Step 7: Run the Full Pipeline

This function ties everything together, running the live meeting pipeline from microphone capture to final output.

In [None]:
def format_transcript(labeled_segments, offset_s=0.0):
    """
    Format labeled transcript segments for LLM input.

    Args:
        labeled_segments (list): List of dicts with 'start', 'end', 'speaker', 'text'.
        offset_s (float): Time offset in seconds to adjust timestamps.

    Returns:
        str: Formatted transcript string.
    """
    lines = []
    for s in labeled_segments:
        a = s["start"] + offset_s
        b = s["end"] + offset_s
        lines.append(f"[{a:06.1f}-{b:06.1f}] {s['speaker']}: {s['text']}")
    return "\n".join(lines)

def run_pipeline_live(out_dir="out", llm_model="mistral"):
    """
    Run the full live meeting pipeline: capture, segment, transcribe, diarize, summarize, and save artifacts.

    Args:
        out_dir (str): Output directory for artifacts.
        llm_model (str): LLM model name for summarization.

    Returns:
        str: Path to the saved meeting JSON file.
    """
    meeting = {
        "config": {
            "rate": RATE,
            "vad_mode": 2,
            "asr_model": "small",
            "llm_model": llm_model
        },
        "segments": [],
        "transcript": "",
        "summary": None
    }

    timeline_s = 0.0
    for pcm in segment_speech(frames_from_mic()):
        seg_id, wav_path = save_segment(pcm)
        asr = transcribe_wav(wav_path)
        turns = diarize_wav(wav_path)
        labeled = assign_speaker(asr["segments"], turns)

        text = format_transcript(labeled, offset_s=timeline_s)
        meeting["segments"].append({
            "id": seg_id,
            "wav_path": wav_path,
            "asr": asr,
            "diarization": turns,
            "labeled": labeled
        })
        meeting["transcript"] += text + "\n"
        timeline_s += max((s["end"] for s in asr["segments"]), default=0.0)

        if len(meeting["segments"]) >= 10:
            break

    meeting["summary"] = summarize_transcript(meeting["transcript"], model=llm_model)
    path = save_meeting_artifacts(out_dir, meeting)
    return path

This pipeline processes segments sequentially. For production use, consider running ASR and diarization in parallel threads or async tasks to reduce latency. You can also queue segments and process them in batches.

## Model Selection and Trade-Offs

Whisper comes in multiple sizes. Tiny and base are fast but less accurate. Small is a good balance for most use cases. Medium and large improve accuracy but require more compute. Use faster-whisper for efficient inference. It supports quantization and runs significantly faster than the original Whisper implementation.

For diarization, Pyannote is the most mature open-source option. It requires GPU for reasonable speed. If you need CPU-only diarization, expect slower processing. Alternative approaches include clustering speaker embeddings or using simpler heuristics based on pause length, but Pyannote delivers better accuracy out of the box.

For the LLM, Mistral 7B is a strong default. It handles instruction following well and fits in 8GB VRAM with float16. Llama 3 8B is another solid choice. For smaller models, Phi-3 or Gemma 2B can work but may struggle with complex extraction tasks. Test your prompts and schema with your chosen model to ensure reliable JSON output.

## Run, Validate, and Next Steps

To run the pipeline, execute:

In [None]:
if __name__ == "__main__":
    meeting_path = run_pipeline_live(out_dir="meetings", llm_model="mistral")
    print(f"Meeting saved to {meeting_path}")

Open the saved JSON file and verify:

- Transcript segments have correct timestamps and speaker labels.
- Summary captures key points from the meeting.
- Action items are extracted with tasks and owners when mentioned.

Measure latency by timing each stage. VAD and segmentation should be near real-time. ASR and diarization will lag depending on hardware. LLM summarization runs after the meeting ends, so latency is less critical.

To validate accuracy, record a test meeting with known speakers and content. Compare the transcript to a manual transcription. Check speaker labels for consistency. Review the summary and action items for completeness and correctness.

For production use, add error handling, logging, and monitoring. Implement retry logic for LLM calls. Store raw LLM responses alongside parsed JSON to debug failures. Add a CLI or web UI to review transcripts, edit speaker labels, and regenerate summaries.

Consider deploying on edge devices for distributed capture. A Raspberry Pi can handle audio capture and VAD, then send segments to a central server for ASR and diarization. This reduces hardware requirements per meeting room while keeping processing local to your network.

Extend the pipeline by adding real-time streaming, multi-language support, or custom entity extraction. You can also fine-tune Whisper on domain-specific audio or adapt the LLM for your meeting format and terminology.