πΊ ObsidianYouTubeSync β YouTube History Downloader, Transcript Extractor & AI Knowledge Graph Builder
Download, sync, and organize your entire YouTube watch history into a structured, AI-tagged knowledge graph β ready for Obsidian, OpenClaw, and GraphRAG pipelines.
This tool automates the full pipeline: it downloads YouTube video metadata, extracts captions and transcripts, generates LLM-powered summaries using Google Gemini (GenAI), and saves each video as a richly categorized Markdown note with a consistent hierarchical taxonomy. Built for users with large YouTube watch libraries who want to transform passive viewing history into an active, searchable, and connectable dataset.
Use cases: Personal knowledge management (PKM), YouTube history cataloging, knowledge graph construction, transcript dataset preparation, GraphRAG ingestion, OpenClaw/LLM agent tooling, personal AI training data, second brain building.
If you have hundreds or thousands of videos in your YouTube history, finding insights buried in that content is nearly impossible without structure. This tool solves that by:
- Downloading your full YouTube watch history (incremental or bulk) using
yt-dlp - Extracting full captions and transcripts for every video (with fallback to auto-generated captions)
- Summarizing each video using Google Gemini LLM (GenAI) β concise, searchable summaries
- Categorizing content with a shared hierarchical taxonomy (e.g.,
Technology/ArtificialIntelligence,Science/Neuroscience) using AI classification - Organizing notes into Obsidian with clean YAML frontmatter (tags, url, channel, date, summary) β queryable via Dataview
- Enabling downstream pipelines: export structured notes as a knowledge graph dataset for GraphRAG, OpenClaw, vector databases, or LLM fine-tuning
graph TD
A[YouTube Watch History] -->|Browser Cookies & yt-dlp| B(Extract Metadata & Transcripts)
B --> C{Google Gemini GenAI}
C -->|Analyze text| D[Generate Concise Summary]
C -->|Classify concepts| E[Apply AI Taxonomy Tags]
D --> F[Construct Structured Markdown]
E --> F
F --> G[(Obsidian Vault / Knowledge Graph)]
G --> H[Ready for GraphRAG & Agents]
classDef default fill:#f9f9f9,stroke:#5c3d9e,stroke-width:2px,color:#333;
classDef ai fill:#e6dfff,stroke:#8b6cef,stroke-width:2px,color:#333;
class C ai;
| Feature | Details |
|---|---|
| LLM Summarization | Google Gemini (GenAI) generates concise, high-quality summaries for every video |
| Full Transcript Download | Pulls captions/transcripts (manual + auto-generated) from YouTube |
| Hierarchical AI Tagging | CamelCase taxonomy tags like Technology/AI/LLM, Person/ElonMusk |
| History Sync | Download & sync your entire YouTube watch history via browser cookies + yt-dlp |
| Incremental Updates | Smart detection β only processes new videos, skips already-synced ones |
| YouTube Shorts Skip | Automatically detects and skips Shorts to keep your knowledge base high-signal |
| Vault-wide Retagging | Retag any folder (Apple Notes, Books, etc.) using the same global AI taxonomy |
| Parallel Processing | ThreadPoolExecutor with configurable pool β syncs 100s of videos in minutes |
| MCP Server | Native Model Context Protocol server for AI agent integration |
| GraphRAG-Ready Output | Structured Markdown with rich metadata nodes for knowledge graph pipelines |
Each YouTube video becomes a structured Markdown node with rich metadata. This format is directly ingestible by OpenClaw, GraphRAG, LlamaIndex, LangChain, or any knowledge graph pipeline:
---
url: https://www.youtube.com/watch?v=VIDEO_ID
channel: "Lex Fridman"
date_synced: 2026-02-27
summary: "A deep conversation on the nature of intelligence and the trajectory of AGI development, exploring consciousness, GΓΆdel's theorems, and the limits of computation."
tags:
- Technology
- Technology/ArtificialIntelligence
- Technology/ArtificialIntelligence/LLM
- Science/CognitiveScience
- Person/LexFridman
- Person/JohnCarmack
- AGI
- ConsciousnessDebate
- ComputationalLimits
---
# The Future of AGI with John Carmack
## Extracted Links
- https://en.wikipedia.org/wiki/GΓΆdel%27s_incompleteness_theorems
## Description
In this episode, Lex Fridman sits down with John Carmack to discuss...
## Raw Transcript
[Full verbatim transcript for semantic search, embedding, and RAG ingestion]- Entities (
Person/,Company/,Location/) form graph nodes - Taxonomy tags form typed edges between concepts
- Summaries are ideal embedding targets for semantic similarity
- Transcripts provide raw text for chunking, dense retrieval, and fine-tuning
- Dates + channels enable temporal and source-based graph traversals
- Python 3.10+
- uv β Fast Python Package Manager (Installation Guide)
- Google Gemini API Key β free tier available at Google AI Studio
- Chrome, Safari, or Firefox logged into YouTube (for
yt-dlphistory access) - Obsidian (optional but recommended for visualization and Dataview queries)
# Clone the repository
git clone https://github.com/mr8lu/ObsidianYouTubeSync.git
cd ObsidianYouTubeSync
# Set up Python environment using uv (Fast Python Package Manager)
uv sync# Copy the example environment file
cp .env.example .envEdit .env with your credentials:
GEMINI_API_KEY=your_gemini_api_key_here
# Optional: Webshare rotating proxy (recommended for large history syncs)
WEBSHARE_PROXY_USER=your_proxy_user
WEBSHARE_PROXY_PASS=your_proxy_password~/Documents/Obsidian Vault. You must update OBSIDIAN_VAULT_PATH in sync.py and retag_notes.py if your vault lives elsewhere before running any syncs.
# Full sync (downloads and updates your watch history, skipping existing videos)
./run.sh --sync
# Re-tag and re-categorize all existing notes with the latest taxonomy
./run.sh --retagApply consistent LLM-powered categorization to any existing notes folder:
# Dry-run preview (no files modified)
python3 retag_notes.py --folder "~/Documents/Obsidian Vault/Apple Notes" --dry-run
# Live retag
python3 retag_notes.py --folder "~/Documents/Obsidian Vault/Books"
# Control parallelism
python3 retag_notes.py --folder "~/Documents/Obsidian Vault/Books" --workers 10This toolkit is designed to be fully manageable by local-first AI agents. There are two primary integration paths:
You can teach the OpenClaw Mac AI agent how to autonomously run your syncs and retag folders using a native SKILL.md file.
Installation:
Create a folder at ~/.openclaw/skills/obsidian-youtube-sync/ and drop in a SKILL.md file that explain how to execute ./run.sh and uv run retag_notes.py. (A working template is provided upon request).
Once installed, you can simply ask OpenClaw: "Sync my YouTube watch history and let me know what themes I learned about today."
Alternatively, you can expose the toolkit as an MCP server to Claude Desktop, Cursor, or other MCP clients.
Setup for Claude Desktop:
Add the following to your claude_desktop_config.json:
{
"mcpServers": {
"ObsidianYouTubeSync": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/ObsidianYouTubeSync",
"run",
"mcp_server.py"
]
}
}
}Note: Ensure you replace /absolute/path/to/ with the actual path to the repository on your machine.
This tool is purpose-built as a data pipeline for knowledge graph construction. Here's how to plug it into downstream systems:
- Run
./run.sh --syncto sync your complete YouTube history - Point your GraphRAG pipeline at the
~/Documents/Obsidian Vault/YouTube/folder - The YAML frontmatter is parsed as node metadata; transcripts are chunked for retrieval; tags form the typed edge schema
- Each note file is a self-contained document node with entity tags (
Person/,Company/,Location/) pre-extracted - The taxonomy hierarchy (
Technology/ArtificialIntelligence/LLM) maps directly to ontological class trees - Channel metadata provides provenance edges in the graph
- Summaries are ideal for dense embedding (short, factual, topic-rich)
- Transcripts can be chunked with metadata-aware splitters
- Tags can be used as filters/facets for hybrid search
- Generate instruction-response pairs from
(transcript) β summarypairs - Use hierarchical tags as classification labels
- Export with a simple Python script reading the YAML frontmatter
| Feature | Status |
|---|---|
| YouTube History Sync (incremental & full) | β Done |
| AI Summarization via Google Gemini (GenAI) | β Done |
| Full Transcript / Caption Extraction | β Done |
| Hierarchical AI Taxonomy Tagging | β Done |
| Parallel Processing (ThreadPoolExecutor) | β Done |
| Vault-wide Retagging Engine | β Done |
| Proxy / Rate-limit Support | β Done |
| MCP Server for AI Agent Integration | β Done |
| GraphRAG Export Helper Script | π Planned |
| Obsidian Community Plugin | π Planned |
| Notion / Logseq Export | π Planned |
| Local LLM Support (Ollama) | π‘ Considering |
| YouTube Playlist Sync | π‘ Considering |
| OpenClaw Native Connector | π‘ Considering |
- Native YAML Properties:
tags:,url:,summary:,channel:,date_syncedβ all recognized by Obsidian natively - Hierarchical Tags:
Parent/Childtags map to Obsidian's Tag Pane tree view - Dataview Queries: Query your entire watch history like a database
- Graph View: Entity tags create visual connections between videos sharing people, companies, or concepts
- Local-First: All notes live on your machine. No cloud sync.
- No Account Data: Only public video metadata (title, description, transcript) is sent to Gemini for summarization. Your YouTube account credentials are never accessed or stored.
- Credential Security: API keys loaded via
.env(excluded from git). Proxy credentials stored locally only. - macOS Permissions: Ensure Terminal has "Full Disk Access" in System Settings β Privacy & Security.
| Environment | Supported |
|---|---|
| macOS (Intel & Apple Silicon) | β |
| Linux | β |
| Windows (WSL) | |
| Python 3.10+ | β |
| Obsidian 1.0+ | β |
| Chrome / Safari / Firefox cookies | β |
| Claude Desktop (MCP) | β |
| GraphRAG / LlamaIndex | β |
| OpenClaw | β |
Contributions are welcome! See CONTRIBUTING.md for how to get started.
If you find this useful, please β star the repo β it helps more researchers, PKM enthusiasts, and knowledge graph builders discover it.
Distributed under the MIT License. See LICENSE for more information.