Skip to content

jcenters/distill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Distill

Give your AI agents clean, readable web content.

Raw HTML is noisy — navigation bars, cookie banners, footers, ads, and a thousand lines of boilerplate for every paragraph of actual content. Feeding that to an LLM wastes tokens and buries the signal.

This tool converts any URL into clean, structured Markdown that agents can actually read. It bypasses Cloudflare and bot detection, strips boilerplate, and extracts the content that matters. Obsidian integration is included — save clips directly to your vault — but the core capability works with or without Obsidian.

What it does

  1. Fetches the URL using Scrapling — a stealth headless browser that bypasses Cloudflare and other bot-detection systems that block standard HTTP requests
  2. Extracts the main content using readability-lxml, the same engine behind Obsidian's own web clipper and Firefox Reader Mode
  3. Converts to clean Markdown using markdownify, with trafilatura as fallback for complex pages
  4. Saves to your Obsidian vault with YAML frontmatter (title, source URL, date, topic) — or use the JSON output directly in your pipeline

Who it's for

AI agent builders. If you're building agents that research the web, you don't want your model wading through raw HTML. Run this as a preprocessing step and feed agents clean markdown instead. Works with Claude Code, LangChain, CrewAI, custom agents — anything that can call a subprocess.

Obsidian power users. Run it on a server or cron job. Clip pages to your vault from anywhere, without a browser open.

Python developers. Use it as a library or subprocess for any pipeline that needs reliable web content extraction.

Installation

git clone https://github.com/jcenters/distill
cd distill
pip install -r requirements.txt
playwright install chromium

Usage

Command line

# Save to $OBSIDIAN_CLIPPINGS/research/
python clip.py https://example.com/article

# Save to $OBSIDIAN_CLIPPINGS/tech/
python clip.py https://example.com/article tech

With Claude Code

Claude Code has a Bash tool. Once Distill is installed, Claude can call it mid-conversation — no framework, no setup, just a shell command:

You: "Research this page for me: https://example.com/article"

Claude runs: python /path/to/clip.py https://example.com/article research
Claude reads: the saved Markdown file
Claude answers: based on clean, extracted content

If a page returns 403 or a Cloudflare block, Claude falls back to Distill automatically. The content comes back as readable Markdown instead of a wall of HTML. Works in any Claude Code session — interactive, cron-based, or agentic.

From any AI agent

import subprocess, json

result = json.loads(
    subprocess.check_output(["python", "clip.py", url, topic])
)
# result = {"file": "clippings/tech/2026-03-26-title.md", "title": "...", "path": "..."}

The agent can read the saved Markdown file, or capture the content inline and pass it directly to the model. Works with any LLM that can call shell commands — Claude, GPT-4, Gemini, local models.

Output format

JSON to stdout:

{
  "file": "clippings/research/2026-03-26-article-title.md",
  "title": "Article Title",
  "topic": "research",
  "url": "https://example.com/article",
  "path": "/absolute/path/to/file.md"
}

Saved file:

---
title: "Article Title"
source: https://example.com/article
clipped: 2026-03-26
topic: research
---

[Clean article content as Markdown — no nav, no ads, no boilerplate]

Configuration

Variable Default Description
OBSIDIAN_VAULT ~/Documents/Obsidian Path to your Obsidian vault
OBSIDIAN_CLIPPINGS $OBSIDIAN_VAULT/clippings Path to clippings folder
export OBSIDIAN_VAULT=~/my-vault
python clip.py https://example.com/article research

What it handles that other tools don't

Cloudflare and bot detection. Standard requests or urllib get blocked by most modern sites. Scrapling runs a real headless browser with stealth headers, solving this at the fetch layer.

Boilerplate removal. Readability extracts the main content and discards everything else — the same algorithm Firefox uses for Reader Mode. The Markdown you get is what a human would copy-paste, not the full page dump.

Nested lists and structure. Unlike simpler extractors, the readability + markdownify pipeline preserves nested lists, blockquotes, code blocks, and heading hierarchy.

Fallback resilience. If readability can't identify a main content block, trafilatura takes over — a battle-tested extraction library used by HuggingFace, IBM, and Microsoft Research.

Credits

Built on top of excellent open-source work:

  • Scrapling by D4Vinci — stealth headless fetching. BSD 3-Clause License.
  • readability-lxml by Yuri Baburov — Python port of Mozilla Readability. Apache License 2.0.
  • markdownify by Matthew Withanm — HTML to Markdown conversion. MIT License.
  • trafilatura by Adrien Barbaresi — web scraping and text extraction. Apache License 2.0.

License

MIT

About

Convert any webpage to clean Markdown for AI agents. Bypasses Cloudflare, strips boilerplate. Optional Obsidian vault integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages