Distill

Give your AI agents clean, readable web content.

Raw HTML is noisy — navigation bars, cookie banners, footers, ads, and a thousand lines of boilerplate for every paragraph of actual content. Feeding that to an LLM wastes tokens and buries the signal.

This tool converts any URL into clean, structured Markdown that agents can actually read. It bypasses Cloudflare and bot detection, strips boilerplate, and extracts the content that matters. Obsidian integration is included — save clips directly to your vault — but the core capability works with or without Obsidian.

What it does

Fetches the URL using Scrapling — a stealth headless browser that bypasses Cloudflare and other bot-detection systems that block standard HTTP requests
Extracts the main content using readability-lxml, the same engine behind Obsidian's own web clipper and Firefox Reader Mode
Converts to clean Markdown using markdownify, with trafilatura as fallback for complex pages
Saves to your Obsidian vault with YAML frontmatter (title, source URL, date, topic) — or use the JSON output directly in your pipeline

Who it's for

AI agent builders. If you're building agents that research the web, you don't want your model wading through raw HTML. Run this as a preprocessing step and feed agents clean markdown instead. Works with Claude Code, LangChain, CrewAI, custom agents — anything that can call a subprocess.

Obsidian power users. Run it on a server or cron job. Clip pages to your vault from anywhere, without a browser open.

Python developers. Use it as a library or subprocess for any pipeline that needs reliable web content extraction.

Installation

git clone https://github.com/jcenters/distill
cd distill
pip install -r requirements.txt
playwright install chromium

Usage

Command line

# Save to $OBSIDIAN_CLIPPINGS/research/
python clip.py https://example.com/article

# Save to $OBSIDIAN_CLIPPINGS/tech/
python clip.py https://example.com/article tech

With Claude Code

Claude Code has a Bash tool. Once Distill is installed, Claude can call it mid-conversation — no framework, no setup, just a shell command:

You: "Research this page for me: https://example.com/article"

Claude runs: python /path/to/clip.py https://example.com/article research
Claude reads: the saved Markdown file
Claude answers: based on clean, extracted content

If a page returns 403 or a Cloudflare block, Claude falls back to Distill automatically. The content comes back as readable Markdown instead of a wall of HTML. Works in any Claude Code session — interactive, cron-based, or agentic.

From any AI agent

import subprocess, json

result = json.loads(
    subprocess.check_output(["python", "clip.py", url, topic])
)
# result = {"file": "clippings/tech/2026-03-26-title.md", "title": "...", "path": "..."}

The agent can read the saved Markdown file, or capture the content inline and pass it directly to the model. Works with any LLM that can call shell commands — Claude, GPT-4, Gemini, local models.

Output format

JSON to stdout:

{
  "file": "clippings/research/2026-03-26-article-title.md",
  "title": "Article Title",
  "topic": "research",
  "url": "https://example.com/article",
  "path": "/absolute/path/to/file.md"
}

Saved file:

---
title: "Article Title"
source: https://example.com/article
clipped: 2026-03-26
topic: research
---

[Clean article content as Markdown — no nav, no ads, no boilerplate]

Configuration

Variable	Default	Description
`OBSIDIAN_VAULT`	`~/Documents/Obsidian`	Path to your Obsidian vault
`OBSIDIAN_CLIPPINGS`	`$OBSIDIAN_VAULT/clippings`	Path to clippings folder

export OBSIDIAN_VAULT=~/my-vault
python clip.py https://example.com/article research

What it handles that other tools don't

Cloudflare and bot detection. Standard requests or urllib get blocked by most modern sites. Scrapling runs a real headless browser with stealth headers, solving this at the fetch layer.

Boilerplate removal. Readability extracts the main content and discards everything else — the same algorithm Firefox uses for Reader Mode. The Markdown you get is what a human would copy-paste, not the full page dump.

Nested lists and structure. Unlike simpler extractors, the readability + markdownify pipeline preserves nested lists, blockquotes, code blocks, and heading hierarchy.

Fallback resilience. If readability can't identify a main content block, trafilatura takes over — a battle-tested extraction library used by HuggingFace, IBM, and Microsoft Research.

Credits

Built on top of excellent open-source work:

Scrapling by D4Vinci — stealth headless fetching. BSD 3-Clause License.
readability-lxml by Yuri Baburov — Python port of Mozilla Readability. Apache License 2.0.
markdownify by Matthew Withanm — HTML to Markdown conversion. MIT License.
trafilatura by Adrien Barbaresi — web scraping and text extraction. Apache License 2.0.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
clip.py		clip.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distill

What it does

Who it's for

Installation

Usage

Command line

With Claude Code

From any AI agent

Output format

Configuration

What it handles that other tools don't

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Distill

What it does

Who it's for

Installation

Usage

Command line

With Claude Code

From any AI agent

Output format

Configuration

What it handles that other tools don't

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages