AI-Crawler-Blocklist

AI-Crawler-Blocklist publishes AI crawler blocklists and deployment-ready firewall snippets from official operator-published sources. It separates verified IP ranges, user-agent rules, robots.txt controls, and watch lists so site operators can choose the right enforcement level without mixing signal quality.

Links

Resource	URL
Generated artifacts	`dist/`
Source policy	`docs/source-policy.md`
Firewall deployment notes	`docs/firewalls.md`
Operating modes	`docs/modes.md`
Source health report	`dist/sources-report.md`
Machine-readable metadata	`dist/metadata.json`

Overview

AI-Crawler-Blocklist is built for publishers, application operators, infrastructure teams, and security engineers who need repeatable controls for AI training crawlers, AI search bots, assistant fetchers, and related indexing systems.

The repository consumes curated source definitions from config/sources.json, validates the source policy, fetches official IP feeds where available, normalizes CIDRs, and renders platform-specific outputs under dist/. Source failures are recorded in metadata instead of failing the entire build, which keeps scheduled updates operational while preserving source health visibility.

System Behavior

config/sources.json
        |
        v
scripts/normalize_sources.py  -> confidence, enforcement, source policy
        |
        v
scripts/fetch_sources.py      -> official JSON/text/embedded JSON/static prefixes
        |
        v
scripts/build.py              -> deterministic dist artifacts
        |
        v
dist/metadata.json + firewall snippets + robots.txt + plain lists

Enforcement is derived from source quality:

Class	Source quality	Output behavior
`verified-drop`	Official crawler-specific IP/CIDR feed	Eligible for IP hard drop
`ua-only`	Documented user-agent without verified IP feed	User-agent block rules only
`robots-only`	Robots token such as `Google-Extended`	robots.txt outputs only
`static-watch`	Broad static ranges, CN/watch, platform ranges, weak signals	Observe, challenge, or rate-limit

Features

Verified IPv4/IPv6 lists for official AI crawler IP feeds.
User-agent lists, regex lists, nginx maps, Apache SetEnvIf rules, and Cloudflare expressions.
robots.txt snippets for training opt-out, all AI bots, CN/watch bots, and search-safe AI opt-out.
iptables/ipset, nftables, pf/pfSense, Caddy, HAProxy, and Traefik outputs.
Deterministic builds with fixed timestamp support via CRAWLERSCOPE_GENERATED_AT.
Machine-readable metadata with counts, source health, confidence, enforcement, and failed sources.
Scheduled GitHub Actions update workflow and daily release workflow.

Quick Start

git clone https://github.com/ipanalytics/AI-Crawler-Blocklist.git
cd AI-Crawler-Blocklist
make install-dev
make build
make validate
make test

For sandboxed environments where uv must keep all state inside the worktree:

UV_CACHE_DIR=.uv-cache UV_PYTHON_INSTALL_DIR=.uv-python \
  uv run --python 3.12 python scripts/build.py

Installation

The generated files are intended to be consumed directly from GitHub raw URLs or vendored into your own configuration management.

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/metadata.json

Pinning to a release tag is recommended for controlled production rollout:

curl -fsSL https://github.com/ipanalytics/AI-Crawler-Blocklist/releases/latest/download/ai-crawler-blocklist-dist.tar.gz \
  -o ai-crawler-blocklist-dist.tar.gz

Usage Examples

robots.txt

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/robots-ai-all-block.txt \
  -o /var/www/html/robots.txt

nginx

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/nginx-ai-map.conf \
  -o /etc/nginx/snippets/nginx-ai-map.conf

include /etc/nginx/snippets/nginx-ai-map.conf;

server {
    if ($ai_crawler) {
        return 403;
    }
}

nginx -t && systemctl reload nginx

Apache

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/apache-ai-setenvif.conf \
  -o /etc/apache2/conf-available/ai-crawlers.conf

a2enconf ai-crawlers
apachectl configtest && systemctl reload apache2

Cloudflare WAF

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/cloudflare-ai-expression.txt

Use the expression in a WAF Custom Rule. The output is UA-based and designed for review before deployment.

iptables

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/iptables-ai.sh \
  -o /usr/local/sbin/update-ai-iptables.sh

chmod +x /usr/local/sbin/update-ai-iptables.sh
/usr/local/sbin/update-ai-iptables.sh

The generated script uses ipset for set-based matching.

nftables

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/nftables-ai.nft \
  -o /etc/nftables.d/ai-crawlers.nft

nft -f /etc/nftables.d/ai-crawlers.nft

Caddy

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/caddy-ai-block.caddy \
  -o /etc/caddy/snippets/ai-crawlers.caddy

caddy validate --config /etc/caddy/Caddyfile && systemctl reload caddy

HAProxy

curl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/haproxy-ai-acl.cfg \
  -o /etc/haproxy/ai-crawlers.cfg

haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy

Outputs

Artifact	Purpose
`dist/ai-ips-verified-v4.txt`	Verified official IPv4 CIDRs
`dist/ai-ips-verified-v6.txt`	Verified official IPv6 CIDRs
`dist/ai-ips-verified-all.txt`	Combined verified CIDRs
`dist/ai-ips-high-confidence-v4.txt`	IPv4 challenge/rate-limit candidates
`dist/ai-ips-high-confidence-v6.txt`	IPv6 challenge/rate-limit candidates
`dist/ai-user-agents.txt`	Plain AI crawler UA tokens
`dist/ai-user-agents-regex.txt`	Escaped UA regex tokens
`dist/ai-cn-user-agents-watch.txt`	CN/watch UA list
`dist/robots-ai-all-block.txt`	robots.txt rules for AI bots and robots-only tokens
`dist/cloudflare-ai-expression.txt`	Cloudflare WAF expression
`dist/metadata.json`	Source health and counts
`dist/sources-report.md`	Human-readable source report

Platform-specific files

Artifact	Platform
`dist/nginx-ai-map.conf`	nginx
`dist/nginx-ai-deny.conf`	nginx
`dist/apache-ai-setenvif.conf`	Apache
`dist/iptables-ai.sh`	iptables/ipset
`dist/nftables-ai.nft`	nftables
`dist/pf-ai-table.conf`	pf / pfSense
`dist/caddy-ai-block.caddy`	Caddy
`dist/haproxy-ai-acl.cfg`	HAProxy
`dist/traefik-ai-middleware.yml`	Traefik

Data Format

All generated text files include a header with project name, generation timestamp, source repository, policy, and review note.

dist/metadata.json is the operational source of truth for current build state:

{
  "generated_at": "2026-06-17T00:00:00Z",
  "project": "AI-Crawler-Blocklist",
  "policy": "official/operator-published sources only",
  "counts": {
    "verified_ipv4_prefixes": 2261,
    "verified_ipv6_prefixes": 1,
    "user_agent_patterns": 24,
    "robots_tokens": 26
  },
  "failed_sources": []
}

Source definitions live in config/sources.json. The normalizer adds confidence, enforcement, ipPolicy, and includeInAiOutputs at build time.

Operational Notes

Use verified-drop artifacts for hard IP enforcement.
Use UA files for application-layer controls where IP ranges are unavailable.
Use watch lists for logging, challenge, bot-score adjustment, or rate limiting.
Treat Google-Extended and Applebot-Extended as robots.txt controls.
Review dist/metadata.json and dist/sources-report.md before rolling changes into production.

Project Scope

The project covers AI crawlers, AI search bots, assistant fetchers, training/indexing bots, and AI-adjacent archive sources such as CCBot. Generic search crawlers, SEO tools, uptime probes, ad verification crawlers, social preview bots, and security scanners are outside the generated AI output set unless explicitly classified by policy.

Use Cases

Publisher AI crawling controls.
WAF rule generation for known AI user agents.
Verified IP hard-drop lists for official crawler feeds.
Bot analytics enrichment from access logs.
Change-controlled distribution of crawler policy into infrastructure automation.

Limitations

User-agent strings can be spoofed.
robots.txt depends on crawler compliance.
Some assistant fetchers are user-triggered and may affect product visibility.
Broad cloud or platform ranges belong in observe/challenge workflows, not default hard drop.

Directory Structure

.
├── config/                 # source definitions, policy, output manifest, schema
├── dist/                   # generated blocklists and platform artifacts
├── docs/                   # operator documentation
├── scripts/                # build, fetch, normalize, validate, render
├── templates/              # Jinja templates for generated configs
├── tests/                  # source policy, parsing, output, workflow tests
└── .github/workflows/      # update, PR validation, daily release

Deployment

The update workflow rebuilds dist/ every six hours and commits changes when generated artifacts differ. The release workflow publishes a daily release containing the current dist/ archive plus metadata and source report.

Production deployments should pin to a release tag or mirror dist/ through internal configuration management. Direct raw URL consumption is suitable for simple hosts and lab environments.

License

MIT. See LICENSE.

Disclaimer

This project provides defensive network and application-layer control data. Operators are responsible for testing enforcement impact in their own environment before blocking traffic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Crawler-Blocklist

Links

Overview

System Behavior

Features

Quick Start

Installation

Usage Examples

robots.txt

nginx

Apache

Cloudflare WAF

iptables

nftables

Caddy

HAProxy

Outputs

Data Format

Operational Notes

Project Scope

Use Cases

Limitations

Directory Structure

Deployment

License

Disclaimer

About

Uh oh!

Releases 14

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
config		config
dist		dist
docs		docs
scripts		scripts
templates		templates
tests		tests
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AI-Crawler-Blocklist

Links

Overview

System Behavior

Features

Quick Start

Installation

Usage Examples

robots.txt

nginx

Apache

Cloudflare WAF

iptables

nftables

Caddy

HAProxy

Outputs

Data Format

Operational Notes

Project Scope

Use Cases

Limitations

Directory Structure

Deployment

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Contributors

Uh oh!

Languages