Daily first-party IP geolocation from public geofeeds.
GeoFeed Harvester discovers RFC 8805 geofeed files from public RIR data, downloads them, validates every row, adds provenance, checks BGP visibility in bulk, and publishes a clean dataset that can be consumed by GeoForge, MMDB builders, fraud systems, routing tools, and research pipelines.
The goal is simple: use operator-published geolocation at the source instead of repackaging opaque commercial GeoIP databases.
- Generated at:
2026-05-23T12:20:03+00:00 - Valid rows:
505,390 - Raw rows:
567,587 - Unique prefixes:
505,390 - Unique geofeed URLs:
3,358 - Countries:
301 - Failed geofeed fetches:
871 - Added / removed / changed prefixes:
121/173/250 - CSV gzip size:
4.1 MB - JSONL gzip size:
5.2 MB - Parquet size:
2.9 MB
Every run writes:
dist/geofeed.csv
dist/geofeed.jsonl
dist/changelog.md
The GitHub workflow uploads compressed artifacts because the full JSONL dataset is larger than GitHub's normal per-file git limit:
geofeed.csv.gz
geofeed.jsonl.gz
geofeed.parquet
failed-geofeeds.csv
diff.json
manifest.json
changelog.md
SHA256SUMS
geofeed.csv is the normalized dataset:
prefix,country,region,city,postal_code,rir,inetnum,url,fetched_at,signed,signature_valid,bgp_valid,confidence,flags
5.23.48.0/24,RU,RU-SPE,Saint Petersburg,,RIPE,0.0.0.0/0,https://example/geofeed.csv,2026-05-23T09:13:26+00:00,false,false,true,0.90,geofeed.jsonl contains the same records as JSON objects, one row per line.
changelog.md summarizes row counts, flagged rows, and per-RIR coverage for the
latest run.
If this repository publishes artifacts through GitHub Actions, download the latest run from:
https://github.com/ipanalytics/GeoFeed-Harvester/actions/workflows/harvest.yml
The daily workflow publishes a date-stamped GitHub Release and marks it as the latest release. Download the latest release assets through stable URLs:
curl -L -o geofeed.csv.gz \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.csv.gz
curl -L -o geofeed.jsonl.gz \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.jsonl.gz
curl -L -o geofeed.parquet \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.parquet
curl -L -o manifest.json \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/manifest.jsonFor automation, prefer release assets when available because the URL is stable. Actions artifacts are useful for inspection, but GitHub expires them according to repository retention settings.
The repository also keeps small metadata files in git:
runs/latest-changelog.md
runs/latest-manifest.json
runs/latest-SHA256SUMS
Automatic discovery currently uses unauthenticated public sources:
| Source | Method | Status |
|---|---|---|
| RIPE | public bulk inetnum / inet6num dumps |
enabled |
| APNIC | public bulk inetnum / inet6num dumps |
enabled |
| AFRINIC | public bulk database dump | enabled |
| LACNIC | public Geofeeds Service CSV | enabled |
| ARIN | authenticated bulk WHOIS or RDAP fallback | not enabled by default |
ARIN bulk WHOIS requires authorization, so it is intentionally not queried as
part of the unauthenticated daily job. ARIN-style records are supported when
provided manually or by a future authenticated adapter: NetRange is treated as
inetnum, and Comment is treated as remarks.
The default production run is bulk-first:
flowchart LR
A["RIR bulk dumps"] --> B["Extract inetnum -> geofeed URL"]
C["LACNIC Geofeeds CSV"] --> F["Normalize rows"]
B --> D["Fetch HTTPS geofeed CSV"]
D --> E["Validate RFC 8805 rows"]
E --> G["Team Cymru bulk BGP check"]
F --> G
G --> H["CSV / JSONL / changelog"]
Validation rules include:
- HTTPS-only geofeed URLs.
- RFC 8805 CSV parsing.
- Country code shape validation.
- Region code shape validation.
- Drop rows outside the referring
inetnum. - Prefer the most specific referring
inetnumon overlap. - Add provenance: RIR, source URL, referring inetnum, fetch time.
- Add confidence and conflict flags.
- Validate ISO-3166 country and ISO-3166-2 subdivision codes when the optional
pycountrycatalog is available. - Optional Team Cymru bulk BGP visibility checks.
Install:
python -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"Run the full automatic sequence:
geofeed-harvester \
--auto-discover \
--out-dir dist \
--cache-dir .cache/geofeeds \
--bulk-dir .cache/rir-bulk \
--direct-geofeed-dir .cache/direct-geofeeds \
--normalized-rir-dump data/rir.txt \
--concurrency 32 \
--bgp-validator cymruThe first run downloads large bulk files. Daily runs reuse cache metadata and HTTP validators where available.
Optional production enrichments:
geofeed-harvester \
--auto-discover \
--arin-rdap-seed data/arin-rdap-seeds.txt \
--arin-rdap-max-queries 100 \
--signature-verdicts data/signature-verdicts.json--arin-rdap-seed is intentionally seed-based. It does not scan ARIN address
space; it only enriches explicit IPs or prefixes listed by the operator.
--signature-verdicts accepts JSON produced by an external CMS/RPKI verifier,
for example:
{
"https://example.net/geofeed.csv": {
"signature_valid": true
}
}You can also provide your own RIR-like records:
inetnum: 203.0.113.0/24
geofeed: https://example.net/geofeed.csv
source: RIPE
NetRange: 198.51.100.0 - 198.51.100.255
Comment: Geofeed https://example.org/geofeed.csv
source: ARIN
Then run:
geofeed-harvester \
--rir-dump data/rir.txt \
--out-dir dist \
--cache-dir .cache/geofeeds \
--concurrency 32 \
--bgp-validator cymruThis repository includes a daily workflow:
.github/workflows/harvest.yml
It runs:
geofeed-harvester --auto-discover ...and commits:
runs/latest-changelog.md
runs/latest-SHA256SUMS
Large datasets are uploaded as compressed workflow artifacts instead of being committed to git.
The workflow publishes stable daily downloads by attaching:
dist/geofeed.csv.gz
dist/geofeed.jsonl.gz
dist/geofeed.parquet
dist/failed-geofeeds.csv
dist/diff.json
dist/manifest.json
dist/changelog.md
dist/SHA256SUMS
to a date-stamped release such as dataset-2026-05-23 and marks that release
as GitHub's latest release. Stable /releases/latest/download/... URLs continue
to work.
The default workflow does not enable Team Cymru checks because GitHub-hosted
runners can hit TCP/43 rate limits or empty responses. Run --bgp-validator cymru manually or from infrastructure with stable egress when BGP confidence
signals are required.
CSV:
curl -L -o geofeed.csv.gz \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.csv.gzJSONL:
curl -L -o geofeed.jsonl.gz \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.jsonl.gzParquet:
curl -L -o geofeed.parquet \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/geofeed.parquetMetadata and daily diff:
curl -L -o manifest.json \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/manifest.json
curl -L -o diff.json \
https://github.com/ipanalytics/GeoFeed-Harvester/releases/latest/download/diff.jsonExample Python:
import csv
with open("geofeed.csv", newline="", encoding="utf-8") as fh:
for row in csv.DictReader(fh):
if row["bgp_valid"] == "true":
print(row["prefix"], row["country"], row["city"])- Geofeed file format: RFC 8805.
- Discovery mechanism: RFC 9632, which replaced RFC 9092.
- Large-scale discovery should use RIR bulk data instead of brute-force WHOIS or RDAP scans.
- RPKI CMS signature verification is delegated to external tooling when enabled.
The harvester can use Team Cymru's IP-to-ASN Mapping Service for bulk BGP visibility checks. It sends many probe IPs in one TCP/43 bulk WHOIS session instead of making thousands of individual WHOIS calls.
This is used only for route visibility/confidence. Team Cymru is not treated as a geolocation source.
This dataset is not a magic truth oracle. It is a normalized view of operator-published geofeed data with explicit provenance.
Useful confidence signals:
- The row came from a public RIR-discovered geofeed.
- The prefix is inside the referring inetnum.
- The prefix is visible in BGP.
- The row has no schema or overlap flags.
- Future signature validation can confirm signed geofeeds.
Rows with flags are retained because they are useful for debugging and research, but consumers can filter them out.
Run tests:
python -m pytestCompile check:
python -m compileall geofeed_harvester testsThis is an early harvester implementation. The core pipeline works, but the next valuable additions are:
- authenticated ARIN bulk adapter;
- first-class CMS signature discovery for signed geofeeds;
- optional release retention policy for historical daily datasets.