A service that continuously fetches IP address information from ipinfo.io and stores it in a ClickHouse database.
- Connects to ClickHouse Cloud
- Creates necessary database and tables if they don't exist
- Handles rate limiting for the ipinfo.io API (up to 10 requests per second)
- Implements retry logic and error handling
- Processes IPs in batches for efficiency
- Processes large tables incrementally by month to avoid memory issues
- Configurable fork digests for filtering IP addresses
- Dockerized for easy deployment
- Automatically sources IPs from your existing database tables
- One-time job mode for batch processing without continuous operation
- Docker and Docker Compose
- ClickHouse database (cloud or self-hosted)
- ipinfo.io API token
This application uses the clickhouse-connect library for ClickHouse communication, which is better suited for ClickHouse Cloud connections than the clickhouse-driver package.
- Clone this repository
- Copy
.env.exampleto.envand update with your credentials - Run the service with Docker Compose
cp .env.example .env
# Edit .env with your settings
docker-compose up -dRun the crawler once to process a single batch of IPs and exit:
# Local execution
python -m src.crawler --once
# With custom batch size
python -m src.crawler --once --batch-size 200
# Docker execution
docker-compose run --rm -e CRAWLER_MODE=once ip-crawlerThe crawler can be run as a one-time job instead of a continuous service. This is useful for:
- Processing batches on demand
- Scheduled jobs (cron, Kubernetes CronJob)
- Testing and development
- Resource-constrained environments
Command Line:
python -m src.crawler --once [--batch-size N]Docker:
# Set environment variable
export CRAWLER_MODE=once
docker-compose up
# Or inline
docker-compose run --rm -e CRAWLER_MODE=once ip-crawlerOutput: The one-time mode provides a summary of the batch processing results:
==================================================
SINGLE RUN SUMMARY
==================================================
Total IPs processed: 150
Successful: 147
Failed: 3
Success rate: 98.0%
Total in database: 15,420
Overall success rate: 96.8%
==================================================
Statistics File:
Results are saved to logs/last_run_stats.json for programmatic access.
All configuration is handled through environment variables in the .env file:
CLICKHOUSE_HOST- ClickHouse server hostnameCLICKHOUSE_PORT- ClickHouse server portCLICKHOUSE_USER- ClickHouse usernameCLICKHOUSE_PASSWORD- ClickHouse passwordCLICKHOUSE_DATABASE- Database name (default: crawlers_data)CLICKHOUSE_SECURE- Use secure connection (true/false)
IP_SOURCE_TABLE- Table containing IP addresses to process (default: ip_addresses)IP_INFO_TABLE- Table to store IP information (default: ipinfo)
IPINFO_API_TOKEN- ipinfo.io API tokenIPINFO_RATE_LIMIT- Requests per day limit (default: 1000)
BATCH_SIZE- Number of IPs to process in a batch (default: 50)SLEEP_INTERVAL- Seconds to wait between batches (default: 60)REQUEST_TIMEOUT- Seconds for API requests timeout (default: 10)MAX_RETRIES- Maximum number of retries for failed requests (default: 3)RETRY_DELAY- Seconds between retries (default: 5)FORK_DIGESTS- Comma-separated list of fork digests to track (default: 0x56fdb5e0,0x824be431,0x21a6f836,0x3ebfd484,0x7d5aab40,0xf9ab5f85)CRAWLER_MODE- Set tooncefor one-time job mode (Docker only)
The application creates the main ipinfo table to store IP information:
CREATE TABLE ipinfo (
ip String,
hostname String,
city String,
region String,
country String,
loc String,
org String,
postal String,
timezone String,
asn String,
company String,
carrier String,
is_bogon Boolean DEFAULT false,
is_mobile Boolean DEFAULT false,
abuse_email String,
abuse_phone String,
error String,
attempts UInt8 DEFAULT 1,
success Boolean DEFAULT true,
created_at DateTime DEFAULT now(),
updated_at DateTime DEFAULT now()
) ENGINE = MergeTree()
ORDER BY (ip, updated_at);The crawler automatically fetches IPs from the nebula.visits table that haven't been processed yet. It uses queries that process the table incrementally by month to avoid memory issues. The core filtering logic looks for:
SELECT DISTINCT toString(ip) AS ip
FROM (
SELECT JSONExtractString(toString(peer_properties), 'ip') AS ip
FROM nebula.visits
WHERE toStartOfMonth(visit_started_at) = toDate('YYYY-MM-01')
AND (
JSONExtractString(toString(peer_properties), 'fork_digest') IN ('0x56fdb5e0', '0x824be431', '0x21a6f836', '0x3ebfd484', '0x7d5aab40', '0xf9ab5f85')
OR JSONExtractString(toString(peer_properties), 'next_fork_version') LIKE '%064%'
)
)
WHERE ip != ''
LIMIT {batch_size}To handle very large tables without encountering memory limitations, the crawler:
- Processes data month by month using the table's time partitioning
- Maintains state in a JSON file to track which months have been processed
- Automatically resumes from where it left off if restarted
The list of fork digests to track can be updated in two ways:
- By changing the
FORK_DIGESTSenvironment variable in your.envfile and restarting the container - By updating the environment variable while the container is running (it will detect the change automatically)
The crawler creates log files in the logs directory, which is mounted as a volume. You can monitor the crawler's activity with:
docker-compose logs -fThe logs directory also contains:
crawler.log- Main application logshealth.log- Current status for healthcheckpartition_state.json- Tracks which months have been processedlast_run_stats.json- Statistics from the last one-time job run
The container includes a health check that verifies the crawler is running by checking for the existence of a health log file.
.
├── .env.example # Template for environment variables
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Docker container definition
├── entrypoint.sh # Container entrypoint script
├── migrations/ # Database migration SQL files
│ ├── 01_create_database.sql
│ ├── 02_create_ipinfo_table.sql
├── README.md # Project documentation
├── requirements.txt # Python dependencies
└── src/ # Source code
├── __init__.py
├── config.py # Configuration management
├── crawler.py # Main crawler logic
├── db.py # Database interaction
├── migrations.py # Database migration runner
├── partition_tracker.py # Manages incremental processing
└── utils.py # Utility functions
For development without Docker:
-
Create a virtual environment and install requirements
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Create a
.envfile with your configuration -
Run the migrations
python -m src.migrations
-
Start the crawler (continuous mode)
python -m src.crawler
-
Or run once
python -m src.crawler --once
If you were previously encountering MEMORY_LIMIT_EXCEEDED errors when querying large tables, the incremental processing approach should solve this issue. The crawler now processes data month by month to keep memory usage low.
To check which months have been processed, examine the partition_state.json file in the logs directory. It contains information about:
- The last fully processed month
- The current month being processed
- Whether the current month is complete
- The list of fork digests being tracked
If you need to start processing from scratch, simply stop the container and delete the partition_state.json file from the logs directory.
If the one-time job reports "No new IPs to process", it means:
- All IPs in the current partition have already been processed
- The current partition is complete and no new partitions are available
- Check
partition_state.jsonto see the current processing status
This project is licensed under the MIT License.