-
Notifications
You must be signed in to change notification settings - Fork 183
Log Parsing EN
likai edited this page Apr 25, 2026
·
7 revisions
This page explains how NginxPulse parses, incrementally scans, backfills, and cleans logs. For source setup, see Log Sources. For format examples, see Supported Log Formats. For Push Agent, see Agent Collection.
- Initial scan: parse the recent window after startup.
- Incremental scan: scan appended logs by
system.taskInterval. - Historical backfill: fill older logs in the background without blocking realtime parsing.
- IP geo backfill: resolve IP locations asynchronously after logs are inserted.
- State file:
var/nginxpulse_data/nginx_scan_state.json - If current file size is smaller than the last recorded size, it is treated as log rotation and parsed from the beginning.
- Site ID is derived from
websites[].name. Renaming creates a new site and reparses logs. -
.gzlogs are parsed as whole files and change detection is based on file metadata.
-
system.parseBatchSizecontrols insert batch size, default 100. - It can be overridden by
LOG_PARSE_BATCH_SIZE. - Larger batches can improve historical import throughput, but increase transaction pressure and memory usage.
Endpoint: GET /api/status
-
log_parsing_progress: log parsing progress, from 0 to 1. -
log_parsing_estimated_remaining_seconds: estimated remaining seconds for log parsing. -
ip_geo_progress: IP geo progress, from 0 to 1. -
ip_geo_estimated_remaining_seconds: estimated remaining seconds for IP geo backfill.
The frontend can poll this endpoint to refresh progress.
-
system.logRetentionDayscontrols retention, default 30. - Cleanup runs at 02:00 in system timezone.
- Cleanup only removes parsed access data in the database. It does not delete raw log files.
- Application logs in
var/nginxpulse_data/nginxpulse.loguse log rotation and are not controlled bylogRetentionDays. - Restart the service/container after changing
logRetentionDays. - To rebuild historical parsed data with a new retention value, restart and run “Reparse”.
- Parsing writes core fields first and queues IP geo work.
- IP geo is resolved in batches in the background.
- For better throughput: increase
parseBatchSize, shorten import-timetaskInterval, improve disk/database IO, or split logs by day.
When optimizing large log imports, “streaming parsing” can mean two different things:
- Current implementation: read line by line, parse line by line, then synchronously batch insert.
- Ideal streaming architecture: split read / parse / clean / write into independently concurrent pipeline stages.
- Simple implementation and shorter debugging path.
- More predictable memory usage without worker backlog explosions.
- Easier consistency for state files, progress, and backfill logic.
- Stable enough for single-node, single-site, low-to-medium growth logs.
- Read, parse, and write are mostly serial, so throughput is limited.
- Very large single-site historical imports are more likely CPU-bound by single-thread parsing.
- Available CPU and disk capacity are not fully utilized.
- Higher throughput ceiling across CPU, disk, and database.
- Better fit for high-volume multi-source realtime ingestion or massive historical imports.
- Easier to scale specific stages, such as parser workers or writer workers.
- Much higher implementation and maintenance complexity.
- Requires backpressure, queue management, duplicate handling, ordering guarantees, and retry design.
- State tracking, progress display, and crash recovery all become more complex.
- Poor throttling can overwhelm PostgreSQL, disk IO, or memory.
For one site with hundreds of GB of historical logs, increasing parseBatchSize alone is usually not enough.
Recommended tuning order:
-
Tune
system.taskIntervalfirst- Default is usually
1m. - During import, temporarily use
5sor10s. - After backfill finishes, restore it to
30sor1m.
- Default is usually
-
Tune
system.parseBatchSize- Default is
100. - Increase gradually: try
500, then1000. - If both the machine and PostgreSQL remain stable, consider
2000. - Too large a batch means heavier transactions, higher memory usage, and higher retry cost after failures.
- Default is
-
Split logs by day
- Daily or hourly files are easier to backfill, retry, and debug than one huge file.
- Examples:
"/share/logs/nginx/access-*.log"or"/share/logs/nginx/access-*.log.gz"
-
Watch disk and PostgreSQL performance
- For single-site imports, bottlenecks often move to disk IO and database writes.
- If PostgreSQL or NginxPulse runs on slow, network, or shared storage, increasing
parseBatchSizemay not help much.
Import-time example:
{
"system": {
"taskInterval": "5s",
"parseBatchSize": 1000
}
}After import, use a more conservative config:
{
"system": {
"taskInterval": "30s",
"parseBatchSize": 500
}
}If these symptoms appear, roll back the tuning:
- PostgreSQL CPU or IO remains high.
- Container/process memory rises noticeably.
- Logs frequently show database write failures, deadlock retries, or timeouts.
- Realtime new logs become slower in the frontend.
- Home
- 快速开始
- 部署方式
- SQLite -> PostgreSQL 迁移
- 配置说明
- 完整字段参考
- 日志来源配置
- 支持的日志格式
- Agent 采集
- 日志解析机制
- IP 归属地解析
- 数据库结构
- 常见问题
- Home (EN)
- Quick Start (EN)
- Deployment (EN)
- Migration (EN)
- Configuration (EN)
- Config Reference (EN)
- Log Sources (EN)
- Supported Log Formats (EN)
- Agent Collection (EN)
- Log Parsing (EN)
- IP Geo (EN)
- Database Schema (EN)
- FAQ (EN)

