halo-spider

一个受 Scrapy 启发的 Rust 异步爬虫框架，支持代码回调与 JSON DSL 混合抓取，并且从现在开始只使用 OPSX / OpenSpec 作为项目规范与变更工作流。

当前状态

库代码位于 src/
示例位于 examples/
当前规范源位于 openspec/specs/
后续需求、方案、任务统一从 openspec/changes/ 发起
openspec init 生成的协作入口位于 .claude/commands/opsx/ 与 .codex/skills/

旧的设计稿、分轮任务稿和 .cursor 规则/技能已经移除，不再作为本项目的协作入口。

快速开始

[dependencies]
halo-spider = "0.0.4"
tokio = { version = "1", features = ["rt-multi-thread", "macros", "signal"] }
tracing-subscriber = "0.3"

use halo_spider::download::{Browser, Http};
use halo_spider::engine::Engine;
use halo_spider::error::SpiderError;
use halo_spider::response::Response;
use halo_spider::scheduler::Memory;
use halo_spider::settings::Settings;
use halo_spider::spider::{Output, Spider};
use halo_spider::value::Value;
use std::time::Duration;

struct MySpider;

impl Spider for MySpider {
    fn name(&self) -> &str {
        "my_spider"
    }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://quotes.toscrape.com/".to_string()]
    }

    async fn parse(&self, response: &Response) -> Result<Output, SpiderError> {
        let items = response
            .css("div.quote span.text::text")
            .all()
            .into_iter()
            .map(|text| halo_spider::item::Item::new().with_field("text", Value::String(text)))
            .collect();

        Ok(Output {
            items,
            requests: vec![],
        })
    }
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt().init();

    let settings = Settings::default()
        .with_download_delay(Duration::from_millis(200))
        .with_idle_timeout(Duration::from_secs(5));

    let mut engine = Engine::new(
        Memory::default(),
        Http::default(),
        Browser::default(),
    )
    .with_settings(settings);

    let handle = engine.shutdown_handle();
    tokio::spawn(async move {
        tokio::signal::ctrl_c().await.ok();
        handle.stop();
    });

    engine.run(&MySpider).await.unwrap();
}

示例

# 基础示例
cargo run --example quotes_code
cargo run --example quotes_dsl
cargo run --example custom_middleware
cargo run --example plugins_demo

# AI 选择器示例（需要 OPENAI_API_KEY 环境变量）
cargo run --example ai_extraction --features ai-selector

# 并发控制示例
cargo run --example concurrency_control

DSL 编写流程（推荐）

使用 JSON DSL 规则文件驱动爬虫，无需编写解析代码：

use halo_spider::rules::Config as RulesConfig;
use halo_spider::spider::Spider;

struct MyDslSpider;

impl Spider for MyDslSpider {
    fn name(&self) -> &str {
        "my_dsl_spider"
    }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://example.com".to_string()]
    }

    fn rules(&self) -> Option<RulesConfig> {
        Some(RulesConfig::local("path/to/rules.json"))
    }
}

引擎会自动加载、编译规则并分发到 DSL step。规则文件示例见 examples/rules/quotes.json。

高级用法： 如需手动控制规则编译和应用，可使用 compile_rules() 和 apply_dsl()，但这不是推荐的入门路径。

已知限制：

HTML 解析暂不支持 XPath 选择器（当前 XPath 实现基于 XML 解析器，对不规范 HTML 容错性差）
建议在 HTML 场景下使用 CSS 选择器替代 XPath
DSL 部分功能尚未实现运行时逻辑（dedup/schedule/retry 已支持配置解析，运行时逻辑待实现）
详见 TODO.md

DSL 配置选项

meta（透传字段）：

{
  "id": "parse_list",
  "meta": {
    "source": "homepage",
    "category": "news"
  }
}

dedup（去重配置）：

{
  "dedup": {
    "enabled": true,
    "key": ["product_id"],
    "ttl": 86400,
    "scope": "TASK"
  }
}

retry（重试配置）：

{
  "retry": {
    "count": 3,
    "http_status": [500, 502, 503],
    "backoff": [1000, 2000, 5000]
  }
}

schedule（调度配置）：

{
  "schedule": {
    "concurrency": 2,
    "interval": 1000
  }
}

完整示例见 examples/rules/ 目录。

AI 选择器

使用 OpenAI API 进行智能内容提取：

[dependencies]
halo-spider = { version = "0.0.4", features = ["ai-selector"] }

// 设置 API key（优先从环境变量读取）
let settings = Settings::default()
    .with_openai_api_key(std::env::var("OPENAI_API_KEY").ok().unwrap())
    .with_openai_model("gpt-4o-mini");

// 使用自定义 API endpoint（兼容 OpenAI 的服务）
let settings = Settings::default()
    .with_openai_api_key("your-api-key")
    .with_openai_base_url("https://your-api-endpoint.com/v1")
    .with_openai_model("your-model-name");

// 在 parse 中使用，支持重试和超时配置
async fn parse(&self, response: &Response) -> Result<Output, SpiderError> {
    let mut query = response.ai("Extract the main article title and summary")
        .with_max_retries(3)
        .with_timeout(Duration::from_secs(30));
    query.execute().await.map_err(|e| SpiderError::parse(e))?;

    if let Some(result) = query.one() {
        println!("AI extracted: {}", result);
    }
    Ok(Output::empty())
}

特性：

自动重试机制（指数退避）
可配置超时时间
完善的错误处理

注意： AI 调用会产生 API 费用，建议仅在复杂内容提取场景使用。

并发控制配置

let settings = Settings::default()
    .with_concurrent_requests(16)              // 全局最大并发数
    .with_concurrent_requests_per_domain(8)    // 每个域名最大并发数
    .with_connection_pool_size(100)            // HTTP 连接池大小
    .with_download_delay(Duration::from_millis(200));  // 请求间延迟

参考 examples/concurrency_control.rs 查看完整示例。

贡献指南

如果你想参与开发或了解项目的开发流程，请查看 CONTRIBUTING.md。

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.cargo		.cargo
.claude		.claude
.codex/skills		.codex/skills
.cursor		.cursor
.kiro/steering		.kiro/steering
examples		examples
openspec		openspec
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
GIT_COMMIT_GUIDE.md		GIT_COMMIT_GUIDE.md
README.md		README.md
RUST_STYLE_GUIDE.md		RUST_STYLE_GUIDE.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

halo-spider

当前状态

快速开始

示例

DSL 编写流程（推荐）

DSL 配置选项

AI 选择器

并发控制配置

贡献指南

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

halo-spider

当前状态

快速开始

示例

DSL 编写流程（推荐）

DSL 配置选项

AI 选择器

并发控制配置

贡献指南

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages