Skip to content

jiup/gospy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

88 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Join Chat Maven Central GitHub release GitHub license

Gospy is a flexible web crawler framework that allows you to develop complete web crawlers in few minutes.

It's minimalist and unified API can greatly reduce the learning costs of new users. With it, you can better focus on the data itself, rather than implement a complicated web crawler from the beginning. If you are familiar with java and hoping to grab some interesting data, just hold on, you will soon carry out your first crawler in few minutes. Ok, let's start!

Features

  • Portable, Flexible and Modular (you can only use one of the modules, or add your own development module into your Gospy-based crawler)
  • Can operate in stand-alone mode (with multi-thread) or distribute mode (RabbitMQ or Hprose) or even the both
  • Built in PhantomJs and Selenium, you can directly call the WebDriver to build a browser-kernel based web crawler
  • Element extraction based on RegEx, XPath and Jsoup, respectively, apply from simple to complex tasks.
  • Support object-oriented processing with annotations
  • Practical structural abstraction, from task scheduling to data persistence
  • Provide robots.txt interpreter (easy to use if you need)

Install

Download jar:

Release Version JDK Version compatibility Release Date Links
0.2.1-beta 1.8+ 07.04.2017 download
0.2.2-beta 1.8+ 21.05.2017 download

To add a dependency using Maven, use the following:

<dependency>
    <groupId>cc.gospy</groupId>
    <artifactId>gospy-core</artifactId>
    <version>0.2.2</version>
</dependency>

To add a dependency using Gradle:

compile 'cc.gospy.gospy-core:0.2.2'

Learn about Gospy

Module division:

http://7xp1jv.com1.z0.glb.clouddn.com/gospy/img/single-infra.jpg

Run in cluster by Hprose:

http://7xp1jv.com1.z0.glb.clouddn.com/gospy/img/cluster-rpc-infra.jpg

Run in cluster under RabbitMQ-Server runtime:

http://7xp1jv.com1.z0.glb.clouddn.com/gospy/img/cluster-rabbitmq-infra.jpg

Quick start

Visit and print the webpage:

Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.getDefault())
        .addFetcher(Fetchers.HttpFetcher.getDefault())
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.custom().bytesToString().build())
        .build().addTask("https://github.com/zhangjiupeng/gospy").start();

Custom Fetcher, and set multiple pipelines:

String dir = "D:/"; // you need to specify a valid dir on you os
Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.custom()
                .setTaskQueue(new PriorityTaskQueue()) // specify a priority queue
                .build())
        .addFetcher(Fetchers.HttpFetcher.custom()
                .setAutoKeepAlive(false)
                .before(request -> { // custom request
                    request.setHeader("Accept", "text/html,image/webp,*/*;q=0.8");
                    request.setHeader("Accept-Encoding", "gzip, deflate, sdch");
                    request.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
                    request.setHeader("Cache-Control", "max-age=0");
                    request.setHeader("Connection", "keep-alive");
                    request.setHeader("DNT", "1");
                    request.setHeader("Host", request.getURI().getHost());
                    request.setHeader("User-Agent", UserAgent.DEFAULT);
                })
                .build())
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.getDefault()) // add multiple pipelines
        .addPipeline(Pipelines.SimpleFilePipeline.custom().setDir(dir).build())
        .build()
        .addTask("https://zhangjiupeng.com/logo.png")
        .addTask("https://www.baidu.com/img/bd_logo1.png")
        .addTasks(UrlBundle.parse("https://www.baidu.com/s?wd=gospy&pn={0~90~10}"))
        .start();

Save page screenshot by PhantomJS:

String phantomJsPath = "/path/to/phantomjs.exe";
String savePath = "D:/capture.png";
Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.custom()
                .setPendingTimeInSeconds(60)
                .build())
        .addFetcher(Fetchers.TransparentFetcher.getDefault())
        .addProcessor(Processors.PhantomJSProcessor.custom()
                .setPhantomJsBinaryPath(phantomJsPath)
                .setWebDriverExecutor((page, webDriver) -> {
                    TakesScreenshot screenshot = (TakesScreenshot) webDriver;
                    File src = screenshot.getScreenshotAs(OutputType.FILE);
                    FileUtils.copyFile(src, new File(savePath));
                    return new Result<>();
                })
                .build())
        .build()
        .addTask("phantomjs://https://www.taobao.com")
        .start();

Crawl by annotated class:

@UrlPattern("http://www.baidu.com/.*\\.php") // task matches this regex will be processed
public static class BaiduHomepageProcessor extends PageProcessor {
    @ExtractBy.XPath("/html/head/title/text()")
    public String title;

    @ExtractBy.XPath("//*[@id='u1']/a/@href") // fill element data by xpath
    @ExtractBy.XPath("//*[@id='head']/div/div[4]/div/div[2]/div[1]/div/a/@href")
    public Set<String> topBarLinks;

    @ExtractBy.Regex(value = "id=\"su\" value=\"(.*?)\"", group = 1) // fill by regex
    public String searchBtnValue;

    @ExtractBy.XPath
    public String[] allLinks;

    @Override
    public void process() { 
        // process after data filling
        System.out.println("Task url      :" + task.getUrl());
        System.out.println("Title         :" + title);
        System.out.println("Search slogan :" + searchBtnValue);
        System.out.println("Top bar links :");
        topBarLinks.forEach(System.out::println);
    }

    @Override
    public Collection<Task> getNewTasks() {
        return Arrays.asList(new Task("https://www.baidu.com/img/bd_logo1.png"));
    }

    @Override
    @Experimental
    public Object[] getResultData() {
        return Arrays.asList(allLinks).stream()
                .filter(s -> s.matches("^https?://((?!javascript:|mailto:| ).)*")).toArray();
    }
}
Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.getDefault())
        .addFetcher(Fetchers.HttpFetcher.getDefault())
        .addPageProcessor(BaiduHomepageProcessor.class)
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.getDefault())
        .build().addTask("http://www.baidu.com/index.php").start();

more examples

Troubleshoot

Common questions will be collected and listed here.

Cooperate & Contact

Welcome to contribute codes to this project, anyone who had significant contributions will be listed here.

If you are interested in this project, please given stars. If you have any possible questions, you can contact us through the following ways:

create an issue | chat on gitter | send an email

Thanks

License

Copyright 2017 ZhangJiupeng

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.