Gospy is a flexible web crawler framework that allows you to develop complete web crawlers in few minutes.
It's minimalist and unified API can greatly reduce the learning costs of new users. With it, you can better focus on the data itself, rather than implement a complicated web crawler from the beginning. If you are familiar with java and hoping to grab some interesting data, just hold on, you will soon carry out your first crawler in few minutes. Ok, let's start!
- Portable, Flexible and Modular (you can only use one of the modules, or add your own development module into your Gospy-based crawler)
- Can operate in stand-alone mode (with multi-thread) or distribute mode (RabbitMQ or Hprose) or even the both
- Built in
PhantomJs
andSelenium
, you can directly call theWebDriver
to build a browser-kernel based web crawler - Element extraction based on RegEx, XPath and Jsoup, respectively, apply from simple to complex tasks.
- Support object-oriented processing with annotations
- Practical structural abstraction, from task scheduling to data persistence
- Provide robots.txt interpreter (easy to use if you need)
Download jar:
Release Version | JDK Version compatibility | Release Date | Links |
---|---|---|---|
0.2.1-beta | 1.8+ | 07.04.2017 | download |
0.2.2-beta | 1.8+ | 21.05.2017 | download |
To add a dependency using Maven, use the following:
<dependency>
<groupId>cc.gospy</groupId>
<artifactId>gospy-core</artifactId>
<version>0.2.2</version>
</dependency>
To add a dependency using Gradle:
compile 'cc.gospy.gospy-core:0.2.2'
Module division:
Run in cluster by Hprose:
Run in cluster under RabbitMQ-Server runtime:
Visit and print the webpage:
Gospy.custom()
.setScheduler(Schedulers.VerifiableScheduler.getDefault())
.addFetcher(Fetchers.HttpFetcher.getDefault())
.addProcessor(Processors.UniversalProcessor.getDefault())
.addPipeline(Pipelines.ConsolePipeline.custom().bytesToString().build())
.build().addTask("https://github.com/zhangjiupeng/gospy").start();
Custom Fetcher, and set multiple pipelines:
String dir = "D:/"; // you need to specify a valid dir on you os
Gospy.custom()
.setScheduler(Schedulers.VerifiableScheduler.custom()
.setTaskQueue(new PriorityTaskQueue()) // specify a priority queue
.build())
.addFetcher(Fetchers.HttpFetcher.custom()
.setAutoKeepAlive(false)
.before(request -> { // custom request
request.setHeader("Accept", "text/html,image/webp,*/*;q=0.8");
request.setHeader("Accept-Encoding", "gzip, deflate, sdch");
request.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
request.setHeader("Cache-Control", "max-age=0");
request.setHeader("Connection", "keep-alive");
request.setHeader("DNT", "1");
request.setHeader("Host", request.getURI().getHost());
request.setHeader("User-Agent", UserAgent.DEFAULT);
})
.build())
.addProcessor(Processors.UniversalProcessor.getDefault())
.addPipeline(Pipelines.ConsolePipeline.getDefault()) // add multiple pipelines
.addPipeline(Pipelines.SimpleFilePipeline.custom().setDir(dir).build())
.build()
.addTask("https://zhangjiupeng.com/logo.png")
.addTask("https://www.baidu.com/img/bd_logo1.png")
.addTasks(UrlBundle.parse("https://www.baidu.com/s?wd=gospy&pn={0~90~10}"))
.start();
Save page screenshot by PhantomJS:
String phantomJsPath = "/path/to/phantomjs.exe";
String savePath = "D:/capture.png";
Gospy.custom()
.setScheduler(Schedulers.VerifiableScheduler.custom()
.setPendingTimeInSeconds(60)
.build())
.addFetcher(Fetchers.TransparentFetcher.getDefault())
.addProcessor(Processors.PhantomJSProcessor.custom()
.setPhantomJsBinaryPath(phantomJsPath)
.setWebDriverExecutor((page, webDriver) -> {
TakesScreenshot screenshot = (TakesScreenshot) webDriver;
File src = screenshot.getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(src, new File(savePath));
return new Result<>();
})
.build())
.build()
.addTask("phantomjs://https://www.taobao.com")
.start();
Crawl by annotated class:
@UrlPattern("http://www.baidu.com/.*\\.php") // task matches this regex will be processed
public static class BaiduHomepageProcessor extends PageProcessor {
@ExtractBy.XPath("/html/head/title/text()")
public String title;
@ExtractBy.XPath("//*[@id='u1']/a/@href") // fill element data by xpath
@ExtractBy.XPath("//*[@id='head']/div/div[4]/div/div[2]/div[1]/div/a/@href")
public Set<String> topBarLinks;
@ExtractBy.Regex(value = "id=\"su\" value=\"(.*?)\"", group = 1) // fill by regex
public String searchBtnValue;
@ExtractBy.XPath
public String[] allLinks;
@Override
public void process() {
// process after data filling
System.out.println("Task url :" + task.getUrl());
System.out.println("Title :" + title);
System.out.println("Search slogan :" + searchBtnValue);
System.out.println("Top bar links :");
topBarLinks.forEach(System.out::println);
}
@Override
public Collection<Task> getNewTasks() {
return Arrays.asList(new Task("https://www.baidu.com/img/bd_logo1.png"));
}
@Override
@Experimental
public Object[] getResultData() {
return Arrays.asList(allLinks).stream()
.filter(s -> s.matches("^https?://((?!javascript:|mailto:| ).)*")).toArray();
}
}
Gospy.custom()
.setScheduler(Schedulers.VerifiableScheduler.getDefault())
.addFetcher(Fetchers.HttpFetcher.getDefault())
.addPageProcessor(BaiduHomepageProcessor.class)
.addProcessor(Processors.UniversalProcessor.getDefault())
.addPipeline(Pipelines.ConsolePipeline.getDefault())
.build().addTask("http://www.baidu.com/index.php").start();
Common questions will be collected and listed here.
Welcome to contribute codes to this project, anyone who had significant contributions will be listed here.
If you are interested in this project, please given stars. If you have any possible questions, you can contact us through the following ways:
create an issue | chat on gitter | send an email
Copyright 2017 ZhangJiupeng
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.