Grow your team on GitHub
GitHub is home to over 28 million developers working together. Join them to grow your own development teams, manage permissions, and collaborate on projects.Sign up
Crawler for gab website emails
This package present some io function that help you to fast as fast file read and write
Fess is very powerful and easily deployable Enterprise Search Server.
Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
Easy to use lightweight web crawler（易用的轻量化网络爬虫）
A set of reusable Java components that implement functionality common to any web crawler
Norconex HTTP Collector is a flexible web crawler for collecting, parsing, and manipulating data from the Internet (or Intranet) to various data repositories such as search engines.
An HTTP+HTTP/2 client for Android and Java applications.
List of Some Crawler!
News crawling with SC - stores output as WARC
Open Source Web Crawler for Java
A scalable web crawler framework for Java.
Extract tables from PDF files
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
一个敏捷的，分布式的爬虫框架;An agile, distributed crawler framework.
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
基于 webmagic 的 Java 爬虫应用
A collection of awesome web crawler,spider in different languages
A html parser with xpath base on Jsoup.Maybe it is the best in java,ha ha.Just try it.
This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.
این مخزن شامل کد تست سلنیوم برای وبسایت سان مارکت می باشد که به زبان جاوا نوشته شده است
Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages
Simple java web crawler
Simple java web crawler
The CommonCrawl Crawler Engine and Related MapReduce code
simple crawler that fetches all the http://mehrnews.ir's news