crawler

Developed for School of International Education, South China University of Technology

This is a module of a news collector system project. The other modules are GUI and a Apache Solr based indexing module. Modules work on different server machine and transfer data using HTTP request containing json string content.

A simple crawler used to collect pages from user specified news websites.
-Used Apache's HttpClient to download and parse html files
-Used Boilerpipe to extract news title and body from pages
-Used Bloomfilter hashtable to avoid duplicate page caching
-Used JsonObject to parse and construct json string to transfer content
-Developed as Java servlets and deployed on Tomcat server

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.settings		.settings
bin		bin
lib		lib
src		src
文档		文档
.classpath		.classpath
.gitattributes		.gitattributes
.gitignore		.gitignore
.project		.project
PLCrawler.txt		PLCrawler.txt
README.md		README.md
SeedUrls.txt		SeedUrls.txt
config.txt		config.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages