Skip to content

itoupeter/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawler

Developed for School of International Education, South China University of Technology

This is a module of a news collector system project. The other modules are GUI and a Apache Solr based indexing module. Modules work on different server machine and transfer data using HTTP request containing json string content.

A simple crawler used to collect pages from user specified news websites.
-Used Apache's HttpClient to download and parse html files
-Used Boilerpipe to extract news title and body from pages
-Used Bloomfilter hashtable to avoid duplicate page caching
-Used JsonObject to parse and construct json string to transfer content
-Developed as Java servlets and deployed on Tomcat server

About

simple crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages