Skip to content

Web crawler functionality that can build sitemap of a given domain

License

Notifications You must be signed in to change notification settings

mvogiatzis/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler Build Status

The web crawler crawls same domain only as required and builds a simple site map showing links to other pages under the same domain, links to static content such as images, and to external URLs. Given a starting URL, it visits all pages within the domain but not follow the links to external sites such as Google or Twitter.

The sitemap is printed every X seconds to the console.

How to build and run

The langauge of choice is Java8 and Maven to build and run.

Build using Maven:

mvn clean install

To run, from the project root execute the following:

mvn exec:java -Dexec.mainClass="com.micvog.Main" -Dwebsite=<website_url>

For example:

mvn exec:java -Dexec.mainClass="com.micvog.Main" -Dwebsite="https://en.wikipedia.org/wiki/Dodo"

Assumptions

  • Different urls can result to the same content and 3xx were not considered for the scope of this project. Considering canonical URLs could also improve removing duplicates.

  • The crawler runs as a single thread, but the solution is extensible to be able to become multithreaded in the future.

  • The project uses Lombok dependency to create classes without boilerplate code (e.g. equals and hashcode, constructors, toString etc)

  • No protection was made against downloading very large web entities. Content-Length checks should be configured in a real system.

  • Jsoup was used only for its DOM parsing functionality. The html parser only identifies absolute paths for this exercise (e.g. "http://www.google.com/images.html" and not "/images.html")

  • The way the sitemap is printed is using a select * approach where in a real keyvalue / storage system this would not be feasible. We would probably make some batch reads in order to iteratively build the sitemap. Select * is not an option in a large-scale storage solution like cassandra.

  • Printing the sitemap would require a richer XML library for creating and printing it. For speed, a string builder is constructing it here.

  • The solution uses a job scheduler to assign tasks in order to avoid a recursive way of crawling. This allows for better control of prioritization (and different crawling strategies if needed). That is, if you want to do a breadth first search then the job scheduler will handle the task distribution differently. For simplicity a FIFO queue is used for now.

  • The HTTP client ignores the validity of SSL certificates.

The output of the sitemap builder looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.net/?id=who</loc>
  </url>
  <url>
    <loc>http://www.example.net/?id=what</loc>
  </url>
  <url>
    <loc>http://www.example.net/?id=how</loc>
  </url>
</urlset>

We could enhance the output by introducing ,

Logging

Using Slf4j-simple for logging.

About

Web crawler functionality that can build sitemap of a given domain

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages