httpcrawler

Very basic HTTP crawler, written in python. Crawls a single domain and outputs an SVG graph showing links between pages and resources (images, css, javascript)

Usage:

$ python httpcrawler.py http://m50d.github.com

Graph will be in the file "out.svg". Nodes and links are coloured according to type - black for pages, red for javascript, green for images and yellow for css.

Dependencies:

Twisted (async HTTP client)
BeautifulSoup (HTML parsing)
Pygraphviz (graph drawing)

Known issues:

Hangs if passed an invalid URL (e.g. missing http)
No error handling at all if any http request fails
Will not follow links to query pages, but will happily follow an infinitely generated chain of "plain" links (if you have e.g. /page/1 linking to /page/2 linking to ...)
Redirects may or may not show up as blue nodes in the graph - untested

Potential improvements:

Separate crawling and graphing parts of the program
More control over which links to follow (currently follows any links on the same host)
More control over graphviz output (e.g. filename, format, ...)
Proper packaging using pip or similar

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.gitignore		.gitignore
README.md		README.md
httpcrawler.py		httpcrawler.py
httpcrawler_test.py		httpcrawler_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

httpcrawler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

httpcrawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages