A Flask application for scraping, indexing, and creating graph visualizations of websites
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
static
templates
.gitignore
README.md
app.yaml
appengine_config.py
crawler.py
getpage.py
main.py
requirements.txt
search.py

README.md

web-graph

This is a Flask application that scrapes websites, indexes them, and creates a graph visualization of the linking structure. A working version limited to five page visits is available at web-graph.appspot.com.

This is a heavily edited version of a project I completed as part of Udacity's Introduction to Computer Science course. This application implements an early version of Google's PageRank algorithm, along with a web scraper built using Google App Engine's urlfetch API.

Features include:

  • A web interface for initializing the crawl
  • A graph visualization from the D3.js library for viewing the results.
  • A search engine for searching the scraped pages, using the PageRank algorithm.

To run locally, clone this project, install the Google App Engine SDK, then cd to the clone and install the requirements in requirements.txt with this command:

$ pip install -r requirements.txt -t lib/ --ignore-installed

Then, from the root of the directory, run the development server:

$ dev_appserver.py .

An earlier version of this application required the requests library, but I replaced that with the Google's URL Fetch service. As a result, the only requirements are Flask and beautifulsoup4, as detailed in requirements.txt.

If you would like to reinstall, or update the dependencies, just edit requirements.txt, cd to the root level of this app and run

$ pip install -r requirements.txt -t lib/ --ignore-installed

###Modifying the Crawler The crawl_web function is set to wait 0.25 seconds between page reads. You can increase or decrease this value as needed by editing the time.sleep() value, but make sure you don't overload someone's servers. The max_pages value is currently set to 5, but you can change that as needed for your crawler. Note that this crawler would continue until it has indexed the entire web (forever) if you don't set a maximum number of page visits.