Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Mozilla Festival 2012 PANDA Project Session

branch: master
requirements Add requirements.
site Fix for udpated pages.
step_1 Fix for udpated pages.
step_2 Fix for udpated pages.
step_3 Fix for udpated pages.
step_4 Fix for udpated pages.
step_5 Fix for udpated pages.
.gitignore Initial import.
README Fix for udpated pages.
index.html Add link to the site we're scraping.
README
WHAT SKILL LEVELS DO WE HAVE
SPLIT INTO PAIRS
INSTALL STUFF

Why write a screen scraper?

    To get data that is available, but not in structured format.

What can I scrape?

    With patience, almost anything. But the more tabular the data the more straightforward it will be.

When doesn't this work?

    When you can't be certain you've found all the data (search only, no predictable urls)

What is PANDA?

    http://pandaproject.net/

Why put data in PANDA?

    To share with your colleagues. To search it.

Tools and technologies:

    Python, Node, Ruby, Scraperwiki, Mechanize

What are we going to produce today?

    A script you can run to extract structured data from an unstructured website.

What we aren't going to cover:

    Sessions/cookies, regular expressions, POST urls/search params, broken HTML, 

Question:

    Does the percentage of runners who finish the race vary with wind speed?

Step 1:

    Explain boilerplate
    How to fetch a webpage
    Scraping the year

Step 2:

    Scraping the registered and finished runners

Step 3:

    Scraping the wind speed

Step 4:

    Scraping all the urls
    Writing to a csv

Step 5:

    Finished script that scrapes everything
Something went wrong with that request. Please try again.