Mozilla Festival 2012 PANDA Project Session
Python ASP
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
requirements
site
step_1
step_2
step_3
step_4
step_5
.gitignore
README
index.html

README

WHAT SKILL LEVELS DO WE HAVE
SPLIT INTO PAIRS
INSTALL STUFF

Why write a screen scraper?

    To get data that is available, but not in structured format.

What can I scrape?

    With patience, almost anything. But the more tabular the data the more straightforward it will be.

When doesn't this work?

    When you can't be certain you've found all the data (search only, no predictable urls)

What is PANDA?

    http://pandaproject.net/

Why put data in PANDA?

    To share with your colleagues. To search it.

Tools and technologies:

    Python, Node, Ruby, Scraperwiki, Mechanize

What are we going to produce today?

    A script you can run to extract structured data from an unstructured website.

What we aren't going to cover:

    Sessions/cookies, regular expressions, POST urls/search params, broken HTML, 

Question:

    Does the percentage of runners who finish the race vary with wind speed?

Step 1:

    Explain boilerplate
    How to fetch a webpage
    Scraping the year

Step 2:

    Scraping the registered and finished runners

Step 3:

    Scraping the wind speed

Step 4:

    Scraping all the urls
    Writing to a csv

Step 5:

    Finished script that scrapes everything