# Scraping Voice of the Shuttle

This notebook contains the code used to scrape [Voice of the Shuttle](http://vos.ucsb.edu) and transform the semi-structure HTML representation of the site into a flat, tabular data structure.

In [206]:
# import the libraries needed
from lxml import html
import json
from lxml import etree
import sys
import csv

## Scraping the Front Page

The first step is scraping the links to the main subpages off of the front page. These are the links to the "Contents" pages and the "Resources" pages. These links are contained in two separate tables so I need to isoloate those tables and then extract the subpage title and the link.

In [277]:
#base_url = "http://vos.ucsb.edu/"
base_url = "http://localhost:8000/"
front_page = html.parse(base_url)

In [11]:
# Isolate all of the tables on the front page
link_tables = front_page.xpath("//table[@bgcolor='#FFFFF3']")

In [89]:
# I know the first table is the Content links, 
# so grab the links out of the first table (except the last link which is broken.)
contents_links = [(link.text_content(), link.xpath("@href")[0]) 
                  for link in link_tables[0].xpath(".//a")[:-1]]
contents_links

[('General Humanities Resources', 'browse.asp?id=2712'),
 ('Postindustrial Business Theory', 'browse.asp?id=2727'),
 ('Anthropology', 'browse.asp?id=2703'),
 ('Archaeology', 'browse.asp?id=2704'),
 ('Architecture', 'browse.asp?id=2705'),
 ('Area & Regional Studies  ', 'browse.asp?id=2706'),
 ('Art (Modern and Contemporary)', 'browse.asp?id=2707'),
 ('Art History', 'browse.asp?id=3404'),
 ('Classical Studies', 'browse.asp?id=2708'),
 ('Cultural Studies', 'browse.asp?id=2709'),
 ('Cyberculture', 'browse.asp?id=2710'),
 ('Dance', 'browse.asp?id=3640'),
 ('Gender and Sexuality Studies', 'browse.asp?id=2711'),
 ('History', 'browse.asp?id=2713'),
 ('Legal Studies', 'browse.asp?id=2716'),
 ('Literature (in English)', 'browse.asp?id=3'),
 ('Literatures (Other Than English)', 'browse.asp?id=2719'),
 ('Literary Theory', 'browse.asp?id=2718'),
 ('Media Studies', 'browse.asp?id=2720'),
 ('Minority Studies', 'browse.asp?id=2721'),
 ('Music ', 'browse.asp?id=2722'),
 ('Philosophy', 'browse.asp?id=27

In [31]:
# Now do the same thing for the Resources table
resources_links = [(link.text_content(), link.xpath("@href")[0]) for link in link_tables[1].xpath(".//a")]
resources_links

[('Academe', 'browse.asp?id=2702'),
 ('Teaching Resources', 'browse.asp?id=2732'),
 ('Libraries &  Museums', 'browse.asp?id=2717'),
 ('Reference', 'browse.asp?id=2729'),
 ('Journals & Zines', 'browse.asp?id=2714'),
 ('Publishers & Booksellers', 'browse.asp?id=2728'),
 ('Listservs & Newsgroups', 'browse.asp?id=2975'),
 ('Conferences', 'browse.asp?id=2972'),
 ('Travel', 'browse.asp?id=2734'),
 ('Laws Of Cool', 'browse.asp?id=2715')]

## Experiments with Subpage Content Extraction

Now that I have the pointers to the main subpages, I need to figure out how to extract the semi-structured information from the HTML markup into a more structured, tabular format. 

This is the information I want to extract for each link:
- The parent categories of the link
- The URL of the link
- The link text
- Any extra text trailing the link
- The ID of the link (which is hidden in the HTML structure)



In [32]:
test_url = base_url + contents_links[0][1]
test_url

'http://vos.ucsb.edu/browse.asp?id=2712'

In [33]:
test_parse = html.parse(test_url)

In [36]:
for link in test_parse.xpath("//a[@target='VoSLink']"):
    print(link.text_content())

British Academy PORTAL
Virtual Reference Desk
ABCentral
About.com: Authors
Britannica.com
BUBL Information Service
Central Conference for History and the Humanities Online (Conferenza Centrale)
CultureKiosque
EDSITEment
Educator's Reference Desk
Electronic Theses and Dissertations in the Humanities: A Directory of Online References and Resources
Funk & Wagnalls Knowledge Center
General Resources
Humanitas
Humanities Lecture Series at U. Kansas
Humanities Web Sites in Japan
Humanities & Social Sciences Services, Texas A&M U. Libraries
Humanities and Arts on the Information Highways
Infomine: Scholarly Internet Resource Collections
Internet Public Library
INTUTE:  Humanities
Library of Congress
Literature Webliography
The Master Works of Western Civilization
MediaMOO
Michigan State U. Vincent Voice Library__
Munich Found Online
New York Times Sunday Book Review Section
OpenHere! Literature
Resources of Scholarly Societies - General & Interdisciplinary
SCAN: Scholarship from California on

In [54]:
for category in test_parse.xpath("//table[.//img/@src='images/bullet-cat.gif']"):
    taxonomy = (category.xpath("./tr/td/@width")[0], category.xpath(".//b")[0].text_content().strip())
    print(taxonomy)

('32', 'Humanities Metapages & Portals')
('64', 'Texas Cultural & Arts Network')
('32', 'Major Web Sites Relevant To General Humanities')
('64', 'Bureau of Labor Statistics')
('96', 'Occupational Outlook Handbook')
('64', 'Chorus - Exploring New Media in the Arts & Humanities')
('64', 'RAND Home Page')
('32', 'Humanities Text Archives')
('32', 'Humanities Centers & Programs')
('32', 'General Humanities Journals')
('32', 'Humanities Discussion Lists & Newsgroups')
('32', 'General Humanities Courses')
('32', 'General Humanities Conferences & Calls for Papers')
('32', 'Copyright & Intellectual Property: Issues, Law, &   Services')
('64', 'John S. Erickson (Darmouth U.)')
('64', 'Ann Okerson')
('32', 'Bibliographic, Translation, & Typesetting Services')
('64', 'Bibliographic & Research Services')
('64', 'Translation Services')
('64', 'Typesetting Services')
('32', "Text-Analysis, Bibliographic, &Amp; Writers' Software")
('32', 'Guides To Critical Thinking & Argument')
('32', 'Guides To Eva

In [76]:
test_element = test_parse.xpath("//table[.//a[@target='VoSLink']]")[100]

In [219]:
etree.tostring(test_element)

b'<table border="0" cellspacing="0" cellpadding="3">&#13;\n  <tr> &#13;\n    <td align="right" width="128" valign="top"><a name="info8817"/><font color="#CC0000"><img src="images/bullet-link.gif" width="10" height="10"/></font></td>&#13;\n    <td> <a href="http://www.csci.csusb.edu/doc/www.sites.html" target="VoSLink">Recently Announced WWW Sites</a>&#160;(gigantic archive of links mentioned in comp.infosystems.www.announce and related newsgroups) (Doc Dick Botting, Calif. State Univ., San Bernardino) </td>&#13;\n  </tr>&#13;\n</table>&#13;\n&#13;\n'

In [88]:
print(test_element.xpath("tr/td/@width"),test_element.xpath("tr/td/"))

for category in test_element.xpath("preceding::table[.//img/@src='images/bullet-cat.gif']"):
    taxonomy = (category.xpath("./tr/td/@width")[0], category.xpath(".//b")[0].text_content().strip())
    print(taxonomy)

['64']
('32', 'Humanities Metapages & Portals')
('64', 'Texas Cultural & Arts Network')
('32', 'Major Web Sites Relevant To General Humanities')
('64', 'Bureau of Labor Statistics')
('96', 'Occupational Outlook Handbook')
('64', 'Chorus - Exploring New Media in the Arts & Humanities')
('64', 'RAND Home Page')
('32', 'Humanities Text Archives')
('32', 'Humanities Centers & Programs')


In [93]:
# extract ID
test_element.xpath("tr/td/a/@name")[0]

'info8871'

In [218]:
# extract URL
test_element.xpath("tr/td[2]/a/@href")[0]

'http://www.csci.csusb.edu/doc/www.sites.html'

In [217]:
# extract link text
test_element.xpath("tr/td[2]/a/text()")[0]

'Recently Announced WWW Sites'

In [222]:
# extract extra text
test_element.xpath("tr/td[2]/text()[2]")[0]

'\xa0(gigantic archive of links mentioned in comp.infosystems.www.announce and related newsgroups) (Doc Dick Botting, Calif. State Univ., San Bernardino) '

In [145]:
def test_function(link):
    
    link_level = int(link.xpath("tr/td/@width")[0]) - 32
    heading_list = [(int(category.xpath("./tr/td/@width")[0]), category.xpath(".//b")[0].text_content().strip()) 
                    for category in link.xpath("preceding::table[.//img/@src='images/bullet-cat.gif']")]
    
    heading_list.reverse()
    
    headings = []
    layer = link_level
    for (level, category) in heading_list:
        
        if level == layer:
            headings.insert(0, category)
            print("adding ", category)
            
            layer = layer - 32
        else:
            print("Skipping ", (level, category))
        
        if len(headings) == (link_level // 32):
            return headings
        
    
    

test_element = test_parse.xpath("//table[.//a[@target='VoSLink']]")[190]

test_function(test_element)


adding  Comp.Infosystems.Www.* Newsgroups
adding  Announcement Services For New Web Sites
Skipping  (96, '"Giving Something Back" Search Help')
Skipping  (64, 'About Searching The Web')
adding  Search The Web


['Search The Web',
 'Announcement Services For New Web Sites',
 'Comp.Infosystems.Www.* Newsgroups']

### Dealing with Sub pages

how many are there?

In [308]:
for heading, link in (contents_links + resources_links):
    local_link = base_url+link.replace("?","%3F")+".html"
    
    parsed_page = html.parse(local_link)
    
    subpages = parsed_page.xpath("//table[.//a/img/@alt = '[Show]']")
    
    
    
    if subpages:
        for page in subpages:
            try:
                label = "///".join(assemble_category(subpages[0], heading))
            except:
                label = assemble_category(subpages[0], heading)
            print(label)
            

General Humanities Resources
General Humanities Resources
Area & Regional Studies  ///American (U.S.) Studies///General American Studies Resources///American Studies   Electronic Crossroads
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and Contemporary Art by Artists and/or Movements///Modern (Through Pop)///Marcel Duchamp
Art (Modern and Contemporary)///Modern and

## Automating the Information Extraction

Ok, I think I have figured out how to programmatically extract the links and the categorical information associated with each link. it is a bit tricky because the HTML structure is not semantic. If the HTML were semantic, that is, if the taxonomic relations between category headings and the links were expressed in the HTML structure it would be easier to extract. 

Instead the HTML structure is flat with a long list of `<tables>` all hanging off the `<body>` of the HTML document. The hierarchical structure is expressed using CSS styles, specifically `width` attributes. I have been able to brew up some XPATH selectors (see the code above) that can extract out links, headers, and their depth in the tree. 

The next step is to brew up the logic that iterates over each link in the page, extracts the link, the title, and extra text, and assembles the categorical tree of that link (and only that link). The logic is going to be tricky because there are a few edge cases I'll need to deal with, but I'll deal with those later.

In [282]:
def process_page(page_link, heading):
    """This function takes a link to a page and heading, 
    parses the page and processes the links"""
    
    # parse the page
    parsed_page = html.parse(page_link)
    
    # fetch all the table elements that contain links
    vos_links = parsed_page.xpath("//table[.//a[@target='VoSLink']]")
    
    # loop over each links
    parsed_links = [extract_link(link, heading) for link in vos_links]

    
    
    return parsed_links

def extract_link(link, heading):
    """This fuction takes a link element and returns the 
    categories, ID, URL, link text, and extra text.
    """
    
    # There is always an ID
    ID = link.xpath("tr/td/a/@name")[0]
    
    # Religious Studies has wonky HTML so this deals with it 
    try:
        categories = "///".join(assemble_category(link, heading))
    except TypeError:
        categories = heading
    
    # There isn't always a URL|text, such as the "Selected Resources"
    try:
        
        URL = link.xpath("tr//a/@href")[0]
    except IndexError:
        URL = ""
    
    try:
        text = "".join(link.xpath("tr//text()")).strip()
    except IndexError:
         text = ""


    return (categories, ID, URL, text)

def assemble_category(link, heading):
    """This function takes a link and heading and computes the category tree."""
    
    link_level = int(link.xpath("./tr/td/@width")[0]) - 32
    heading_list = [(int(category.xpath("./tr/td/@width")[0]), category.xpath(".//a")[0].text_content().strip()) 
                    for category in link.xpath("preceding::table[.//img/@src='images/bullet-cat.gif']")]
    
    heading_list.reverse()
    
    headings = []
    layer = link_level
    for (level, category) in heading_list:
        
        if level == layer:
            headings.insert(0, category)            
            layer = layer - 32

        
        if len(headings) == (link_level // 32):
            headings.insert(0,heading)
            return headings
        
   


In [284]:

with open("links.csv", 'w') as csvfile:
    linkwriter = csv.writer(csvfile)

    for heading, link in (contents_links + resources_links):
        print(heading)
        local_link = base_url+link.replace("?","%3F")+".html"
        processed_links = process_page(local_link, heading)
        for link in processed_links:
            if link:
                linkwriter.writerow(link)


      

Religious Studies
