Web Scraping
======
Adapted from a page created by [John Beieler](https://gist.github.com/johnb30/4743272)

> Even with the best of websites, I don’t think I’ve ever encountered a scraping job that couldn’t be described as *“A small and simple general model with heaps upon  piles of annoying little exceptions”* 

>> \- Swizec Teller [http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039](http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039)

##What is it?

A large portion of the data that we as social scientists are interested in resides on the web in manner. Web scraping is a method for pulling data from the structured (or not so structured!) HTML that makes up a web page. Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful. If there is something you want to do, there's usually a way to accomplish it. Perhaps not easily, but it can be done. 


##How is it accomplished?

In general, there are three problems that you might face when undertaking a scraping task:

1. You have a single page, or a set of pages, that you know of and you want to scrape.
2. You have a source that generates links, e.g., [RSS feeds](http://rss.nytimes.com/services/xml/rss/nyt/World.xml), to various pages with the same structure.
3. You have a page that contains many pages of interest that are scattered across the file system and you only have general rules for reaching these pages. 

The key is that you must identify which type of problem you have. After this, you must look at the HTML structure of a webpage and construct a script that will select the parts of the page that are of interest to you.

##There's a library for that! 

As mentioned previously, Python has various libraries for scraping tasks. The ones I have found the most useful are:

- [pattern](http://www.clips.ua.ac.be/pages/pattern)
- [lxml](http://lxml.de/)
- [requests](http://docs.python-requests.org/en/latest/)
- [Scrapy](http://doc.scrapy.org/en/0.16/)
- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)


### Inspecting a web page
let's inspect a [mongabay web page](http://names.mongabay.com/data/1000.html). Open the page in a separate tab, it should look something like the image below. Mongapay is a site that has a nice collection of statistics about first and last names.

<img src="images/page.png" alt="web page" height="420" width="600">

The source code language for web pages is HTML (Hyper-Text Markup-Language]. HTML is a hierarchical description of the visual content of a page. You can view the source, in Chrome, by choosing `View/Developer/View Source`. However, commercial web pages are very complex and inspecting them requires a more powerful method.

The Chrome browsr comes with such a tool built in. You can open this tool by choosing `View/Developer/Developer Tools`

If you are using Firefox you can install the [Firebug](https://getfirebug.com/) plugin which gives similar functionality.

The bottom half of the browser page will now have a sophisticated development environment for all things web (HTML, CSS, Javascript).
In particular it allows you to click on a visual element in the page and find out where it is in the source and what tags it is associated with
<img src="images/InspectingHTML.png" alt="web page" height="420" width="600">

Note that the element SMITH is surrounded by `<td>` and `</td>` and that this element and all of the elements in that row :`SMITH  1  2376206 ...` are surrounded by `<tr> ..... </tr>`. In HTML-speak we say that each element in the table is of type 'td' and that the whole row is an element if type 'tr' (stands for "table row").

We will now see how to inspect these elements from the command line using `scrapy shell`

### Scrapy

* [Official Scrapy Tutorial](http://doc.scrapy.org/en/0.24/intro/tutorial.html)
* Other important sections in the Scrapy documentation are **selector** and **Scrapy shell**


### Using scrapy shell
*Everything here is done from the terminal window, not inside a notebook*

Install scrapy. Use one of those or something similar:
> `pip install scrapy`

> `sudo pip install scrapy`

> `conda install scrapy`

Start Scrapy Shell from a terminal window:
> `scrapy shell`

Fetch a page: 
> `fetch('http://names.mongabay.com/data/1000.html')`

View the page in a Browser:
> `view(response)`

Get the HTML text of the response:
> `response.body`

Get just the header
> `response.header`

Get all href links to other web pages `<a href="http://web.site/file"> link text </a>`:
> `response.xpath('//a')`

Filter out of the href links the ones that match a particular regular expression:
> `response.xpath('//a').re('data/.+\.html?')`

scrapy contains much more than the shell. You can use scrapy as a library to perform web operations. You can also write a **Spider** or **Crawler** i.e. a program that will follow links to extract and process all of the pages of a particular type from a web site. Later in this notebook there is an example of using Scrapy for crawling. 

For now, we switch from scrapy to the libraries `requests` and `lxml.html` which are somewhat easier to use for web page parsing.

***Scraping a page that you know***

The easiest approach to webscraping is getting the content from a page that you know in advance. I'll go ahead and keep using the page we looked at earlier. There are three basic steps to scraping a single page:

1. Get (request) the page
2. Parse the page content
3. Select the content of interest using an XPath selector

The following code executes these three steps and prints the result. 

In [3]:
import requests
import lxml.html as lh

url = 'http://names.mongabay.com/data/1000.html'
page = requests.get(url)
doc = lh.fromstring(page.content)
page.content[:500]

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">\n<HTML>\n<HEAD>\n<title>Most common last names in the United States, top 1000</title>\n<link href="https://plus.google.com/116584964404644143364" rel="publisher"/>\n<meta name=\'twitter:card\' content=\'summary\'>\n<meta name=\'twitter:site\' content=\'@mongabay\'>\n<meta name=\'twitter:title\' content=\'Most common last names in the United States, top 1000\'>\n<meta name=\'twitter:creator\' content=\'mongabay\'>\n<meta name=\'twitter:domain\' content=\'mongabay.com\'>\n<meta'

In [4]:
# the tag tr (table row) is used in many places, 
# among them the table of interest to us.
# we can identify those rows by the fact that 
# the table contains 11 columns.
tr_elements = doc.xpath('//tr')
[len(T) for T in tr_elements[:20]]

[5, 3, 1, 2, 3, 3, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]

In [6]:
for i in range(len(tr_elements)):
    if len(tr_elements[i])==11:
        print i, tr_elements[i].text_content()
        break

 6 NameRankNumber of occurrences
Occurrences per 100,000 people
Cumulative proportion per 100,000 people
Percent Non-Hispanic White Only
Percent Non-Hispanic Black Only
Percent Non-Hispanic Asian and Pacific Islander Only
Percent Non-Hispanic American Indian and Alaskan Native Only
Percent Non-Hispanic of Two or More Races
Percent Hispanic Origin


In [7]:
col=[]  # collect column names into col
T=tr_elements[6]
print type(T)
i=0
print len(T)
for t in T.iterchildren():
    i+=1
    name=t.text_content()
    print '%d:"%s"'%(i,name)
    col.append((name,[]))

<class 'lxml.html.HtmlElement'>
11
1:"Name"
2:"Rank"
3:"Number of occurrences"
4:"Occurrences per 100,000 people"
5:"Cumulative proportion per 100,000 people"
6:"Percent Non-Hispanic White Only"
7:"Percent Non-Hispanic Black Only"
8:"Percent Non-Hispanic Asian and Pacific Islander Only"
9:"Percent Non-Hispanic American Indian and Alaskan Native Only"
10:"Percent Non-Hispanic of Two or More Races"
11:"Percent Hispanic Origin"


In [11]:
for j in range(7,len(tr_elements)):
    T=tr_elements[j]
    if len(T)!=11:
        break
    i=0
    for t in T.iterchildren():
        data=t.text_content()
        if i>0:
            try:
                data=float(data)
            except:
                print data,'cannot be converted to float, row,col=',j,i
                data=None
        col[i][1].append(data)
        i+=1


(S) cannot be converted to float, row,col= 880 6
(S) cannot be converted to float, row,col= 880 8


In [12]:
[len(C) for (title,C) in col]

[3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000]

In [13]:
min_len=min([len(C) for (title,C) in col])
min_len

3000

In [14]:
Dict={title:column for (title,column) in col}
import pandas as pd
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,"Cumulative proportion per 100,000 people",Name,Number of occurrences,"Occurrences per 100,000 people",Percent Hispanic Origin,Percent Non-Hispanic American Indian and Alaskan Native Only,Percent Non-Hispanic Asian and Pacific Islander Only,Percent Non-Hispanic Black Only,Percent Non-Hispanic White Only,Percent Non-Hispanic of Two or More Races,Rank
0,880.85,SMITH,2376206,880.85,1.56,0.85,0.4,22.22,73.35,1.63,1
1,1569.3,JOHNSON,1857160,688.44,1.5,0.91,0.42,33.8,61.55,1.82,2
2,2137.96,WILLIAMS,1534042,568.66,1.6,0.78,0.37,46.72,48.52,2.01,3
3,2649.58,BROWN,1380145,511.62,1.64,0.83,0.41,34.54,60.71,1.86,4
4,3154.75,JONES,1362755,505.17,1.44,0.94,0.35,37.73,57.69,1.85,5


So we now have our lovely output. This output can be manipulated in various ways, or written to an output file.

**Scraping generated links**

Let's say you want to get a stream of news stories in an easy manner. You could visit the homepage of the NYT and work from there, or you can use an [RSS feed](http://rss.nytimes.com/services/xml/rss/nyt/World.xml). RSS stands for Real Simple Syndication and is, at its heart, an XML document. This allows it to be easily parsed. The fantastic library `pattern` allows for easy parsing of RSS feeds. Using `pattern`'s `Newsfeed()` method, it is possible to parse a feed and obtain attributes of the XML document. The `search()` method returns an iterable composed of the individual stories. Each result has a variety of attributes such as `.url`, `.title`, `.description`, and more. The following code demonstrates these methods.

In [15]:
import pattern.web
num=5;
url = 'http://rss.nytimes.com/services/xml/rss/nyt/World.xml'
results = pattern.web.Newsfeed().search(url, count=num)

print 'The current top headers from the NY times are:'
for i in range(num):
    print "%d\t%s"%(i,results[i].title)

print '\n\nURL: %s \n\n Header\n%s \n\nFull Article\n %s \n\n' % (results[0].url, results[0].title, results[0].description)

The current top headers from the NY times are:
0	Mali Hotel Attack Leaves Dozens Dead; Hostages Are Taken
1	Jonathan Pollard, American Who Spied for Israel, Released After 30 Years
2	Ebola Cases in 3 Family Members Confirmed in Liberia
3	When You’re Named Isis for the Goddess, Not the Terror Group
4	Jonathan Pollard, American Who Spied for Israel, Released After 30 Years


URL: http://rss.nytimes.com/c/34625/f/642565/s/4ba858d7/sc/7/l/0L0Snytimes0N0C20A150C110C210Cworld0Cafrica0Cmali0Ehotel0Eattack0Eradisson0Bhtml0Dpartner0Frss0Gemc0Frss/story01.htm 

 Header
Mali Hotel Attack Leaves Dozens Dead; Hostages Are Taken 

Full Article
 Gunmen stormed a Radisson Blu hotel on Friday morning in Bamako,seizing scores of hostages and leaving bodies strewn across parts of the building.<br clear="all" /><br /><br /><a href="http://rc.feedsportal.com/r/244158992254/u/0/f/642565/c/34625/s/4ba858d7/sc/7/rc/1/rc.htm" rel="nofollow"><img border="0" src="http://rc.feedsportal.com/r/244158992254/u/0/f/64

That looks pretty good, but the description looks nastier than we would generally prefer. Luckily, `pattern` provides functions to get rid of the HTML in a string.  

In [16]:
print '%s \n\n %s \n\n %s \n\n' % (results[0].url, results[0].title, pattern.web.plaintext(results[0].description))

http://rss.nytimes.com/c/34625/f/642565/s/4ba858d7/sc/7/l/0L0Snytimes0N0C20A150C110C210Cworld0Cafrica0Cmali0Ehotel0Eattack0Eradisson0Bhtml0Dpartner0Frss0Gemc0Frss/story01.htm 

 Mali Hotel Attack Leaves Dozens Dead; Hostages Are Taken 

 Gunmen stormed a Radisson Blu hotel on Friday morning in Bamako,seizing scores of hostages and leaving bodies strewn across parts of the building. 




While it's all well and good to have the title and description of a story this is often insufficient (some descriptions are just the title, which isn't particularly helpful). To get further information on the story, it is possible to combine the single-page scraping discussed previously and the results from the RSS scrape. The following code implements a function to scrape the NYT article pages, which can be done easily since the NYT is wonderfully consistent in their HTML, and then iterates over the results applying the `scrape` function to each result.

In [17]:
import codecs

outputFile = codecs.open('tutorialOutput.txt', encoding='utf-8', mode='w')

def scrape(url):
    page = requests.get(url)
    doc = lh.fromstring(page.content)
    text = doc.xpath('//p[@itemprop="articleBody"]')
    finalText = str()
    for par in text:
        finalText += par.text_content()
    return finalText+'\n\n\n'

for result in results:
    outputText = scrape(result.url)
    outputFile.write(outputText)

outputFile.close()

In [18]:
!head -4 tutorialOutput.txt



WASHINGTON â  He was spirited out of a federal prison on Friday under cover of night, eluding witnesses in a cloak-and-dagger coda to a spy story that has strained relations between two allies for three decades.But while Jonathan J. Pollard, one of the most notorious spies of the late Cold War, tried to stay out of sight after emerging from custody almost as if from a time machine, the United States and Israel hoped his release would finally heal a long-festering open wound in their partnership.For 30 years, Mr. Pollard was at the center of a profound struggle between Washington and Jerusalem, one that shadowed American presidents and Israeli prime ministers since Ronald Reagan was in the White House. The Americans called him a traitor. The Israelis deemed him a soldier, to some a hero. At times, both made him a diplomatic bargaining chip.The only American ever sentenced to life in prison for spying for an ally, Mr. Pollard was freed on parole to an uncertain future. After ducking 

**Scraping arbitrary websites**

The final approach is for a webpage that contains information you want and the pages are spread around in a fairly consistent manner, but there is no simple, straightfoward manner in which the pages are named.

I'll offer a brief aside here to mention that it is often possible to make slight modifications to the URL of a website and obtain many different pages. For example, a website that contains Indian parliament speeches has the URL `http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=` with differing values appended after the `=`. Thus, using a `for-loop` allows for the programatic creation of different URLs. Some sample code is below.

In [19]:
url = 'http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl='

for i in xrange(5175,5973):
    newUrl = url + str(i)
    print 'Scraping: %s' % newUrl

Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5175
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5176
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5177
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5178
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5179
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5180
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5181
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5182
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5183
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5184
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5185
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5186
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5187
Scraping: http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=5188
Scrapi

Getting back on topic, it is often more difficult than the above to iterate over numerous webpages within a site. This is where the `Scrapy` library comes in. `Scrapy` allows for the creation of web spiders that crawl over a webpage, following any links that it finds. This is often far more difficult to implement than a simple scraper since it requires the identification of rules for link following. The [State Department](http://www.state.gov/r/pa/prs/dpb/2012/index.htm) offers a good example. I don't really have time to go into the depths of writing a `Scrapy` spider, but I thought I would put up some code to illustrate what it looks like.

In [None]:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from BeautifulSoup import BeautifulSoup
import re
import codecs

class MySpider(CrawlSpider):
    name = 'statespider' #name is a name
    start_urls = ['http://www.state.gov/r/pa/prs/dpb/2010/index.htm',
    ] #defines the URL that the spider should start on. adjust the year.

        #defines the rules for the spider
    rules = (Rule(SgmlLinkExtractor(allow=('/2010/'), restrict_xpaths=('//*[@id="local-nav"]'),)), #allows only links within the navigation panel that have /year/ in them.

    Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id="dpb-calendar"]',), deny=('/video/')), callback='parse_item'), #follows links within the caldendar on the index page for the individuals years, while denying any links with /video/ in them

    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) #prints the response.url out in the terminal to help with debugging
        
        #Insert code to scrape page content

        #opens the file defined above and writes 'texts' using utf-8
        with codecs.open(filename, 'w', encoding='utf-8') as output:
            output.write(texts)


### Excercise
Google news provides RSS feeds that can be filtered, at the source, according to the country, the language and a query term. See [this description](http://thinktostart.com/creating-custom-rss-feeds-with-google/).

Create a feed for all spanish news about san diego that prints the latest 5 headers and descriptions.


In [27]:
import pattern.web
num=5;
url = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=es&hl=us&q=san+diego&output=rss'
results = pattern.web.Newsfeed().search(url, count=num)

print 'The current top spanish headers about San Diego are:'
for i in range(num):
    print "%d\t%s"%(i,results[i].title)

for i in range(num):
    print '\n\n Description for "%s" is: \n\n %s' % (results[i].title, pattern.web.plaintext(results[i].description))

The current top spanish headers about San Diego are:
0	Chargers 31-25 Jaguars: Rivers lanza 4 de TD y San Diego sale de ... - Univisión
1	Continúa clima frío en San Diego - SanDiegoRed
2	San Diego se pinta de blanco con la primera nevada de la temporada - Noticieros Televisa
3	Bomberos de Tijuana y San Diego, preparados para actos terroristas - Noticieros Televisa
4	Visita a nuestras filiales de Cancún y San Diego - Diario Judio (blog)


 Description for "Chargers 31-25 Jaguars: Rivers lanza 4 de TD y San Diego sale de ... - Univisión" is: 

 Univisión

Chargers 31-25 Jaguars: Rivers lanza 4 de TD y San Diego sale de ...
Univisión
JACKSONVILLE, Florida - Los San Diego Chargers salieron de la mala racha de 6 derrotas consecutivas luego de vencer 31 - 25 a los Jacksonville Jaguars. Philip Rivers lanzó para 300 yardas con 4 pases de touchdown. Uno a Dontrelle Inman de 2 yardas; ...
Chargers de San Diego visitan a Jaguars de JacksonvilleFrontera.info

los 23 artículos informativos »


 Des

##The Pitfalls of Webscraping

Web scraping is much, much, *much*, more of an art than a science. It is often non-trivial to identify the XPath selector that will get you what you want. Also, some web programmers can't seem to decide how they want to structure the pages they write, so they just change the HTML every few pages. Notice that for the NYT example if `articleBody` gets changed to `articleBody1`, everything breaks. There are ways around this that are often convoluted, messy, and hackish. Usually, however, where there is a will there is a way.

...brief pause to demonstrate the lengths this can go to.

##PITF Human Atrocities

As a wrap up, I show the workflow I have been using to perform real-time scraping from various news sites of stories pertaining to human atrocities. This is illustrative both of web scraping and of the issues that can accompany programming. 

The general flow of the scraper is:

RSS feed -> identify relevant stories -> scrape story -> place results in mongoDB -> repeat every hour
