# Scraping

Today we'll talk about "scraping": how to get unstructured data and turn it into something usable. We'll primarily focus on _web scraping_. Python has mature tools that make this pretty easy.

The basic workflow is:

1. Find the data you want on the web.
2. Inspect the webpage and figure out how to select the content you want. This usually involves some combination of
    - Viewing the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with a something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / [regular expressions](https://docs.python.org/2/howto/regex.html)__ in Python.
    - If the page is more complicated (and/or written in good style), we want to use the HTML parse tree => __[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) / [lxml](http://lxml.de/lxmlhtml.html)__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

## Example

As an example, suppose we want to crawl the list of "Available Technologies" being licensed by MIT at http://technology.mit.edu and store their basic info; their associated patents; and the reference counts on their associated patents.

### Step 1: Okay, let's go to that URL.

- _First try_:  Aha, a list of links on the right.  Let's click on a few -- what do we see?  Many are empty, the categories are not obviously mutually exclusive, okay.  Maybe there's a better way.
- _Second try_: Let's just search for all technologies at http://technology.mit.edu/technologies.  Okay, better but it only gives us 50 at a time.  We could just combine the four pages, that's fine.  Let's just click on page 2 to see what happens
- _Third try_: Aha, the URL for page 2 is http://technology.mit.edu/technologies?limit=50&offset=50&query=.  That looks like we can just specify a higher limit and offset 0 and get the whole thing.
- _Final answer_: Indeed, http://technology.mit.edu/technologies?limit=1000 has a giant list.

In [3]:
import requests

url = "http://technology.mit.edu/technologies"
response = requests.get(url, params={"limit": 1000})
print response.url
print response.text[:1000]

http://technology.mit.edu/technologies?limit=1000
<!DOCTYPE html><html lang="en"><head><title>Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology</title><link href="http://technology.mit.edu/technologies.rss" rel="alternate" title="Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology" type="application/rss+xml" /><link href="http://technology.mit.edu/technologies.atom" rel="alternate" title="Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology" type="application/atom+xml" /><meta charset="UTF-8" /><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /><script type="text/javascript">var NREUMQ=NREUMQ||[];NREUMQ.push(["mark","firstbyte",new Date().getTime()]);</script><link href="http://dqyfmjbs

**A quick introduction to HTML and the DOM**

To get started:

- Pull up http://technology.mit.edu/technologies?limit=1000 in Chrome.  
- Open __View->Developer->Developer Tools__.  
- Right click on one of the technology titles, and choose __"Inspect Element"__.

What are we looking at?  Well.. it's this is the structure of the webpage.  Nested _tags_ of different _types_ and having a variety of _attributes_.

What we learned above:

  - All of the technologies are underneath ("_descendents of_")   `<div class="search" id="nouvant-portfolio-content">`
  - In fact, each of them is in its own `<div class="technology" data-images="true" id="technology_XXXX">`
  
Now we're ready to move on:

### Step 2: Extract Information
Now, we need to parse the raw HTML and actually grab the links of detailed info. The two main parser libraries in Python are `BeautifulSoup` and `lxml`. `lxml` is much faster (it leverages several C libraries), but it's also worse at dealing with malformed, crummy HTML. Because parsing speed isn't our bottleneck here, we'll use `BeautifulSoup`.

In [11]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)
print soup.prettify()

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology
  </title>
  <link href="http://technology.mit.edu/technologies.rss" rel="alternate" title="Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology" type="application/rss+xml"/>
  <link href="http://technology.mit.edu/technologies.atom" rel="alternate" title="Select Technologies - MIT Technology Licensing Office - MIT Patents and Technologies Available for Licensing - Massachusetts Institute of Technology" type="application/atom+xml"/>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script type="text/javascript">
   var NREUMQ=NREUMQ||[];NREUMQ.push(["mark","firstbyte",new Date().getTime()]);
  </script>
  <link href="http://dqyfmjbsr92rj.cloudfront

In [24]:
parent_div = soup.find('div', attrs={'id': 'nouvant-portfolio-content'}) #Find (at most) *one*
tech_divs = parent_div.find_all('div', attrs={'class':'technology'})  #Find *all*
print len(tech_divs)
print tech_divs[0].prettify()

78
<div class="technology" data-images="false" id="technology_8914">
 <h2>
  <a href="/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality">
   Novel Puncture Access Mechanism with Pre-Loading Functionality
  </a>
 </h2>
 <p>
  <strong>
   14947-15811
  </strong>
  –
  <span>
   Laparoscopic surgery (trocar); veress needles; venous access needles for catheter placement; epidural or spinal tap needles; lung puncture devices to correct collapsed lungs 

  

 In medical puncture access procedures, a device is inserted axially through layers of tissue to gain access to blood vessels, ducts, or body cavities.  Although highly common, these procedures are risky due to...
   <a href="/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality">
    Read More
   </a>
  </span>
 </p>
</div>



** Introduction to CSS selectors**

This pattern of nested finds, based on tag type, id, and class is very common. It's so common that there are two a special convenience language for such traversals: [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp) and [XPath](http://www.w3schools.com/xpath/) (which works for all XML, not just HTML). We'll be using CSS selectors, which are more common for HTML and easier to learn.

With CSS selectors, we can write the above in a more concise and expressive way:
    ```
    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
    ```

All selectors work like 'find_all'.  Some basic building examples of selectors are:

 - `'mytag'` picks out all tags of type `mytag`.
 - `'#myid'` picks out all tags whose _id_ is equal to `myid`
 - `'.myclass'` picks out all tags whose _class_ is equal to `myclass`
 - `'mytag#myid'` will pick all tags of type `mytag` **and** `id` equal to `myid` (analogously for `'mytag.myclass'`)
 - If `'selector1'` and `'selector2'` are two selectors, then there is another selector `'selector1 selector2'`.  It picks out all tags satisfying `selector2` that are __descendents__(*) of something satisfying `selector1`, i.e., it's like our nested find.
 
 (*) It doesn't have to be a _direct_ descedent.  I.e., it can be a grand-grand-..-grand-child of something satisfying `selector1`.  For direct descendents we'd instead write `'selector1 > selector2'`
 
Let's just explain how this applies to our example:

1.  Let's start with the first half
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This picks out all 'div' tags with id 'nouvant-portfolio-content'.
2.  Then the second half
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                                                            ^^^^^^^^^^^^^^
This picks out all 'div' tags with class 'technology'.
3.  Finally the whole thing
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
does exactly the same as our nested find above!

In [25]:
#parent_div = soup.find('div', attrs={'id': 'nouvant-portfolio-content'}) #Find (at most) *one*
tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')  #Find *all*
print len(tech_divs)

78


In [16]:
tech_divs = soup.select('div#nouvant-portfolio-content div.technology')
print len(tech_divs)

78


Let's check out what we've zoomed in to 

In [26]:
print tech_divs[0].prettify()

<div class="technology" data-images="false" id="technology_8914">
 <h2>
  <a href="/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality">
   Novel Puncture Access Mechanism with Pre-Loading Functionality
  </a>
 </h2>
 <p>
  <strong>
   14947-15811
  </strong>
  –
  <span>
   Laparoscopic surgery (trocar); veress needles; venous access needles for catheter placement; epidural or spinal tap needles; lung puncture devices to correct collapsed lungs 

  

 In medical puncture access procedures, a device is inserted axially through layers of tissue to gain access to blood vessels, ducts, or body cavities.  Although highly common, these procedures are risky due to...
   <a href="/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality">
    Read More
   </a>
  </span>
 </p>
</div>



Now we're ready to pull out some key pieces of info:

- The technology's "title" (the text in the `<a>` element)
- The link to follow for more info on the technology (the _href_ attribute of the `<a>`)
- And a short blurb about the text (in the `<span>`)

Let's write some code to extract this.  But before we do, let's discuss what _form_ the output should take: It is often convenient to store data in a dictionary (i.e. as a _key-value_ hashtable), in other words to name the bits of data you are collecting.  One big advantage is that this makes it easier to add in extra fields progrssively.

Let's see what the code looks like:

In [10]:
firsta = tech_divs[0].find('a')
print firsta.text
print firsta['href']


/technologies/15820_a-low-cost-passive-alignment-system-for-micro-and-nano-imprinting


In [29]:
## 
# We're going to use a "named tuple" to store our key-value data.
# We could also have used a dictionary, with strings as keys.
# Named tuples have some advantages
#  - Better notation with autocomplete, x.field_name instead of x['field_name']
#  - If you change your object structure later and fail to update your
#    code to include the new fields, this will make it easier to find.
#  - They are immutable, preventing certain sorts of bugs.
# .. and some disadvantages:
#  - If you want to augment object structure you need a new type
#    (or to go back and fill your code )
#  - They are immutable.
##
from collections import namedtuple
TechBasic = namedtuple('TechBasic', 'title, url, short')

def td_info(td):
    la = td.select('h2 > a')
    ls = td.select('span')
    if len(la) != 1 or len(ls) != 1:
        print "Uh oh! We did something wrong for:"
        print "\n".join(">>> " + line for line in td.prettify().split("\n"))
        return
    return TechBasic(title = la[0].text, url=la[0]['href'], short=ls[0].text)

tech_links = [td_info(td) for td in tech_divs if td_info(td) is not None]

print tech_links[0].url

/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality


In [30]:
from urlparse import urljoin

Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')

url_base="http://technology.mit.edu/"

def get_tech_details(response, tech_basic):
    soup = BeautifulSoup(response.text)
    patents = [Patent(name=a.text, url=a["href"])
               for a in soup.select('dd.us_patent_issued a')]
    return TechDetailed(tech_basic=response.url, patents=patents)

tech_details = [get_tech_details(requests.get(urljoin(url_base, tech_basic.url)),
                                 tech_basic)
               for tech_basic in tech_links[:2]]
print tech_details

[TechDetailed(tech_basic=u'http://technology.mit.edu/technologies/14947-15811_novel-puncture-access-mechanism-with-pre-loading-functionality', patents=[Patent(name=u'US Patent 8,419,764', url='http://google.com/patents/US8419764'), Patent(name=u'US Patent 9,039,723', url='http://google.com/patents/US9039723'), Patent(name=u'US Patent 8,894,679', url='http://google.com/patents/US8894679')]), TechDetailed(tech_basic=u'http://technology.mit.edu/technologies/16302_method-for-predicting-patient-outcomes-from-a-quasi-periodic-patient-signal', patents=[])]


**Note**: 
In the last code segment, we only did the first one.  If we try to get them all this way, it'll take a while.  Run the next cell for as long (or not) as you wish, and when you get bored use _Kernel->Interrupt_ to stop it.

The problem is that connecting to a remote server and fetching the pages takes a while. Scraping web pages is usually _IO-bound_ and not CPU-bound (that is, we spent most of our time waiting for data nadn ot processing it). Fortunately, Python gives us lots of different ways to deal with this problem.

We'll be using [requests-futures](https://github.com/ross/requests-futures), a nice wrapper combining `requests` with `concurrent.futures`. For a faster, though harder to debug alternative, you can look at [grequests](https://github.com/kennethreitz/grequests).

In [31]:
urls = [urljoin(url_base, tech_basic.url) for tech_basic in tech_links]

In [14]:
# %%timeit
# Slow version

# tech_details = [get_tech_details(requests.get(url), tech_basic)
#                 for url, tech_basic in zip(urls, tech_links)]

In [32]:
# %%timeit
# requests-futures
from requests_futures.sessions import FuturesSession

session = FuturesSession(max_workers=15)
futures = [session.get(url) for url in urls]

tech_details = [get_tech_details(future.result(), tech_basic)
                for future, tech_basic in zip(futures, tech_links)]

Another option is to use Python's [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) interface, which can easily parallelize a map:

In [None]:
from multiprocessing import Pool
p = Pool(15)    # Use 15 processes
responses = p.map(requests.get, urls)
tech_details = [get_tech_details(response, tech_basic)
                for response, tech_basic in zip(responses, tech_links)]

**Exercise**:

Let's put all of that together.  Write a function 
```python
def get_tech_basics(url):
    ...
```

that returns `TechBasic` all each technology on the page.  Combine this with the pooled requests to get_tech_details to obtain a list of TechDetails.

**Fin.**
That's it, we now have a basic not-entirely-trivial example. We took some detours along the way, so let's just take a look at what our code looks like without those detours:

**Exercises:**

1. Modify "get_tech_details" to get other interesting information on the technology, like a long form description and/or the authors' names.  (You'll also want to modify TechDetailed.  Do that first and note that now the code breaks when it tries to construct a TechDetailed with the wrong number of fields.)

2. Modify "get_tech_details" to try to follow the link and to get more information on the patent -- for instance when it was filed and granted, or how many other patents reference it.  (Warning: The patent web site is much less regular than MIT's!)

**Coda / More complicated example**:
Suppose we had picked Stanford instead of MIT.  Let's try to do the same thing (it's a bit harder to get a good listing URL, so I just downloaded one).

In [None]:
from collections import namedtuple
from urlparse import urljoin

from bs4 import BeautifulSoup, Comment
import requests

with open("../small_data/Stanford-Tech-Listing.html", "r") as fin:
    soup = BeautifulSoup(fin)

In [None]:
#BeautifulSoup doesn't seem to support 'or' selectors, so:
selector = lambda x: x.has_attr("id") and x["id"].startswith("output_row")
tech_rows = soup.find_all(selector)[1:]
# Alternate -- showing how to go up and down the tree
#tech_rows = soup.find('tr', attrs={'id':'output_row_1'}).parent.findAll('tr')[1:]
print len(tech_rows)
print tech_rows[0].prettify()
print tech_rows[-1].prettify()

**Details**: Let's quickly break down that last line for two bits of Python syntax that we haven't explicitly talked about
    >    selector = lambda x: x.has_attr('id') and x['id'].startswith('output_row')
This is a "lambda expresion" -- a short, inline, unnamed function. Lambdas are pretty limited, so you should define a named function for anything complicated  

    >    tech_rows = soup.find_all(selector)[1:]      
                                            ^^^^
This is list slice notation (we already used this above with [:2]!).  In this case, we're taking all but the zero-th entry (which is a list header)

**UH OH**: 
When originally preparing this, I was using Anaconda.  The same code only showed about _254_ of the _1727_ entries -- BeautifulSoup was incorrectly parsing the file.  These sorts of things are not entirely uncommon, so sometimes it helps to double-check.

In [None]:
# Warning: This is hacky code!
TechBlurb = namedtuple('TechBlurb', 'docket techid url title')
def parse_tr(tr):
    link = tr.select("td.output_data a")[0]
    
    docket = link.text
    url = link["href"]
    techid = url.split("=")[1]
    title = tr.select("td.output_data")[2].text
    return TechBlurb(docket=docket, techid=techid, url=url, title=title)
tech_blurbs=map(parse_tr, tech_rows)

In [None]:
# And this isn't much better!
def find_comment_by_text_in(soup, comment_text):
    return soup.find(text=lambda text: isinstance(text, Comment) and comment_text in text)

TechDetailed = namedtuple('TechDetailed', 'blurb, abstract, similar')
SimilarHint = namedtuple('SimilarHint', 'techid, docket, title')
def get_tech_details(response):
    # We're doing s lot of chaining with implicit assumptions here -- 
    #   it might fail in all sorts of way, in which case we give up.
    soup = BeautifulSoup(response.text)
    contents = soup.find_all('form')[1]
    abstract = find_comment_by_text_in(contents, 'Abstract').find_next_sibling('hr') \
                                                            .find('div') \
                                                            .text
    def parse_similar_tr(r):
        tds = r.find_all('td')
        if len(tds) < 3:
            return None
        return SimilarHint (
            techid = tds[0].find('a')['href'].split('=')[1], 
            docket = "S"+tds[0].text.strip(), 
            title  = tds[2].text.strip()
        )

    similar_trs = find_comment_by_text_in(soup.find_all('form')[1], 'Similar Tech') \
                      .find_next_sibling('table') \
                      .find('div') \
                      .find('table') \
                      .find('table') \
                      .find_all('tr')
    similar = filter(None, [parse_similar_tr(tr) for tr in similar_trs])
    
    return TechDetailed(blurb=blurb, abstract=abstract, similar=similar)

In [None]:
## Since the point is to show that something goes wrong, let's not wait until the end!
# imap_unordered lets you use the results of the map as they are produced (rather than storing them)
# and with no guarantee on order.

url_base="http://techfinder.stanford.edu/"

for blurb in tech_blurbs:
    response = requests.get(urljoin(url_base, blurb.url))
    try:
        get_tech_details(response)
    except:
        print "Something went wrong!"
        break

**Remark**
When we run the above code, it tells us that [this technology](http://techfinder.stanford.edu/technology_detail.php?ID=30261) did not have a list of similar technologies.  But going to the web page shows that it does!  What went wrong?

In [None]:
url = 'http://techfinder.stanford.edu/technology_detail.php'
soup = BeautifulSoup(requests.get(url, params={"ID": 30261}).text)
contents = soup.find_all('form')[1]
print contents

If we go and look at the same part of the **raw** HTML, we find that there is no `</form>` there:

    >    <!--- Applications --->
    >    <h3>Applications</h3><br/>
    >    <ul><li>Imaging apoptosis<ul type="circle" style="margin-bottom:0in"></li><li>Research</li><li>Clinical<ul type="circle" style="margin-bottom:0in"></li><li>Monitor therapeutic efficacy in cancer patients</li><li>Anti-cancer drug selection</ul></ul></li></ul><br/>
    >    
    >    <!--- Advantages --->
    >    <h3>Advantages</h3><br/>
    >    <ul><li>High specificity for caspase-3 and -7</li><li>High sensitivity</li><li>Non-invasive</li><li>Biocompatible</li><li>Small size of probe allows:<ul type="circle" style="margin-bottom:0in"></li><li>Deep tissue penetration</li><li>More extensive biodistribution</ul></li><li>PET probes:<ul type="circle" style="margin-bottom:0in"></li><li>High tumor/muscle ratio in apoptotic tumors</li><li>High uptake value in apoptotic tumors</ul></li><li>Fluorescent probe:<ul type="circle" style="margin-bottom:0in"></li><li>Possess NIR spectral properties</ul></li><li>May help promote personalized cancer medicine</li><li>Potential for probe design strategy to be applied to other enzyme targets</li></ul><br/>

What there **is** is _mal-formed HTML_ that is bad enough to confuse BeautifulSoup.  (Note that it's not nearly bad enough to confuse a web browser however).  If you look at more examples, you will find even worse ones -- a stray `</html>` in the middle of a document is not unheard of.  

To fix this, we can pre-"tidy" the page before feeding it to BeautifulSoup using **pytidylib**.

In [None]:
from tidylib import tidy_document
url='http://techfinder.stanford.edu/technology_detail.php'

tidy_page, __ = tidy_document(requests.get(url, params={"ID": 30261}).text)
soup = BeautifulSoup(tidy_page)
contents = soup.find_all('form')[1]
print contents

**Exercises**:

1. Go back and modify `get_tech_details` to use this 'tidy' approach.

2. Sometimes web servers are slow and/or unreliable, and sometimes your connection it.  If we were to run the above test twice, we'd probably find that some of the failures were just due to a connection error.  We didn't notice this because the _outer_ `try` / `except` is also catching these.  So: Modify `get_tech_details` to allow up to 3 retries. <br/>Bonus points if you actually look what what exceptions `urllib` throws in those cases instead of a general catch-all mechanism.  Alternate type of bonus points if you figure out how to do it using the `retrying` package.  You can test these by throttling your internet on and off to simulate an unreliable connection.

# Spoilers

In [1]:
from collections import namedtuple

from bs4 import BeautifulSoup
from requests_futures.sessions import FuturesSession
import requests

# Getting the list of short 'blurbs' about the techs
TechBasic = namedtuple('TechBasic', 'title, url, short')
def get_tech_basics(url):
    url = "http://technology.mit.edu/technologies?limit=1000"
    soup = BeautifulSoup(requests.get(url).text)

    ## Get the list of tech blurbs
    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')

    ## Parse a single 'td' on the index page
    def td_info(td):
        la = td.select('h2 > a')
        ls = td.select('span')
        if len(la) != 1 or len(ls) != 1:
            print "Uh oh! We did something wrong"
            return
        return TechBasic(title=la[0].text, url=la[0]['href'], short=ls[0].text)
    
    return [td_info(td) for td in tech_divs]


# Adding in some details (just patent info, for now)
Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')

def get_tech_details(response):
    soup = BeautifulSoup(response.text)

    def patent_info(a):
        return Patent(name=a.text, url=a['href'])
    patents = [patent_info(a) for a in soup.select('dd.us_patent_issued a')]

    return TechDetailed(tech_basic=tech_basic, patents=patents)

## The main driver code:
tech_basics = get_tech_basics("http://technology.mit.edu/technologies?limit=1000")
url_base="http://technology.mit.edu/"
session = FuturesSession(max_workers=15)
futures = [session.get(url_base + tech_basic.url)
       for tech_basic in tech_basics if tech_basic is not None]
tech_details = [get_tech_details(future.result()) for future in futures]



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))


In [None]:
print len(tech_details)
print tech_details[24]

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*