# CPS600 - Python Programming for Finance 
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Web Scraping & Crawling

###  October 18, 2018

We'll look at some more tools for pulling data down off the web.

**Scraping**

For starters, we have mature tools for parsing HTML (XML broadly speaking). Here is **lxml**.

In [1]:
# The yoozhe
import pandas as pd
import numpy as np

In [4]:
# Since we are going to grab some html
import lxml.html as lh

# Simply way to make HTTP requests
import requests

URL = "https://en.wikipedia.org/wiki/HTML"
r = requests.get(URL)

Next, we take the response and turn it into an *element tree*.

In [5]:
tree   = lh.fromstring(r.text)

We can now get information by using the structure of the HTML

In [None]:
tree.findtext('head/title')

Furthermore, we can explore the structure.

In [None]:
childs = tree.xpath('child::*')
childs

The second child is the body of the document.

In [None]:
body = childs[1]

body.attrib['class']

Let's look for all div elements.

In [10]:
divList = body.xpath('div')

I wonder what is in them.

In [None]:
divList[0].values()

And how many?

In [None]:
len(divList)

Not that many, we must have meant *find all of them*:

In [13]:
allDivs = body.xpath('//div')

In [None]:
len(allDivs)

I wonder whether there are any interesting tables.

In [15]:
allTables = body.xpath('//table')

In [None]:
len(allTables)

<img src="https://i.imgur.com/QQeZcZL.gif" style="width: 200px;"/>



Let's see the first.

In [None]:
table = allTables[0]
table.values()

That is not much information. What else is in there?

In [None]:
len(table[1].xpath('tr'))

So there is a table with many rows

There aren't any structured datasets available in this page (just inspect). OK. Let's try another page from which we can pull some data.

In [22]:
URL = "https://en.wikipedia.org/wiki/Python_(programming_language)"
p = requests.get(URL)

ptree = lh.fromstring(p.text)

Looking at the source, we know what we are looking for. We can select by attribute values.

In [None]:
pTables = ptree.xpath("//table[@class='wikitable']")
len(pTables)

Let's see what we've got here.

In [None]:
chart = pTables[0]
chart.xpath('child::*')

In [25]:
lastBody = chart[1]

In [None]:
len(lastBody.xpath('tr'))

**Exercise** Let's build a `DataFrame` that represents the data in the table stored (and parsed as a tree) in `lastBody`.

In [None]:
toNP = [[x.text_content().strip() for x in r.xpath('th')] for r in lastBody.xpath('tr') ]
toNP

OK, that's not exactly what we expected, but it got the first row, so we can go ahead and use that for columns

In [None]:
tableCols = toNP[0]
tableCols

But how to get the remaining data? Let's redo that list comprehension. There is no way around it; we need a helper function.

In [None]:
def tableHelper(node):
    if len(node.xpath('code')) > 0:
        codes = node.xpath('code')
        txtCodes = ' '.join([x.text_content().strip() for x in codes])
        return txtCodes
    else:
        return node.text_content().strip()

Note the changes below. We are lookin in different tags (`td`) because that is where the data are. We have inserted our helper function in order to process the nodes properly.

In [None]:
toNP = [[tableHelper(x) for x in r.xpath('td')] for r in lastBody.xpath('tr')[1:] ]
toNP

Note that this is imperfect, but did we at least get the basic structure right?

In [None]:
len(toNP), [len(x) for x in toNP], len([len(x) for x in toNP])

Yes, everything has the right shape. Now let's build an `array` of our data.

In [None]:
tableData = np.array(toNP)
tableData.shape

Good, now let's build `DataFrame`.

In [None]:
tableFrame = pd.DataFrame(data=tableData, columns=tableCols)
tableFrame

There is of course, room for improvement. Generally, when engaged in a scraping endeavor you will be interested in a specific kind of element in a specific kind of web page and therefore you will specialize your processing steps in order to clean things up. This is necessary; no amount of 'wrapping' or elegance will free us from having to express what we mean. That said, let's do some wrapping and also make use of some pre-packaged wrapping!

Below, I am going to put everything into a single function `getFrame`. My function *takes a string* and it *returns a `DataFrame`*.

In [None]:
# Copied from above for completeness of this cell.
def tableHelper(node):
    if len(node.xpath('code')) > 0:
        codes = node.xpath('code')
        txtCodes = ' '.join([x.text_content().strip() for x in codes])
        return txtCodes
    else:
        return node.text_content().strip()

def getFrame(name):
    """ Give me your thing
    that you want me to 
    Wikipedia and I will
    try to extract the first
    table from it."""
    # Request and results
    base = "https://en.wikipedia.org/wiki/"
    URL = base+name
    r = requests.get(URL)
    # Prepare my element tree structure
    tree   = lh.fromstring(r.text)
    
    # Finding the table
    pTables = tree.xpath("//table[@class='wikitable']")
    Body = pTables[0].xpath('tbody')[0]

    # Extracting the data
    toNP = [[tableHelper(x) for x in r.xpath('td')] for r in Body.xpath('tr')[1:] ]
    tableData = np.array(toNP)
    
    # Extracting the column names
    header = Body.xpath('tr')[0]
    tableCols = [x.text_content().strip() for x in header.xpath('th')]
    
    # Building the DataFrame
    tableFrame = pd.DataFrame(data=tableData, columns=tableCols)
    
    return tableFrame

Let's try it out.

In [None]:
newFrame = getFrame('porsche')
newFrame

**Exercise** (for the reader)
1. Make this function more robust against errors/incompatible pages
2. Make this function more fully-featured (w.r.t input)
3. Make this function more refined (w.r.t output)

Let's have a look at *Beautiful Soup*, named for *tag soup* - a term of endearment for messy HTML or XML. The `lxml` [documentation](https://lxml.de/) mentions Beautiful Soup as an alternative for handling *really broken* HTML documents.

In [None]:
# Example from the Wikipedia page on BS4
# Note the use of the native urllib
from bs4 import BeautifulSoup
import urllib.request

with urllib.request.urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    webpage = response.read()
    soup = BeautifulSoup(webpage, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

Let's try to replicate what we did above, this time using `bs4`. The setup is the same.

In [39]:
base = "https://en.wikipedia.org/wiki/" 
URL = base+"Python_(programming_language)"
r = requests.get(URL)

Next, we parse the source.

In [40]:
soup = BeautifulSoup(r.text)
table = soup.findAll('table',{'class':'wikitable'})[0]

Actually, the rest of this goes pretty much as before.

In [None]:
# Illustration of the methods we're using
header = table.findAll('th')[0]
header.text.strip()

Let's recreate that list comprehension from before.

In [42]:
toNP = [[x.text.strip() for x in r.findAll('th')] for r in table.findAll('tr') ]
tableCols = toNP[0]

We should probably rewrite `tableHelper`

In [43]:
def tableHelper(node):
    if len(node.findAll('code')) > 0:
        codes = node.findAll('code')
        txtCodes = ' '.join([x.text.strip() for x in codes])
        return txtCodes
    else:
        return node.text.strip()

In [44]:
toNP = [[tableHelper(x) for x in r.findAll('td')] for r in table.findAll('tr')[1:] ]

The rest is as before.

In [None]:
tableData = np.array(toNP)
tableFrame = pd.DataFrame(data=tableData, columns=tableCols)
tableFrame.head()

Here is a brief illustration of the options in parsing HTML responses

In [None]:
doc1 = "<a><b /></a>"
doc2 = "<a></p>"

We can use `bs4` in a few different ways. Here we parse `doc1` as HTML

In [None]:
BeautifulSoup(doc1)

Here we parse the same document as XML.

In [None]:
BeautifulSoup(doc1,'xml')

Here we use the `lxml` library to parse the second, invalid document.

In [None]:
BeautifulSoup(doc2, 'lxml')

The `lxml` library decided to simple drop the dangling end tag. What will the `html5lib` library do?

In [None]:
BeautifulSoup(doc2,'html5lib')

Instead, this one filled in the missing start tag.

**REGEX**

Let's have a brief look at *regular expressions*. This is another tool that will serve you well in wrangling real-world data, particularly text data. This discussion of scraping is as good a place as any for a review of regexes.

In [51]:
tedURL = "https://www.washingtonpost.com/wp-srv/national/longterm/unabomber/manifesto.text.htm?noredirect=on"
tedMan = requests.get(tedURL)
tedParags =  [x.text_content().strip() for x in lh.fromstring(tedMan.text).xpath('//p')]
tedText = 'PARAG'.join(tedParags)

We didn't really need that last part, it was just for fun. Now we have a really long string. Let's look for patterns in it.

>A regular expression (or RE) specifies a set of strings that matches it

See the rest of the documentation [here](https://docs.python.org/3/library/re.html). Let's look at some examples. We inserted those "PARAG" strings. What follows them?

This example below uses a *lookahead* pattern - we don't extract the "PARAG", but the thing that follows it.

In [None]:
import re
parags = re.findall('(?<=PARAG).*',tedText)
parags

That's pretty close to what I wanted. Note that most of these are paragraphs with the author's original numbering.

In [None]:
parags[15]

Why does it just stop right there? From the documentation:

> `.` matches any character except a newline.

That explains it. So what if we wanted just the first word following a "PARAG"? We could do, for instance

In [None]:
import re
parags = re.findall('(?<=PARAG)\S*',tedText)
parags

Not bad. What about the next couple of words? (**exercise**). There is (almost) no limit to what you can express with REs.

**Scrapy**

Let's look at a tool for automating web scraping.
>Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Let's step through the tutorial.

1. Create a new terminal in Jupyter (or otherwise open a new terminal)
2. Enter `scrapy startproject tutorial` (or another name for your tutorial project)
3. Copy or type the below code into a text file that we'll call `quotes_spider.py`. Create that file in `tutorial/tutorial/spiders`

In [51]:
# Class definition for your first scrapy spider
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename) # old-fashioned string parsing

Remarks on these definitions:
* `name` identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

* `start_requests()` must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

* `parse()` a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The `parse()` method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

4. Go to the top-level directory (`tutorial`) and run `scrapy crawl quotes`.
5. Look around (`ls`) the directory. Examine the new files.

What you'll notice about this example is that we really didn't do any parsing. Let's fix that by updating our spider. Replace the parse method definition in `quotes_spider.py` with the new one below:

(*Remark* you can use the scrapy shell to play with the methods used below: `scrapy shell "some_url.com"`)

In [None]:
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('small.author::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract(),
        }


Finally, run the command `scrapy crawl quotes -o quotes.json` to extract data from these pages and store the parsed results in the file `quotes.json`.

Then take a look (e.g. `cat quotes.json`).

We are sort of scraping, but there is no *crawling* to speak of. What does that mean, crawling? Let's see how to follow links.

Replace the contents of `quotes_spider.py` with the code below. Note that this one also uses the `start_urls` shortcut (which you can read about on the tutorial page).

In [53]:
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Notice what has changed: we have extracted links and included logic to move to the next one in the list. More precisely...
>Now, after extracting the data, the `parse()` method looks for the link to the next page, builds a full absolute URL using the `urljoin()` method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What else can you do? Here is one example of a slightly more advanced spider that scrapes author information:

In [None]:
import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

From the tutorial:
>This spider will start from the main page, it will follow all the links to the authors pages calling the `parse_author` callback for each of them, and also the pagination links with the `parse` callback as we saw before.