# Web Scraping Fundamentals and Data Acquisition with Scrapy

## HTML (Hypertext Markup Language)

HTML is the standard markup language for creating Web pages. Its elements tell the browser how to display the content. 

HTML **tags** are element names surrounded by angle brackets:
`<tag_name>content.....</tag_name>`

For instance, HTML links are defined with the `<a>` tag. 

`<a href="https://www.google.com">This is a link</a>`

<a href="https://www.google.com">This is a link</a>

**Attributes** are used to provide additional information about HTML elements. The link's destination above is specified in the _href_ attribute. 

The `style` attribute is used to specify the styling of an element, like color, font, size etc.
`<p style="color:red">Paragraph in red..</p>`

<p style="color:red">Paragraph in red</p>

Some attributes that are particularly important for web scraping are:
- `class`: The HTML class attribute is used to define equal styles for elements with the same class name. All HTML elements with the same class attribute will get the same style.
- `id`: The id attribute specifies a unique id for an HTML element.
- `href`: Hyperlink url (important for web-crawling).

An HTML element can only have one unique id that belongs to that single element, while a class name can be used by multiple elements.

[source](https://www.w3schools.com/html/default.asp)

## XPath Notation

XPath uses a path notation (as in URLs) for navigating through the hierarchical structure of an XML document. It uses a non-XML syntax so that it can be used in URIs and XML attribute values.

Expression|Description
-|-
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes
`*`|Wildcard

- `xpath = '/html/body/div[2]/p'`: Specific paragraph element
- `xpath = '//p'`: All paragraph elements
- `xpath = '//div[@id="uid"]'`: Looks at all `div` elements for the case that attribute `id` equals `uid`.
- `xpath = '//*[@id="div3"]/p'`: Select the paragraph element with the id "div3"
- `xpath = '//p[@id="p1"]/a/@href'`: Select url from an `a` tag.

`/html/body/*` selects all elements one generation below the body element without concern of the tag type, it selects all children of the body element. On the other hand, `/html/body//*` selects all elements from all future generations of the body element (that is, all descendants of the body) regardless of tag type.

## Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. It can be used for large scale web scraping.

<img width="500" alt="Scrapy" src="https://docs.scrapy.org/en/0.20/_images/scrapy_architecture.png">

[source](https://docs.scrapy.org/en/0.20/topics/architecture.html)

### Selector Objects

In [13]:
from scrapy import Selector

# Create a string with the html code
html = '''
<html>
<body>
<div class="hello selector">
        <p>Hello World!</p>
</div>
<p>Thank you!</p> </body>
</html>
'''

# Create a scrapy Selector object
sel = Selector(text = html)

In [14]:
sel

<Selector xpath=None data='<html>\n<body>\n<div class="hello selec...'>

In [15]:
# Create a SelectorList of all <p> elements in the HTML document
sel.xpath("//p")
# Output is a SelectorList of Selector objects

[<Selector xpath='//p' data='<p>Hello World!</p>'>,
 <Selector xpath='//p' data='<p>Thank you!</p>'>]

In [16]:
# Extract data from a SelectorList
sel.xpath("//p").extract()

['<p>Hello World!</p>', '<p>Thank you!</p>']

In [17]:
# Extract the first element from a SelectorList
sel.xpath("//p").extract_first()

'<p>Hello World!</p>'

In [18]:
# Extract the second element from a SelectorList
sel.xpath("//p")[1].extract()

'<p>Thank you!</p>'

### Requests Library

The `requests` library is the standard for making HTTP requests in Python.

Note: Although it is possible to do web scraping only with `scrapy` (without `requests`), it is good to learn `requests` as well.

In [25]:
import requests

# Get a webpage and create a Response object
url = 'https://api.github.com/events'
r = requests.get(url)

# Create the string html containing the HTML source
html = r.content

# Create the Selector object sel from html
sel = Selector(text = html)

# Print out the number of elements in the HTML document
print(f"There are {len( sel.xpath('//*'))} elements in the HTML document!")

There are 15 elements in the HTML document!


### Response Objects

The `Response` is similar to the `Selector` such that we can use `xpath` and `css` methods and chain them with `extract`. Both Response and Selector objects return a `SelectorList` when using the `xpath` or `css` methods. The main difference is that, `Response` can also keep tack of the **url** where the HTML is loaded from. Moreover, it lets us follow a new link with the `follow()` method. This is particularly useful when _crawling_ the web.

## CSS (Cascading Style Sheets)

CSS Locator notation can be used as an alternative to XPath using the `.css` method of the `Selector`. One advantage of using CSS Locator is that it often makes **attribute selection** very easy.


- `/` replace by `>` (except first character) 
    - XPath: `/html/body/div`
    - CSS Locator: `html > body > div`
- `//` replaced by a blank space (except first character)
    - XPath: `//div/span//p`
    - CSS Locator: `div > span p`
- `[N]` replaced by :nth-of-type(N)
    - XPath: `//div/p[2]`
    - CSS Locator: `div > p:nth-of-type(2)`
- To find an element by class:
    - XPath: `/div[class="courses"]/a`
    - CSS Locator: `div.courses > a`
- To find an element by id:
    - XPath: `//div[@id="uid"]/span//h4`
    - CSS Locator: `div#uid > span h4`

- Attribute selection:
    - XPath: `//div[@id="uid"]/a/@href`
    - CSS Locator: ``
- Text selection:
    - XPath: `//p[@id="p-example"]/text()`
    - CSS Locator: `p#p-example::text`
- Text selection 2:
    - XPath: `//p[@id="p-example"]//text()`
    - CSS Locator: `p#p-example ::text`

The wildcard `*` can also be used in CSS Locators. For example, `'#uid > *'` selects all children (regardless of tag-type) of the unique element in the HTML document that has its `id` attribute equal to `uid`.

In [62]:
# Create a string with the html code
html = '''
<html>
<body>
<div class="hello-selector">
        <p id="p1">Hello World!</p>
        <a href="http://www.google.com">Google</a>
</div>
<div class="hello-selector">
        <p id="p2">Hello <a href="http://www.bing.com">Bing</a> World!</p>
</div>
<p>Thank you!</p> </body>
</html>
'''

# Create a selector object from a secret website
sel = Selector(text = html)

# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css('div.hello-selector > a')

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css('::attr(href)')
print(hrefs_from_css.extract())

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath('./@href')  # needs a glue (.)
print(hrefs_from_xpath.extract())

['http://www.google.com']
['http://www.google.com']


Selecting links:

In [63]:
# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css('div.hello-selector a')

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css('::attr(href)')
print(hrefs_from_css.extract())

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath('./@href') # needs a glue (.)
print(hrefs_from_xpath.extract())

['http://www.google.com', 'http://www.bing.com']
['http://www.google.com', 'http://www.bing.com']


Selecting top level text:

In [65]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p2"]/text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p2::text'

# Print the text from our selections
print(sel.xpath(xpath).extract())
print(sel.css(css_locator).extract())

['Hello ', ' World!']
['Hello ', ' World!']


Selecting all level text:

In [66]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p2"]//text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p2 ::text'

# Print the text from our selections
print(sel.xpath(xpath).extract())
print(sel.css(css_locator).extract())

['Hello ', 'Bing', ' World!']
['Hello ', 'Bing', ' World!']


## Spiders

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites). [source](https://docs.scrapy.org/en/latest/topics/spiders.html)

In [8]:
%%writefile quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Writing quotes_spider.py


In [9]:
!scrapy runspider quotes_spider.py -o quotes.json

2019-12-26 23:14:32 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2019-12-26 23:14:32 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.3 (default, Mar 27 2019, 16:54:48) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Darwin-19.2.0-x86_64-i386-64bit
2019-12-26 23:14:32 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'quotes.json', 'SPIDER_LOADER_WARN_ONLY': True}
2019-12-26 23:14:32 [scrapy.extensions.telnet] INFO: Telnet Password: f841ffd460bdbadb
2019-12-26 23:14:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-12-26 23:14:32 [scrapy.middleware] INFO: Enabled downloa

In [10]:
import pandas as pd

pd.read_json("quotes.json")

Unnamed: 0,author,text
0,Jane Austen,"“The person, be it gentleman or lady, who has ..."
1,Steve Martin,"“A day without sunshine is like, you know, nig..."
2,Garrison Keillor,“Anyone who thinks sitting in church can make ...
3,Jim Henson,“Beauty is in the eye of the beholder and it m...
4,Charles M. Schulz,“All you need is love. But a little chocolate ...
5,Suzanne Collins,"“Remember, we're madly in love, so it's all ri..."
6,Charles Bukowski,“Some people never go crazy. What truly horrib...
7,Terry Pratchett,"“The trouble with having an open mind, of cour..."
8,Dr. Seuss,“Think left and think right and think low and ...
9,George Carlin,“The reason I talk to myself is because I’m th...


In [None]:
# Import the scrapy library
import scrapy

url_short = 'https://www.asdf.com'

# Create the Spider class
class DCdescr( scrapy.Spider ):
    name = 'dcdescr'
    # start_requests method
    def start_requests( self ):
        yield scrapy.Request( url = url_short, callback = self.parse )
  
    # First parse method
    def parse( self, response ):
        links = response.css( 'div.course-block > a::attr(href)' ).extract()
        # Follow each of the extracted links
        for link in links:
            yield response.follow(url = link, callback = self.parse_descr)
      
    # Second parsing method
    def parse_descr( self, response ):
        # Extract course description
        course_descr = response.css( 'p.course__description::text' ).extract_first()
        # For now, just yield the course description
        yield course_descr

# Inspect the spider
inspect_spider( DCdescr )

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
    name = "dc_chapter_spider"
    # start_requests method
    def start_requests(self):
        yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
    # First parsing method
    def parse_front(self, response):
        course_blocks = response.css('div.course-block')
        course_links = course_blocks.xpath('./a/@href')
        links_to_follow = course_links.extract()
        for url in links_to_follow:
            yield response.follow(url = url,
                                callback = self.parse_pages)
    # Second parsing method
    def parse_pages(self, response):
        # Create a SelectorList of the course titles text
        crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
        # Extract the text and strip it clean
        crs_title_ext = crs_title.extract_first().strip()
        # Create a SelectorList of course descriptions text
        crs_descr = response.css('p.course__description::text')
        # Extract the text and strip it clean
        crs_descr_ext = crs_descr.extract_first().strip()
        # Fill in the dictionary
        dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)