# L6 Data Collection

In this practical, you learn 
* how to scrap web pages using Scrapy and extract the content using XPath, regex, and CSSSelector with LXML 


Useful Tools for Regex, XPath and CSSSelector development
Crawling and extraction rely heavily on the usage of XPath, and CSS Selector. However developing these patterns from scratch might be challenging, you might find some of the following tools useful.

XPath Wizard
https://chrome.google.com/webstore/detail/xpath-helper-wizard/jadhpggafkbmpdpmpgigopmodldgfcki?hl=en

Selector Gadget
https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en

## Focused crawl

First let's consider a simple crawler which crawl a quotes website using focused crawl strategy. 
[quotes.toscrape.com](http://quotes.toscrape.com/)


Suppose that by navigating the website, we are able to guess the list of pages is in the shape of
`http://quotes.toscrape.com/page/1/`,
`http://quotes.toscrape.com/page/2/`,

The quotes.toscrape example is inspired by [https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html ]. 

Recall from the lecture note that a focused crawl behaves as follows,

1. For each URL `u` in the list of seed URLs,
     1. extract the needed content from `u`. 

We can use a function or just hard coding to generate the sequence of start / seed URLs.

Note that a focused crawl does not follow links in the pages. We get a page from the list, and extract the needed content. 


First of all we need to define some writer classes, which help to debug or save the output of the extract. 

 * ```ConsoleWriterPipeline``` receives the extract result from the spider and prints out the content. 
 * ```JsonWriterPipeline``` receives the extract result from the spider and appends them into a JSON Line file, (each line is a json)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
data_dir_path='/content/drive/My Drive/Data/DS6/'

In [3]:
import lxml.etree

import json

# receives the extract result from the spider and prints out the content
class ConsoleWriterPipeline(object):
    def open_spider(self, spider):
        None
    def close_spdier(self, spider):
        None
    
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        print(line)
        return item
    
# receives the extract result from the spider and appends them into a JSON Line file, (each line is a json)
class JsonWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open(data_dir_path+'result.json', 'w')

    def close_spider(self, spider):
        print('JSON File Generated')
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Next we define our spider, `QuoteSpider` is a focused spider. 

It reads a list of URLs and calls `parse()` for each page (`response`) given by the link. 

Note that for each link, we find multiple quotes. Hence, in `parse()` we use a CSS selector to retrieve the list of all `div` elements that containing the quotes, one quote per element. 

The `yield` statement constructs the result JSON object that will be consumed by the downstream writer, in this case we use `ConsoleWriterPipeline`. 


![image](https://drive.google.com/uc?id=13tZtcspz_KFJecaNMs8ZQt3RnYz3qHWI)



```
for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
```



In [4]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
[?25l[K     |█▎                              | 10 kB 22.5 MB/s eta 0:00:01[K     |██▌                             | 20 kB 15.8 MB/s eta 0:00:01[K     |███▊                            | 30 kB 11.0 MB/s eta 0:00:01[K     |█████                           | 40 kB 9.5 MB/s eta 0:00:01[K     |██████▏                         | 51 kB 4.8 MB/s eta 0:00:01[K     |███████▍                        | 61 kB 5.7 MB/s eta 0:00:01[K     |████████▊                       | 71 kB 5.8 MB/s eta 0:00:01[K     |██████████                      | 81 kB 5.7 MB/s eta 0:00:01[K     |███████████▏                    | 92 kB 6.4 MB/s eta 0:00:01[K     |████████████▍                   | 102 kB 5.5 MB/s eta 0:00:01[K     |█████████████▋                  | 112 kB 5.5 MB/s eta 0:00:01[K     |██████████████▉                 | 122 kB 5.5 MB/s eta 0:00:01[K     |████████████████▏               | 133 kB 5.5 MB/s eta 0:00:01

In [5]:
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,                            # Default : Debug
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}  # Used for pipeline
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
    


In the following, we create a process which will start the crawler. By uncommenting and running the below code, we perform the focused crawl the web site. The result will be printed in the output sessoin. In case it does not stop. You consider click the "Block Square" button below the menu bar to stop the kernel. 

*Note* In case you hit the `ReactorNotRestartable: ` error, you should comment away another crawler processes in this note book and restart the kernel.

User Agent is the runner that we use to execute the crawling process.

For more details, refer to https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

In [6]:
# uncomment me and run
# '''
quotes_crawler_process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

quotes_crawler_process.crawl(QuotesSpider)
quotes_crawler_process.start()
# '''

2022-05-11 06:02:54 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-05-11 06:02:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-05-11 06:02:54 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


JSON File Generated


## Exercise

When you are happy with result, you may modify the `QuotesSpider` class to use `JSONWriterPipeline` to save the result in a file.

##  Question

Manually getting the list of input URLs for a focused crawl could be challenging? Is there anyway to automate it?   

In the following example, we are going to BBC and get the headline and introduction from all the pages that can be access the landing URL.


## General Crawl - News Crawler

Restart Runtime to avoid ReactorNotRestartable error.


At the http://www.bbc.co.uk/news/technology/ page parse the articles.

Get each article text for headline and introduction.

Setup the parsing result either console or json file. 


In [1]:
import lxml.etree

import json
    
class ConsoleWriterPipeline(object):
    def open_spider(self, spider):
        None
    def close_spider(self, spider):
        None
    
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        print(line)
        return item
    
class JsonWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open(data_dir_path+'newsresult.json', 'w')

    def close_spider(self, spider):
        print('JSON File Generated')
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item 

Define the start Url


```
"http://www.bbc.co.uk/news/technology/"
```

Set the rule for the parsing in the URL

```
Rule(LinkExtractor(allow=['/technology-\d+'])
```

Parsing function to extract each article headline and introduction

```
    story = NewsItem()
    story['headline'] = response.xpath('//head/title/text()').get()
    story['intro'] = response.xpath('//p/b/text()').get()
    
    yield {
        "headline":story['headline'],
        "intro":story['intro']
        }
```





Tag for the Headline

![image](https://drive.google.com/uc?id=19r3WBTOHK3tj6k55I7h-tXWOegc3mQvv)

Tag for the introduction

![image](https://drive.google.com/uc?id=1lZnqIqx64sOuaHPGq8sQL8nTemJjoCWM)

In [2]:
import logging
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class NewsItem(scrapy.Item):
  # define the fields for your item here like:
  headline = scrapy.Field()
  intro = scrapy.Field()
  # url = scrapy.Field()

class NewsSpider(CrawlSpider):
  name = "bbcnews"
  allowed_domains = ["bbc.co.uk"]
  start_urls = ["http://www.bbc.co.uk/news/technology/",]
  custom_settings = {
      'LOG_LEVEL': logging.WARNING,
      'ITEM_PIPELINES': {'__main__.ConsoleWriterPipeline': 1} # Used for pipeline 1
      }
  rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')]

  def parse_story(self, response):
    story = NewsItem()
    story['headline'] = response.xpath('//head/title/text()').get()
    story['intro'] = response.xpath('//p/b/text()').get()
    
    yield {
        "headline":story['headline'],
        "intro":story['intro']
        }
      


In [3]:
from scrapy.crawler import CrawlerProcess

hgw_crawler_process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

hgw_crawler_process.crawl(NewsSpider)
hgw_crawler_process.start()

2022-05-11 06:05:11 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-05-11 06:05:11 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-05-11 06:05:11 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


{"headline": "Apple to discontinue the iPod after 21 years - BBC News", "intro": "Apple has announced it is discontinuing its music player, the iPod Touch, bringing to an end a device widely praised for revolutionising how people listen to music."}

{"headline": "Influencers and followers need more protection, say MPs - BBC News", "intro": "MPs are calling for more protection for social media influencers and the children who follow them."}

{"headline": "Twitter: X marks the spot for Elon Musk's growth plans - BBC News", "intro": "Elon Musk aims to increase Twitter's revenue fivefold to $26.4bn (\u00a321.4bn) by 2028, a presentation to prospective Twitter investors "}

{"headline": "UK blames Russia for satellite internet hack at start of war - BBC News", "intro": "Russia was behind a cyber-attack targeting American commercial satellite internet company Viasat, UK and US intelligence suggests."}

{"headline": "Prince Charles visits sustainable aviation laboratory - BBC News", "intro": 

## General Crawl - Book Crawler


A general crawler may have only one start URL, and typically two rules. It starts by added the start URL to its URL queue.
it repeats the following until the URL queue is empty.
 1. get a URL from the from the URL queue,
     1. rule 1. when a target URL is loaded, extract it.
     1. rule 2. when a non-target URL is loaded and add all (new) links in the page the URL queue.
 1. remove the URL from the URL queue.
 




Goto http://books.toscrape.com.

Extract each book title, price and stock


![image](https://drive.google.com/uc?id=1Z5pnlu47Wjc7Nl11EBRdM5E71QCPuuua)




```
yield {
            'title': response.css('.product_main h1::text').get(),
            'price': response.css('.product_main p.price_color::text').re_first('£(.*)'),
            'stock': int(
                ''.join(
                    response.css('.product_main .instock.availability ::text').re('(\d+)')
                )
            ),
        }
```



In [None]:
import lxml.etree

import json
    
class ConsoleWriterPipeline(object):
    def open_spider(self, spider):
        None
    def close_spdier(self, spider):
        None
    
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        print(line)
        return item
    
class JsonWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open(data_dir_path+'bookresult.json', 'w')

    def close_spider(self, spider):
        print('JSON File Generated')
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item 

In [None]:
import logging
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class BooksCrawlSpider(CrawlSpider):
    name = 'books-crawlspider'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://books.toscrape.com']
    custom_settings = {
      'LOG_LEVEL': logging.WARNING,
      'ITEM_PIPELINES': {'__main__.ConsoleWriterPipeline': 1}#, # Used for pipeline 1
      }
    rules = [
        Rule(
            LinkExtractor(allow=('/catalogue/page-\d+.html')),follow=True
        ),
        Rule(
             LinkExtractor(deny=('/category/books', '.com/index.html')),callback='parse_book_page',
            follow=True
        ),
    ]

    def parse_book_page(self, response):
        yield {
            'title': response.css('.product_main h1::text').get(),
            'price': response.css('.product_main p.price_color::text').re_first('£(.*)'),
            'stock': int(
                ''.join(
                    response.css('.product_main .instock.availability ::text').re('(\d+)')
                )
            ),
        }

In [None]:
#  uncomment me and run
from scrapy.crawler import CrawlerProcess

hgw_crawler_process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

hgw_crawler_process.crawl(BooksCrawlSpider)
hgw_crawler_process.start()

2021-05-20 09:51:35 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-05-20 09:51:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.7.10 (default, May  3 2021, 02:48:31) - [GCC 7.5.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Linux-5.4.109+-x86_64-with-Ubuntu-18.04-bionic
2021-05-20 09:51:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-05-20 09:51:35 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


{"title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "price": "52.29", "stock": 19}

{"title": "A Light in the Attic", "price": "51.77", "stock": 22}

{"title": "Libertarianism for Beginners", "price": "51.33", "stock": 19}

{"title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "price": "37.59", "stock": 19}

{"title": "Rip it Up and Start Again", "price": "35.02", "stock": 19}

{"title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "price": "57.25", "stock": 19}

{"title": "It's Only the Himalayas", "price": "45.17", "stock": 19}

{"title": "Olio", "price": "23.88", "stock": 19}

{"title": "Set Me Free", "price": "17.46", "stock": 19}

{"title": "Shakespeare's Sonnets", "price": "20.66", "stock": 19}

{"title": "The Black Maria", "price": "52.15", "stock": 19}

{"title": "The Dirty Little Secrets of Getting Your Dream Job", "price": "33.34", "stock": 19}

{"title": "The Boys in the Boat: Nine Americans and Their Epi

# RestFul API

Let's use API from data.gov.sg to check the PSI readings.

https://api.data.gov.sg/v1/environment/psi

When you click on the above link, you see that the data return is in json format.

With reference to lecture slide 34 and 35, let's extract the PSI 24-hourly reading.


In [None]:
# importing the requests library
import requests 
# api-endpoint
URL = "https://api.data.gov.sg/v1/environment/psi"

# sending get request and saving the response as response object
r = requests.get(url = URL)

In [None]:
# extracting data in json format
data = r.json()
print(data)

{'region_metadata': [{'name': 'west', 'label_location': {'latitude': 1.35735, 'longitude': 103.7}}, {'name': 'national', 'label_location': {'latitude': 0, 'longitude': 0}}, {'name': 'east', 'label_location': {'latitude': 1.35735, 'longitude': 103.94}}, {'name': 'central', 'label_location': {'latitude': 1.35735, 'longitude': 103.82}}, {'name': 'south', 'label_location': {'latitude': 1.29587, 'longitude': 103.82}}, {'name': 'north', 'label_location': {'latitude': 1.41803, 'longitude': 103.82}}], 'items': [{'timestamp': '2021-05-20T18:00:00+08:00', 'update_timestamp': '2021-05-20T18:08:52+08:00', 'readings': {'o3_sub_index': {'west': 18, 'national': 33, 'east': 24, 'central': 23, 'south': 17, 'north': 33}, 'pm10_twenty_four_hourly': {'west': 38, 'national': 38, 'east': 30, 'central': 35, 'south': 35, 'north': 35}, 'pm10_sub_index': {'west': 38, 'national': 38, 'east': 30, 'central': 35, 'south': 35, 'north': 35}, 'co_sub_index': {'west': 3, 'national': 5, 'east': 5, 'central': 3, 'south':

It is quite difficult to view the json data from the notebook. We can make use of online JSON Viewer such as http://jsonviewer.stack.hu/ to help us.

From the json viewer, look for psi_twenty_four_hourly which is the data that we want to display.

Note that the sample in lecture is reading pm25 one hourly reading, but now we want PSI 24-hourly reading.

In [None]:
# extracting PSI24 readings
readings = data['items'][0]['readings']['psi_twenty_four_hourly']
print("The PSI 24hourly readings are")
for r in readings:
  print("%s : %s"%(r, readings[r]))

The PSI 24hourly readings are
west : 58
national : 61
east : 55
central : 60
south : 52
north : 61
