# 3. Higher Level Webcrawling (Solution)


In the first example we create a web crawler from scratch. Now we will use the propably most used Webcrawling Framework [Scrapy](https://scrapy.org/) to do the same thing. <br>
Usually Scrapy is run via the command line not in a Notebook, but for the workshop we will use a small hack to run it in the terminal. For a tutorial on how to run Scrapy regularly, see this [tutorial](https://doc.scrapy.org/en/latest/intro/tutorial.html).

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exporters import JsonItemExporter
import json
import logging
import pandas as pd

First we setup a Pipeline to store all articles to the articles_pipeline.json file <br>
The JSONWriterPipeline is a simple element, which receives an crawled article and stores it into the `articles_pipline.json`.

In [2]:
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open("articles_pipeline.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()
 
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
 
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

### The Spider
A spider is the core of a Scrapy crawler. Every spider needs a list of starting urls. `start_urls` and needs to implement the message `parse(self, response)`. 
To access the single elements in a website uses [XPath]() or [css](https://doc.scrapy.org/en/latest/topics/selectors.html). A few fundamential examples for the [XPath syntax](https://www.w3schools.com/xml/xpath_syntax.asp) or:
- XPath: Select the text of a paragraph based with a special id: `//a[@id="author-id"]/text()`
- CSS: Get the href of a link, with a specific class: `a.myclass::attr(href)`
- CSS: Get the text of a headline with a specific class: `h1.myheadline::text`

In [3]:
class ArticleSpider(scrapy.Spider):
    """
    Crawls the all articles published by TechCrunch.
    """
    
    name = 'articles'

    start_urls = ['https://techcrunch.com/']

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',
        'FEED_URI': 'articles.json'
    }
    
    pagination_count = 0
    max_pages = 3 # Maximum number of pages, which list the articles
        
    def parse(self, response):
        """
        Crawls all pages listing the articles.
        """
        print("Starting Crawling: {}".format(response.url))
        # follow links to article pages
        for href in response.css('a.read-more::attr(href)'):
            yield response.follow(href, self.parse_author)

        self.pagination_count += 1
        if self.pagination_count < self.max_pages:
            # follow pagination links
            next_href = response.css('li.next a::attr(href)').extract_first()
            next_page = response.urljoin(next_href)
            yield response.follow(next_page, self.parse)

    def parse_author(self, response):
        """
        Extracts information for a given article.
        """
        title   = response.css('h1.tweet-title::text').extract_first().strip()
        authors = response.xpath('//a[@rel="author"]/text()').extract()
        date    = response.css('time::attr(datetime)').extract_first()
        tags    = response.css('a.tag::text').extract()
        
        text_raw = response.css('div.text p::text').extract()
        text = ' '.join(text_raw)
        

        yield {
            'title': title,
            'authors': authors,
            'date': date,
            'tags': tags,
            'text': text
        }

### Start Crawling
To start crawling we start a crawler process which uses our ArticleSpider to crawl TechCrunch.

In [4]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ArticleSpider)
process.start()

2018-03-04 19:31:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-03-04 19:31:44 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x7ff155aac080>

Starting Crawling: https://techcrunch.com/
Starting Crawling: https://techcrunch.com/page/2/
Starting Crawling: https://techcrunch.com/page/3/


### Use data
We can use Pandas to load the JSON file into a dataframe.

In [6]:
dfjson = pd.read_json('articles_pipeline.json')
print(dfjson.shape)
dfjson.head()

(58, 5)


Unnamed: 0,authors,date,tags,text,title
0,[Josh Constine],2018-03-02 13:30:57,"[Apps, Snapchat, Evan Spiegel, snap inc, snapc...","“Timing”, Snapchat CEO Evan Spiegel said crypt...",Snapchat is stuck in the uncanny valley of AR ...
1,[Jonathan Salama],2018-03-02 07:45:38,"[Transportation, trucking]",\n Jonathan Salama is chief technology officer...,Blockchain will work in trucking — but only if...
2,[Sarah Perez],2018-03-02 10:53:10,"[Apps, iphone apps, storage, iOS apps, Apps]","These days, home movies aren’t recorded with h...",Air’s app lets you record high-quality home mo...
3,[Devin Coldewey],2018-03-03 17:05:20,"[eCommerce, Amazon, counterfeit]",It’s become a standard part of my dwindling Am...,Another small business complains of counterfei...
4,[Danny Crichton],2018-03-04 09:17:01,"[Government, Facebook, Google]",If there is one policy dilemma facing nearly e...,No one wants to build a “feel good” internet
