# 3. Higher Level Webcrawling (Solution)


In the first example we create a web crawler from scratch. Now we will use the propably most used Webcrawling Framework [Scrapy](https://scrapy.org/) to do the same thing. <br>
Usually Scrapy is run via the command line not in a Notebook, but for the workshop we will use a small hack to run it in the terminal. For a tutorial on how to run Scrapy regularly, see this [tutorial](https://doc.scrapy.org/en/latest/intro/tutorial.html).

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exporters import JsonItemExporter
import json
import logging
import pandas as pd

First we setup a Pipeline to store all articles to the articles_pipeline.json file <br>
The JSONWriterPipeline is a simple element, which receives an crawled article and stores it into the `articles_pipline.json`.

In [2]:
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open("articles_pipeline.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()
 
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
 
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

### The Spider
A spider is the core of a Scrapy crawler. Every spider needs a list of starting urls. `start_urls` and needs to implement the message `parse(self, response)`. 
To access the single elements in a website uses [XPath]() or [css](https://doc.scrapy.org/en/latest/topics/selectors.html). A few fundamential examples for the [XPath syntax](https://www.w3schools.com/xml/xpath_syntax.asp) or:
- XPath: Select the text of a paragraph based with a special id: `//a[@id="author-id"]/text()`
- CSS: Get the href of a link, with a specific class: `a.myclass::attr(href)`
- CSS: Get the text of a headline with a specific class: `h1.myheadline::text`

In [3]:
class ArticleSpider(scrapy.Spider):
    """
    Crawls the all articles published by TechCrunch.
    """
    
    name = 'articles'

    start_urls = ['https://techcrunch.com/']

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',
        'FEED_URI': 'articles.json'
    }
    
    pagination_count = 0
    max_pages = 3 # Maximum number of pages, which list the articles
        
    def parse(self, response):
        """
        Crawls all pages listing the articles.
        """
        print("Starting Crawling: {}".format(response.url))
        # TODO: follow links to article pages
        article_urls = response.css('a.post-block__title__link::attr(href)')
        for article_url in article_urls:
            yield response.follow(article_url, self.parse_articles)

        self.pagination_count += 1
        if self.pagination_count < self.max_pages and self.max_pages == -1:
            # TODO: follow pagination links
            next_url = response.css('a.load-more::attr(href)').extract_first()
            yield response.follow(next_url, self.parse)
            

    def parse_articles(self, response):
        """
        Extracts information for a given article.
        """
        # TODO Extract information from article
        
        title = response.css('h1.article__title::text').extract_first().strip()
        author = response.css('div.article__byline a::text').extract_first().strip()
        text_raw = response.css('div.article-content p::text').extract()
        text = ' '.join(text_raw)
        url = response.url
        print(url)
        article_info = {
            'title': title,
            'author': author,
            'content': text,
            'date': '/'.join(url.split('/')[3:6])
        }
        
        #print(article_info)
        yield article_info

### Start Crawling
To start crawling we start a crawler process which uses our ArticleSpider to crawl TechCrunch.

In [4]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ArticleSpider)
process.start()

2018-03-22 15:41:58 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-03-22 15:41:58 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x7f76b707abe0>

Starting Crawling: https://techcrunch.com/
https://techcrunch.com/2018/03/22/twitchs-extensions-come-to-mobile/
https://techcrunch.com/story/facebook-responds-to-data-misuse/
https://techcrunch.com/2018/03/22/swiss-police-order-up-tesla-model-x-police-cars-for-active-duty/
https://techcrunch.com/2018/03/22/ansarada-gets-18m-in-series-a-funding-to-help-companies-better-prepare-for-major-deals/
https://techcrunch.com/2018/03/21/first-impressions-of-the-199-oculus-go-vr-headset/
https://techcrunch.com/2018/03/21/ai-game-trainer-gosu-ai-raises-1-9m-to-give-gamers-a-virtual-assistant/
https://techcrunch.com/2018/03/21/video-the-driver-of-the-autonomous-uber-was-distracted-before-fatal-crash/
https://techcrunch.com/2018/03/21/get-the-latest-tc-stories-read-to-you-over-the-phone-with-braillevoice/
https://techcrunch.com/2018/03/22/revolut-launches-disposable-virtual-cards/
https://techcrunch.com/2018/03/22/gopro-to-license-camera-lenses-and-sensors-through-jabil/
https://techcrunch.com/2018/0

### Use data
We can use Pandas to load the JSON file into a dataframe.

In [5]:
dfjson = pd.read_json('articles_pipeline.json')
print(dfjson.shape)
dfjson.head()

(59, 4)


Unnamed: 0,author,content,date,title
0,Sarah Perez,– the tools that allows streamers to custom...,2018/03/22,Twitch’s extensions come to mobile
1,Natasha Lomas,,story/facebook-responds-to-data-misuse/,Facebook responds to data misuse
2,Darrell Etherington,Model X has had some brushes with law enforce...,2018/03/22,Swiss police order up Tesla Model X police car...
3,Catherine Shu,"Australian startup , which provides tools for...",2018/03/22,Ansarada gets $18M in Series A funding to help...
4,Lucas Matney,Virtual reality seems to have become a very ti...,2018/03/21,First impressions of the $199 Oculus Go VR hea...
