## Using Scrapy in a Jupyter notebook
This notebook makes use of the [Scrapy](https://scrapy.org) library to scrape data from a website. Following the basic example, we create a QuotesSpider and call the CrawlerProcess with this spider to retrieve quotes from http://quotes.toscrape.com. 

In this notebook two pipelines are defined, both writing results to a JSON file. The first option is to create a separate class that defines the pipeline and explicitly has the functions to write to a file per found item. It enables more flexibility when dealing with stranger data formats, or if you want to setup a custom way of writing items to file. The pipeline is set in the custom_settings parameter ITEM_PIPELINES inside the QuoteSpider class. However, I simply want to write the list of items that are found in the spider to a JSON file and therefor it is easier to choose the second option, where only the FEED_FORMAT has to be set to JSON and the output file needs to be defined in FEED_URI inside the custom settings of the spider. No additional classes or definitions need to be created, making the FEED_FORMAT/FEED_URI a convenient option.

Once the quotes are retrieved the JSON file will be created on disk and can be loaded to a Pandas dataframe. This dataframe can then be analyzed, modified and be used for further processing. This notebook simply loads the JSON file to a dataframe and writes it again to a pickle. 

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

'3.6.1'

### Import Scrapy

In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess

### Setup a pipeline
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [3]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('../data/quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

### Define the spider
The QuotesSpider class defines from which URLs to start crawling and which values to retrieve. I set the logging level of the crawler to warning, otherwise the notebook is overloaded with DEBUG messages about the retrieved data.

In [4]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': '../data/quoteresult.json'                # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

### Start the crawler

In [5]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2017-08-31 12:51:23 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-08-31 12:51:23 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x7f46767c2ac8>

### Check the files
Verify that the files has been created on disk. As we can observe the files are both created and have data. The .jl file has line separated JSON elements, while the .json file has one big JSON array containing all the quotes.

In [6]:
ll ../data/quoteresult.*

-rw-r--r-- 1 root 5551 Aug 31 12:51 ../data/quoteresult.jl
-rw-r--r-- 1 root 5573 Aug 31 12:51 ../data/quoteresult.json


In [8]:
!tail -n 2 ../data/quoteresult.jl

{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]}
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]}


In [9]:
!tail -n 2 ../data/quoteresult.json

{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]}
]

### Create dataframes
Pandas can now be used to create dataframes and save the frames to pickles. The .sjon file can be loaded directly into a frame, whereas for the .jl file we need to specify the JSON objects are divided per line.

In [10]:
import pandas as pd
dfjson = pd.read_json('../data/quoteresult.json')
dfjson

Unnamed: 0,author,tags,text
0,Albert Einstein,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,J.K. Rowling,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,Albert Einstein,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,Jane Austen,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,"[adulthood, success, value]",“Try not to become a man of success. Rather be...
6,André Gide,"[life, love]",“It is better to be hated for what you are tha...
7,Thomas A. Edison,"[edison, failure, inspirational, paraphrased]","“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,[misattributed-eleanor-roosevelt],“A woman is like a tea bag; you never know how...
9,Steve Martin,"[humor, obvious, simile]","“A day without sunshine is like, you know, nig..."


In [11]:
dfjl = pd.read_json('../data/quoteresult.jl', lines=True)
dfjl

Unnamed: 0,author,tags,text
0,Albert Einstein,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,J.K. Rowling,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,Albert Einstein,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,Jane Austen,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,"[adulthood, success, value]",“Try not to become a man of success. Rather be...
6,André Gide,"[life, love]",“It is better to be hated for what you are tha...
7,Thomas A. Edison,"[edison, failure, inspirational, paraphrased]","“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,[misattributed-eleanor-roosevelt],“A woman is like a tea bag; you never know how...
9,Steve Martin,"[humor, obvious, simile]","“A day without sunshine is like, you know, nig..."


In [12]:
dfjson.to_pickle('../data/quotejson.pickle')
dfjl.to_pickle('../data/quotejl.pickle')

In [13]:
ll ../data/*pickle

-rw-r--r-- 1 root 5676 Aug 31 12:52 ../data/quotejl.pickle
-rw-r--r-- 1 root 5676 Aug 31 12:52 ../data/quotejson.pickle
