<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/scrapy_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping Example using Scrapy Module 

Mount Drive:

In [28]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [29]:
import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')

Run Scrapy code from Jupyter Notebook without issues

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner. Data scientists usually prefer some sort of computational notebook for managing their workflow. Jupyter Notebook is very popular amid data scientists among other options like PyCharm, zeppelin, VS Code, nteract, Google Colab, and spyder to name a few.

Scraping using Scrapy is done with a .py file often. It can be also initialized from a Notebook. The problem with that is, it throws an error `ReactorNotRestartable:` when the code block is run for the second time.

There is a work-around for this error using crochet package. ReactorNotRestartable error can be mitigated using this package. In this blog post, I am showing the steps that I took to run scrapy codes from Jupyter Notebook with out the error.

Demo Project:

In [30]:
import scrapy
from scrapy.crawler import CrawlerRunner
# text cleaning
import re
# Reactor restart
from crochet import setup, wait_for
setup()

In [31]:
class QuotesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "MJKQuotesToCsv"
    start_urls = [
        'https://en.wikiquote.org/wiki/Mahatma_Gandhi',
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1
        },
        'FEEDS': {
            '/content/gdrive/My Drive/quotes.csv': {
                'format': 'csv',
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}


class ExtractFirstLine(object):
    def process_item(self, item, spider):
        """text processing"""
        lines = dict(item)["quote"].splitlines()
        first_line = self.__remove_html_tags__(lines[0])

        return {'quote': first_line}

    def __remove_html_tags__(self, text):
        """remove html tags from string"""
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)

@wait_for(10)
def run_spider():
    """run spider with MJKQuotesToCsv"""
    crawler = CrawlerRunner()
    d = crawler.crawl(QuotesToCsv)
    return d

run_spider()

1. Using CrawlerRunner instead of CrawlerProcess .

2. Importing setup and wait_for from crochet and initializing using setup() .

3. Using @wait_for(10) decorator on the function that runs the spider from scrapy. @wait_for is used for blocking calls into Twisted Reactor thread. Click here to learn more about this.