<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/scrapy_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping Example using Scrapy Module 

Run Scrapy code from Jupyter Notebook without issues

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner. Data scientists usually prefer some sort of computational notebook for managing their workflow. Jupyter Notebook is very popular amid data scientists among other options like PyCharm, zeppelin, VS Code, nteract, Google Colab, and spyder to name a few.

Scraping using Scrapy is done with a .py file often. It can be also initialized from a Notebook. The problem with that is, it throws an error `ReactorNotRestartable:` when the code block is run for the second time.

There is a work-around for this error using crochet package. ReactorNotRestartable error can be mitigated using this package. In this blog post, I am showing the steps that I took to run scrapy codes from Jupyter Notebook with out the error.

Demo Project:

In [1]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 3.8 MB/s 
[?25hCollecting itemadapter>=0.1.0
  Downloading itemadapter-0.5.0-py3-none-any.whl (10 kB)
Collecting queuelib>=1.4.2
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting pyOpenSSL>=16.2.0
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.2 MB/s 
[?25hCollecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting tldextract
  Downloading tldextract-3.3.0-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.7 MB/s 
[?25hCollecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.zip (47 kB)
[K     |████████████████████████████████| 47 kB 4.3 MB/s 
[?25hCollecting service-identity>=16.0.0
  Downloading service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting cssselect>=0.9.1
  Downloading cssselect-1.1.0-py2

In [2]:
!pip install crochet

Collecting crochet
  Downloading crochet-2.0.0-py3-none-any.whl (31 kB)
Installing collected packages: crochet
Successfully installed crochet-2.0.0


In [3]:
import scrapy
from scrapy.crawler import CrawlerRunner
# text cleaning
import re
# Reactor restart
from crochet import setup, wait_for
setup()

In [5]:
class QuotesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "MJKQuotesToCsv"
    start_urls = [
        'https://en.wikiquote.org/wiki/Mahatma_Gandhi',
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1
        },
        'FEEDS': {
            '/tmp/quotes.csv': {
                'format': 'csv',
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}


class ExtractFirstLine(object):
    def process_item(self, item, spider):
        """text processing"""
        lines = dict(item)["quote"].splitlines()
        first_line = self.__remove_html_tags__(lines[0])

        return {'quote': first_line}

    def __remove_html_tags__(self, text):
        """remove html tags from string"""
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)

@wait_for(10)
def run_spider():
    """run spider with MJKQuotesToCsv"""
    crawler = CrawlerRunner()
    d = crawler.crawl(QuotesToCsv)
    return d

run_spider()

In [6]:
# Python program to read CSV file line by line
# import necessary packages
import csv

# Open file
with open('/tmp/quotes.csv') as file_obj:
	
	# Create reader object by passing the file
	# object to reader method
	reader_obj = csv.reader(file_obj)
	
	# Iterate over each row in the csv
	# file using reader object
	for row in reader_obj:
		print(row)


['quote']
['Ours is one continual struggle against a degradation sought to be inflicted upon us by the Europeans, who desire to degrade us to the level of the raw Kaffir whose occupation is hunting, and whose sole ambition is to collect a certain number of cattle to buy a wife with and, then, pass his life in indolence and nakedness.']
['One thing we have endeavoured to observe most scrupulously, namely, never to depart from the strictest facts and, in dealing with the difficult questions that have arisen during the year, we hope that we have used the utmost moderation possible under the circumstances. Our duty is very simple and plain. We want to serve the community, and in our own humble way to serve the Empire. We believe in the righteousness of the cause, which it is our privilege to espouse. We have an abiding faith in the mercy of the Almighty God, and we have firm faith in the British Constitution. That being so, we should fail in our duty if we wrote anything with a view to hur

1. Using CrawlerRunner instead of CrawlerProcess .

2. Importing setup and wait_for from crochet and initializing using setup() .

3. Using @wait_for(10) decorator on the function that runs the spider from scrapy. @wait_for is used for blocking calls into Twisted Reactor thread. Click here to learn more about this.