<h1>Crawlers and Scrapers</h1>

The goal of this session is to build and run our own Amazon.com scraper using the **scrapy** python library. 

Our scraper will crawl through a specific product's customer review pages and get all of the available ratings and reviews. This will allow us to get complete review details that we can’t get with the Amazon Product Advertising API.

First we will install scrapy using pip command in terminal/cmd:

In [1]:
!pip install scrapy



Next, we'll use scrapy to automatically generate a skeleton of the code needed for our scraper. (On terminal/cmd type the following command without the exclamation mark):

In [2]:
!scrapy genspider amazon_scraper amazon.com

Created spider 'amazon_scraper' using template 'basic' 


A new python-script, amazon_scraper.py file will be created. The final content of our scraper will be as follows:

In [None]:
# -*- coding: utf-8 -*-
import scrapy

class AmazonScraperSpider(scrapy.Spider):
    name = 'amazon_scraper'
    allowed_domains = ['amazon.com']
    # assing a product-review-page url below
    start_urls = ['https://www.amazon.com/Apple-iPhone-Verizon-Unlocked-Renewed/product-reviews/B07HYDFX8G/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=helpful&pageNumber=1']
    
    def parse(self, response):
        review_texts = response.css('.a-size-base.review-text')
        for i in range(len(review_texts)):
            review_texts[i] = "".join(review_texts[i].css('::text').extract()).strip()

        review_ratings = response.css('[data-hook="review-star-rating"] > span::text').extract()

        for i in range(len(review_texts)):
            review = {
                'text' : review_texts[i],
                'rating': review_ratings[i]
            }
            yield review

        next_page_url = response.css('.a-last > a::attr(href)').extract_first()
        yield response.follow(next_page_url, self.parse)


We can call the script from terminal/cmd as follows:


In [3]:
!scrapy runspider amazon_scraper.py -o out.json

2020-03-12 22:30:19 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: scrapybot)
2020-03-12 22:30:19 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:01:53) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-17.5.0-x86_64-i386-64bit
2020-03-12 22:30:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-03-12 22:30:19 [scrapy.crawler] INFO: Overridden settings:
{'FEED_FORMAT': 'json', 'FEED_URI': 'out.json', 'SPIDER_LOADER_WARN_ONLY': True}
2020-03-12 22:30:20 [scrapy.extensions.telnet] INFO: Telnet Password: 2db2ab647f899c41
2020-03-12 22:30:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'sc