# Scraping Winemag with scrapy

## A prove of concept

As a prove of concept, we scrape all reviews of wines from California. We separate the observations into some arbitrary groups via the start urls, because we can only retrieve 50,000 observations at once. The links column of the dataset that we create could be piped into another parse function to retrieve the full review of each individual wine. Also, we are using a dictionary to hold the entries, using lists instead would be more efficient, but this approach renders more readable text for now.

### Define Spider

In [1]:
from scrapy import Spider
from scrapy.http import Request
import re

class WinemagSpider(Spider):
    name = 'WinemagSpider'
    allowed_domains = ['www.winemag.com']
    start_urls = ['https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Sonoma&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Napa&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=California%20Other&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Mendocino%20County&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Sierra%20Foothills&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Napa-Sonoma&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Valley&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=North%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Lake%20County&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=South%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=California-Washington&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=California-Oregon&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=Napa-Monterey-Mendocino&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=Santa%20Barbara%20County-Sonoma%20County-Monterey%20County&page=1',
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Mendocino/Lake%20Counties&page=1']

    global wineries 
    wineries = []
    
    def parse(self, response):
        
        organizations = response.xpath('//*[@class="review-item "]')
        for organization in organizations:
            title = organization.xpath('.//*[@class="title"]/text()').extract_first()
            appelation = organization.xpath('.//*[@class="appellation"]//text()').extract_first()
            excerpt = organization.xpath('.//*[@class="excerpt"]//text()').extract_first()
            rating = organization.xpath('.//*[@class="rating"]//text()').extract_first()
            price = organization.xpath('.//*[@class="price"]//text()').extract_first()
            referer = response.url
            link = organization.xpath('.//a/@href').extract_first()
            
            wineries.append({'title': title, 'appelation': appelation, 'excerpt': excerpt, 
                             'rating': rating, 'price': price, 'referer': referer, 'link': link})
        
        page_n = int(re.search(r'\d+$', response.url).group())
        next_pages = response.xpath('//*[@class="pagination"]//a/text()').extract()
                
        if page_n == 1 and next_pages:
            max_page = int(next_pages[-1])
            
            url = re.search(r'.*[^\d+$]', response.url).group()
            
            for i in range(2, max_page + 1):
                absolute_url = url + str(i)
                yield Request(absolute_url, callback=self.parse)


### Initiate process and crawl website

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0', 
    'CONCURRENT_REQUESTS': 8,
    'DOWNLOAD_DELAY': 0.1
})

2019-07-02 15:08:57 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-07-02 15:08:57 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Linux-4.15.0-52-generic-x86_64-with-Ubuntu-18.04-bionic


In [3]:
process.crawl(WinemagSpider)

2019-07-02 15:08:58 [scrapy.crawler] INFO: Overridden settings: {'CONCURRENT_REQUESTS': 8, 'DOWNLOAD_DELAY': 0.1, 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0'}
2019-07-02 15:08:58 [scrapy.extensions.telnet] INFO: Telnet Password: 1f7c2855d59d0cfa
2019-07-02 15:08:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-07-02 15:08:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 '

<Deferred at 0x7fd74c0e2fd0>

In [4]:
process.start()

2019-07-02 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Coast&page=1> (referer: None)
2019-07-02 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Sonoma&page=1> (referer: None)
2019-07-02 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=North%20Coast&page=1> (referer: None)
2019-07-02 15:09:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=California%20Other&page=1> (referer: None)
2019-07-02 15:09:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Valley&page=1> (referer: None)
2019-07-02 15:09:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ww

In [5]:
process.stop()

<DeferredList at 0x7fd74ef78438 current result: []>

### Verify results

In [6]:
import pandas as pd

wineries = pd.DataFrame(wineries)
len(wineries)

77906

In [7]:
wineries.head()

Unnamed: 0,appelation,excerpt,link,price,rating,referer,title
0,Central Coast,This is an attention-grabbing bottling from th...,https://www.winemag.com/buying-guide/pali-2016...,$60,95,https://www.winemag.com/?s=&drink_type=wine&co...,Pali 2016 John Sebastiano Vineyard Pinot Noir ...
1,Central Coast,"Very dark in the glass, this top-level bottlin...",https://www.winemag.com/buying-guide/stolpman-...,$68,95,https://www.winemag.com/?s=&drink_type=wine&co...,Stolpman 2017 Angeli Syrah (Ballard Canyon)
2,Central Coast,"Deep and dark aromas of black cherry, mint cho...",https://www.winemag.com/buying-guide/orfila-20...,$65,94,https://www.winemag.com/?s=&drink_type=wine&co...,Orfila 2016 Sequestered Pinot Noir (Santa Mari...
3,Central Coast,"Impressive aromas of roasted cocoa, cappuccino...",https://www.winemag.com/buying-guide/caliza-20...,$60,94,https://www.winemag.com/?s=&drink_type=wine&co...,Caliza 2016 Azimuth Red (Paso Robles Willow Cr...
4,Central Coast,"Black plum, violets and oak aromas come in wav...",https://www.winemag.com/buying-guide/central-c...,$97,94,https://www.winemag.com/?s=&drink_type=wine&co...,Central Coast Group Project 2013 Names Syrah (...


In [8]:
from datetime import date

wineries.to_csv('winemag_archive_{}'.format(date.today().isoformat()))