# Scraping Winemag with scrapy

## A prove of concept

As a prove of concept, we scrape all reviews of wines from California. We separate the observations into some arbitrary groups via the start urls, because we can only retrieve 50,000 observations at once. The links column of the dataset that we create could be piped into another parse function to retrieve the full review of each individual wine. Also, we are using a dictionary to hold the entries, using lists instead would be more efficient, but this approach renders more readable text for now.

### Define Spider

In [1]:
from scrapy import Spider
from scrapy.http import Request
import re

class WinemagSpider(Spider):
    name = 'WinemagSpider'
    allowed_domains = ['www.winemag.com']
    start_urls = ['https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Sonoma&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Napa&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=California%20Other&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Mendocino%20County&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Sierra%20Foothills&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Napa-Sonoma&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Central%20Valley&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=North%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Lake%20County&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=South%20Coast&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=California-Washington&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=California-Oregon&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=Napa-Monterey-Mendocino&page=1', 
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&appellation=Santa%20Barbara%20County-Sonoma%20County-Monterey%20County&page=1',
                  'https://www.winemag.com/?s=&drink_type=wine&country=US&state=California&region=Mendocino/Lake%20Counties&page=1']

    global wineries 
    wineries = []
    
    def parse(self, response):
        
        organizations = response.xpath('//*[@class="review-item "]')
        for organization in organizations:
            title = organization.xpath('.//*[@class="title"]/text()').extract_first()
            appelation = organization.xpath('.//*[@class="appellation"]//text()').extract_first()
            excerpt = organization.xpath('.//*[@class="excerpt"]//text()').extract_first()
            rating = organization.xpath('.//*[@class="rating"]//text()').extract_first()
            price = organization.xpath('.//*[@class="price"]//text()').extract_first()
            referer = response.url
            link = organization.xpath('.//a/@href').extract_first()
            
            wineries.append({'title': title, 'appelation': appelation, 'excerpt': excerpt, 
                             'rating': rating, 'price': price, 'referer': referer, 'link': link})
        
        page_n = int(re.search(r'\d+$', response.url).group())
        next_pages = response.xpath('//*[@class="pagination"]//a/text()').extract()
                
        if page_n == 1 and next_pages:
            max_page = int(next_pages[-1])
            
            url = re.search(r'.*[^\d+$]', response.url).group()
            
            for i in range(2, max_page + 1):
                absolute_url = url + str(i)
                yield Request(absolute_url, callback=self.parse)


### Initiate process and crawl website

In [2]:
from scrapy.crawler import CrawlerProcess
from datetime import date

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0', 
    'DOWNLOAD_DELAY': '0.05', 
#     'CONCURRENT_REQUESTS': 2,
    'LOG_FILE': 'log_{}.txt'.format(date.today().isoformat())
})

In [3]:
process.crawl(WinemagSpider)

<Deferred at 0x105455d90>

In [4]:
process.start()

In [5]:
process.stop()

<DeferredList at 0x1084e2910 current result: []>

### Verify results

In [6]:
import pandas as pd

wineries = pd.DataFrame(wineries)
len(wineries)

1427

In [7]:
wineries.head()

Unnamed: 0,title,appelation,excerpt,rating,price,referer,link
0,Hartford Court 2021 Rosé of Pinot Noir (Russia...,Sonoma,Quenching in acid-driven flavors of Meyer lemo...,94,$32,https://www.winemag.com/?s=&drink_type=wine&co...,https://www.winemag.com/buying-guide/hartford-...
1,Marimar 2018 La Masía Don Miguel Vineyard Esta...,Sonoma,"Perfumed in rose and violet, this wine offers ...",92,$49,https://www.winemag.com/?s=&drink_type=wine&co...,https://www.winemag.com/buying-guide/marimar-2...
2,Rodney Strong 2021 Rosé of Pinot Noir (Sonoma ...,Sonoma,This perennially delicious wine continues its ...,91,$25,https://www.winemag.com/?s=&drink_type=wine&co...,https://www.winemag.com/buying-guide/rodney-st...
3,Davies 2019 Nobles Vineyard Pinot Noir (Fort R...,Sonoma,"Rich and thick in sweet red fruit, this wine o...",91,$75,https://www.winemag.com/?s=&drink_type=wine&co...,https://www.winemag.com/buying-guide/davies-20...
4,Marimar 2017 Don Miguel Vineyard Earthquake Bl...,Sonoma,High acid underscores the wealth of tart red f...,91,$66,https://www.winemag.com/?s=&drink_type=wine&co...,https://www.winemag.com/buying-guide/marimar-2...


In [8]:
wineries.to_csv('winemag_archive_{}.csv'.format(date.today().isoformat()))