## Scotch

The scraping notebook walked through some basic web scraping with html that was fairly straightforward and well behaved to show you the basics of the involved libraries. And there are a lot of nicely written tutorials on the web where you can see a polished version of what people did for scraping. But the question I was struggling with while putting this together was how do you really teach someone to scrape? And what is a good site to teach with?

While I was thinking about it, I recieved the weekly email from one of my favorite online retailers, [Drink Up NY](http://www.drinkupny.com/Default.asp). Since I am quite fond of their inventory and I had scraping on the brain, I thought I would see if I could scrape the data (and I didn't see any good reason [why I shouldn't](http://www.drinkupny.com/robots.txt)). As I was playing around with this notebook I thought it might be a good way to demonstrate a thought process for approaching scraping. This is not a definitive way to scrape, it is just the way *I* approached *this* page sitting around on my couch the week before this tutorial. Like anything in Python, there are a lot of ways to accomplish something. 

I can read and write some html and I have broken some CSS on WordPress before, but I am not a web dev. I would call myself barely competent in that realm. If I were, I imagine this would go a lot faster sometimes. I do like tinkering around and trying things though, so that is how I approach scraping and what I will demonstrate here.

So, let's see what is going on with Scotch prices on Drink Up NY!

## Update

Drink Up NY not available so trying with another site (SMWS Shop)

In [1]:
import requests
from bs4 import BeautifulSoup

I am using [requests](http://docs.python-requests.org/en/master/) here, but I could have just as easily used [urllib](https://docs.python.org/3.5/library/urllib.html). 

First, we need to get the page. If you navigate around the [Drink Up NY](http://www.drinkupny.com/Default.asp) site a bit, you can see they have a lot of drop down menus. I am just intereted in the [Scotch](https://www.drinkupny.com/single-malt-scotch-whisky-s/77.htm) page right now. I navigated here through the "SPIRITS ETC" - "Whisk(e)y" - "Scotch Whisky" links. For scraping, I am going to change the "per page" option to "120 per page" so I have everything on one page.

In [6]:
html_file = "https://www.smws.com.au/shop/?product_count=60"
html_rpt = requests.get(html_file, headers={'User-Agent': 'GCLearn (Learning crawler for Gavin Cooper: gavincooper.net)'})

In [7]:
if html_rpt.status_code == 200:
    print(html_rpt.content)
else:
    print(html_rpt.status_code)



It worked. Let's parse and prettify it.

In [8]:
soup = BeautifulSoup(html_rpt.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="avada-html-layout-wide" lang="en-AU" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Products – The Scotch Malt Whisky Society Australia
  </title>
  <link href="//chimpstatic.com" rel="dns-prefetch"/>
  <link href="//fonts.googleapis.com" rel="dns-prefetch"/>
  <link href="//s.w.org" rel="dns-prefetch"/>
  <link href="https://www.smws.com.au/feed/" rel="alternate" title="The Scotch Malt Whisky Society Australia » Feed" type="application/rss+xml"/>
  <link href="https://www.smws.com.au/comments/feed/" rel="alternate" title="The Scotch Malt Whisky Society Australia » Comments Feed" type="application/rss+xml"/>
  <link href="https://www.smws.com.au/events/?ical=1" rel="alternate" title="The Scotch Malt Whisky Society Australia » iCal

Now let's take a look through the structure and find the data. 

In [9]:
# I wouldn't expect my data to be here, but there is some metadata if we want it.
soup.head

<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Products – The Scotch Malt Whisky Society Australia</title>
<link href="//chimpstatic.com" rel="dns-prefetch"/>
<link href="//fonts.googleapis.com" rel="dns-prefetch"/>
<link href="//s.w.org" rel="dns-prefetch"/>
<link href="https://www.smws.com.au/feed/" rel="alternate" title="The Scotch Malt Whisky Society Australia » Feed" type="application/rss+xml"/>
<link href="https://www.smws.com.au/comments/feed/" rel="alternate" title="The Scotch Malt Whisky Society Australia » Comments Feed" type="application/rss+xml"/>
<link href="https://www.smws.com.au/events/?ical=1" rel="alternate" title="The Scotch Malt Whisky Society Australia » iCal Feed" type="text/calendar"/>
<link href="https://www.smws.com.au/shop/feed/" rel="alternate" title="The Scotch Malt Whisky Society Australia » Product

In [10]:
soup.body

<body class="archive post-type-archive post-type-archive-product woocommerce woocommerce-page woocommerce-no-js tribe-no-js yith-wcbm-theme-avada fusion-image-hovers fusion-body ltr fusion-sticky-header no-tablet-sticky-header no-mobile-sticky-header no-mobile-slidingbar fusion-disable-outline layout-wide-mode has-sidebar fusion-top-header menu-text-align-left fusion-woo-product-design-clean mobile-menu-design-classic fusion-show-pagination-text fusion-header-layout-v3 avada-responsive avada-footer-fx-bg-parallax fusion-search-form-classic fusion-avatar-square">
<a class="skip-link screen-reader-text" href="#content">Skip to content</a>
<div class="" id="wrapper">
<div id="home" style="position:relative;top:-1px;"></div>
<header class="fusion-header-wrapper fusion-header-shadow">
<div class="fusion-header-v3 fusion-logo-left fusion-sticky-menu- fusion-sticky-logo- fusion-mobile-logo-1 fusion-mobile-menu-design-classic">
<div class="fusion-secondary-header">
<div class="fusion-row">
<di

We can certainly pull a lot of links easily with their navigation system. Here is one way.

In [11]:
for link in soup.find_all('a'):
    print(link.get('href'))

#content
/cdn-cgi/l/email-protection#7e585d4f4e4b45585d4f4f4e4518585d4f4f4f45585d484a450d13585d4f4f47450d50585d474745585d4f4f4f45585d4f4e4745501f0b
https://www.facebook.com/Australiansmws
https://www.instagram.com/smws_aus/
https://twitter.com/SMWS_Australia
https://www.linkedin.com/company/scotch-malt-whisky-society-australia
https://www.smws.com.au/
https://www.smws.com.au/
https://www.smws.com.au/shop/
#
https://www.smws.com.au/membership/
https://www.smws.com.au/product/smws-annual-membership-subscription/
https://www.smws.com.au/product/smws-gift-membership/
https://www.smws.com.au/society/
https://www.smws.com.au/news/
https://www.smws.com.au/outturn-magazine-2/
https://www.smws.com.au/unfiltered-magazine/
https://www.smws.com.au/frequently-asked-questions/
https://www.smws.com.au/our-unique-whisky/
https://www.smws.com.au/our-events/
https://www.smws.com.au/tasting-panel/
https://www.smws.com.au/corporate-private-functions/
https://www.smws.com.au/our-team/
https://www.smws.com.

It looks like there are a few tables in there as well.
### Nope
SMWS doesn't use tables

In [14]:
soup.body.table

In [15]:
len(soup.find_all('table'))

0

In [40]:
prods = soup.find_all('h3', _class=['product-title'])#, class_=['product_details'])
prods[-40:]

[]

65 items for that selector. And 65 bottles on the page. That could be it. 

When we use select, Beautiful Soup gives us a list. So let's look at the list elements.

In [None]:
soup.select(".v-product")[0]

So when we use the CSS selector, we get a list where each item in the list is one of our scotch listings. How can we pull some meaning full data from that? I would like to get the name, link to the product, and the pricing info. Let's get one item to work with so we can try to parse this out. 

In [37]:
prods = soup.find_all("div", _class="product-details")
prods

[]

In [34]:
prods

[]

In [None]:
table[1]

In [None]:
scotch = table[0]
scotch

In [None]:
scotch.a

In [None]:
scotch.a.string

In [None]:
scotch.a.text

None of that is particularly useful. Let's try working with the CSS selectors again.

In [None]:
scotch.select(".v-product__title")

In [None]:
scotch.select(".v-product__title").text

Ooops. Remember select returns a list.

In [None]:
scotch.select(".v-product__title")[0].text

Now we are getting somewhere.

In [None]:
scotch.select(".product_productprice")[0]

In [None]:
scotch.select(".product_productprice")[0].text

In [None]:
scotch.select(".product_saleprice")[0].text

In [None]:
scotch.select(".product_listprice")[0].text

Now what about the product link?

In [None]:
scotch.select(".v-product__title")[0].get('href')

That's it!  [https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm](https://www.drinkupny.com/Aberlour-a-bunadh-p/s0958.htm)

Now we can manipulate our strings a bit to strip off the parts we don't want.

In [None]:
scotch.select(".v-product__title")[0].text.split('\n', 1)[1]

In [None]:
scotch.select(".v-product__title")[0].text.split('\n', 1)[1].strip()

In [None]:
scotch.select(".v-product__title")[0].text.strip()

In [None]:
scotch.select(".product_productprice")[0].text.split('$', 1)[1]

In [None]:
scotch.select(".product_productprice")[0].text.split('$', 1)[1].strip()

In [None]:
scotch.select(".product_saleprice")[0].text.split('$', 1)[1].strip()

In [None]:
scotch.select(".product_listprice")[0].text.split('$', 1)[1].strip()

Let's pull it all together and see if it works for parsing all of the data.

In [None]:
len(table)

In [None]:
import pandas as pd

In [None]:
i = 0
column_names = ['name', 'url', 'regular_price', 'our_price', 'sale_price']

df = pd.DataFrame(columns = column_names)

In [None]:
while i < len(table):
    new_scotch = pd.Series([table[i].select(".v-product__title")[0].text.strip(), \
                                 table[i].select(".v-product__title")[0].get('href'), \
                                 table[i].select(".product_listprice")[0].text.split('$', 1)[1], \
                                 table[i].select(".product_productprice")[0].text.split('$', 1)[1], \
                                 table[i].select(".product_saleprice")[0].text.split('$', 1)[1]], column_names)
    df = df.append(new_scotch, ignore_index = True)
    i += 1

No errors, so that is a good sign. Can we now can easily compare Scotch prices?

In [None]:
df

If you went through this notebook instead of trying a page you wanted to scrape, I fully expect you to go try and scrape your own page now!

If you are interested in working with [scrapy](https://doc.scrapy.org/en/latest/index.html), I highly recommend starting with the [tutorial](https://doc.scrapy.org/en/latest/intro/tutorial.html). I have provided a [scrapy sample](./scrapy/scotch) that uses scrapy to begin pulling data from the site utilized here.