# Section 10: HTML, CSS and Webscraping

### Terminology

Web pages can be represented by the objects that comprise their structure and content. This representation is known as the **Document Object Model (DOM)**. The purpose of the DOM is to provide an interface for programs to change the structure, style, and content of web pages. The DOM represents the document as nodes and objects. Amongst other things, this allows programming languages to interactively change the page and HTML!

What you'll see is the DOM and HTML create a hierarchy of elements. This structure and the underlying elements can be navigated similarly to a family tree which is one of Beautiful Soup's main mechanisms for navigation. Once you select a specific element within a page, you can then navigate to successive elements using methods to retrieve related tags including a tag's sibling, parent or descendants.
  
To learn more about the DOM see:  
https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction

<img src="images/DOM-model.svg.png" width="500">

### Beautiful Soup     

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library designed for quick scraping projects. It allows you to select and navigate the tree-like structure of HTML documents, searching for particular tags, attributes or ids. It also allows you to then further traverse the HTML documents through relations like children or siblings. In other words, with Beautiful Soup, you could first select a specific `div` tag and then search through all of its nested tags. 


## Scraping a Single Page

In [1]:
from bs4 import BeautifulSoup
import requests

http://books.toscrape.com/

In [2]:
html_page = requests.get('http://books.toscrape.com/') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing


In [6]:
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [7]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" t

In [8]:
soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>, <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_c

In [9]:
first_20 = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

In [10]:
len(first_20)

20

In [11]:
first = first_20[0]

In [12]:
first

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>

In [15]:
first.find('a')

<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>

In [16]:
first.find('a')['href']

'catalogue/a-light-in-the-attic_1000/index.html'

In [17]:
first.find('h3')

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>

In [14]:
first.find('h3').find('a')['title']

'A Light in the Attic'

In [18]:
first.find('p', {'class': 'price_color'})

<p class="price_color">£51.77</p>

In [19]:
first.find('p', {'class': 'price_color'}).text

'£51.77'

In [21]:
first.find('p', {'class': "instock availability"}).text

'\n\n    \n        In stock\n    \n'

In [22]:
# this one uses Regex -- a Mod 4 topic -- but could come in handy!!

import re
regex = re.compile("star-rating (.*)")
first.find('p', {'class': regex})

<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

In [23]:
first.find('p', {'class': regex})['class']

['star-rating', 'Three']

In [24]:
first.find('p', {'class': regex})['class'][1]

'Three'

In [25]:
def clean_scrape(book):
    info = {}
    
    
    info['title'] = book.find('h3').find('a')['title']
    info['price'] = book.find('p', {'class': 'price_color'}).text
    
    if 'In stock' in first.find('p', {'class': "instock availability"}).text:
        info['in_stock'] = True
    else:
        info['in_stock']= False
    
    regex = re.compile("star-rating (.*)")
    info['stars'] = book.find('p', {'class': regex})['class'][-1]
    
    info['url'] = 'http://books.toscrape.com/' + book.find('a')['href']
    
    return info

In [28]:
clean_scrape(first)

{'title': 'A Light in the Attic',
 'price': '£51.77',
 'in_stock': True,
 'stars': 'Three',
 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}

In [29]:
book_dicts = [clean_scrape(book) for book in first_20]

In [27]:
book_dicts

[{'title': 'A Light in the Attic',
  'price': '£51.77',
  'in_stock': True,
  'stars': 'Three',
  'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'},
 {'title': 'Tipping the Velvet',
  'price': '£53.74',
  'in_stock': True,
  'stars': 'One',
  'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'},
 {'title': 'Soumission',
  'price': '£50.10',
  'in_stock': True,
  'stars': 'One',
  'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html'},
 {'title': 'Sharp Objects',
  'price': '£47.82',
  'in_stock': True,
  'stars': 'Four',
  'url': 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html'},
 {'title': 'Sapiens: A Brief History of Humankind',
  'price': '£54.23',
  'in_stock': True,
  'stars': 'Five',
  'url': 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'},
 {'title': 'The Requiem Red',
  'price': '£22.65',
  'in_stock': True,
  'stars': 'One',
  'url': 'http:/

In [30]:
import pandas as pd
pd.DataFrame(book_dicts)

Unnamed: 0,title,price,in_stock,stars,url
0,A Light in the Attic,£51.77,True,Three,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,£53.74,True,One,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,£50.10,True,One,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,£47.82,True,Four,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,£54.23,True,Five,http://books.toscrape.com/catalogue/sapiens-a-...
5,The Requiem Red,£22.65,True,One,http://books.toscrape.com/catalogue/the-requie...
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,True,Four,http://books.toscrape.com/catalogue/the-dirty-...
7,The Coming Woman: A Novel Based on the Life of...,£17.93,True,Three,http://books.toscrape.com/catalogue/the-coming...
8,The Boys in the Boat: Nine Americans and Their...,£22.60,True,Four,http://books.toscrape.com/catalogue/the-boys-i...
9,The Black Maria,£52.15,True,One,http://books.toscrape.com/catalogue/the-black-...


## Scraping Multiple Pages (Pagination!)

In [31]:
url = 'http://books.toscrape.com/catalogue/page-1.html'

In [32]:
urls = ['http://books.toscrape.com/catalogue/page-{}.html'.format(i) for i in range(1, 51)]
urls

['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html',
 'http://books.toscrape.com/catalogue/page-6.html',
 'http://books.toscrape.com/catalogue/page-7.html',
 'http://books.toscrape.com/catalogue/page-8.html',
 'http://books.toscrape.com/catalogue/page-9.html',
 'http://books.toscrape.com/catalogue/page-10.html',
 'http://books.toscrape.com/catalogue/page-11.html',
 'http://books.toscrape.com/catalogue/page-12.html',
 'http://books.toscrape.com/catalogue/page-13.html',
 'http://books.toscrape.com/catalogue/page-14.html',
 'http://books.toscrape.com/catalogue/page-15.html',
 'http://books.toscrape.com/catalogue/page-16.html',
 'http://books.toscrape.com/catalogue/page-17.html',
 'http://books.toscrape.com/catalogue/page-18.html',
 'http://books.toscrape.com/catalogue/page-19.html',
 '

In [None]:
def get_20_books(url):
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    raw = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
    to_dicts = [clean_scrape(book) for book in raw]
    
    return to_dicts

In [None]:
all_dicts = []

for url in urls:
    all_dicts.extend(get_20_books(url))

print(len(all_dicts))
all_dicts

In [None]:
df = pd.DataFrame(all_dicts)

In [None]:
df