## Scraping datasets of Books

### What is Web Scraping


Web scraping consists in gathering data available on websites. This can be done manually by a human user or by a bot. The latter can of course gather data much faster than a human user and that is why we are going to focus on this. Is it therefore technically possible to collect all the data of a website in a matter of minutes. But the legality of this practice is not well defined.  

#### How does it works

Web scrapers gather website data and the scraper goes onto a web page of the website, gets the relevant data, and move forward to the next web page. Every website has a different structure, that is why web scrapers are usually built to explore one website. 

#### Tools we're going to use

We are going to use the Python modules requests and BeautifulSoup.

- Requests will allow us to send HTTP requests to get the HTML files.
- BeautifulSoup will be used to parse the HTML files. It is one of the most used library for web scraping. Its is quite simple to use and has many features that help gathering websites data efficiently.

#### Prerequisites

- python 2.7
- requests
- beautifulsoup4
- pandas

So,  we will be gathering the following information about all the products of the website:

- book title
- price
- availability
- image
- category
- rating

### First we will use the requests module to download the HTML of website's main page

In [1]:
main_url = "http://books.toscrape.com/index.html"

In [2]:
import requests

In [3]:
result = requests.get(main_url)

In [4]:
 # using bs4 to parse and extract information

from bs4 import BeautifulSoup
doc = BeautifulSoup(result.text, 'html.parser')

print(doc.prettify()[:1000])   # prettify() makes the HTML more readable

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [5]:
# function to request and parse HTML web page 

def getAndParseURL(url):   
    result = requests.get(url)
    doc = BeautifulSoup(result.text, 'html.parser')
    return(doc)

### Find book URLs on the main page

In [6]:
# to find the book url from the main page we'll use find() function in the reference to the 
# class value 

doc.find("article", class_ = "product_pod")

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [7]:
doc.find("article", class_ = "product_pod").div.a   # using the other child tags for deeper information

<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>

In [8]:
# to get the URL value in 'href' tag we'll use .get() function

doc.find("article", class_ = "product_pod").div.a.get('href') 

'catalogue/a-light-in-the-attic_1000/index.html'

In [9]:
# we'll gather all the products URL value at once by using findAll() function 

main_page_products_urls = [x.div.a.get('href') for x in doc.findAll("article", class_ = "product_pod")]

print(str(len(main_page_products_urls)) + " fetched products URLs")
print("One example:")
main_page_products_urls[0]

20 fetched products URLs
One example:


'catalogue/a-light-in-the-attic_1000/index.html'

In [10]:
def getBooksURLs(url):
    doc = getAndParseURL(url)
    
    # remove the index.html part of the base url before returning the results
    
    return(["/".join(url.split("/")[:-1]) + "/" + x.div.a.get('href') for x in doc.findAll("article", class_ = "product_pod")])

### Find book categories URLs on the main page

In [11]:
import re

categories_urls = [main_url + x.get('href') for x in doc.find_all("a", href=re.compile("catalogue/category/books"))]

# we remove the first one because it corresponds to all the books

categories_urls = categories_urls[1:]

print(str(len(categories_urls)) + " fetched categories URLs")
print("Some examples:")
categories_urls[:10]

50 fetched categories URLs
Some examples:


['http://books.toscrape.com/index.htmlcatalogue/category/books/travel_2/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/mystery_3/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/historical-fiction_4/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/sequential-art_5/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/classics_6/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/philosophy_7/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/romance_8/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/womens-fiction_9/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/fiction_10/index.html',
 'http://books.toscrape.com/index.htmlcatalogue/category/books/childrens_11/index.html']

### Scraping all books data

In [12]:
# store all the results into a list
pages_urls = [main_url]

doc = getAndParseURL(pages_urls[0])

# while we get two matches, this means that the webpage contains a 'previous' and a 'next' button
# if there is only one button, this means that we are either on the first page or on the last page
# we stop when we get to the last page

while len(doc.findAll("a", href=re.compile("page"))) == 2 or len(pages_urls) == 1:
    
    # get the new complete url by adding the fetched URL to the base URL (and removing the .html part of the base URL)
    new_url = "/".join(pages_urls[-1].split("/")[:-1]) + "/" + doc.findAll("a", href=re.compile("page"))[-1].get("href")
    
    # add the URL to the list
    pages_urls.append(new_url)
    
    # parse the next page
    doc = getAndParseURL(new_url)
    

print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/index.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html']

In [13]:
result = requests.get("http://books.toscrape.com/catalogue/page-50.html")
print("status code for page 50: " + str(result.status_code))

result = requests.get("http://books.toscrape.com/catalogue/page-51.html")
print("status code for page 51: " + str(result.status_code))

status code for page 50: 200
status code for page 51: 404


In [14]:
pages_urls = []

new_page = "http://books.toscrape.com/catalogue/page-1.html"
while requests.get(new_page).status_code == 200:
    pages_urls.append(new_page)
    new_page = pages_urls[-1].split("-")[0] + "-" + str(int(pages_urls[-1].split("-")[1].split(".")[0]) + 1) + ".html"
    

print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html']

### Get all products URLs

In [15]:
# iterating through the pages

booksURLs = []
for page in pages_urls:
    booksURLs.extend(getBooksURLs(page))
    
print(str(len(booksURLs)) + " fetched URLs")
print("Some examples:")
booksURLs[:5]

1000 fetched URLs
Some examples:


['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html']

### Get product data

This step helps in scraping the data for each product by putting all things together.
With ths step we can easily retrieve the information like:
- book title
- price
- availability
- image
- category
- rating

In [None]:
names = []
prices = []
nb_in_stock = []
img_urls = []
categories = []
ratings = []

# scrape data for every book URL: this may take some time
for url in booksURLs:
    doc = getAndParseURL(url)
    # product name
    names.append(doc.find("div", class_ = re.compile("product_main")).h1.text)
    # product price
    prices.append(doc.find("p", class_ = "price_color").text[2:]) # get rid of the pound sign
    # number of available products
    nb_in_stock.append(re.sub("[^0-9]", "", doc.find("p", class_ = "instock availability").text)) # get rid of non numerical characters
    # image url
    img_urls.append(url.replace("index.html", "") + doc.find("img").get("src"))
    # product category
    categories.append(doc.find("a", href = re.compile("../category/books/")).get("href").split("/")[3])
    # ratings
    ratings.append(doc.find("p", class_ = re.compile("star-rating")).get("class")[1])
    

In [17]:
# add data into pandas df
import pandas as pd

scraped_data = pd.DataFrame({'name': names, 'price': prices, 'nb_in_stock': nb_in_stock, "url_img": img_urls, "product_category": categories, "rating": ratings})
scraped_data

Unnamed: 0,name,price,nb_in_stock,url_img,product_category,rating
0,A Light in the Attic,51.77,22,http://books.toscrape.com/catalogue/a-light-in...,poetry_23,Three
1,Tipping the Velvet,53.74,20,http://books.toscrape.com/catalogue/tipping-th...,historical-fiction_4,One
2,Soumission,50.10,20,http://books.toscrape.com/catalogue/soumission...,fiction_10,One
3,Sharp Objects,47.82,20,http://books.toscrape.com/catalogue/sharp-obje...,mystery_3,Four
4,Sapiens: A Brief History of Humankind,54.23,20,http://books.toscrape.com/catalogue/sapiens-a-...,history_32,Five
...,...,...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,55.53,1,http://books.toscrape.com/catalogue/alice-in-w...,classics_6,One
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",57.06,1,http://books.toscrape.com/catalogue/ajin-demi-...,sequential-art_5,Four
997,A Spy's Devotion (The Regency Spies of London #1),16.97,1,http://books.toscrape.com/catalogue/a-spys-dev...,historical-fiction_4,Five
998,1st to Die (Women's Murder Club #1),53.98,1,http://books.toscrape.com/catalogue/1st-to-die...,mystery_3,One


#### We finally got our data, the web scraping experiment is a success.