## Prerequisites

- python 2.7
- requests
- beautifulsoup4
- pandas

## Objective

We want to scrape the data of an online book store: http://books.toscrape.com/

This website is fictional so we can scrape it as much as we want.

In this tutorial we will be gathering the following information about all the products of the website:
- book title
- price
- availability
- image
- category
- rating

## Warm-up: get the content of the main page

First let's use the requests module to get the HTML of the website's main page.

In [65]:
main_url = "http://www.istpravda.com.ua/"

In [100]:
article_url = main_url + "articles/2019/01/8/153493/"
norm_article_url = main_url + "articles/2019/07/23/155994/"

In [101]:
import requests
result = requests.get(norm_article_url)

In [102]:
result.text[:1000]

'<!DOCTYPE html>\n<html>\n    <head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n    <title>Вербиця. Історія одного села із Закерзоння | Історична правда</title>\n    <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=yes">\n    <link href="https://fonts.googleapis.com/css?family=Fira+Sans:400,400i,700,700i,800,800i&amp;subset=cyrillic,cyrillic-ext,latin-ext" rel="stylesheet">\n<link rel="stylesheet" href="/misc/v2/index.css?v=17">\r\n<link rel="stylesheet" href="/misc/v2/responsive.css?v=1">\n\n\n<meta name="robots" content="index,follow">\n\n\n<meta name="description" content="">\n<meta name="keywords" content="">\n<meta name="document-state" content="state">\n<meta name="revisit-after" content="1 days">\n\n<script src="/misc/v2/jquery-1.9.1.min.js"></script>\n\n<script src="/misc/v2/common.js?v=1"></script>\n\n\n<meta property="og:title" content="Вербиця. Історія одного села із Закерзоння"/>\n<meta property

The result is quite messy! Let's make this more readable:

In [103]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.text, 'html.parser')

In [104]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Вербиця. Історія одного села із Закерзоння | Історична правда
  </title>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=yes" name="viewport"/>
  <link href="https://fonts.googleapis.com/css?family=Fira+Sans:400,400i,700,700i,800,800i&amp;subset=cyrillic,cyrillic-ext,latin-ext" rel="stylesheet"/>
  <link href="/misc/v2/index.css?v=17" rel="stylesheet"/>
  <link href="/misc/v2/responsive.css?v=1" rel="stylesheet"/>
  <meta content="index,follow" name="robots"/>
  <meta content="" name="description"/>
  <meta content="" name="keywords"/>
  <meta content="state" name="document-state"/>
  <meta content="1 days" name="revisit-after"/>
  <script src="/misc/v2/jquery-1.9.1.min.js">
  </script>
  <script src="/misc/v2/common.js?v=1">
  </script>
  <meta content="Вербиця. Історія одного села із Закерзоння" property="og:title">
   <meta content="art

The function prettify() makes the HTML more readable. However we will not use this directly to explore where the relevant data is.

Let's define a function to request and parse a HTML web page as we will need this a lot during this tutorial:

In [128]:
def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

In [140]:
soup.find_all("div", "image-box")

[<div class="image-box image-box_center" style="max-width: 660px;">
 <img alt="" height="416" src="/images/doc/5/c/5c8995f-1111.jpg" title="" width="660"/>
 </div>,
 <div class="image-box image-box_center" style="max-width: 660px;"><img alt=" " src="/images/doc/0/8/0818559-roodl.jpg">
 <div class="image-box__caption">
 <p style="text-align: center;">Схема Вербиці. Зроблена колишнім жителем.</p>
 </div>
 <div class="image-box__author"></div>
 </img></div>,
 <div class="image-box image-box_center" style="max-width: 660px;"><img alt=" " src="/images/doc/6/d/6da61c9---ubijovicha.mapa--.png">
 <div class="image-box__caption">
 <p style="text-align: center;">Етнічний склад регіону згідно мапи В Кубійовича</p>
 </div>
 <div class="image-box__author" style="text-align: center;">В. Кубійович Етнічні групи південозахідної України (Галичини) на 01.01.1939</div>
 </img></div>,
 <div class="image-box image-box_left" style="max-width: 200px;"><img alt=" " src="/images/doc/c/9/c9ef5c1-rodovi-gerbi-ro

In [127]:
soup.find("div", class_ = "image-box image-box_center")

<div class="image-box image-box_center" style="max-width: 660px;">
<img alt="" height="416" src="/images/doc/5/c/5c8995f-1111.jpg" title="" width="660"/>
</div>

In [126]:
soup.find("div", class_ = "image-box image-box_center")

<div class="image-box image-box_center" style="max-width: 660px;">
<img alt="" height="416" src="/images/doc/5/c/5c8995f-1111.jpg" title="" width="660"/>
</div>

In [152]:
"caption" in soup.find("div", "image-box")

False

In [166]:
dicc = {}

# lst = [x.img.get("src") for x in soup.findAll("div", class_ = "image-box")]
lst = [x for x in soup.findAll("div", class_ = "image-box")]

print(type(lst[1]))

for x in lst:
    if x.find('image-box__caption'):
        print("yes")
    else:
        print("no")
# print("image-box__caption" in x)

# for i in range(len(soup.findAll("div", class_ = "image-box__caption"))):
#     dicc[lst[i]] = soup.findAll("div", class_ = "image-box__caption")[i].tbody.tr.td.img.get('src')
    
# # main_page_products_urls = [[x.strong.text for x in soup.findAll("td", class_ = "tb_text")], [x.tbody.tr.td.img.get('src') for x in soup.findAll("table", class_ = "tb_center")]]
# print(dicc)


<class 'bs4.element.Tag'>
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no
no


This function is very handy for finding all the values at once, but you have to check that all the information collected is relevant. Sometimes one same tag can contain completely different data. That is why it is important to be as specific as possible when choosing the tags. Here we decided to rely on the tag 'article' with the 'product_pod' class because this seems to be a very specific tag and it is unlikely that we can find data other than product data in it.

The previous URLs correspond to their relative path from the main page. In order to make them complete, we just need to add before them the URL of the main page: http://books.toscrape.com/index.html (after removing the index.html part).

Now let's use this to define a function to retrieve book links on any given page of the website:

In [55]:
def getBooksURLs(url):
    soup = getAndParseURL(url)
    # remove the index.html part of the base url before returning the results
    return(["/".join(url.split("/")[:-1]) + "/" + x.tbody.tr.td.img.get('src') for x in soup.findAll("table", class_ = "tb_center")])

## Find book categories URLs on the main page

Now let's try retrieving the URLs corresponding the different product categories:

<img src="images/inspect2.png">

By inspecting, we can see that they follow the same URL pattern: 'catalogue/category/books'. 

We can tell BeautifulSoup to match the URLs that contain this pattern in order to retrieve easily the categories URLs:

In [57]:
import re

categories_urls = [main_url + x.get('src') for x in soup.find_all("img", src=re.compile("images/doc"))]
categories_urls = categories_urls[1:] # we remove the first one because it corresponds to all the books

print(str(len(categories_urls)) + " fetched categories URLs")
print("Some examples:")
categories_urls[:5]

14 fetched categories URLs
Some examples:


['http://www.istpravda.com.ua//images/doc/4/9/49ba80c-1.jpg',
 'http://www.istpravda.com.ua//images/doc/b/0/b04be88-4.jpg',
 'http://www.istpravda.com.ua//images/doc/4/a/4a45cdd-6.jpg',
 'http://www.istpravda.com.ua//images/doc/3/7/37edc63-7.jpg',
 'http://www.istpravda.com.ua//images/doc/a/b/ab5240e-3.jpg']

## Scrape all books data

For the last part of this tutorial, we will finally tackle our main objective: gather data about all the books of the website.

We know how to get the links of the books within a given page. If all the books were displayed on a same page this would be easy. However this situation is unlikely as it is not very user friendly to display all the catalog to the user on the same page.

Usually products are displayed on multiple pages or on one page but through scrolling. We can see here at the bottom of the main page that there are 50 products pages and a button 'next' to access to the next product page.

<img src="images/next.png">

On the next pages there is also a 'previous' button to come back to the last product page.

<img src="images/previous_next.png">

### Get all pages URLs

In order to fetch all the products URLs, we need to be able to get through all the pages. To do so, we can go iteratively through all the 'next' buttons.

<img src="images/next_inspect.png">

The 'next' button contains the pattern 'page'. We can use this to retrieve the URLs of the next pages. But let's be careful: the 'previous' button also contains this pattern!

If we have two results when matching with 'page', we should take the second one as it will correspond to the next page. For the first and the last pages we will have only one result because we will have either the 'next' button or the 'previous' button.

In [59]:
# store all the results into a list
pages_urls = [main_url]

soup = getAndParseURL(pages_urls[0])

print(soup)

# while we get two matches, this means that the web page contains a 'previous' and a 'next' button
# if there is only one button, this means that we are either on the first page or on the last page
# we stop when we get to the last page

while len(soup.findAll("a", href=re.compile("page"))) == 2 or len(pages_urls) == 1:
    
    # get the new complete url by adding the fetched URL to the base URL (and removing the .html part of the base URL)
    new_url = "/".join(pages_urls[-1].split("/")[:-1]) + "/" + soup.findAll("a", href=re.compile("page"))[-1].get("href")
    
    # add the URL to the list
    pages_urls.append(new_url)
    
    # parse the next page
    soup = getAndParseURL(new_url)

<!DOCTYPE html>

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Історична правда</title>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=yes" name="viewport"/>
<link href="https://fonts.googleapis.com/css?family=Fira+Sans:400,400i,700,700i,800,800i&amp;subset=cyrillic,cyrillic-ext,latin-ext" rel="stylesheet"/>
<link href="/misc/v2/index.css?v=17" rel="stylesheet"/>
<link href="/misc/v2/responsive.css?v=1" rel="stylesheet"/>
<meta content="index,follow" name="robots"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="state" name="document-state"/>
<meta content="1 days" name="revisit-after"/>
<script src="/misc/v2/jquery-1.9.1.min.js"></script>
<script src="/misc/v2/common.js?v=1"></script>
<meta content="" property="og:title">
<meta content="article" property="og:type">
<meta content="https://www.istpravda.com.ua/" property="og:url">
<meta content="https://img.istpravda.com

IndexError: list index out of range

In [14]:
print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/index.html',
 u'http://books.toscrape.com/catalogue/page-2.html',
 u'http://books.toscrape.com/catalogue/page-3.html',
 u'http://books.toscrape.com/catalogue/page-4.html',
 u'http://books.toscrape.com/catalogue/page-5.html']

We successfully managed to get the 50 pages URLs. What is interesting here is that the URL of those pages is highly predictable. We could have just created this list by incrementing 'page-X.html' until 50.

This solution could work for this exact example but would not work anymore if the number of pages changed (e.g. if the website decided to print more products per pages, or if the catalog changed).

One solution could be to increment the value until we get on a 404 page.

<img src="images/404.png">

Here we can see that trying to go to the 51th page effectively gets us a 404 error. 

Fortunately the result of a request has a very useful attribute that can show us the return status of the HTML request.

In [15]:
result = requests.get("http://books.toscrape.com/catalogue/page-50.html")
print("status code for page 50: " + str(result.status_code))

result = requests.get("http://books.toscrape.com/catalogue/page-51.html")
print("status code for page 51: " + str(result.status_code))

status code for page 50: 200
status code for page 51: 404


The 200 code indicates that there is no error. The 404 code tells us that the page was not found.

We can use this information to get all our pages URLs: we should iterate until we get a 404 code.

Let's try this method now:

In [16]:
pages_urls = []

new_page = "http://books.toscrape.com/catalogue/page-1.html"
while requests.get(new_page).status_code == 200:
    pages_urls.append(new_page)
    new_page = pages_urls[-1].split("-")[0] + "-" + str(int(pages_urls[-1].split("-")[1].split(".")[0]) + 1) + ".html"

In [17]:
print(str(len(pages_urls)) + " fetched URLs")
print("Some examples:")
pages_urls[:5]

50 fetched URLs
Some examples:


['http://books.toscrape.com/catalogue/page-1.html',
 'http://books.toscrape.com/catalogue/page-2.html',
 'http://books.toscrape.com/catalogue/page-3.html',
 'http://books.toscrape.com/catalogue/page-4.html',
 'http://books.toscrape.com/catalogue/page-5.html']

We managed to obtain the same URLs using this simpler method!

### Get all products URLs

Now the next step consists in fetching all the products URLs for every page. This step is quite simple as we already have the list of all pages and the function to get products URLs from a page.

Let's iterate through the pages and apply our function:

In [18]:
booksURLs = []
for page in pages_urls:
    booksURLs.extend(getBooksURLs(page))

In [19]:
print(str(len(booksURLs)) + " fetched URLs")
print("Some examples:")
booksURLs[:5]

1000 fetched URLs
Some examples:


[u'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 u'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 u'http://books.toscrape.com/catalogue/soumission_998/index.html',
 u'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 u'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html']

We finally got the 1000 book URLs. This corresponds to the number indicated on the website!

### Get product data

The last step consist in scraping the data for each product. Let's explore first how the information is structured on the products pages:

<img src="images/product_inspect.png">

We can easily retrieve a lot of information for every book:
- book title
- price
- availability
- image
- category
- rating

Let's do it!

In [20]:
%%time

names = []
prices = []
nb_in_stock = []
img_urls = []
categories = []
ratings = []

# scrape data for every book URL: this may take some time
for url in booksURLs:
    soup = getAndParseURL(url)
    # product name
    names.append(soup.find("div", class_ = re.compile("product_main")).h1.text)
    # product price
    prices.append(soup.find("p", class_ = "price_color").text[2:]) # get rid of the pound sign
    # number of available products
    nb_in_stock.append(re.sub("[^0-9]", "", soup.find("p", class_ = "instock availability").text)) # get rid of non numerical characters
    # image url
    img_urls.append(url.replace("index.html", "") + soup.find("img").get("src"))
    # product category
    categories.append(soup.find("a", href = re.compile("../category/books/")).get("href").split("/")[3])
    # ratings
    ratings.append(soup.find("p", class_ = re.compile("star-rating")).get("class")[1])

CPU times: user 45.4 s, sys: 868 ms, total: 46.3 s
Wall time: 2min 30s


In [21]:
# add data into pandas df
import pandas as pd

scraped_data = pd.DataFrame({'name': names, 'price': prices, 'nb_in_stock': nb_in_stock, "url_img": img_urls, "product_category": categories, "rating": ratings})
scraped_data.head()

Unnamed: 0,name,nb_in_stock,price,product_category,rating,url_img
0,A Light in the Attic,22,51.77,poetry_23,Three,http://books.toscrape.com/catalogue/a-light-in...
1,Tipping the Velvet,20,53.74,historical-fiction_4,One,http://books.toscrape.com/catalogue/tipping-th...
2,Soumission,20,50.1,fiction_10,One,http://books.toscrape.com/catalogue/soumission...
3,Sharp Objects,20,47.82,mystery_3,Four,http://books.toscrape.com/catalogue/sharp-obje...
4,Sapiens: A Brief History of Humankind,20,54.23,history_32,Five,http://books.toscrape.com/catalogue/sapiens-a-...


We got our data: our web scraping experiment is a success. 

Some data cleaning may be useful before using them:
- transform the ratings into numerical values
- remove the numbers in the product_category column

## Wrap up

We have seen how to get through websites and gather data on each web page using automated web scrapers. One key thing in order to build efficient web scrapers is to understand the structure of the website on which you want to scrape the information. This means that you will probably have to maintain you scraper if you want it to remain useful after websites updates.

This book store website was an easy example, but in real life you may have to deal with more complex websites that render some of their content using Javascript. You may want to use a browser automator like Selenium for those kind of tasks (https://www.seleniumhq.org/).