# First Web Scraping Project

For this beginners web scraping project, we will be using http://books.toscrape.com. This is a mock book webshop in a rather simplified form made available for the purpose of scraping. It is ideal for people who are learning about web scraping and wanting to apply what they learn. For this project, we will be using `BeautifulSoup`. We start off by importing the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import re
import requests
import time
from bs4 import BeautifulSoup

Since we are going to be scraping quite some pages, we create a function that returns the html script of the page.

In [2]:
def scraper(url):
    response = requests.get(url)
    return BeautifulSoup(response.text, 'html.parser')

Let's begin by examining the home page. We will both be looking at the compiled page and the HTML script.

In [3]:
url = 'http://books.toscrape.com'

In [4]:
soup = scraper(url)
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

The home page shows all books uncategorized. We could run our scraping from this page, but since this is a learning project, we will be doing some extra and inefficient things as an exercise. We create a dataframe containing all categories shown in the menu.

In [5]:
all_a = soup.find_all('a')
all_a

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

We see a bunch of URLs. We are only interested in the categories. We will have to filter out the links to the individual pages of the books and other pages linked like the home page.

In [6]:
text_to_select = 'catalogue/category/books/'
cat_name = [text.string.strip() for text in all_a if text_to_select in text.get('href')]

We will also save the URLs of the category index pages. We assign a name code to every one of the categories since the names contain spaces and uppercases. We could use the URL, but they contain numbers and a combination of underscores and dashes. The name code may come in handy later on.

In [7]:
cat_urlcode = [text.get('href').split('/books/')[1].split('/')[0] for text in all_a if text_to_select in text.get('href')]
cat_namecode = [name.split('_')[0].replace('-', '_') for name in cat_urlcode]

We can now combine the extracted information in a structured dataframe.

In [8]:
Categories = pd.DataFrame({
    'Name': cat_name,
    'Code': cat_namecode,
    'URL': cat_urlcode
})

Each category has a number of books. This number is displayed at the top of the category's index page. We have to figure out how to extract that number from the script. Let's see what the script looks like.

In [9]:
url = f'http://books.toscrape.com/catalogue/category/books/{Categories.URL[0]}/index.html'
scraper(url)


<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    Travel | 
     Books to Scrape - Sandbox

</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    
" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../../../../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../../../../static/oscar/css/styles.css" rel="stylesheet"

The number is found in a `strong` tag.

In [10]:
soup.find_all('strong')

[<strong>1000</strong>,
 <strong>1</strong>,
 <strong>20</strong>,

We need the second element of the list. We can now extract the category size in a for loop.

In [11]:
cat_size = []

for i in range(Categories.shape[0]):
    url = f'http://books.toscrape.com/catalogue/category/books/{Categories.URL[i]}/index.html'
    cat_size.append(scraper(url).find_all('strong')[1].text)

A new column is added to the dataframe `Categories` containing the category sizes. The column type is converted into `int64`. The number of pages is computed by dividing the number of books by 20 and rounding the result up. The number 20 comes from the fact that the pages show no more than 20 books.

In [12]:
Categories['Size'] = cat_size
Categories['Size'] = Categories['Size'].astype('int64')

In [13]:
Categories['Pages'] = np.ceil(Categories['Size']/20).astype('int64')

This is the finished dataframe `Categories` containing the information that we might need later on:

In [14]:
Categories

Unnamed: 0,Name,Code,URL,Size,Pages
0,Travel,travel,travel_2,11,1
1,Mystery,mystery,mystery_3,32,2
2,Historical Fiction,historical_fiction,historical-fiction_4,26,2
3,Sequential Art,sequential_art,sequential-art_5,75,4
4,Classics,classics,classics_6,19,1
5,Philosophy,philosophy,philosophy_7,11,1
6,Romance,romance,romance_8,35,2
7,Womens Fiction,womens_fiction,womens-fiction_9,17,1
8,Fiction,fiction,fiction_10,65,4
9,Childrens,childrens,childrens_11,29,2


As mentioned before, we could go straight to the individual book pages and gain all the data we want, but we will go through the catelog category by category. We extract the available data from the pages that list the books for the particular category. The goal is to end up with a dataframe containing the title, star rating, price, whether the book is in stock, the book's individual URL and the category.

In [15]:
# initialize lists (these will be added to in the loop below using the function .extend())
title = []
stars = []
price = []
in_stock = []
book_url = []
category = []

# loop through all categories
for i in range(Categories.shape[0]):
    # extract title, star rating, price and stock information from the index page (notice that URL is not yet included here, see below why)
    url = f'http://books.toscrape.com/catalogue/category/books/{Categories.URL[i]}/index.html'
    soup = scraper(url)
    title.extend([text.get('alt') for text in soup.find_all('img')])
    stars.extend([soup.find_all('p')[i].get('class')[1] for i in range(0, len(soup.find_all('p')), 3)])
    price.extend([soup.find_all('p')[i].text.strip() for i in range(1, len(soup.find_all('p')), 3)])
    in_stock.extend([soup.find_all('p')[i].text.strip() for i in range(2, len(soup.find_all('p')), 3)])

    # categories that have more than one page have different looking pages since they have the option to go to the next/previous page at the bottom
    # which adds elements to the list resulting from << soup.find_all('a') >>. this is dealt with uing if/else statements
    if Categories.Pages[i] > 1: # category in loop has more than one page
        # 1 element is added to list resulting from << soup.find_all('a') >>: next button
        # also, notice that we start the iteration at the 43rd element. that's because all the elements before are other parts of the page
        book_url.extend([f"http://books.toscrape.com/catalogue{url.get('href').rsplit('..', 1)[1]}" for url in soup.find_all('a')[54:-1:2]])

        # loop through pages 2 and up
        for p in range(1, Categories.Pages[i]):
            # extract title, star rating, price and stock information from the index page
            url = f'http://books.toscrape.com/catalogue/category/books/{Categories.URL[i]}/page-{p+1}.html'
            soup = scraper(url)
            title.extend([text.get('alt') for text in soup.find_all('img')])
            stars.extend([soup.find_all('p')[i].get('class')[1] for i in range(0, len(soup.find_all('p')), 3)])
            price.extend([soup.find_all('p')[i].text.strip() for i in range(1, len(soup.find_all('p')), 3)])
            in_stock.extend([soup.find_all('p')[i].text.strip() for i in range(2, len(soup.find_all('p')), 3)])

            # again, categories that have more than two pages, have different looking middle pages (so different from the first and last page)
            if p == Categories.Pages[i]: # page in the loop is the final page 
                                         # (1 element is added to list resulting from << soup.find_all('a') >>: previous button)
                book_url.extend([f"http://books.toscrape.com/catalogue{url.get('href').rsplit('..', 1)[1]}" for url in soup.find_all('a')[54:-1:2]])
            else: # page in the loop is a middle page (2 elements are added to list resulting from << soup.find_all('a') >>: previous and next button)
                book_url.extend([f"http://books.toscrape.com/catalogue{url.get('href').rsplit('..', 1)[1]}" for url in soup.find_all('a')[54:-2:2]])
    else: # category in loop only has one page (the index page)
        book_url.extend([f"http://books.toscrape.com/catalogue{url.get('href').rsplit('..', 1)[1]}" for url in soup.find_all('a')[54::2]])

    # add category corresponding to the book
    category.extend([Categories.Name[i]] * Categories.Size[i])
    
    # being friendly to the server
    time.sleep(1.5)

We combine all the scraped data in a dataframe named `Catalog`.

In [16]:
Catalog = pd.DataFrame({
    'Title': title,
    'Stars': stars,
    'Price': price,
    'In_Stock': in_stock,
    'URL': book_url,
    'Category': category
})

In [17]:
Catalog.head()

Unnamed: 0,Title,Stars,Price,In_Stock,URL,Category
0,It's Only the Himalayas,Two,Â£45.17,In stock,http://books.toscrape.com/catalogue/its-only-t...,Travel
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,Four,Â£49.43,In stock,http://books.toscrape.com/catalogue/full-moon-...,Travel
2,See America: A Celebration of Our National Par...,Three,Â£48.87,In stock,http://books.toscrape.com/catalogue/see-americ...,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,Two,Â£36.94,In stock,http://books.toscrape.com/catalogue/vagabondin...,Travel
4,Under the Tuscan Sun,Three,Â£37.33,In stock,http://books.toscrape.com/catalogue/under-the-...,Travel


Notice that the price is not a float. We remove the currency sign and convert the prices into float elements.

In [18]:
Catalog['Price'] = [price[2:] for price in Catalog.Price]
Catalog = Catalog.astype({'Price':'float'})

Another thing is that the star ratings are written out. It would be more convenient if they were just numerical.

In [19]:
pd.unique(Catalog.Stars)

array(['Two', 'Four', 'Three', 'One', 'Five'], dtype=object)

In [20]:
nmbrs = ['One', 'Two', 'Three', 'Four', 'Five']

for i in range(5):
    Catalog.loc[Catalog['Stars'] == nmbrs[i], 'Stars'] = i + 1

Catalog = Catalog.astype({'Stars':'int64'})

In [21]:
Catalog.head()

Unnamed: 0,Title,Stars,Price,In_Stock,URL,Category
0,It's Only the Himalayas,2,45.17,In stock,http://books.toscrape.com/catalogue/its-only-t...,Travel
1,Full Moon over Noahâs Ark: An Odyssey to Mou...,4,49.43,In stock,http://books.toscrape.com/catalogue/full-moon-...,Travel
2,See America: A Celebration of Our National Par...,3,48.87,In stock,http://books.toscrape.com/catalogue/see-americ...,Travel
3,Vagabonding: An Uncommon Guide to the Art of L...,2,36.94,In stock,http://books.toscrape.com/catalogue/vagabondin...,Travel
4,Under the Tuscan Sun,3,37.33,In stock,http://books.toscrape.com/catalogue/under-the-...,Travel


We will use the book URLs in `Catalog` to scrape each individual page. We first want to that a look at the structure of the HTML script of these pages.

In [22]:
soup = scraper(book_url[0])
soup


<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    It's Only the Himalayas | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as &quot;stupid.&quot; She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and wa âWherever you go, whatever you do, jus

There are some elements on this page that we haven't yet added to the dataset but could be interesting for this project. This includes the product description, the UPC and the number of books still in stock. Other information that is also stated on the page is the number of reviews, the VAT, the product type, the other viewed products and the warning text. You can even see when the content was created. Since this is a static mock website (and only showing one product type: books), we will exclude them from this beginner project.

Our the stock size and the description are found in the p tags.

In [23]:
soup.find_all('p')

[<p class="price_color">Â£45.17</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock (19 available)
     
 </p>,
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <!-- <small><a href="/catalogue/its-only-the-himalayas_981/reviews/">
         
                 
                     0 customer reviews
                 
         </a></small>
          --> 
 
 
 <!-- 
     <a id="write_review" href="/catalogue/its-only-the-himalayas_981/reviews/add/#addreview" class="btn btn-success btn-sm">
         Write a review
     </a>
 
  --></p>,
 <p>âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lion

What we didn´t do before, but which would help us now, is making use of the classes of the p tags. For example, the stock size is found under the class `instock availability`.

In [24]:
soup.find_all('p', class_='instock availability')

[<p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock (19 available)
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>]

The output shows us stock information on not just the book that corresponds to the page, but also of the books that we previously viewed which are displayed at the bottom of the page. So we are only interested in the first element of this list.

In [25]:
re.findall(r'\d+', soup.find('p', class_='instock availability').text)

['19']

The description does not have its own class in the p tags, but it can also be found in the meta tags where it's easier found and extracted.

In [26]:
soup.find('meta', attrs={'name': 'description'})

<meta content="
    âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as &quot;stupid.&quot; She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and wa âWherever you go, whatever you do, just . . . donât do anything stupid.â âMy MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as &quot;stupid.&quot; She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and watched as her friend was attacked by a monkey in Indonesia.But interspersed in those slightly more crazy moments, Sue Bedfored and her friend &quot;Sara the Stoic&quot; experienced the sights, s

We now need to get the text from the output. 

In [27]:
soup.find('meta', attrs={'name': 'description'}).get('content')

'\n    â\x80\x9cWherever you go, whatever you do, just . . . donâ\x80\x99t do anything stupid.â\x80\x9d â\x80\x94My MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and wa â\x80\x9cWherever you go, whatever you do, just . . . donâ\x80\x99t do anything stupid.â\x80\x9d â\x80\x94My MotherDuring her yearlong adventure backpacking from South Africa to Singapore, S. Bedford definitely did a few things her mother might classify as "stupid." She swam with great white sharks in South Africa, ran from lions in Zimbabwe, climbed a Himalayan mountain without training in Nepal, and watched as her friend was attacked by a monkey in Indonesia.But interspersed in those slightly more crazy moments, Sue Bedfored and her friend "Sara the Stoic" experienced the sigh

We could do something about the encoding, we will leave it as is.

The UPC cannot directly be found by searching p tags. It's found in a tr tag.

In [28]:
soup.find_all('tr')[0].text

'\nUPCa22124811bfa8350\n'

It looks like this because it is displayed in a table. On the left side it says 'UPC' and on the right side it shows the UPC. We have to isolate the UPC.

In [29]:
soup.find_all('tr')[0].text[4:-1]

'a22124811bfa8350'

Now that we have figured this all out, we will create a loop that will go through all the pages.

In [30]:
# initialize lists (these will be added to in the loop below using the function .extend())
stock = []
description = []
upc = []

# loop through all the books
for url in book_url:
    soup = scraper(url)
    stock.append(re.findall(r'\d+', soup.find('p', class_='instock availability').text)[0])
    description.append(soup.find('meta', attrs={'name': 'description'}).get('content'))
    upc.append(soup.find_all('tr')[0].text[4:-1])
    time.sleep(1.5)

In [31]:
Catalog['Stock_Size'] = stock
Catalog['Description'] = description
Catalog['UPC'] = upc

In [32]:
Catalog = Catalog.astype({'Stock_Size':'int64'})

Recall that we created name codes for the categories. These are essentially unique identifiers. We should should also create unique identifiers for the books. This way, a proper database can be built. We will replace `Categories.Code` by `Categories.ID` and we will create `Catalog.ID` for the books.

In [33]:
Categories['ID'] = range(1, Categories.shape[0] + 1)
Catalog['ID'] = range(1, Catalog.shape[0] + 1)

Catalog['Category_ID'] = [0] * Catalog.shape[0]

for i in range(Categories.shape[0]):
    Catalog.loc[Catalog['Category'] == Categories.loc[i, 'Name'], 'Category_ID'] = Categories.loc[i, 'ID']

Categories.drop(columns='Code', inplace=True)

In [34]:
catcols = list(Categories.columns)
catcols.remove('ID')
['ID'] + catcols

Categories = Categories[['ID'] + catcols]

catcols = list(Catalog.columns)
catcols.remove('ID')
['ID'] + catcols

Catalog = Catalog[['ID'] + catcols]

In [35]:
Categories.head()

Unnamed: 0,ID,Name,URL,Size,Pages
0,1,Travel,travel_2,11,1
1,2,Mystery,mystery_3,32,2
2,3,Historical Fiction,historical-fiction_4,26,2
3,4,Sequential Art,sequential-art_5,75,4
4,5,Classics,classics_6,19,1


In [36]:
Catalog.head()

Unnamed: 0,ID,Title,Stars,Price,In_Stock,URL,Category,Stock_Size,Description,UPC,Category_ID
0,1,It's Only the Himalayas,2,45.17,In stock,http://books.toscrape.com/catalogue/its-only-t...,Travel,19,"\n âWherever you go, whatever you do, jus...",a22124811bfa8350,1
1,2,Full Moon over Noahâs Ark: An Odyssey to Mou...,4,49.43,In stock,http://books.toscrape.com/catalogue/full-moon-...,Travel,15,\n Acclaimed travel writer Rick Antonson se...,ce60436f52c5ee68,1
2,3,See America: A Celebration of Our National Par...,3,48.87,In stock,http://books.toscrape.com/catalogue/see-americ...,Travel,14,\n To coincide with the 2016 centennial ann...,f9705c362f070608,1
3,4,Vagabonding: An Uncommon Guide to the Art of L...,2,36.94,In stock,http://books.toscrape.com/catalogue/vagabondin...,Travel,8,\n With a new foreword by Tim Ferriss â¢Th...,1809259a5a5f1d8d,1
4,5,Under the Tuscan Sun,3,37.33,In stock,http://books.toscrape.com/catalogue/under-the-...,Travel,7,\n A CLASSIC FROM THE BESTSELLING AUTHOR OF...,a94350ee74deaa07,1


In [37]:
Categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      50 non-null     int64 
 1   Name    50 non-null     object
 2   URL     50 non-null     object
 3   Size    50 non-null     int64 
 4   Pages   50 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB


In [38]:
Catalog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           1000 non-null   int64  
 1   Title        1000 non-null   object 
 2   Stars        1000 non-null   int64  
 3   Price        1000 non-null   float64
 4   In_Stock     1000 non-null   object 
 5   URL          1000 non-null   object 
 6   Category     1000 non-null   object 
 7   Stock_Size   1000 non-null   int64  
 8   Description  1000 non-null   object 
 9   UPC          1000 non-null   object 
 10  Category_ID  1000 non-null   int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 86.1+ KB


We can now export the dataframes to csv files.

In [39]:
Categories.to_csv('Categories.csv', index=False)
Catalog.to_csv('Catalog.csv', index=False)