### **Scraping a book website**

First, go to this link:

https://books.toscrape.com/



For today, let's just focus on the music section:

https://books.toscrape.com/catalogue/category/books/music_14/index.html

Our goal will be to save the following information to a csv file for every book in the music section:
- Title
- Rating
- UPC
- Product Type
- Price (excl. tax)
- Price (incl. tax)
- Tax
- Availability
- Number of reviews




In [7]:
# installing libaries: you will only need to run this once
# can also be ran in your terminal with: pip install requests beautifulsoup4 pandas

import sys
!{sys.executable} -m pip install --user requests beautifulsoup4 pandas




In [8]:
# importing libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
import re
import time

In [9]:
# step 1: make a request to the website and get the HTML content
url = "https://books.toscrape.com/catalogue/category/books/music_14/index.html" # this is the url to the website we want to scrape

response = requests.get(url) # this sends a GET request to the website and stores the response in the variable "response"

print("status code:", response.status_code) # this prints the status code of the response. A status code of 200 means the request was successful, while a status code of 404 means the page was not found.

html = response.text # this gets the HTML content of the page as a string and stores it in the variable "html"

print(html[:1000]) # this prints the first 1000 characters of the HTML content

status code: 200


<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    Music | 
     Books to Scrape - Sandbox

</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    
" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" hr

Now go back to this link: https://books.toscrape.com/catalogue/category/books/music_14/index.html

1. Right click
2. Inspect
3. Click the top left icon that looks like a square with an arrow in the bottom right corner
4. Hover over the area for the first book
5. Click the corresponding html element that is highlighted
6. Copy it
7. Paste it below
8. Go back and hover over the lists of similar elements-- notice how each "product pod" is getting highlighted as you do



In [11]:
# it should look like this:
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    <article class="product_pod">
            <div class="image_container">
                    <a href="../../../rip-it-up-and-start-again_986/index.html"><img src="../../../../media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg" alt="Rip it Up and Start Again" class="thumbnail"></a>
            </div>
                <p class="star-rating Five">
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                </p>
            <h3><a href="../../../rip-it-up-and-start-again_986/index.html" title="Rip it Up and Start Again">Rip it Up and ...</a></h3> # this is an important line-- it gives us the hyperlink to the page
            <div class="product_price">
        <p class="price_color">£35.02</p>
<p class="instock availability">
    <i class="icon-ok"></i>
        In stock
</p>
    <form>
        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
    </form>            
            </div>
    </article>
</li>

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 16)

In [None]:
# let's pull each hyperlink from each product pod so that we can access the individual pages for each book
soup = BeautifulSoup(html, "html.parser") # creates a beautifulsoup object from the HTML content, which allows us to easily navigate and search the HTML structure
product_pods = soup.find_all("article", class_="product_pod") # this finds all the "article" tags with the class "product_pod" and stores them in a list called "product_pods". Each "article" tag represents a product on the page.
links = [] 

for pod in product_pods: # this loops through each product pod in the list of product pods
    link = pod.h3.a["href"] # this gets the hyperlink from the "a" tag inside the "h3" tag of the product pod. The "href" attribute contains the URL of the individual book page.
    links.append(link) # this adds the hyperlink to the list of existing hyperlinks

In [None]:
print(len(links))

13


In [None]:
for link in links:
    print(link)

../../../rip-it-up-and-start-again_986/index.html
../../../our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html
../../../how-music-works_979/index.html
../../../love-is-a-mix-tape-music-1_711/index.html
../../../please-kill-me-the-uncensored-oral-history-of-punk_537/index.html
../../../kill-em-and-leave-searching-for-james-brown-and-the-american-soul_528/index.html
../../../chronicles-vol-1_462/index.html
../../../this-is-your-brain-on-music-the-science-of-a-human-obsession_414/index.html
../../../orchestra-of-exiles-the-story-of-bronislaw-huberman-the-israel-philharmonic-and-the-one-thousand-jews-he-saved-from-nazi-horrors_337/index.html
../../../no-one-here-gets-out-alive_336/index.html
../../../life_104/index.html
../../../old-records-never-die-one-mans-quest-for-his-vinyl-and-his-past_39/index.html
../../../forever-rockers-the-rocker-12_19/index.html


Notice that these are relative hyperlinks. if we tried to copy and paste these into a browser, they would not return a valid page

But let's try clicking on the first book on the page and looking at the hyperlink structure: https://books.toscrape.com/catalogue/category/books/music_14/index.html

So if we append "https://books.toscrape.com/catalogue" to the beginning of our hyperlinks, this will return valid pages


In [12]:
from urllib.parse import urljoin

page_url = "https://books.toscrape.com/catalogue/category/books/music_14/index.html" # this is the url of the page we are scraping, which we will use as the base url to create the full urls for each book page

clean_links = [urljoin(page_url, link) for link in links] # this creates a new list called "clean_links" that contains the full urls for each book page by joining the base url with each hyperlink in the "links" list using the urljoin function from the urllib.parse library

for link in clean_links: 
    print(link)

 # now we can loop through each of the clean links and make a request to each book page to get more information about each book, such as the title, price, stock availability, and star rating

https://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html
https://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html
https://books.toscrape.com/catalogue/how-music-works_979/index.html
https://books.toscrape.com/catalogue/love-is-a-mix-tape-music-1_711/index.html
https://books.toscrape.com/catalogue/please-kill-me-the-uncensored-oral-history-of-punk_537/index.html
https://books.toscrape.com/catalogue/kill-em-and-leave-searching-for-james-brown-and-the-american-soul_528/index.html
https://books.toscrape.com/catalogue/chronicles-vol-1_462/index.html
https://books.toscrape.com/catalogue/this-is-your-brain-on-music-the-science-of-a-human-obsession_414/index.html
https://books.toscrape.com/catalogue/orchestra-of-exiles-the-story-of-bronislaw-huberman-the-israel-philharmonic-and-the-one-thousand-jews-he-saved-from-nazi-horrors_337/index.html
https://books.toscrape.com/catalogue/no-one-here-gets-out-

In [13]:
# now let's make a request to the first link in our list and examine its html structure
first_book_url = clean_links[0]

response = requests.get(first_book_url)
print("status code:", response.status_code)

book_html = response.text

book_soup = BeautifulSoup(book_html, "html.parser")

print(book_soup.prettify())


status code: 200
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   Rip it Up and Start Again | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="
    Punk's raw power rejuvenated rock, but by the summer of 1977 the movement had become a parody of itself. RIP IT UP AND START AGAIN is a celebration of what happened next.Post-punk bands like PiL, Joy Division, Talking Heads, The Fall and The Human League dedicated themselves to fulfilling punk's unfinished musical revolution. The post-punk groups were fervent modernists; w Punk's raw power rejuvenated r

In [14]:
# title
title = book_soup.find("div", class_="product_main").find("h1").get_text(strip=True)

# rating (stored as a class, e.g. "star-rating Five")
rating = book_soup.find("p", class_="star-rating")["class"][1]

print("title:", title)
print("rating:", rating)


title: Rip it Up and Start Again
rating: Five


In [15]:
# product information table
table_rows = book_soup.find("table", class_="table table-striped").find_all("tr")

product_info = {}
for row in table_rows:
    key = row.find("th").get_text(strip=True)
    value = row.find("td").get_text(strip=True)
    product_info[key] = value

print(table_rows)


[<tr>
<th>UPC</th><td>a34ba96d4081e6a4</td>
</tr>, <tr>
<th>Product Type</th><td>Books</td>
</tr>, <tr>
<th>Price (excl. tax)</th><td>Â£35.02</td>
</tr>, <tr>
<th>Price (incl. tax)</th><td>Â£35.02</td>
</tr>, <tr>
<th>Tax</th><td>Â£0.00</td>
</tr>, <tr>
<th>Availability</th>
<td>In stock (19 available)</td>
</tr>, <tr>
<th>Number of reviews</th>
<td>0</td>
</tr>]


In [16]:
print("upc:", product_info.get("UPC"))
print("product type:", product_info.get("Product Type"))
print("price (excl. tax):", product_info.get("Price (excl. tax)"))
print("price (incl. tax):", product_info.get("Price (incl. tax)"))
print("tax:", product_info.get("Tax"))
print("availability:", product_info.get("Availability"))
print("number of reviews:", product_info.get("Number of reviews"))

upc: a34ba96d4081e6a4
product type: Books
price (excl. tax): Â£35.02
price (incl. tax): Â£35.02
tax: Â£0.00
availability: In stock (19 available)
number of reviews: 0


In [17]:
# cleaning output a bit further

# --- clean price fields: extract numeric values ---
price_excl_tax = float(
    re.search(r"[\d.]+", product_info["Price (excl. tax)"]).group()
)

price_incl_tax = float(
    re.search(r"[\d.]+", product_info["Price (incl. tax)"]).group()
)

tax = float(
    re.search(r"[\d.]+", product_info["Tax"]).group()
)

# --- clean availability ---
availability_text = product_info["Availability"]

# availability flag
available_flag = "y" if "In stock" in availability_text else "n"

# number available (extract integer)
match = re.search(r"\((\d+) available\)", availability_text)
number_available = int(match.group(1)) if match else None

# print cleaned results
print("price_excl_tax:", price_excl_tax)
print("price_incl_tax:", price_incl_tax)
print("tax:", tax)
print("available_flag:", available_flag)
print("number_available:", number_available)

price_excl_tax: 35.02
price_incl_tax: 35.02
tax: 0.0
available_flag: y
number_available: 19


In [18]:
rows = []
total = len(clean_links)

for i, url in enumerate(clean_links, start=1): # loops through each clean link in our list
    print(f"scraping {i}/{total}: {url}")

    response = requests.get(url) # grabs the html for the url
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser") 

    title = soup.find("div", class_="product_main").find("h1").get_text(strip=True) # title
    rating = soup.find("p", class_="star-rating")["class"][1] # rating

    table_rows = soup.find("table", class_="table table-striped").find_all("tr") # product information table
    product_info = {}
    for row in table_rows:
        key = row.find("th").get_text(strip=True)
        value = row.find("td").get_text(strip=True)
        product_info[key] = value

    price_excl_tax = float(re.search(r"[\d.]+", product_info["Price (excl. tax)"]).group()) # cleans prices
    price_incl_tax = float(re.search(r"[\d.]+", product_info["Price (incl. tax)"]).group()) # cleans prices
    tax = float(re.search(r"[\d.]+", product_info["Tax"]).group()) # cleans taxes

    availability_text = product_info["Availability"] 
    available_flag = "y" if "In stock" in availability_text else "n" # cleans availability
    match = re.search(r"\((\d+)\s+available\)", availability_text)
    number_available = int(match.group(1)) if match else None
 
    rows.append({ # saves the final dataframe 
        "book_url": url, 
        "title": title,
        "rating": rating,
        "upc": product_info.get("UPC"),
        "product_type": product_info.get("Product Type"),
        "price_excl_tax": price_excl_tax,
        "price_incl_tax": price_incl_tax,
        "tax": tax,
        "available_flag": available_flag,
        "number_available": number_available,
        "number_of_reviews": product_info.get("Number of reviews"),
    })

    time.sleep(0.5)

df_books = pd.DataFrame(rows)
df_books

scraping 1/13: https://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html
scraping 2/13: https://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html
scraping 3/13: https://books.toscrape.com/catalogue/how-music-works_979/index.html
scraping 4/13: https://books.toscrape.com/catalogue/love-is-a-mix-tape-music-1_711/index.html
scraping 5/13: https://books.toscrape.com/catalogue/please-kill-me-the-uncensored-oral-history-of-punk_537/index.html
scraping 6/13: https://books.toscrape.com/catalogue/kill-em-and-leave-searching-for-james-brown-and-the-american-soul_528/index.html
scraping 7/13: https://books.toscrape.com/catalogue/chronicles-vol-1_462/index.html
scraping 8/13: https://books.toscrape.com/catalogue/this-is-your-brain-on-music-the-science-of-a-human-obsession_414/index.html
scraping 9/13: https://books.toscrape.com/catalogue/orchestra-of-exiles-the-story-of-bronislaw-huberman-the-israel-phil

Unnamed: 0,book_url,title,rating,upc,product_type,price_excl_tax,price_incl_tax,tax,available_flag,number_available,number_of_reviews
0,https://books.toscrape.com/catalogue/rip-it-up...,Rip it Up and Start Again,Five,a34ba96d4081e6a4,Books,35.02,35.02,0.0,y,19,0
1,https://books.toscrape.com/catalogue/our-band-...,Our Band Could Be Your Life: Scenes from the A...,Three,deda3e61b9514b83,Books,57.25,57.25,0.0,y,19,0
2,https://books.toscrape.com/catalogue/how-music...,How Music Works,Two,327f68a59745c102,Books,37.32,37.32,0.0,y,19,0
3,https://books.toscrape.com/catalogue/love-is-a...,Love Is a Mix Tape (Music #1),One,adb5f51c2e2ab13f,Books,18.03,18.03,0.0,y,14,0
4,https://books.toscrape.com/catalogue/please-ki...,Please Kill Me: The Uncensored Oral History of...,Four,4a823d80aa30dbb0,Books,31.19,31.19,0.0,y,8,0
5,https://books.toscrape.com/catalogue/kill-em-a...,Kill 'Em and Leave: Searching for James Brown ...,Five,39eefec3a8498dde,Books,45.05,45.05,0.0,y,8,0
6,https://books.toscrape.com/catalogue/chronicle...,"Chronicles, Vol. 1",Two,4781b6bfd6f7a2c5,Books,52.6,52.6,0.0,y,7,0
7,https://books.toscrape.com/catalogue/this-is-y...,This Is Your Brain on Music: The Science of a ...,One,bc209a900101f28d,Books,38.4,38.4,0.0,y,5,0
8,https://books.toscrape.com/catalogue/orchestra...,Orchestra of Exiles: The Story of Bronislaw Hu...,Three,2ba5ccc84d35006c,Books,12.36,12.36,0.0,y,4,0
9,https://books.toscrape.com/catalogue/no-one-he...,No One Here Gets Out Alive,Five,3a95f5a2df4ff921,Books,20.02,20.02,0.0,y,4,0


In [19]:
df_books.to_csv("books.csv", index=False)