# Web scraping in Python: Scraping with Requests and BeautifulSoup

Web scraping is a valuable tool for programmers to effortlessly gather information from the vast resources of the internet. While it is generally acceptable for non-commercial purposes with publicly available data, caution should be taken to avoid scraping protected information such as personal data, intellectual property, or confidential information. 

Additionally, the complexities of scraping social media due to its varying levels of accessibility highlight the need for cautious and informed scraping practices.


This coffee and coding session, we will focus to use Python's two libraries; <b>Requests</b> and <b>BeautifulSoup</b>. <br>http://books.toscrape.com/ contains review  for fake books for the beginners learning web scrapings. <br>
This session aims for the beginners (like me!) introduction to web scraping.

To gather information from the internet through web scraping, one typically follows a four-step process:

<li> Sending an HTTP GET request to the URL </li>
<li> Retrieving HTML (Hypertext Markup Language) content </li>
<li> Building the HTML document tree </li>
<li> Extracting information from the HTML document tree </li>   

### Requests


The Requests library in Python is a popular and widely used library for making HTTP requests. This library allows user to send HTTP requests to server, receive response and handle in a simple and efficient manner. 

It supports varous methods for making requests, such as GET, POST, HEAD, PUT, DELETE etc.

### BeautifulSoup

BeautifulSoup is a Python library for web scraping and data extraction from HTML and XML files. <br>
It provides a convenient and efficient way to parse and naviage through HTML contents, allowing for the extraction of specific elements and data.

<b>Resource: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ </b> 

We will first fetch HTML code from our fake-books site. But before we start, I will show you a very quick overview of basic HTML. 

HTML is a standard markup language used for creating web pages and other information that can be displayed in a web browser. It consists of a set of tags and attributes that define the structure, content, and appearance of a web page. 

In [None]:
from IPython.core.display import HTML

In [None]:
html_test = """
<html>
  <head>
    <title>My First Web Page</title>
  </head>
  <body>
    <h1>Welcome to my first test web page</h1>
    <p>This is a paragraph of text. I want to list some of my favourite composers and their music 
        <i> Italic font </i>    
        <b> Bold text 1</b>
        <b> Bold text 2</b>
    </p>    
    <p> <b color = "blue"> This is next paragraph </b></p>
    <br>    
    <ul id = "composer" class = "myclass">
      <li>Liszt</li>  
      <a href = "https://en.wikipedia.org/wiki/Franz_Liszt"> Franz Liszt: Hungarian composer, pianist of romantic period. </a>
      <li>Mozart</li>
      <a href = "https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart"> Wolfgang Amadeus Mozart influential composer of classical period. </a>
      <li>Debussy</li>
      <a href = "https://en.wikipedia.org/wiki/Claude_Debussy"> Claude Debussy, French composer seen as the first impressionist. </a>
      <li>Beethoven</li>
      <a href = "https://en.wikipedia.org/wiki/Ludwig_van_Beethoven"> Ludwig van Beethoven, German composer and pianist.</a>
    </ul>
    <ul id = "piece" class = "myclass">
        <li>Love Dream (No.3)</li>
        <li>Piano Sonata No.16</li>
        <li>Moonlight</li>
        <li>Symphony No.5</li>
    </ul>
  </body>
</html>
"""

In [None]:
# Let's import libraries first 
from bs4 import BeautifulSoup

We explore functions in BeautifulSoup using the sample HTML we've just created. 

In [None]:
# create soup variable
soup = BeautifulSoup(html_test, features = "html.parser")

In [None]:
# soup has the information extracted from HTML string. We can make it better display
soup

In [None]:
# Now indent etc works better
# Prettify arranges all the tags in a parse-tree manner with better readability.
print(soup.prettify())

When you read HTML, you want to search specific aspect. You can find thins by `tag`. like `<head>`,`<title>` etc. 

In [None]:
# Codes to nativage data structure (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
display(soup.title) # access title tag
display(soup.title.string) # acces title tag and want only access what is inside in the tag
# You can also modify string in your tag
# soup.title.string = "My Second web page"
display(soup.p.get_text())
display(soup.ul.get_text()) # it returns first ul element only.

Let's try `find()` or `find_all()` functions to search for specific tags in the HTML content. <br>
`find()` returns only the first occurrence of the search query. `find_all()` returns a list of all matches.

In [None]:
display(soup.find('a')) # first element of a tag only
display(soup.find('a').text) # This allows us to extract the inner HTML text
display(soup.find_all('a')) # all a tag elements

In [None]:
display(soup.find_all("p")[0]) # First element of p tag
display(soup.find_all("p")[0].find_all("b")) # inside p tag all b tags

If we only want to extract composer items `<li>`, we can use `attr ={}` dictionary to define the attributes of an HTML tag. Dictionary keys are the name of the attributes, and the values are the attribute values.<br>

In [None]:
myList = soup.find(attrs = {'id':'composer', 'class':'myclass'})
myList.find_all('li')

You can also traverse the parent and children elements in the HTML code. Below code uses `find_parent()` function to find the parent of the first `ul` tag (`body` tag). It then uses `find_all()` function again to find all the `li` elements in the first `ul` tag and print each `li` element.

In [None]:
# First find the first ul tag
first_ul = soup.find("ul")

# Find the parent of the first ul tag (the body tag)
body = first_ul.find_parent("body")

# Print the text content of each li element
for li in first_ul.find_all(["a", "li"]):
    print(li.text)


### Let's try with fake-book review site 

We've tried few basics of beautifulsoup functions using our own HTML text. Now, we can go to fake book review sites and extract some information about the books, their prices and book description.
Before we start, let's check this webpage and inspect HTML codes. https://books.toscrape.com/catalogue/page-1.html

In [None]:
# Import requests and pandas, re as well for regex string manipulation 
import requests
import pandas as pd
import re
from tqdm import tqdm

#### Read the first page, first book information



First, read book title, price and rating. We will then extend this by adding book description. (When user click book title, it takes to the product description page)

In [None]:
# request can get HTML data from the very first page.
url = "https://books.toscrape.com/catalogue/page-1.html" #page-2 etc will repeat to extract all pages. which we will try in the next section
response = requests.get(url) # http request and get the content of the page
display(response)
# 200 means 'Success' status
# If you want to see content of resonse (it will be messy)
# response.text

In [None]:
soup = BeautifulSoup(response.text, "html.parser")
# You can look at the content using 
# soup.prettify()

Let's go back to our html inspect page extract our interest (book title, price) from the first page.

Below code will extract book title, its price and rating. <br>
First, we will find the first `<article>` tag. From there, we can search `title` attribute from `<h3>/<a>` tag. It will return the book title.
Next, we will search and find `<p>` tag from the child of `<article>` tag and look for `price_color` attribute and return its text (£54.12 for example).
We only want to keep price (float) number, so cleaning that field by using regex grouping.
Similaryly, we extract `<p>` tag `<class>` attribute second element (One, Two, Three etc) and convert it into integer value. 

In [362]:
# Extract first book in the first page to see if this is working.
review_mapping = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} # For mapping for numbers

article = soup.find("article")
title = article.h3.a["title"]
print(title)
price = article.find('p', class_="price_color").text 
price = float(re.findall("\d+\.\d+", price)[0]) #regex + is metacharacters means one or more occurrences
print(price)
rating = article.p["class"][1] # access secound element one, two, three etc 
rating = review_mapping[rating]
print(rating)

A Light in the Attic
51.77
3


Expand it for all page 1 using for loop. As you can see we now use `find_all()` function to find all `<article>` tags. 

#### Read the first page, all books in the first page information

In [363]:
# Based on the code above, we can now extract all books in the first page.
books = [] # We will append to this list, define empty list

for article in soup.find_all("article"):
    title = article.h3.a["title"]
    # price = article.select_one(".price_color").get_text() # another way using select_one (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)    
    price = article.find('p', class_="price_color").text 
    price = float(re.findall("\d+\.\d+", price)[0])
    rating = article.p["class"][1] # access secound element one, two etc 
    rating = review_mapping[rating]
    books.append({"title": title, "price": price, "rating":rating})
      
for book in books[0:5]: # First 5 books
    print(f"Title: {book['title']}")
    print(f"Price: {book['price']}")
    print(f"Rating: {book['rating']}")
    print("")   


Title: A Light in the Attic
Price: 51.77
Rating: 3

Title: Tipping the Velvet
Price: 53.74
Rating: 1

Title: Soumission
Price: 50.1
Rating: 1

Title: Sharp Objects
Price: 47.82
Rating: 4

Title: Sapiens: A Brief History of Humankind
Price: 54.23
Rating: 5



So far, we extracted book title, price, rating. Can we also add extra information? I would like to add product description, topic (Travel/Poetry/Mystery etc etc). To achieve this, we can extract hyperlink for each book and extract description, genre from individual pages. 

#### Read the first page, all books in the first page information and extract extra information from individual books (description, genre)

In [369]:
# Expand the code from above, include book description and genre
# We need to get hyperlink and get information from each book page
# Extract book description and genre from the page.

# request the main page
url = "https://books.toscrape.com/catalogue/page-1.html" #page-2 etc will repeat to extract all pages for next step. 
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# # Find all book titles, prices, and customer reviews
books = []
review_mapping = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} # For mapping for numbers

# This code is expanding from the previous codes.


for article in soup.find_all("article"):
    title = article.h3.a["title"]
    price = article.find('p', class_="price_color").text 
    price = float(re.findall("\d+\.\d+", price)[0]) # Extract numbers (price) only. Remove special character & £ sign.
    rating = article.p["class"][1] # access secound element one, two etc 
    rating = review_mapping[rating]
    
    # find the link to the individual book's page 
    link = article.h3.a["href"]       
    book_url = "http://books.toscrape.com/catalogue/" + link  # get information of individual book link page to take you to the description  
    
    # Again request the individual book's page    
    book_response = requests.get(book_url)    
    book_soup = BeautifulSoup(book_response.text, "html.parser")
    
    # extract the product description
    # select a meta tag with attribute name = "description". The method select_one returns the first matching element, or 'None'
    # if there are not matches. ['content'] part is accessing the value of the content attribute of the selected meta tag. 
    
    product_description = book_soup.select_one("meta[name='description']")["content"]  
    genre_related_tag = book_soup.select("ul.breadcrumb li") # all li elements
    genre_list = [item.text for item in genre_related_tag]
    # print(genre_list) 
    # it will extract text from tags and this website follows this structure
    # home/books/{topic}/title
    # we will extract [-2] second-to-last item (poetry, historical fiction, fiction etc)
    genre = genre_list[-2] if len(genre_list) > 2 else None # control error, extract only if list is more than 2 (otherwise it will throw error)    
    books.append({"title": title, "genre": genre, "price": price, "rating":rating, "book_description": product_description})
    
for book in books[0:5]: # First 5 books
    print(f"Title: {book['title']}")
    print(f"Genre: {book['genre']}")    
    print(f"Price: {book['price']}")
    print(f"Rating: {book['rating']}")
    print(f"Description: {book['book_description']}")
    print("")

Title: A Light in the Attic
Genre: 
Poetry

Price: 51.77
Rating: 3
Description: 
    It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cr

We have now include information such as:
<li> Book title </li>
<li> Genre </li>
<li> Price </li>
<li> Rating </li>
<li> Book description </li>

It would be useful if we can go through all pages and extract all books they have in this website. This is our final stage and we will save this as `pandas DataFrame`. 

How do we get the total number of pages? We will find `<li>` tag with `next` class first. Based on this we search previous sibling `<li>` tag using `find_previous_sibling('li')`. 

[0] requires to access first element as `find_previous_sibling` can only access single BeautifulSoup object.

#### Expand for all pages

In [380]:
review_mapping = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
books = []

# Get the total number of pages
url = "http://books.toscrape.com/catalogue/page-1.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

previous_page = soup.select("li.next")[0].find_previous_sibling("li")
if previous_page:
    text = previous_page.text # which returns page 1 of 50 and we want to extract 50 from here.
    print(text)
    # in here extract the last page number 
    match = re.search(r'Page (\d+) of (\d+)', text) # any number
    total_pages = int(match.group(2)) # extract last page.
    print(total_pages)
else:
    total_pages = 1 # prevent index out of range error


# Loop through each page

# for page_number in tqdm(range(1, total_pages + 1)):
#     url = f"http://books.toscrape.com/catalogue/page-{page_number}.html"
#     response = requests.get(url)
#     soup = BeautifulSoup(response.text, "html.parser")
#     for article in soup.find_all("article"):
#         title = article.h3.a["title"]
#         price = article.find('p', class_="price_color").text 
#         price = float(re.findall("\d+\.\d+", price)[0]) # Extract numbers (price) only. Remove special character & £ sign.
#         rating = article.p["class"][1] # access secound element one, two etc 
#         rating = review_mapping[rating]

#         # find the link to the individual book's page 
#         link = article.h3.a["href"]       
#         book_url = "http://books.toscrape.com/catalogue/" + link  # get information of individual book link page to take you to the description  

#         # Again request the individual book's page    
#         book_response = requests.get(book_url)    
#         book_soup = BeautifulSoup(book_response.text, "html.parser")

#         # extract the product description
#         # select a meta tag with attribute name = "description". The method select_one returns the first matching element, or 'None'
#         # if there are not matches. ['content'] part is accessing the value of the content attribute of the selected meta tag. 

#         product_description = book_soup.select_one("meta[name='description']")["content"]  
#         genre_related_tag = book_soup.select("ul.breadcrumb li") # all li elements
#         genre_list = [item.text for item in genre_related_tag]
#         # print(genre_list) 
#         # it will extract text from tags and this website follows this structure
#         # home/books/{topic}/title
#         # we will extract [-2] second-to-last item (poetry, historical fiction, fiction etc)
#         genre = genre_list[-2] if len(genre_list) > 2 else None # control error, extract only if list is more than 2 (otherwise it will throw error)    
#         books.append({"title": title, "genre": genre, "price": price, "rating":rating, "book_description": product_description}) 

        
        
# # # Convert the list of dictionaries to a Pandas dataframe
# # # We will clean the text
# df = pd.DataFrame(books)

# df.head()
# # We will use this for the next.
# df.to_csv("books_web_scraping.csv", index = False)        


            
                Page 1 of 50
            
            
50


100%|████████████████████████████████████████████████████████████████████████████████████| 50/50 [06:47<00:00,  8.14s/it]


### Pandas manipulation 

<li> Tidy up book description text by removing non alphanumeric characters. </li>
<li> Descriptive analysis </li>    
<li> We could try NLP on book_description? </li>

In [409]:
# We don't want to run it during the session (takes 7mins) so we will read in here.

df = pd.read_csv("books_web_scraping.csv")
df.head() # genre, book_description could do with cleaning


Unnamed: 0,title,genre,price,rating,book_description
0,A Light in the Attic,\nPoetry\n,51.77,3,\n It's hard to imagine a world without A L...
1,Tipping the Velvet,\nHistorical Fiction\n,53.74,1,"\n ""Erotic and absorbing...Written with sta..."
2,Soumission,\nFiction\n,50.1,1,\n Dans une France assez proche de la nÃ´tr...
3,Sharp Objects,\nMystery\n,47.82,4,"\n WICKED above her hipbone, GIRL across he..."
4,Sapiens: A Brief History of Humankind,\nHistory\n,54.23,5,\n From a renowned historian comes a ground...


In [423]:
# substitute any non alphanumeric character to empty string
df[['genre','book_description']] = df[['genre','book_description']].applymap(lambda x: re.sub(r'\n|[^\w\s\']+',' ',x))
df[['genre','book_description']] = df[['genre','book_description']].applymap(lambda x: x.strip())


In [411]:
df.head()

Unnamed: 0,title,genre,price,rating,book_description
0,A Light in the Attic,Poetry,51.77,3,It's hard to imagine a world without A Li...
1,Tipping the Velvet,Historical Fiction,53.74,1,Erotic and absorbing Written with starli...
2,Soumission,Fiction,50.1,1,Dans une France assez proche de la nÃ tre...
3,Sharp Objects,Mystery,47.82,4,WICKED above her hipbone GIRL across her...
4,Sapiens: A Brief History of Humankind,History,54.23,5,From a renowned historian comes a groundb...


In [418]:
category_df = df['genre'].value_counts() \
           .rename('count') \
           .reset_index() 

In [419]:
category_df

Unnamed: 0,index,count
0,Default,152
1,Nonfiction,110
2,Sequential Art,75
3,Add a comment,67
4,Fiction,65
5,Young Adult,54
6,Fantasy,48
7,Romance,35
8,Mystery,32
9,Food and Drink,30


In [435]:
df['book_description'][2]

'Dans une France assez proche de la nÃ tre  un homme sâ engage dans la carriÃ re universitaire  Peu motivÃ  par lâ enseignement  il sâ attend Ã\xa0 une vie ennuyeuse mais calme  protÃ gÃ e des grands drames historiques  Cependant les forces en jeu dans le pays ont fissurÃ  le systÃ me politique jusquâ Ã\xa0 provoquer son effondrement  Cette implosion sans soubresauts  sans vraie rÃ volution  s Dans une France assez proche de la nÃ tre  un homme sâ engage dans la carriÃ re universitaire  Peu motivÃ  par lâ enseignement  il sâ attend Ã\xa0 une vie ennuyeuse mais calme  protÃ gÃ e des grands drames historiques  Cependant les forces en jeu dans le pays ont fissurÃ  le systÃ me politique jusquâ Ã\xa0 provoquer son effondrement  Cette implosion sans soubresauts  sans vraie rÃ volution  se dÃ veloppe comme un mauvais rÃªve Le talent de lâ auteur  sa force visionnaire nous entraÃ nent sur un terrain ambigu et glissant   son regard sur notre civilisation vieillissante fait coexister dans ce rom

I've noticed that some of the book description is not written in English. I want to remove 