---
# INTERMEDIATE PYTHON PROGRAMMING
# CHAPTER 4 - Web Scraping Using BeautifulSoup
---


# WEB SCRAPING INTRODUCTION

When open data source is not an option, you can write your own Python codes to grab data from the web. 
 This is known as web scraping.  
![](https://www.promptcloud.com/wp-content/uploads/2024/03/1_CxVccbFGtv6W2qlq0A4hxw-1024x499.png.webp)


**However, do pay attention to**
- Reading HTML is not always easy (HTML codes are often messy)
- Many website implements data protections to prevent data grabbing
- Modern web application generates web content on the fly when the page is loading at the client side.  That meas the HTML page is empty at the begining of loading while progressively loading data by JavaScript.
- Websites have their terms and condition on assessing their data.  (In our case, we are just doing this for learning purpose.  So we will be fine.)

## Use `pandas.read_html()` when you can

Before you actually implement your scraping code, consider if easy approach would do the job.  In this section, we use pandas built-in function `read_html()` to read webpage that contain targeted data in the format of HTML table.

Below is a HTML table codes and its corresponding appearance in browser.
![](https://dotnettutorials.net/wp-content/uploads/2021/11/word-image-533.png)

Use `read_html()` function to read a `url` (the address of a targeted webpage) that you know contains **HTML table** with your wanted data.  

**E.g.**:
```
    pd.read_html("WEB-PAGE-URL", headers=headers)
```

## `pandas.read_html()` Syntax
`pandas.read_html()` function requires a web page link that you want to grab table(s).  And there is chance that the pages contains multiple tables (or sometimes zero table).  Therefore the returned result of `pandas.read_html()` is type of python `list` (or also widely known as array).

### Import required packages

We need `numpy`, `pandas` and `requests`  to get the job done. `requests` is for triggering HTTP network requests.

```
import numpy as np
import pandas as pd
import requests
```

In [1866]:
import numpy as np
import pandas as pd
import requests

### Declare url and read url
`sp500_divident_yield_by_month_url = 'https://www.multpl.com/s-p-500-dividend-yield/table/by-month'`
sp500_divident_yield_by_month_url

In [None]:
sp500_divident_yield_by_month_url = 'https://www.multpl.com/s-p-500-dividend-yield/table/by-month'
sp500_divident_yield_by_month_url

### Read the url

**Call `read_html()` function here**
```
raw_html_tbl = pd.read_html(sp500_divident_yield_by_month_url)
```

In [2017]:
raw_html_tbl = pd.read_html(sp500_divident_yield_by_month_url)

In [1881]:
raw_html_tbl

[              Date     Value
 0     Apr 28, 2025  â 1.36%
 1     Apr 30, 2025  â 1.34%
 2     Mar 31, 2025  â 1.32%
 3     Feb 28, 2025  â 1.24%
 4     Jan 31, 2025  â 1.25%
 ...            ...       ...
 1848  May 31, 1871     5.35%
 1849  Apr 30, 1871     5.49%
 1850  Mar 31, 1871     5.64%
 1851  Feb 28, 1871     5.78%
 1852  Jan 31, 1871     5.86%
 
 [1853 rows x 2 columns]]

### Check it's type

`type(raw_html_tbl)`

It should say `list`

In [1883]:
type(raw_html_tbl)

list

### Check how many tables are retrieved

There could be more than 1 table in the returned result and therefore it requires use `[ ]` with **index number** to tell which table you like to access.  

Sometimes, there are zero table in the returned result and therefore make sure you codes do proper checking on the result before you try to grab data from tables.

Calling `len()` function would tell you how many tables are in the retuned result.


In [1888]:
len(raw_html_tbl)

1

### Retrieve the table by `[ ]` operation

Retrieve the table by `[ ]` operation together with correct index number

the index number starts `0` and therefore you use `0`to refer to the first table the code below get the 
first table in the result

`raw_html_tbl[0]`

In [1892]:
raw_html_tbl[0]

Unnamed: 0,Date,Value
0,"Apr 28, 2025",â 1.36%
1,"Apr 30, 2025",â 1.34%
2,"Mar 31, 2025",â 1.32%
3,"Feb 28, 2025",â 1.24%
4,"Jan 31, 2025",â 1.25%
...,...,...
1848,"May 31, 1871",5.35%
1849,"Apr 30, 1871",5.49%
1850,"Mar 31, 1871",5.64%
1851,"Feb 28, 1871",5.78%


### Check the type of the first table

It shows content in `DataFrame` and this is good becasues `DataFrame` (structured data) is best for data analysis.

`type(raw_html_tbl[0])`

In [2019]:
type(raw_html_tbl[0])

pandas.core.frame.DataFrame

### let's create a variable to store our table

In [2023]:
sp500 = raw_html_tbl[0]
sp500

Unnamed: 0,Date,Value
0,"Apr 28, 2025",â 1.36%
1,"Apr 30, 2025",â 1.34%
2,"Mar 31, 2025",â 1.32%
3,"Feb 28, 2025",â 1.24%
4,"Jan 31, 2025",â 1.25%
...,...,...
1848,"May 31, 1871",5.35%
1849,"Apr 30, 1871",5.49%
1850,"Mar 31, 1871",5.64%
1851,"Feb 28, 1871",5.78%


### Access meta-data and actual data of the `sp500` data-frame

Tables are grabbed and presented to you in the form of `DataFrame`, so use any functions or attributes that you known about `DataFrame` to query it now.

```
sp500.shape
sp500.head()
sp500.tail()
sp500.info()
sp500.describe()
sp500[0:5]
```

In [2025]:
sp500.shape

(1853, 2)

In [2027]:
sp500.head()

Unnamed: 0,Date,Value
0,"Apr 28, 2025",â 1.36%
1,"Apr 30, 2025",â 1.34%
2,"Mar 31, 2025",â 1.32%
3,"Feb 28, 2025",â 1.24%
4,"Jan 31, 2025",â 1.25%


In [2029]:
sp500.tail()

Unnamed: 0,Date,Value
1848,"May 31, 1871",5.35%
1849,"Apr 30, 1871",5.49%
1850,"Mar 31, 1871",5.64%
1851,"Feb 28, 1871",5.78%
1852,"Jan 31, 1871",5.86%


In [2031]:
sp500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1853 entries, 0 to 1852
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    1853 non-null   object
 1   Value   1853 non-null   object
dtypes: object(2)
memory usage: 29.1+ KB


In [2059]:
sp500.describe()

Unnamed: 0,Percent
count,1853.0
mean,0.042294
std,0.017548
min,0.0111
25%,0.0302
50%,0.042
75%,0.0534
max,0.1384


In [2035]:
sp500[0:10]

Unnamed: 0,Date,Value
0,"Apr 28, 2025",â 1.36%
1,"Apr 30, 2025",â 1.34%
2,"Mar 31, 2025",â 1.32%
3,"Feb 28, 2025",â 1.24%
4,"Jan 31, 2025",â 1.25%
5,"Dec 31, 2024",1.24%
6,"Nov 30, 2024",1.25%
7,"Oct 31, 2024",1.28%
8,"Sep 30, 2024",1.31%
9,"Aug 31, 2024",1.33%


### Data Tidying
- Getting Rid of Unwanted Characters
- Convert percentage strinp to float value

Use strip() function to get rid of `'â\x80 '` and `%`character

In [2061]:
sp500["Value"][0]

'â\x80 1.36%'

In [2063]:
sp500["Value"][0].strip("â\x80 ")

'1.36%'

In [2065]:
sp500["Value"][0].strip("â\x80 ").strip('%')

'1.36'

In [2067]:
sp500["Value"]

0       â 1.36%
1       â 1.34%
2       â 1.32%
3       â 1.24%
4       â 1.25%
          ...   
1848       5.35%
1849       5.49%
1850       5.64%
1851       5.78%
1852       5.86%
Name: Value, Length: 1853, dtype: object

In [2069]:
sp500["Percent"] = sp500["Value"].apply(lambda x: float(x.strip("â\x80 ").strip("%")) / 100)

In [2071]:
sp500.drop(columns = ['Value'], inplace=True)

In [2073]:
print(sp500)

              Date  Percent
0     Apr 28, 2025   0.0136
1     Apr 30, 2025   0.0134
2     Mar 31, 2025   0.0132
3     Feb 28, 2025   0.0124
4     Jan 31, 2025   0.0125
...            ...      ...
1848  May 31, 1871   0.0535
1849  Apr 30, 1871   0.0549
1850  Mar 31, 1871   0.0564
1851  Feb 28, 1871   0.0578
1852  Jan 31, 1871   0.0586

[1853 rows x 2 columns]


## `read_html()` doesn't always work

- There are too many broken HTML codes out there
- Data are always presented in the format of table
- Some websites implement blocking policy
- Modern webpage involving more and more JavaScript programming.  A webpage might start EMPTY at first and actual page contents are generated by client-side.

**The following `read_html()` call fails**  
It throws `HTTPError: HTTP Error 403: Forbidden`
```
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20/topGainers'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2
```

In [2075]:
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20/topGainers'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2

HTTPError: HTTP Error 403: Forbidden

# USING `BeautifulSoup` TO AUTOMATE DATA GRABBING

You can extract HTML element by using BeautifulSoup.  

**BeautifulSoup** is a popular web scraping tool.  

Besides BeautifulSoup, **Scrapy** and **Selenium** are also widely used.

## Check if `beautifulsoup4` is install

Run the following magic command to check if beautifulsoup4 is installed on your computer.  
`!conda list beautifulsoup4`

If beautifulsoup4 is NOT installed, run the following magic commnad to install it.  
`conda install beautifulsoup4`

In [2078]:
!conda list beautifulsoup4

# packages in environment at /opt/anaconda3:
#
# Name                    Version                   Build  Channel
beautifulsoup4            4.12.3          py312hca03da5_0  


## Import `BeautifulSoup`
Import `BeautifulSoup` before you use it

```
from bs4 import BeautifulSoup
```

In [2080]:
from bs4 import BeautifulSoup

## Let's start with simple dummy HTML

**Declare the following HTML documents**

```
html_doc = """<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Today's News</title>
    <style>
      #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
    </style>
  </head>
  <body>
    <h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>
    <b id="today" class="date-style">Date: 2025-04-30</b>
    <hr />
    <main>
      <article id="cover-story">
        <h2 class="news-title">News 001</h2>
        <a href="news001.html" class="news-cover-photo" id="cover-story-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+1"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 001</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 002</h2>
        <a href="news002.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+2"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 002</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 003</h2>
        <a href="news003.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+3"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 003</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>
    </main>

    <aside class="related-news">
      <h2 id="related-news-section-heading">Related News</h2>
      <a href="news001.html" class="related-news-link">News 101</a><br />
      <a href="news002.html" class="related-news-link">News 102</a><br />
      <a href="news003.html" class="related-news-link">News 103</a><br />
      <a href="news004.html" class="related-news-link">News 104</a><br />
      <a href="news005.html" class="related-news-link">News 105</a><br />
      <button class="related-news-link">Show more</button>
    </aside>

    <hr />
    <footer><span>ABC Company</span>. All rights reserved.</footer>
  </body>
</html>
"""
```

In [2222]:
html_doc = """<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Today's News</title>
    <style>
      #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
    </style>
  </head>
  <body>
    <h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>
    <b id="today" class="date-style">Date: 2025-04-30</b>
    <hr />
    <main>
      <article id="cover-story">
        <h2 class="news-title">News 001</h2>
        <a href="news001.html" class="news-cover-photo" id="cover-story-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+1"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 001</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 002</h2>
        <a href="news002.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+2"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 002</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 003</h2>
        <a href="news003.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+3"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 003</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>
    </main>

    <aside class="related-news">
      <h2 id="related-news-section-heading">Related News</h2>
      <a href="news001.html" class="related-news-link">News 101</a><br />
      <a href="news002.html" class="related-news-link">News 102</a><br />
      <a href="news003.html" class="related-news-link">News 103</a><br />
      <a href="news004.html" class="related-news-link">News 104</a><br />
      <a href="news005.html" class="related-news-link">News 105</a><br />
      <button class="related-news-link">Show more</button>
    </aside>

    <hr />
    <footer><span>ABC Company</span>. All rights reserved.</footer>
  </body>
</html>

"""

### Createing BeautifulSoup Object

`soup = BeautifulSoup(html_doc, 'html.parser')`

In [2224]:
soup = BeautifulSoup(html_doc, 'html.parser')

## Use `prettify()` function to show clear HTML codes

The following command will display a neat output
```
print(soup.prettify())
```

In [2084]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Today's News
  </title>
  <style>
   #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
  </style>
 </head>
 <body>
  <h1 id="website-name">
   Today's News
   <span>
    (An ABC Company)
   </span>
  </h1>
  <b class="date-style" id="today">
   Date: 2025-04-30
  </b>
  <hr/>
  <main>
   <article id="cover-story">
    <h2 class="news-title">
     News 001
    </h2>
    <a class="news-cover-photo" href="news001.html" id="cover-story-photo">
     <img alt="photo" src="https://placehold.

##  Use `find()` function to retrieve child elements

A HTML file is usually long.  It can easy contain hundred or thousand of lines of code.  We can we `find()` function to target our tag.

Examples:
```
soup.find('title')
type(soup.find('title'))
soup.find('h1')
soup.find('p')
p = soup.find('p')
type(p)
```
`find()` _function will only return ONE SINGLE element even if there are multiple matched_

In [2086]:
soup.find('title')

<title>Today's News</title>

In [2088]:
type(soup.find('title'))

bs4.element.Tag

In [2090]:
soup.find('h1')

<h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>

In [2092]:
soup.find('p')

<p>
<b>News 001</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>

In [2094]:
a_paragraph = soup.find('p')
type(a_paragraph)

bs4.element.Tag

## Retrieve tag's content

Use the following properties name to get the contents of a tag  

- `.text`	Extracts all text within an element, including **nested tags**.
  Useful for grabbing full text content of an element  

- `.string`	Returns text only if the element has a single text node, otherwise `None`.
  Works when an element contains direct text without nested tags  

- `.content`	Retrieves the raw binary content (bytes) of an HTML element.
  Useful for extracting non-text elements like images  

### let's retrieve a simple tag WITHOUT child tag

In [2098]:
a_heading_2 = soup.find('h2')

In [2100]:
a_heading_2

<h2 class="news-title">News 001</h2>

In [2102]:
a_heading_2.text

'News 001'

In [2104]:
a_heading_2.string

'News 001'

### let's retrieve a tag with child tag

In [2107]:
a_paragraph = soup.find('p')

In [2109]:
a_paragraph.text

'\nNews 001 dolor sit amet, consectetur adipisicing elit, sed do\n          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad\n          minim veniam, quis nostrud exercitation ullamco laboris nisi ut\n          aliquip ex ea commodo consequat. Duis aute irure dolor in\n          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla\n          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in\n          culpa qui officia deserunt mollit anim id est laborum.\n        '

In [2111]:
a_paragraph.string # this get nothing as the paragraph has nested child tag

## `find()` vs. `find_all`

- `find()` Finds the first matching element. Single element (Tag object) or `None` if not found.	Extracting a single heading (`<h2>`), first paragraph, etc.
- `find_all()` Finds all matching elements. List of Tag objects (empty list if no match).	Extracting all links (`<h2>`).

In [2115]:
soup.find('h2')

<h2 class="news-title">News 001</h2>

In [2117]:
type(soup.find('h2'))

bs4.element.Tag

In [2119]:
soup.find_all('h2')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>,
 <h2 id="related-news-section-heading">Related News</h2>]

---
**Check the type**
When you use `find_all`, it returns a ResultSet (a list of item found)

In [2121]:
type(soup.find_all('h2'))

bs4.element.ResultSet

---
**Use `[]` to refer to an item in the resultset**

`soup.find_all('h2')[0]`

In [2127]:
# returns the first one: use 0 as index
soup.find_all('h2')[0]

<h2 class="news-title">News 001</h2>

In [2129]:
# returns the second one: use 1 as index
soup.find_all('h2')[1] 

<h2 class="news-title">News 002</h2>

In [2131]:
# returns the last one: use -1 as index
soup.find_all('h2')[-1]

<h2 id="related-news-section-heading">Related News</h2>

In [2135]:
# returns the second last one: use -2 as index
soup.find_all('h2')[-2]

<h2 class="news-title">News 003</h2>

## find by tag name, class name or id
You can find target tag by it's tag name, tag's class or id attributes  

![](https://codetheweb.blog/assets/img/posts/html-syntax/tag-structure-2.png)

### by tag names / element name
This approach is easy.  But this approach will usually targeting TOO  MANY tags (considing a HTML page can easily contain thousand of lines
```
find('h1')
find('h2')
find('p')
```
### by css `class_` attribute
This approach let you target tags with certain css class name.  Parameter name is `class_` instead of `class`, because `class` is a reserved keyword in Python
```
soup.find_all(class_='news-title')
soup.find_all(class_='featured')

```
### by `id` attribute
Use id when targeting unique elements.  
_Note_
- `id` is a unique value among a HTML document. So you should expecting only one matched tag.  
- However there could be exception as it's quite common that HMTL codes are buggy and messy.
```
soup.find_all(id='website-name')
soup.find_all(id='related-news-section-heading')
```


In [2137]:
soup.find_all('h2')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>,
 <h2 id="related-news-section-heading">Related News</h2>]

In [2139]:
soup.find_all(class_='news-title')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>]

In [2141]:
soup.find_all(id='website-name')

[<h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>]

In [2143]:
soup.find_all(id='related-news-section-heading')

[<h2 id="related-news-section-heading">Related News</h2>]

## Use  `.` to refer a child element
To retrieve the `<title>` child tag, use `soup.title`

Or other child elements
```
soup.meta
soup.h1
soup.h2
soup.footer
```

This approach will only return ONE object

In [2146]:
soup.title

<title>Today's News</title>

In [2148]:
soup.meta

<meta charset="utf-8"/>

In [2150]:
soup.h1

<h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>

In [2152]:
soup.h1.span

<span>(An ABC Company)</span>

In [2154]:
soup.h1.span.text

'(An ABC Company)'

In [2156]:
soup.h2

<h2 class="news-title">News 001</h2>

In [2158]:
soup.footer

<footer><span>ABC Company</span>. All rights reserved.</footer>

## Getting a tag
```
a_tag = soup.title 
a_tag.name
```

In [2161]:
soup.title

<title>Today's News</title>

In [2163]:
a_tag = soup.title 

In [2165]:
a_tag.name

'title'

In [2167]:
a_tag.text

"Today's News"

In [2169]:
a_tag.string

"Today's News"

## Get the parent tag
`.parent` gives the parent tag of current tag
```
a_tag.parent
a_tag.parent.name
a_tag.parent.string
```

In [2172]:
a_tag

<title>Today's News</title>

In [2174]:
a_tag.parent

<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Today's News</title>
<style>
      #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
    </style>
</head>

In [2176]:
a_tag.parent.name

'head'

In [2178]:
a_tag.parent.text

"\n\n\nToday's News\n\n"

## Tag attributes

The diagram below explain what is a tag's attributes.
![](https://www.scientecheasy.com/wp-content/uploads/2023/03/img-html-attributes.png)

Showing attribute
```
a_tag = soup.a
a_tag
a_tag["class"]
a_tag["href"]
a_tag["id"]
a_tag.attrs
```

`.attrs` lists all attributes

In [2190]:
a_tag = soup.a
a_tag

<a class="news-cover-photo" href="news001.html" id="cover-story-photo"><img alt="photo" src="https://placehold.co/600x400?text=Dummy+Photo+1"/></a>

In [2192]:
a_tag["class"]

['news-cover-photo']

In [2194]:
a_tag["href"]

'news001.html'

In [2196]:
a_tag["id"]

'cover-story-photo'

In [2198]:
a_tag.attrs # show all the attributes of a tag

{'href': 'news001.html',
 'class': ['news-cover-photo'],
 'id': 'cover-story-photo'}

## Specifying both `tag` name and `class` name

We previously use `class_` to find matching elemenet, this approach will return all type of HTML tag that has a matching CSS class name. 
`soup.find_all(class_='related-news-link')`

We can include `tag` name as the first parameter to narrowing search only for certain type of HTML tags

This searches for `<a>` tag with css class named `related-news-link`   
`soup.find_all('a', class_='related-news-link')`

This searches for `<button>` tag with css class named `related-news-link`  
`soup.find_all('button', class_='related-news-link')`


In [2236]:
soup.find_all(class_='related-news-link')

[<a class="related-news-link" href="news001.html">News 101</a>,
 <a class="related-news-link" href="news002.html">News 102</a>,
 <a class="related-news-link" href="news003.html">News 103</a>,
 <a class="related-news-link" href="news004.html">News 104</a>,
 <a class="related-news-link" href="news005.html">News 105</a>,
 <button class="related-news-link">Show more</button>]

In [2238]:
soup.find_all('a', class_='related-news-link')

[<a class="related-news-link" href="news001.html">News 101</a>,
 <a class="related-news-link" href="news002.html">News 102</a>,
 <a class="related-news-link" href="news003.html">News 103</a>,
 <a class="related-news-link" href="news004.html">News 104</a>,
 <a class="related-news-link" href="news005.html">News 105</a>]

In [2240]:
soup.find_all('button', class_='related-news-link')

[<button class="related-news-link">Show more</button>]

## Limiting the number in the search result

Use `limit` parameter to specify how many items you are expecting

In the following example, we limit the search to **TWO**
```
soup.find_all('h2', limit=2) 

```

In [2248]:
soup.find_all('h2')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>,
 <h2 id="related-news-section-heading">Related News</h2>]

In [2250]:
soup.find_all('h2', limit=2)

[<h2 class="news-title">News 001</h2>, <h2 class="news-title">News 002</h2>]

## Search by using advanced CSS selectors
If you are an experienced HTML/CSS coding, you can use complex css selector to be more targeted on small part of the HTML contents.  

Here we use `select()` function by specifying `css selector` as parameter.  It returns all the matching elements.
```
soup.select('body b')
soup.select('p b')
soup.select('body>b')
soup.select('body>p>b')
```

In [2253]:
soup.select('h2')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>,
 <h2 id="related-news-section-heading">Related News</h2>]

In [2255]:
soup.select('article h2')

[<h2 class="news-title">News 001</h2>,
 <h2 class="news-title">News 002</h2>,
 <h2 class="news-title">News 003</h2>]

In [2257]:
soup.select('aside h2')

[<h2 id="related-news-section-heading">Related News</h2>]

In [2259]:
soup.select('b')

[<b class="date-style" id="today">Date: 2025-04-30</b>,
 <b>News 001</b>,
 <b>News 002</b>,
 <b>News 003</b>]

In [2261]:
soup.select('body>b')

[<b class="date-style" id="today">Date: 2025-04-30</b>]

In [2263]:
soup.select('article>p>b')

[<b>News 001</b>, <b>News 002</b>, <b>News 003</b>]

# PRACTICAL SCRPAING USING `requests` PACKAGE
In the previous section, we use a simple **HTML strings** to demonstrate the how to `find()`, `find_all()` and `select()` HTML tags/elements in our targeted HTML string becuase actually web page usually are very long and messy.  We learn BeautifulSoup skills by process a simple document.

In real life, we need to initiate **HTTP request** to webpage(s) over the internet directly and convert the returned HTML texts to a BeautifulSoup object.  

To issue HTTPS request, we need to import `requests` package.
![HTTP Protocol](https://miro.medium.com/v2/resize:fit:853/1*8-fT6K1o6nHiBRxKppcqOg.png)

## Imporint packages

In [2266]:
import requests
from bs4 import BeautifulSoup

## Defining `User-Agent` for HTTP Request Header

Defining a `User-Agent` when web scraping is important because many websites check this header to determine whether the request is coming from a browser or a bot.  

If a request lacks a User-Agent or looks suspicious, websites may block or rate-limit it.

**Why Use a User-Agent?** 
1. Avoid Blocks & Restrictions – Websites may reject requests from bots or unknown sources.
2. Mimic a Real Browser – Helps make the request appear more like human activity.
3. Access Site Content Properly – Some sites serve different content based on the User-Agent.
4. Bypass Captchas or Anti-Bot Measures – Many sites block automated scraping tools.

In Python, headers are defined using `dict` type.
```
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
```

In [2269]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

## Calling `requests.get()` function

We need to pass extra parameter `hearders` when calling `get()` function.

```
response = requests.get("https://hk.yahoo.com", headers=headers)
```

The HTTP Response from server is stored as `response` for later use.

In [2272]:
response = requests.get("https://hk.yahoo.com", headers=headers)

## Checking on `response` 

Use`.status_code` to check the response status code.

To retrieve response body, you  refer to response `.text` or `.content` attribute
- Use `.text` when you need the response as a readable string (e.g., web pages, JSON).
- Use `.content` when working with binary data like images, PDFs, or file downloads.

In [2275]:
response.status_code

200

In [2277]:
response.text



In [2279]:
response.content



In [2281]:
response.url

'https://hk.yahoo.com/'

## Converting Response Text to BeautifulSoup Object
yahoo = BeautifulSoup(response.content, "html.parser")

In [2284]:
yahoo = BeautifulSoup(response.content, "html.parser")
yahoo

<!DOCTYPE html>
<html class="" data-region="HK" id="atomic" lang="zh-Hant-HK"><head><title>Yahoo Hong Kong 雅虎香港</title><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="on" http-equiv="x-dns-prefetch-control"/><meta content="chrome=1" http-equiv="X-UA-Compatible"/><meta content="在Yahoo! 首頁，您可以一覽即時重要社會新聞、國際新知、娛樂消息、生活資訊；您更可以加入Yahoo! 會員計劃，天天累積點數，免費兌換禮物、美食、優惠券，著數多多！" name="description"/><meta content="在Yahoo! 首頁，您可以一覽即時重要社會新聞、國際新知、娛樂消息、生活資訊；您更可以加入Yahoo! 會員計劃，天天累積點數，免費兌換禮物、美食、優惠券，著數多多！" property="og:description"/><meta content="yahoo, yahoo香港, yahoo首頁, yahoo香港首頁, 雅虎首頁, Yahoo Hong Kong 雅虎香港, 雅虎香港, 新聞, 財經, 體育, 娛樂, TV, 團購, 購物, Style, 旅遊, 電影" name="keywords"/><meta content="131943696968253" property="fb:app_id"/><meta content="https://s.yimg.com/cv/apiv2/social/images/yahoo_default_logo.png" property="og:image"/><meta content="http://hk.yahoo.com" property="og:url"/><meta content="Yahoo Hong Kong 雅虎香港" property="og:title"/><meta content="guce.yahoo.com" name="oa

In [2286]:
yahoo.find('title')

<title>Yahoo Hong Kong 雅虎香港</title>

In [2288]:
yahoo.find_all('meta')

[<meta content="text/html; charset=utf-8" http-equiv="content-type"/>,
 <meta content="on" http-equiv="x-dns-prefetch-control"/>,
 <meta content="chrome=1" http-equiv="X-UA-Compatible"/>,
 <meta content="在Yahoo! 首頁，您可以一覽即時重要社會新聞、國際新知、娛樂消息、生活資訊；您更可以加入Yahoo! 會員計劃，天天累積點數，免費兌換禮物、美食、優惠券，著數多多！" name="description"/>,
 <meta content="在Yahoo! 首頁，您可以一覽即時重要社會新聞、國際新知、娛樂消息、生活資訊；您更可以加入Yahoo! 會員計劃，天天累積點數，免費兌換禮物、美食、優惠券，著數多多！" property="og:description"/>,
 <meta content="yahoo, yahoo香港, yahoo首頁, yahoo香港首頁, 雅虎首頁, Yahoo Hong Kong 雅虎香港, 雅虎香港, 新聞, 財經, 體育, 娛樂, TV, 團購, 購物, Style, 旅遊, 電影" name="keywords"/>,
 <meta content="131943696968253" property="fb:app_id"/>,
 <meta content="https://s.yimg.com/cv/apiv2/social/images/yahoo_default_logo.png" property="og:image"/>,
 <meta content="http://hk.yahoo.com" property="og:url"/>,
 <meta content="Yahoo Hong Kong 雅虎香港" property="og:title"/>,
 <meta content="guce.yahoo.com" name="oath:guce:consent-host"/>,
 <meta content="width=device-width, initial-scale=1" name="view

In [2290]:
yahoo.find(id='module-featurebar')

<div class="wafer-rapid-module featurebar" id="module-featurebar"><div class="react-wafer-apac-featurebar Pos(r) H(40px) Ov(h)"><a class="featurebar-content Pos(a) Bxz(bb) W(100%) H(40px) Px(20px) Py(10px) Bdrs(8px) D(f) Ai(c) Jc(sb) Td(n) Bgc(varHighlightBlue)" data-ylk="cpos:1;ct:STORY;elm:hdln;g:6abef2ec-a541-3975-bad7-0cb8993d97af;itc:0;pos:1;sec:featurebar;slk:古天樂正研發電影新模式救市　游學修頻道結束仍深信網片價值;" href="https://hk.news.yahoo.com/yahoo%E5%A8%9B%E6%A8%82%E5%9C%88%EF%BD%9C%E3%80%8A%E9%80%81%E9%99%A2%E9%80%94%E4%B8%AD%E3%80%8B%E5%B0%88%E8%A8%AA-%E5%8F%A4%E5%A4%A9%E6%A8%82%E6%AD%A3%E7%A0%94%E7%99%BC%E9%9B%BB%E5%BD%B1%E6%96%B0%E6%A8%A1%E5%BC%8F%E6%95%91%E5%B8%82-%E6%B8%B8%E5%AD%B8%E4%BF%AE%E9%A0%BB%E9%81%93%E7%B5%90%E6%9D%9F%E4%BB%8D%E6%B7%B1%E4%BF%A1%E7%B6%B2%E7%89%87%E5%83%B9%E5%80%BC%EF%BC%9A%E7%B6%B2%E7%B5%A1%E4%BB%8D%E6%98%AF%E5%BE%88%E5%A5%BD%E7%9A%84%E5%9C%9F%E5%A3%A4%E5%8E%BB%E5%89%B5%E4%BD%9C-020200115.html" style="transform:translateY(0px);z-index:1"><div class="D(f) Ai(c) W(100%)"><

In [2292]:
yahoo.find(id='module-featurebar').text

'《送院途中》專訪古天樂正研發電影新模式救市\u3000游學修頻道結束仍深信網片價值'

In [2294]:
yahoo.find_all(class_='apac-ntk-item')

[<a class="apac-ntk-item ntk-hero D(b) Pos(r) W(100%) Ov(h) V(h) active_V(v) Bdrs(8px) H(236px) H(250px)--sm1024" data-ylk="elm:img;elmt:ct;cpos:1;itc:0;pkgt:need_to_know;pos:1;subsec:needtoknow;sec:strm;ccode:ntk_single_feed__zh-Hant-HK__frontpage__pinning__default__desktop__ga__noSplit;ct:story;expb:0;g:0e3739ba-72f6-46bb-9079-e613142cd627;slk:元朗幼稚園下午起火　過百師生疏散;cposy:1;aid:2f3d6140-765a-3f16-bee3-50b0eee88df8;p_sys:jarvis;" href="https://hk.news.yahoo.com/%E5%85%83%E6%9C%97%E5%B9%BC%E7%A8%9A%E5%9C%92%E4%B8%8B%E5%8D%88%E8%B5%B7%E7%81%AB-%E9%81%8E%E7%99%BE%E5%B8%AB%E7%94%9F%E7%96%8F%E6%95%A3%EF%BD%9Cyahoo-085134901.html"><img alt="元朗幼稚園下午起火　過百師生疏散" class="apac-ntk-item-image H(100%) W(100%) Bdrs(8px) Objf(cv)" src="https://s.yimg.com/uu/api/res/1.2/uDwXawwhVauDuv6yCiqXRg--~B/Zmk9ZmlsbDtoPTQ3MjtweW9mZj0wO3c9ODA4O2FwcGlkPXl0YWNoeW9u/https://s.yimg.com/os/creatr-uploaded-images/2025-04/15f16bf0-24d6-11f0-bfe7-62411ac9cbc8"/><div class="Pos(a) Start(0) B(0) Px(16px) Pb(16px) Pt(94px) C(whit

In [2296]:
type(yahoo.find_all(class_='apac-ntk-item'))

bs4.element.ResultSet

In [2298]:
yahoo.find_all(class_='apac-ntk-item')[0]

<a class="apac-ntk-item ntk-hero D(b) Pos(r) W(100%) Ov(h) V(h) active_V(v) Bdrs(8px) H(236px) H(250px)--sm1024" data-ylk="elm:img;elmt:ct;cpos:1;itc:0;pkgt:need_to_know;pos:1;subsec:needtoknow;sec:strm;ccode:ntk_single_feed__zh-Hant-HK__frontpage__pinning__default__desktop__ga__noSplit;ct:story;expb:0;g:0e3739ba-72f6-46bb-9079-e613142cd627;slk:元朗幼稚園下午起火　過百師生疏散;cposy:1;aid:2f3d6140-765a-3f16-bee3-50b0eee88df8;p_sys:jarvis;" href="https://hk.news.yahoo.com/%E5%85%83%E6%9C%97%E5%B9%BC%E7%A8%9A%E5%9C%92%E4%B8%8B%E5%8D%88%E8%B5%B7%E7%81%AB-%E9%81%8E%E7%99%BE%E5%B8%AB%E7%94%9F%E7%96%8F%E6%95%A3%EF%BD%9Cyahoo-085134901.html"><img alt="元朗幼稚園下午起火　過百師生疏散" class="apac-ntk-item-image H(100%) W(100%) Bdrs(8px) Objf(cv)" src="https://s.yimg.com/uu/api/res/1.2/uDwXawwhVauDuv6yCiqXRg--~B/Zmk9ZmlsbDtoPTQ3MjtweW9mZj0wO3c9ODA4O2FwcGlkPXl0YWNoeW9u/https://s.yimg.com/os/creatr-uploaded-images/2025-04/15f16bf0-24d6-11f0-bfe7-62411ac9cbc8"/><div class="Pos(a) Start(0) B(0) Px(16px) Pb(16px) Pt(94px) C(white

In [2300]:
yahoo.find_all(class_='apac-ntk-item')[0].text

'元朗幼稚園下午起火\u3000過百師生疏散'

In [2302]:
yahoo.find_all(class_='apac-ntk-item')[0].attrs

{'class': ['apac-ntk-item',
  'ntk-hero',
  'D(b)',
  'Pos(r)',
  'W(100%)',
  'Ov(h)',
  'V(h)',
  'active_V(v)',
  'Bdrs(8px)',
  'H(236px)',
  'H(250px)--sm1024'],
 'data-ylk': 'elm:img;elmt:ct;cpos:1;itc:0;pkgt:need_to_know;pos:1;subsec:needtoknow;sec:strm;ccode:ntk_single_feed__zh-Hant-HK__frontpage__pinning__default__desktop__ga__noSplit;ct:story;expb:0;g:0e3739ba-72f6-46bb-9079-e613142cd627;slk:元朗幼稚園下午起火\u3000過百師生疏散;cposy:1;aid:2f3d6140-765a-3f16-bee3-50b0eee88df8;p_sys:jarvis;',
 'href': 'https://hk.news.yahoo.com/%E5%85%83%E6%9C%97%E5%B9%BC%E7%A8%9A%E5%9C%92%E4%B8%8B%E5%8D%88%E8%B5%B7%E7%81%AB-%E9%81%8E%E7%99%BE%E5%B8%AB%E7%94%9F%E7%96%8F%E6%95%A3%EF%BD%9Cyahoo-085134901.html'}

In [2304]:
yahoo.find_all(class_='apac-ntk-item')[0]['class']

['apac-ntk-item',
 'ntk-hero',
 'D(b)',
 'Pos(r)',
 'W(100%)',
 'Ov(h)',
 'V(h)',
 'active_V(v)',
 'Bdrs(8px)',
 'H(236px)',
 'H(250px)--sm1024']

In [2306]:
yahoo.find_all(class_='apac-ntk-item')[0]['href']

'https://hk.news.yahoo.com/%E5%85%83%E6%9C%97%E5%B9%BC%E7%A8%9A%E5%9C%92%E4%B8%8B%E5%8D%88%E8%B5%B7%E7%81%AB-%E9%81%8E%E7%99%BE%E5%B8%AB%E7%94%9F%E7%96%8F%E6%95%A3%EF%BD%9Cyahoo-085134901.html'