---
# INTERMEDIATE PYTHON PROGRAMMING
# CHAPTER 4 - Web Scraping Using BeautifulSoup
---


# WEB SCRAPING INTRODUCTION

When open data source is not an option, you can write your own Python codes to grab data from the web. 
 This is known as web scraping.  
![](https://www.promptcloud.com/wp-content/uploads/2024/03/1_CxVccbFGtv6W2qlq0A4hxw-1024x499.png.webp)


**However, do pay attention to**
- Grabbing data from HTML is not always easy (HTML codes are often messy)
- Many websites implement data protections to prevent data grabbing
- Modern web application generates web content on the fly when the actual content is loading at the client side.  That meas the HTML page is empty at the begining while progressively loading client-side JavaScript.
- Websites have their terms and condition on assessing their data.  (In our case, we are just doing this for learning purpose.  So we will be fine.)

## Use `pandas.read_html()` when you can

Before you actually implement your scraping code, consider if easy approach would do the job.  In this section, we use pandas built-in function `read_html()` to read webpage that contains targeted data in the format of HTML table.

Below is a HTML table codes and its corresponding appearance in browser.
![](https://dotnettutorials.net/wp-content/uploads/2021/11/word-image-533.png)

Use `read_html()` function to read a `url` (the address of a targeted webpage) that you know contains your wanted data in **HTML table** form.   

**E.g.**:
```
    pd.read_html("WEB-PAGE-URL", headers=headers)
```

## `pandas.read_html()` Syntax
`pandas.read_html()` function requires a **web page link** that you want to grab table(s).  There is chance that the pages contains multiple tables.  Therefore the returned result of `pandas.read_html()` is type of python `list` (or also widely known as array).

### Import required packages

We need `numpy`, `pandas` and `requests`  packages to get the job done. `requests` is for triggering HTTP network requests.

```
import numpy as np
import pandas as pd
import requests
```

In [None]:
import numpy as np
import pandas as pd
import requests

### Declare url and read url
```
sp500_divident_yield_by_month_url = 'https://www.multpl.com/s-p-500-dividend-yield/table/by-month'
sp500_divident_yield_by_month_url
```

In [None]:
sp500_divident_yield_by_month_url = 'https://www.multpl.com/s-p-500-dividend-yield/table/by-month'
sp500_divident_yield_by_month_url

### Read the url

**Call `read_html()` function here**
```
raw_html_tbl = pd.read_html(sp500_divident_yield_by_month_url)
```

In [None]:
raw_html_tbl = pd.read_html(sp500_divident_yield_by_month_url)

In [None]:
raw_html_tbl

### Check it's type

`type(raw_html_tbl)`

It should say `list`

In [None]:
type(raw_html_tbl)

### Check how many tables are retrieved

There could be more than 1 table in the returned result and therefore it requires use `[ ]` with **index number** to tell which table you like to access.  

Sometimes, there are zero table in the returned result and therefore make sure you codes do proper checking on the result before you try to grab data from tables.

Calling `len()` function would tell you how many tables are in the retuned result.


In [None]:
len(raw_html_tbl)

### Retrieve the table by `[ ]` operation

Retrieve the table by `[ ]` operation together with correct index number

the index number starts `0` and therefore you use `0`to refer to the first table the code below get the 
first table in the result

`raw_html_tbl[0]`

In [None]:
raw_html_tbl[0]

### Check the type of the first table

It shows content in `DataFrame` and this is good becasues `DataFrame` (structured data) is best for data analysis.

`type(raw_html_tbl[0])`

In [None]:
type(raw_html_tbl[0])

**Code below will produce an error as in the result there is only one table.**

In [None]:
raw_html_tbl[1]

### let's create a variable to store our table

In [None]:
sp500 = raw_html_tbl[0]
sp500

### Access meta-data and actual data of the `sp500` data-frame

Tables dat are grabbed and presented to you in the form of pandas `DataFrame`, so use any functions or attributes that you known available by `DataFrame` to query the data.

```
sp500.shape
sp500.head()
sp500.tail()
sp500.info()
sp500.describe()
sp500[0:5]
```

In [None]:
sp500.shape

In [None]:
sp500.head()

In [None]:
sp500.tail()

In [None]:
sp500.info()

In [None]:
sp500.describe()

In [None]:
sp500[0:10]

### Data Tidying
- Getting Rid of Unwanted Characters
- Convert percentage strinp to float value

Use strip() function to get rid of `'â\x80 '` and `%`character

In [None]:
sp500["Value"][0]

In [None]:
sp500["Value"][0].strip("â\x80 ")

In [None]:
sp500["Value"][0].strip("â\x80 ").strip('%')

In [None]:
sp500["Value"]

### Cleaning and transforming the `Value` column
This part involves using lamdba (a short form of function).  We define that short function so that each piece of data from `Value` column will trigger a function call which gets rid of unwanted symbol and convert to `float` type.

Here we add an extra column named `Percent` to store the cleaned and converted `Value` column.

In [None]:
sp500["Percent"] = sp500["Value"].apply(lambda x: float(x.strip("â\x80 ").strip("%")) / 100)

### Here we drop the unwanted `Value` column

In [None]:
sp500.drop(columns = ['Value'], inplace=True)

In [None]:
print(sp500)

### Another Table Reading Example

Copy and paste the following page to a browser to take a glance on the page.  It contains two tables.

`https://www.w3schools.com/html/html_tables.asp`

In [None]:
url2 = 'https://www.w3schools.com/html/html_tables.asp'

In [None]:
tables = pd.read_html(url2)

In [None]:
len(tables)

In [None]:
tables[0]

In [None]:
tables[1]

## `read_html()` doesn't always work

- There are too many broken HTML codes out there
- Data are always presented in the format of table
- Some websites implement blocking policy
- Modern webpage involving more and more JavaScript programming.  A webpage might start EMPTY at first and actual page contents are generated by client-side.

**The following `read_html()` call fails**  
It throws `HTTPError: HTTP Error 403: Forbidden`
```
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20/topGainers'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2
```

The following cell will generate `HTTPError: HTTP Error 403: Forbidden`

In [None]:
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20/topGainers'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2

# USING `BeautifulSoup` TO AUTOMATE DATA GRABBING

You can extract HTML elements from complex HTML codes by using BeautifulSoup.  

**BeautifulSoup** is a popular web scraping tool.  **Scrapy** and **Selenium** are also widely used.

## Using Browser Dev Tools to Explore HTML/CSS/JavaScript

We will use the Developer Tools function by Google Chrome browser to explore how HTML, CSS and JavaScript work together to present a modern webpage. 

If you don't have Google Chrome Browser, click below to download and instll.  
[Download Google Chrome](https://www.google.com/intl/en_hk/chrome/)

## Chrome Dev Tool

It helps developers debug, inspect, and optimize web pages for performance, security, and responsiveness.

At any empty area of a web page, right-click and choose **Inspect** to open up dev tool.  

Let's have some fun with dev tool.  You will be amazed how much it can do and these features are only known by developer, not regular users.

![Chrome Dev Tool](https://developer.chrome.com/static/docs/devtools/overview/image/elements-panel-546127ed29eac.png)

## Let's try to create a very simple web page 

Let's user [JSFiddle](https://jsfiddle.net/)

![](https://b6land.github.io/assets/imgs/2024-01-14/jsFiddle.png)

## Check if `beautifulsoup4` is already installed

Run the following magic command to check if beautifulsoup4 is installed on your computer.  
`!conda list beautifulsoup4`

In [None]:
!conda list beautifulsoup4


If beautifulsoup4 is NOT installed, run the following magic commnad to install it.  

`!conda install beautifulsoup4`

## Import `BeautifulSoup`
Import `BeautifulSoup` before you use it

```
from bs4 import BeautifulSoup
```

In [None]:
from bs4 import BeautifulSoup

## Let's start with simple dummy HTML

**Declare the following HTML documents**

```
html_doc = """<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Today's News</title>
    <style>
      #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
    </style>
  </head>
  <body>
    <h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>
    <b id="today" class="date-style">Date: 2025-04-30</b>
    <hr />
    <main>
      <article id="cover-story">
        <h2 class="news-title">News 001</h2>
        <a href="news001.html" class="news-cover-photo" id="cover-story-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+1"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 001</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 002</h2>
        <a href="news002.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+2"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 002</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 003</h2>
        <a href="news003.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+3"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 003</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>
    </main>

    <aside class="related-news">
      <h2 id="related-news-section-heading">Related News</h2>
      <a href="news001.html" class="related-news-link">News 101</a><br />
      <a href="news002.html" class="related-news-link">News 102</a><br />
      <a href="news003.html" class="related-news-link">News 103</a><br />
      <a href="news004.html" class="related-news-link">News 104</a><br />
      <a href="news005.html" class="related-news-link">News 105</a><br />
      <button class="related-news-link">Show more</button>
    </aside>

    <hr />
    <footer><span>ABC Company</span>. All rights reserved.</footer>
  </body>
</html>
"""
```

In [None]:
html_doc = """<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Today's News</title>
    <style>
      #website-name {
        color: rgb(164, 11, 11);
      }
      #website-name span {
        color: black;
        font-size: 0.5em;
      }
      .news-title {
        text-transform: uppercase;
      }
      h2 {
        color: rgb(164, 11, 11);
      }

      article {
        border-bottom: solid 1px grey;
      }

      aside {
        border: solid 1px #ccc;
        padding: 10px;
      }
    </style>
  </head>
  <body>
    <h1 id="website-name">Today's News <span>(An ABC Company)</span></h1>
    <b id="today" class="date-style">Date: 2025-04-30</b>
    <hr />
    <main>
      <article id="cover-story">
        <h2 class="news-title">News 001</h2>
        <a href="news001.html" class="news-cover-photo" id="cover-story-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+1"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 001</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 002</h2>
        <a href="news002.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+2"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 002</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>

      <article class="featured">
        <h2 class="news-title">News 003</h2>
        <a href="news003.html" class="news-cover-photo"
          ><img
            src="https://placehold.co/600x400?text=Dummy+Photo+3"
            alt="photo"
        /></a>
        <br /><i>2025-01-01</i>
        <p>
          <b>News 003</b> dolor sit amet, consectetur adipisicing elit, sed do
          eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
          minim veniam, quis nostrud exercitation ullamco laboris nisi ut
          aliquip ex ea commodo consequat. Duis aute irure dolor in
          reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
          pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
          culpa qui officia deserunt mollit anim id est laborum.
        </p>
      </article>
    </main>

    <aside class="related-news">
      <h2 id="related-news-section-heading">Related News</h2>
      <a href="news001.html" class="related-news-link">News 101</a><br />
      <a href="news002.html" class="related-news-link">News 102</a><br />
      <a href="news003.html" class="related-news-link">News 103</a><br />
      <a href="news004.html" class="related-news-link">News 104</a><br />
      <a href="news005.html" class="related-news-link">News 105</a><br />
      <button class="related-news-link">Show more</button>
    </aside>

    <hr />
    <footer><span>ABC Company</span>. All rights reserved.</footer>
  </body>
</html>

"""

### Createing BeautifulSoup Object

Create BeautifulSoup object with HTML string (`html_doc`) and specify a parser (in this case, `html.parser`)

`soup = BeautifulSoup(html_doc, 'html.parser')`

A parser converts raw web page content into a structured format.  There are many parser options.  They are built to do the same job but handle different situation.
- Use `html.parser` for simple HTML parsing (default, no installation needed).
- Use `lxml` for speed and XML support.
- Use `html5lib` for handling poorly formatted HTML.

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

## Use `prettify()` function to show clear HTML codes

The following command will display a neat output
```
print(soup.prettify())
```

In [None]:
print(soup.prettify())

##  Use `find()` function to retrieve child elements

A HTML file is usually long.  It can easy contain hundred or thousand of lines of code.  We can we `find()` function to target our tag.

Examples:
```
soup.find('title')
type(soup.find('title'))
soup.find('h1')
soup.find('p')
p = soup.find('p')
type(p)
```
`find()` _function will only return ONE SINGLE element even if there are multiple matched_

In [None]:
soup.find('title')

In [None]:
type(soup.find('title'))

In [None]:
soup.find('h1')

In [None]:
soup.find('p')

In [None]:
a_paragraph = soup.find('p')
type(a_paragraph)

## Retrieve tag's content

Use the following properties name to get the contents of a tag  

- `.text`	Extracts all text within an element, including **nested tags**.
  Useful for grabbing full text content of an element  

- `.string`	Returns text only if the element has a single text node, otherwise `None`.
  Works when an element contains direct text without nested tags  

- `.content`	Retrieves the raw binary content (bytes) of an HTML element.
  Useful for extracting non-text elements like images  

### let's retrieve a simple tag WITHOUT child tag

In [None]:
a_heading_2 = soup.find('h2')

In [None]:
a_heading_2

In [None]:
a_heading_2.text

In [None]:
a_heading_2.string

### let's retrieve a tag with child tag

In [None]:
a_paragraph = soup.find('p')

In [None]:
a_paragraph.text

In [None]:
a_paragraph.string # this get nothing as the paragraph has nested child tag

## `find()` vs. `find_all`

- `find()` - finds the first matching HTML tag/element.  
   Retuns single element (Tag object) or `None` if not found.  
   Extracting a single heading (`<h2>`), first paragraph, etc.
- `find_all()` - finds all matching HTML tag/elements.  
   Returns a list of Tag objects (empty list if no match).  
   Extracting all links (`<h2>`).

In [None]:
soup.find('h2')

In [None]:
type(soup.find('h2'))

In [None]:
soup.find_all('h2')

---
**Check the type**
When you use `find_all`, it returns a ResultSet (a list of item found)

In [None]:
type(soup.find_all('h2'))

---
**Use `[]` to refer to an item in the resultset**

`soup.find_all('h2')[0]`

In [None]:
# returns the first one: use 0 as index
soup.find_all('h2')[0]

In [None]:
# returns the second one: use 1 as index
soup.find_all('h2')[1] 

In [None]:
# returns the last one: use -1 as index
soup.find_all('h2')[-1]

In [None]:
# returns the second last one: use -2 as index
soup.find_all('h2')[-2]

## find by tag name, class name or id
You can find target tag by it's tag name, tag's class or id attributes  

![](https://codetheweb.blog/assets/img/posts/html-syntax/tag-structure-2.png)

### by tag names / element name
This approach is easy.  But this approach will usually targeting TOO  MANY tags (considing a HTML page can easily contain thousand of lines
```
find('h1')
find('h2')
find('p')
```
### by css `class_` attribute
This approach let you target tags with certain css class name.  Parameter name is `class_` instead of `class`, because `class` is a reserved keyword in Python
```
soup.find_all(class_='news-title')
soup.find_all(class_='featured')

```
### by `id` attribute
Use id when targeting unique elements.  
_Note_
- `id` is a unique value among a HTML document. So you should expecting only one matched tag.  
- However there could be exception as it's quite common that HMTL codes are buggy and messy.
```
soup.find_all(id='website-name')
soup.find_all(id='related-news-section-heading')
```


In [None]:
soup.find_all('h2')

In [None]:
soup.find_all(class_='news-title')

In [None]:
soup.find_all(id='website-name')

In [None]:
soup.find_all(id='related-news-section-heading')

## Use  `.` to refer a child element
To retrieve the `<title>` child tag, use `soup.title`

Or other child elements
```
soup.meta
soup.h1
soup.h2
soup.footer
```

This approach will only return ONE object

In [None]:
soup.title

In [None]:
soup.meta

In [None]:
soup.h1

In [None]:
soup.h1.span

In [None]:
soup.h1.span.text

In [None]:
soup.h2

In [None]:
soup.footer

## Getting a tag
```
a_tag = soup.title 
a_tag.name
```

In [None]:
soup.title

In [None]:
a_tag = soup.title 

In [None]:
a_tag.name

In [None]:
a_tag.text

In [None]:
a_tag.string

## Get the parent tag
`.parent` gives the parent tag of current tag
```
a_tag.parent
a_tag.parent.name
a_tag.parent.string
```

In [None]:
a_tag

In [None]:
a_tag.parent

In [None]:
a_tag.parent.name

In [None]:
a_tag.parent.text

## Tag attributes

The diagram below explain what is a tag's attributes.
![](https://www.scientecheasy.com/wp-content/uploads/2023/03/img-html-attributes.png)

Showing attribute
```
a_tag = soup.a
a_tag
a_tag["class"]
a_tag["href"]
a_tag["id"]
a_tag.attrs
```

`.attrs` lists all attributes

In [None]:
a_tag = soup.a
a_tag

In [None]:
a_tag["class"]

In [None]:
a_tag["href"]

In [None]:
a_tag["id"]

In [None]:
a_tag.attrs # show all the attributes of a tag

## Specifying both `tag` name and `class` name

We previously use `class_` to find matching elemenet, this approach will return all type of HTML tag that has a matching CSS class name. 
`soup.find_all(class_='related-news-link')`

We can include `tag` name as the first parameter to narrowing search only for certain type of HTML tags

This searches for `<a>` tag with css class named `related-news-link`   
`soup.find_all('a', class_='related-news-link')`

This searches for `<button>` tag with css class named `related-news-link`  
`soup.find_all('button', class_='related-news-link')`


In [None]:
soup.find_all(class_='related-news-link')

In [None]:
soup.find_all('a', class_='related-news-link')

In [None]:
soup.find_all('button', class_='related-news-link')

## Limiting the number in the search result

Use `limit` parameter to specify how many items you are expecting

In the following example, we limit the search to **TWO**
```
soup.find_all('h2', limit=2) 

```

In [None]:
soup.find_all('h2')

In [None]:
soup.find_all('h2', limit=2)

## Search by using advanced CSS selectors
If you are an experienced HTML/CSS coding, you can use complex css selector to be more targeted on small part of the HTML contents.  

Here we use `select()` function by specifying `css selector` as parameter.  It returns all the matching elements.
```
soup.select('body b')
soup.select('p b')
soup.select('body>b')
soup.select('body>p>b')
```

In [None]:
soup.select('h2')

In [None]:
soup.select('article h2')

In [None]:
soup.select('aside h2')

In [None]:
soup.select('b')

In [None]:
soup.select('body>b')

In [None]:
soup.select('article>p>b')

# PRACTICAL SCRPAING USING `requests` PACKAGE
In the previous section, we use a simple **HTML strings** to demonstrate the how to `find()`, `find_all()` and `select()` HTML tags/elements in our targeted HTML string becuase actually web page usually are very long and messy.  We learn BeautifulSoup skills by process a simple document.

In real life, we need to initiate **HTTP request** to webpage(s) over the internet directly and convert the returned HTML texts to a BeautifulSoup object.  

To issue HTTPS request, we need to import `requests` package.
![HTTP Protocol](https://miro.medium.com/v2/resize:fit:853/1*8-fT6K1o6nHiBRxKppcqOg.png)

## Imporint packages

In [None]:
import requests
from bs4 import BeautifulSoup

## Defining `User-Agent` for HTTP Request Header

Defining a `User-Agent` when web scraping is important because many websites check this header to determine whether the request is coming from a browser or a bot.  

If a request lacks a User-Agent or looks suspicious, websites may block or rate-limit it.

**Why Use a User-Agent?** 
1. Avoid Blocks & Restrictions – Websites may reject requests from bots or unknown sources.
2. Mimic a Real Browser – Helps make the request appear more like human activity.
3. Access Site Content Properly – Some sites serve different content based on the User-Agent.
4. Bypass Captchas or Anti-Bot Measures – Many sites block automated scraping tools.

In Python, headers are defined using `dict` type.
```
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
```

In [None]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

## Calling `requests.get()` function

We need to pass extra parameter `hearders` when calling `get()` function.

```
response = requests.get("https://hk.yahoo.com", headers=headers)
```

The HTTP Response from server is stored as `response` for later use.

In [None]:
response = requests.get("https://hk.yahoo.com", headers=headers)

## Checking on `response` 

Use`.status_code` to check the response status code.

To retrieve response body, you  refer to response `.text` or `.content` attribute
- Use `.text` when you need the response as a readable string (e.g., web pages, JSON).
- Use `.content` when working with binary data like images, PDFs, or file downloads.

In [None]:
response.status_code

In [None]:
response.text

In [None]:
response.content

In [None]:
response.url

## Converting Response Text to BeautifulSoup Object
yahoo = BeautifulSoup(response.content, "html.parser")

In [None]:
yahoo = BeautifulSoup(response.content, "html.parser")
yahoo

In [None]:
yahoo.find('title')

In [None]:
yahoo.find_all('meta')

In [None]:
yahoo.find(id='module-featurebar')

In [None]:
yahoo.find(id='module-featurebar').text

In [None]:
yahoo.find_all(class_='apac-ntk-item')

In [None]:
type(yahoo.find_all(class_='apac-ntk-item'))

In [None]:
yahoo.find_all(class_='apac-ntk-item')[0]

In [None]:
yahoo.find_all(class_='apac-ntk-item')[0].text

In [None]:
yahoo.find_all(class_='apac-ntk-item')[0].attrs

In [None]:
yahoo.find_all(class_='apac-ntk-item')[0]['class']

In [None]:
yahoo.find_all(class_='apac-ntk-item')[0]['href']