# A Simple Introduction to Web Scraping with Beautiful Soup

![](https://github.com/kaopanboonyuen/GISTDA2023/raw/main/img/gistda_day1.png)


Credit: 

[1] https://realpython.com/beautiful-soup-web-scraper-python/

[2] https://www.analyticsvidhya.com/blog/2021/08/a-simple-introduction-to-web-scraping-with-beautiful-soup/

[3] https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/

In [1]:
from bs4 import BeautifulSoup 
import requests
import pandas as pd
import re

Beautiful Soup is a library useful to extract data from HTML and XML files. A sort of parse tree is built for the parsed page. Indeed, an HTML document is composed of a tree of tags. I will show an example of HTML code to make you grasp this concept.


<!-- <!DOCTYPE html>
<html>
<head>
<title>Tutorial of Web scraping</title>
</head>
<body>
<h1>1. Import libraries</h1>
<p>Let's import: </p>
</body>
</html> -->

![](https://cdn-images-1.medium.com/max/1000/1*iOWLHDOtqxgngIOj9N3Hzw.png)

In [2]:
url = 'https://en.wikipedia.org/wiki/Big_data'
req = requests.get(url)
print(req)

<Response [200]>


In [3]:
soup = BeautifulSoup(req.text,"html.parser")
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [4]:
print(soup.prettify()[:100])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la


## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

Note: DOM = Document Object Model 

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

In [5]:
# title is a tag of one of the element node in the example.
# we can refer to the node by using the tag name
type(soup.title)

bs4.element.Tag

In [6]:
soup.head.style

In [7]:
# we can get tag of a node with 'name'
soup.title.name

'title'

In [8]:
# we can get outerHTML by converting node to string
str(soup.title)

'<title>Big data - Wikipedia</title>'

In [9]:
# we can get innerHTML with 'string'
soup.title.string

'Big data - Wikipedia'

In [10]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [11]:
# getting the parent node with 'parent'
soup.title.parent.name

'head'

In [12]:
# referring to children
soup.title.children

<list_iterator at 0x7fdbf97aa580>

# Wikipedia Page Data Extraction

In this tutorial, we will learn how to extract a static page and convert it into useful information.

We first get a wikipeidia page using requests.

![](https://www.techhub.in.th/wp-content/uploads/2013/12/wikipedia-logo.jpg)

In [13]:
bigdata = requests.get('https://en.wikipedia.org/wiki/Big_data')

In [14]:
len(bigdata.text)

525615

## Parsing a wikipedia page

In [15]:
soup = BeautifulSoup(bigdata.text, "lxml")
#print(soup.prettify())

In [16]:
soup.title.string

'Big data - Wikipedia'

In [17]:
# soup.find_all('a')

In [18]:
for link in soup.find_all('a', limit=15):
    print('{} : {}'.format(link.get('class'), link.get('href')))

['mw-jump-link'] : #bodyContent
None : /wiki/Main_Page
None : /wiki/Wikipedia:Contents
None : /wiki/Portal:Current_events
None : /wiki/Special:Random
None : /wiki/Wikipedia:About
None : //en.wikipedia.org/wiki/Wikipedia:Contact_us
None : https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
None : /wiki/Help:Contents
None : /wiki/Help:Introduction
None : /wiki/Wikipedia:Community_portal
None : /wiki/Special:RecentChanges
None : /wiki/Wikipedia:File_upload_wizard
['mw-logo'] : /wiki/Main_Page
['mw-ui-button', 'mw-ui-quiet', 'mw-ui-icon', 'mw-ui-icon-element', 'mw-ui-icon-wikimedia-search', 'search-toggle'] : /wiki/Special:Search


In [19]:
pattern = re.compile(r'/wiki/(.*)')

In [20]:
for link in soup.find_all('a', {'class': None}, limit=20):
    href = link.get('href')
    if href is not None:
        match = re.match(pattern, href)
        if match:
            print(href)

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Big_data


In [21]:
a_list = soup.select('div.div-col ul a')
a_list

[<a href="/wiki/Big_data_ethics" title="Big data ethics">Big data ethics</a>,
 <a href="/wiki/Big_data_maturity_model" title="Big data maturity model">Big data maturity model</a>,
 <a href="/wiki/Big_memory" title="Big memory">Big memory</a>,
 <a href="/wiki/Data_curation" title="Data curation">Data curation</a>,
 <a href="/wiki/Data_defined_storage" title="Data defined storage">Data defined storage</a>,
 <a href="/wiki/Data_engineering" title="Data engineering">Data engineering</a>,
 <a href="/wiki/Data_lineage" title="Data lineage">Data lineage</a>,
 <a href="/wiki/Data_philanthropy" title="Data philanthropy">Data philanthropy</a>,
 <a href="/wiki/Data_science" title="Data science">Data science</a>,
 <a href="/wiki/Datafication" title="Datafication">Datafication</a>,
 <a href="/wiki/Document-oriented_database" title="Document-oriented database">Document-oriented database</a>,
 <a href="/wiki/List_of_big_data_companies" title="List of big data companies">List of big data companies</a>

In [22]:
for e in a_list:
    print(e['href'])

/wiki/Big_data_ethics
/wiki/Big_data_maturity_model
/wiki/Big_memory
/wiki/Data_curation
/wiki/Data_defined_storage
/wiki/Data_engineering
/wiki/Data_lineage
/wiki/Data_philanthropy
/wiki/Data_science
/wiki/Datafication
/wiki/Document-oriented_database
/wiki/List_of_big_data_companies
/wiki/Very_large_database
/wiki/XLDB


In [23]:
data = []
for e in a_list:
    data.append({ 'keyword' : e.string, 'link' : e['href'] })
df = pd.DataFrame(data)

In [24]:
df

Unnamed: 0,keyword,link
0,Big data ethics,/wiki/Big_data_ethics
1,Big data maturity model,/wiki/Big_data_maturity_model
2,Big memory,/wiki/Big_memory
3,Data curation,/wiki/Data_curation
4,Data defined storage,/wiki/Data_defined_storage
5,Data engineering,/wiki/Data_engineering
6,Data lineage,/wiki/Data_lineage
7,Data philanthropy,/wiki/Data_philanthropy
8,Data science,/wiki/Data_science
9,Datafication,/wiki/Datafication


# REST API Data Extraction

![](https://raw.githubusercontent.com/Codecademy/articles/0b631b51723fbb3cc652ef5f009082aa71916e63/images/rest_api.svg)

Gathering data from a REST API is quite typical.  Most Single-Page-Application (SPA) and AJAX dynamic pages rely on REST APIs.  In addition, most vendor-specific APIs such as Facebook, Twitter, etc., base on REST.

The most important step of extracting data via REST API is to identify the endpoint.

In [25]:
import requests
import json
import pprint

In [26]:
api_url = 'http://api.settrade.com/api/market/SET/info'

In [27]:
data_info = requests.get(api_url)
data_info.text

'{"market_name":"SET","market_display_name":"SET","market_status":"Closed","datetime":"29/03/2023 23:37:47","gainer_amount":617,"gainer_volume":6.765037939E9,"unchange_amount":612,"unchange_volume":2.117060024E9,"loser_amount":754,"loser_volume":5.633700384E9,"index":[{"index_name":"SET","index_display_name":"SET","market":"SET","prior":1606.91,"last":1610.52,"change":3.61,"percent_change":0.2246,"high":1615.29,"low":1602.43,"total_volume":1.4606734024E10,"total_value":4.454078432419E10,"flag_url":null},{"index_name":"SET50","index_display_name":"SET50","market":"SET","prior":967.78,"last":971.45,"change":3.67,"percent_change":0.3792,"high":974.3,"low":965.49,"total_volume":1.028903641E9,"total_value":2.809787411577E10,"flag_url":null},{"index_name":"SET100","index_display_name":"SET100","market":"SET","prior":2167.87,"last":2175.29,"change":7.42,"percent_change":0.3422,"high":2181.62,"low":2162.55,"total_volume":1.566862302E9,"total_value":3.315673146461E10,"flag_url":null},{"index_na

In [28]:
set_info = json.loads(data_info.text)
pprint.pprint(set_info['index'])

[{'change': 3.61,
  'flag_url': None,
  'high': 1615.29,
  'index_display_name': 'SET',
  'index_name': 'SET',
  'last': 1610.52,
  'low': 1602.43,
  'market': 'SET',
  'percent_change': 0.2246,
  'prior': 1606.91,
  'total_value': 44540784324.19,
  'total_volume': 14606734024.0},
 {'change': 3.67,
  'flag_url': None,
  'high': 974.3,
  'index_display_name': 'SET50',
  'index_name': 'SET50',
  'last': 971.45,
  'low': 965.49,
  'market': 'SET',
  'percent_change': 0.3792,
  'prior': 967.78,
  'total_value': 28097874115.77,
  'total_volume': 1028903641.0},
 {'change': 7.42,
  'flag_url': None,
  'high': 2181.62,
  'index_display_name': 'SET100',
  'index_name': 'SET100',
  'last': 2175.29,
  'low': 2162.55,
  'market': 'SET',
  'percent_change': 0.3422,
  'prior': 2167.87,
  'total_value': 33156731464.61,
  'total_volume': 1566862302.0},
 {'change': 1.73,
  'flag_url': None,
  'high': 1047.12,
  'index_display_name': 'sSET',
  'index_name': 'sSET',
  'last': 1044.79,
  'low': 1040.35,
 

In [29]:
market = set_info['index'][0]
print(market['market'], market['last'])

SET 1610.52


In [30]:
for ind in set_info['index']:
    print(ind['index_name'], ind['last'])

SET 1610.52
SET50 971.45
SET100 2175.29
sSET 1044.79
SETCLMV 957.54
SETHD 1160.24
SETTHSI 1028.85
SETWB 994.3


## Data Table Scraping

In [31]:
# Send a GET request to the website
response = requests.get("https://www.w3schools.com/html/html_tables.asp")

In [32]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find the table you want to scrape
table = soup.find("table", {"id": "customers"})

In [33]:
# Extract the table headers
headers = []
for th in table.find_all("th"):
    headers.append(th.text.strip())

# Extract the table rows and cells
rows = []
for tr in table.find_all("tr"):
    cells = []
    for td in tr.find_all("td"):
        cells.append(td.text.strip())
    if cells:
        rows.append(cells)

In [34]:
# Store the table data in a Pandas DataFrame
df = pd.DataFrame(rows, columns=headers)
df

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada
5,Magazzini Alimentari Riuniti,Giovanni Rovelli,Italy


# Real Cases: BBC news homepage

![](https://d.newsweek.com/en/full/881613/33-bbc-breaking-news.jpg?w=466&h=311&f=0717db3d760d0f8559be00d641c9f167)

In [35]:
# make a GET request to the BBC news homepage
response = requests.get('https://www.bbc.com/news')

# create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, 'html.parser')

# find all the main news headlines and their URLs
main_headlines = soup.find_all('a', class_='gs-c-promo-heading')

# iterate through the headlines and print their text and URLs
for headline in main_headlines:
    print('headline:',headline.text.strip())
    print('href:',headline['href'])
    print()

headline: Nashville school shooter hid guns in parents' home
href: /news/world-us-canada-65106976

headline: Nashville school shooter hid guns in parents' home
href: /news/world-us-canada-65106976

headline: 'Nashville shooter sent me messages before attack'
href: /news/world-us-canada-65106763

headline: Video of deadly Mexico fire causes outrage
href: /news/world-latin-america-65111258

headline: Young Brits told to stay away from Amsterdam
href: /news/world-europe-65107405

headline: Advanced AI risk to humanity - technology leaders
href: /news/technology-65110030

headline: US clears over-the-counter spray for opioid overdoses
href: /news/world-us-canada-65114337

headline: Swimmers in Hawaii accused of harassing dolphins
href: /news/world-us-canada-65114336

headline: Russian spies more effective than army, say experts
href: /news/world-europe-65113340

headline: Will Serial's Adnan Syed go back to jail?
href: /news/world-us-canada-62964216

headline: TV star Paul O'Grady dies age

In [36]:
# make a GET request to the BBC news homepage
response = requests.get('https://www.bbc.com/news')

# create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, 'html.parser')

# find all the main news articles
main_articles = soup.find_all('div', class_='gs-c-promo')

# iterate through the articles and print their headline, description, and URL
for article in main_articles:
    try:
      headline = article.find('a', class_='gs-c-promo-heading')
      description = article.find('p', class_='gs-c-promo-summary')
      url = article.find('a', class_='gs-c-promo-heading')['href']
      
      print('headline:', headline.text.strip())
      print('description:',description.text.strip())
      print('url:','https://www.bbc.com/'+url)
      print()
    except:
      pass


headline: Nashville school shooter hid guns in parents' home
description: Police say the parents felt the suspect should not own weapons, but did not realise guns were in the house.
url: https://www.bbc.com//news/world-us-canada-65106976

headline: 'Nashville shooter sent me messages before attack'
description: "I'm still trying to wrap my head around what we're going through," former classmate Averianna Patton told the BBC.
url: https://www.bbc.com//news/world-us-canada-65106763

headline: Video of deadly Mexico fire causes outrage
description: Footage emerges which appears to show officers failing to open a cell door as the fire erupted.
url: https://www.bbc.com//news/world-latin-america-65111258

headline: Young Brits told to stay away from Amsterdam
description: The Dutch city targets UK men aged 18-35 in an ad campaign aimed at changing its reputation.
url: https://www.bbc.com//news/world-europe-65107405

headline: Advanced AI risk to humanity - technology leaders
description: Elo

# Image Scraping

![](https://www.enostech.com/wp-content/uploads/2022/04/AdobeStock_474211244.jpg)

In [37]:
import requests
from bs4 import BeautifulSoup
import os

os.makedirs('image_scraping_results', exist_ok = True)

In [38]:
url = 'https://www.webdesignerdepot.com/2009/01/the-evolution-of-apple-design-between-1977-2008/'
response = requests.get(url)
html_content = response.content

In [39]:
soup = BeautifulSoup(html_content, 'html.parser')


In [40]:
image_tags = soup.find_all('img')
image_urls = [tag['src'] for tag in image_tags]

In [41]:
c = 0
for i, url in enumerate(image_urls):
    try:
      response = requests.get(url)
      with open(f'image_scraping_results/image_{i}.jpg', 'wb') as f:
        f.write(response.content)
        print(f"-- {c} we found the.jpg format and scrape it")
        c+=1
    except:
        print("!! it is not .jpg format")


!! it is not .jpg format
!! it is not .jpg format
-- 0 we found the.jpg format and scrape it
-- 1 we found the.jpg format and scrape it
-- 2 we found the.jpg format and scrape it
-- 3 we found the.jpg format and scrape it
-- 4 we found the.jpg format and scrape it
-- 5 we found the.jpg format and scrape it
-- 6 we found the.jpg format and scrape it
-- 7 we found the.jpg format and scrape it
-- 8 we found the.jpg format and scrape it
-- 9 we found the.jpg format and scrape it
-- 10 we found the.jpg format and scrape it
-- 11 we found the.jpg format and scrape it
-- 12 we found the.jpg format and scrape it
-- 13 we found the.jpg format and scrape it
-- 14 we found the.jpg format and scrape it
-- 15 we found the.jpg format and scrape it
-- 16 we found the.jpg format and scrape it
-- 17 we found the.jpg format and scrape it
-- 18 we found the.jpg format and scrape it
-- 19 we found the.jpg format and scrape it
-- 20 we found the.jpg format and scrape it
-- 21 we found the.jpg format and sc