# A Simple Introduction to Web Scraping with Beautiful Soup

![](https://github.com/kaopanboonyuen/GISTDA2023/raw/main/img/gistda_day1.png)


Credit: 

[1] https://realpython.com/beautiful-soup-web-scraper-python/

[2] https://www.analyticsvidhya.com/blog/2021/08/a-simple-introduction-to-web-scraping-with-beautiful-soup/

[3] https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/

In [None]:
from bs4 import BeautifulSoup 
import requests
import pandas as pd
import re

Beautiful Soup is a library useful to extract data from HTML and XML files. A sort of parse tree is built for the parsed page. Indeed, an HTML document is composed of a tree of tags. I will show an example of HTML code to make you grasp this concept.


<!-- <!DOCTYPE html>
<html>
<head>
<title>Tutorial of Web scraping</title>
</head>
<body>
<h1>1. Import libraries</h1>
<p>Let's import: </p>
</body>
</html> -->

![](https://cdn-images-1.medium.com/max/1000/1*iOWLHDOtqxgngIOj9N3Hzw.png)

In [None]:
url = 'https://en.wikipedia.org/wiki/Big_data'
req = requests.get(url)
print(req)

In [None]:
soup = BeautifulSoup(req.text,"html.parser")
print(type(soup))

In [None]:
print(soup.prettify()[:100])

## Beautiful Soup DOM Tree
The structure of Beautiful Soup bases on the concept of DOM, which is used in all web browsers.  DOM is a tree of all elements in the webpage.  Each element node consists of:
- tag
- innerHTML/outerHTML
- id
- attributes
- parent and children

Note: DOM = Document Object Model 

### Traversing simple HTML's DOM Tree

In our example, the structure is as followed:

```
html
+-- head
|   +-- title
|   +-- meta
|   +-- meta
|   +-- style
+-- body
    +-- div
    |   +-- h1
    |   +-- p
    |       +-- b
    +-- div
    |   +-- a
    |   +-- a
    |   +-- a
    |   +-- a
    +-- div
    |   +--div
    |   |   +-- h2
    |   |   +-- h5
    |   |   +-- ...
    |   +--div
    |       +-- h2
    |       +-- h5
    |       +-- ...
    +-- div
        +-- h2
```

In [None]:
# title is a tag of one of the element node in the example.
# we can refer to the node by using the tag name
type(soup.title)

In [None]:
soup.head.style

In [None]:
# we can get tag of a node with 'name'
soup.title.name

In [None]:
# we can get outerHTML by converting node to string
str(soup.title)

In [None]:
# we can get innerHTML with 'string'
soup.title.string

In [None]:
# we can get id with 'id' (it is empty in this example)
soup.title.id

In [None]:
# getting the parent node with 'parent'
soup.title.parent.name

In [None]:
# referring to children
soup.title.children

# Intro to Beautiful Soup: Build a Web Scraper With Python

![](https://realpython.com/cdn-cgi/image/width=960,format=auto/https://files.realpython.com/media/Build-a-Web-Scraper-With-Requests-and-Beautiful-Soup_Watermarked.37918fb3906c.jpg)

Credit: https://realpython.com/beautiful-soup-web-scraper-python/

In [None]:
import requests
from bs4 import BeautifulSoup

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="ResultsContainer")

job_elements = results.find_all("div", class_="card-content")

for job_element in job_elements:
    print(job_element, end="\n"*2)

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element)
    print(company_element)
    print(location_element)
    print()

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

# Wikipedia Page Data Extraction

In this tutorial, we will learn how to extract a static page and convert it into useful information.

We first get a wikipeidia page using requests.

![](https://www.techhub.in.th/wp-content/uploads/2013/12/wikipedia-logo.jpg)

In [None]:
bigdata = requests.get('https://en.wikipedia.org/wiki/Big_data')

In [None]:
len(bigdata.text)

## Parsing a wikipedia page

In [None]:
soup = BeautifulSoup(bigdata.text, "lxml")
#print(soup.prettify())

In [None]:
soup.title.string

In [None]:
# soup.find_all('a')

In [None]:
for link in soup.find_all('a', limit=15):
    print('{} : {}'.format(link.get('class'), link.get('href')))

In [None]:
pattern = re.compile(r'/wiki/(.*)')

In [None]:
for link in soup.find_all('a', {'class': None}, limit=20):
    href = link.get('href')
    if href is not None:
        match = re.match(pattern, href)
        if match:
            print(href)

In [None]:
a_list = soup.select('div.div-col ul a')
a_list

In [None]:
for e in a_list:
    print(e['href'])

In [None]:
data = []
for e in a_list:
    data.append({ 'keyword' : e.string, 'link' : e['href'] })
df = pd.DataFrame(data)

In [None]:
df

# REST API Data Extraction

![](https://raw.githubusercontent.com/Codecademy/articles/0b631b51723fbb3cc652ef5f009082aa71916e63/images/rest_api.svg)

Gathering data from a REST API is quite typical.  Most Single-Page-Application (SPA) and AJAX dynamic pages rely on REST APIs.  In addition, most vendor-specific APIs such as Facebook, Twitter, etc., base on REST.

The most important step of extracting data via REST API is to identify the endpoint.

In [None]:
import requests
import json
import pprint

In [None]:
api_url = 'http://api.settrade.com/api/market/SET/info'

In [None]:
data_info = requests.get(api_url)
data_info.text

In [None]:
set_info = json.loads(data_info.text)
pprint.pprint(set_info['index'])

In [None]:
market = set_info['index'][0]
print(market['market'], market['last'])

In [None]:
for ind in set_info['index']:
    print(ind['index_name'], ind['last'])

## Data Table Scraping

In [None]:
# Send a GET request to the website
response = requests.get("https://www.w3schools.com/html/html_tables.asp")

In [None]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find the table you want to scrape
table = soup.find("table", {"id": "customers"})

In [None]:
# Extract the table headers
headers = []
for th in table.find_all("th"):
    headers.append(th.text.strip())

# Extract the table rows and cells
rows = []
for tr in table.find_all("tr"):
    cells = []
    for td in tr.find_all("td"):
        cells.append(td.text.strip())
    if cells:
        rows.append(cells)

In [None]:
# Store the table data in a Pandas DataFrame
df = pd.DataFrame(rows, columns=headers)
df

# Real Cases: BBC news homepage

![](https://d.newsweek.com/en/full/881613/33-bbc-breaking-news.jpg?w=466&h=311&f=0717db3d760d0f8559be00d641c9f167)

In [None]:
# make a GET request to the BBC news homepage
response = requests.get('https://www.bbc.com/news')

# create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, 'html.parser')

# find all the main news headlines and their URLs
main_headlines = soup.find_all('a', class_='gs-c-promo-heading')

# iterate through the headlines and print their text and URLs
for headline in main_headlines:
    print('headline:',headline.text.strip())
    print('href:',headline['href'])
    print()

In [None]:
# make a GET request to the BBC news homepage
response = requests.get('https://www.bbc.com/news')

# create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, 'html.parser')

# find all the main news articles
main_articles = soup.find_all('div', class_='gs-c-promo')

# iterate through the articles and print their headline, description, and URL
for article in main_articles:
    try:
      headline = article.find('a', class_='gs-c-promo-heading')
      description = article.find('p', class_='gs-c-promo-summary')
      url = article.find('a', class_='gs-c-promo-heading')['href']
      
      print('headline:', headline.text.strip())
      print('description:',description.text.strip())
      print('url:','https://www.bbc.com/'+url)
      print()
    except:
      pass


# Image Scraping

![](https://www.enostech.com/wp-content/uploads/2022/04/AdobeStock_474211244.jpg)

In [None]:
import requests
from bs4 import BeautifulSoup
import os

os.makedirs('image_scraping_results', exist_ok = True)

In [None]:
url = 'https://www.webdesignerdepot.com/2009/01/the-evolution-of-apple-design-between-1977-2008/'
response = requests.get(url)
html_content = response.content

In [None]:
# soup = BeautifulSoup(html_content, 'lxml')
# soup

In [None]:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}

resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
#print(soup)

In [None]:
image_tags = soup.find_all('img')
image_urls = [tag['src'] for tag in image_tags]

In [None]:
c = 0
for i, url in enumerate(image_urls):
    try:
      response = requests.get(url)
      with open(f'image_scraping_results/image_{i}.jpg', 'wb') as f:
        f.write(response.content)
        print(f"-- {c} we found the.jpg format and scrape it")
        c+=1
    except:
        print("!! it is not .jpg format")


In [None]:
!zip -r image_scraping_results.zip image_scraping_results >> tmp