What Is Web Scraping

Web scraping is the process of gathering information from the internet. Even copying and pasting the lyrics of your favorite song can be considered a form of web scraping! However, the term “web scraping” usually refers to a process that involves automation. While some websites don’t like it when automatic scrapers gather their data, which can lead to legal issues, others don’t mind it.

Instead of having to check the job site every day, you can use Python to help automate the repetitive parts of your job search. With automated web scraping, you can write the code once, and it’ll get the information that you need many times and from many pages. Whether you’re actually on the job hunt or just want to automatically download all the lyrics of your favorite artist, automated web scraping can help you accomplish your goals.

Web scraping steps:

Inspect your data source.
Scrape HTML content from a page.
Parse HTML code with Beautiful Soup.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

We have to run the command below to install BeautifulSoup.



In [1]:
pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4)
  Downloading soupsieve-2.8.3-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading beautifulsoup4-4.14.3-py3-none-any.whl (107 kB)
Downloading soupsieve-2.8.3-py3-none-any.whl (37 kB)
Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing-extensions, soupsieve, beautifulsoup4
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [beautifulsoup4]
[1A[2KSuccessfully installed beautifulsoup4-4.14.3 soupsieve-2.8.3 typing-extensions-4.15.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

type(html_doc)

str

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [5]:
type(soup)

bs4.BeautifulSoup

In [6]:
title = soup.title
title

<title>The Dormouse's story</title>

In [7]:
type(title)

bs4.element.Tag

In [8]:
name = soup.title.name
name

'title'

In [10]:
text = soup.title.text
text

"The Dormouse's story"

In [11]:
string = soup.title.string
string

"The Dormouse's story"

In [12]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [14]:
a = soup.a
a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [15]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [17]:
soup.find(id = 'link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [18]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [19]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



In [31]:
soup = BeautifulSoup('<b id="bold text" class="boldest another">Extremely bold</b>')
tag = soup.b
tag

<b class="boldest another" id="bold text">Extremely bold</b>

In [21]:
type(tag)

bs4.element.Tag

In [23]:
tag.name = 'blocquote'
tag

<blocquote class="boldest another" id="bold text">Extremely bold</blocquote>

In [24]:
tag['id']

'bold text'

In [25]:
tag['another attribute'] = 1
tag

<blocquote another attribute="1" class="boldest another" id="bold text">Extremely bold</blocquote>

In [27]:
del tag['another attribute']
tag

<blocquote class="boldest another" id="bold text">Extremely bold</blocquote>

In [28]:
tag.get('id')

'bold text'

In [29]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']

['body', 'strikeout']

In [36]:
soup.b.get_attribute_list('id')

['bold text']

In [37]:
soup.b['id']

'bold text'

**Example**

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://data.worldbank.org/country'
res = requests.get(url)
res.status_code



200

In [41]:
soup = BeautifulSoup(res.content)
soup

<!DOCTYPE html>

<html data-react-checksum="-1928246669" data-reactid="1" data-reactroot=""><head data-reactid="2"><meta charset="utf-8" data-reactid="3"/><title data-react-helmet="true" data-reactid="4">Countries | Data</title><meta content="width=device-width, initial-scale=1, minimal-ui" data-reactid="5" name="viewport"/><meta content="IE=Edge" data-reactid="6" http-equiv="X-UA-Compatible"/><meta content="Countries from The World Bank: Data" data-react-helmet="true" data-reactid="7" name="description"/><link data-reactid="8" href="/favicon.ico?v=1.1" rel="shortcut icon"/><meta content="ByFDZmo3VoJURCHrA3WHjth6IAISYQEbe20bfzTPCPo" data-reactid="9" name="google-site-verification"/><meta content="World Bank Open Data" data-reactid="10" property="og:title"/><meta content="Free and open access to global development data" data-reactid="11" property="og:description"/><meta content="https://data.worldbank.org/assets/images/logo-wb-header-en.svg" data-reactid="12" property="og:image"/><meta 

In [56]:
countries = {}

sections = soup.find_all('section')
for section in sections:
    title = section.find('h3')
    countries[title.text] = []
    # print(title.text)
    names = section.find_all('a')
    for name in names:
        countries[title.text].append(name.text)
        # print('\t', name.text)
    

In [59]:
countries['U']

['Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Uruguay',
 'Uzbekistan']