In [33]:
from __future__ import print_function

from bs4 import BeautifulSoup

import requests

# Beautiful Soup on test data

Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Below, we create a simple HTML page that include some frequently used tags. 
Note, however, that we have also left one paragraph tag unclosed. 

In [34]:
source = """
<!DOCTYPE html>  
<html>  
  <head>
    <title>Scraping</title>
  </head>
  <body class="col-sm-12">
    <h1>section1</h1>
    <p>paragraph1</p>
    <p>paragraph2</p>
    <div class="col-sm-2">
      <h2>section2</h2>
      <p>paragraph3</p>
      <p>unclosed
    </div>
  </body>
</html>  
"""

soup = BeautifulSoup(source, "html.parser")

Once the soup object has been created successfully, we can execute a number of queries on the DOM. 
First we request all data from the `head` tag. 
Note that while it looks like a list of strings was returned, actually, a `bs4.element.Tag` type is returned. 
These examples explore how to extract tags, the text from tags, how to filter queries based on 
attributes, how to retreive attributes from a returned query, and how the BeautifulSoup engine 
is tolerant of unclosed tags. 
Notice in the actual HTML source, the last paragraph is not closed. 

In [35]:
print(soup.prettify())
# BeautifulSoup engine corrects the HTML source by including </p> to the unclosed paragraph

<!DOCTYPE html>
<html>
 <head>
  <title>
   Scraping
  </title>
 </head>
 <body class="col-sm-12">
  <h1>
   section1
  </h1>
  <p>
   paragraph1
  </p>
  <p>
   paragraph2
  </p>
  <div class="col-sm-2">
   <h2>
    section2
   </h2>
   <p>
    paragraph3
   </p>
   <p>
    unclosed
   </p>
  </div>
 </body>
</html>



In [36]:
print('Head:')
print('', soup.find_all("head"))
# [<head>\n<title>Scraping</title>\n</head>]

Head:
 [<head>
<title>Scraping</title>
</head>]


In [37]:
print('\nType of head:')
print('', map(type, soup.find_all("head")))
# [<class 'bs4.element.Tag'>]


Type of head:
 <map object at 0x7fc34086ca60>


In [38]:
print('\nTitle tag:')
print('', soup.find("title"))
# <title>Scraping</title>


Title tag:
 <title>Scraping</title>


In [39]:
print('\nTitle text:')
print('', soup.find("title").text)
# Scraping


Title text:
 Scraping


In [40]:
divs = soup.find_all("div", attrs={"class": "col-sm-2"})
print('\nDiv with class=col-sm-2:')
print('', divs)
# [<div class="col-sm-2">....</div>]


Div with class=col-sm-2:
 [<div class="col-sm-2">
<h2>section2</h2>
<p>paragraph3</p>
<p>unclosed
    </p></div>]


In [41]:
print('\nClass of first div:')
print('', divs[0].attrs['class'])
# [u'col-sm-2']


Class of first div:
 ['col-sm-2']


In [42]:
print('\nAll paragraphs:')
print('', soup.find_all("p"))
# [<p>paragraph1</p>, 
#  <p>paragraph2</p>, 
#  <p>paragraph3</p>, 
#  <p>unclosed\n    </p>]


All paragraphs:
 [<p>paragraph1</p>, <p>paragraph2</p>, <p>paragraph3</p>, <p>unclosed
    </p>]


# Beautilful soup on real data 

In this example I will show how you can use BeautifulSoup to retreive information from live web pages. 
We make use of The Guardian newspaper, and retreive the HTML from an arbitrary article. 
We then create the BeautifulSoup object, and query the links that were discovered in the DOM. 
Since a large number are returned, we then apply attribute filters that let us reduce significantly 
the number of returned links. 
I selected the filters selected for this example in order to focus on the names in the paper. 
The parameterisation of the attributes was discovered by using the `inspect` functionality of Google Chrome

In [43]:
url = 'https://www.theguardian.com/world/2021/jan/21/johnson-raises-fears-of-covid-lockdown-in-england-continuing-into-summertime'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

In [44]:
print(source)

m" role="none">
<a class="menu-item__title" role="menuitem" href="https://www.theguardian.com/food" data-link-name="nav2 : secondary : Food">
Food
</a>
</li>
<li class="menu-item" role="none">
<a class="menu-item__title" role="menuitem" href="https://www.theguardian.com/tone/recipes" data-link-name="nav2 : secondary : Recipes">
Recipes
</a>
</li>
<li class="menu-item" role="none">
<a class="menu-item__title" role="menuitem" href="https://www.theguardian.com/uk/travel" data-link-name="nav2 : secondary : Travel">
Travel
</a>
</li>
<li class="menu-item" role="none">
<a class="menu-item__title" role="menuitem" href="https://www.theguardian.com/lifeandstyle/health-and-wellbeing" data-link-name="nav2 : secondary : Health &amp; fitness">
Health &amp; fitness
</a>
</li>
<li class="menu-item" role="none">
<a class="menu-item__title" role="menuitem" href="https://www.theguardian.com/lifeandstyle/women" data-link-name="nav2 : secondary : Women">
Women
</a>
</li>
<li class="menu-item" role="none">

In [45]:
links = soup.find_all('a')
links

[<a class="u-h skip" data-link-name="skip : main content" href="#maincontent">Skip to main content</a>,
 <a class="new-header__logo" data-link-name="nav2 : logo" href="https://www.theguardian.com/uk">
 <span class="u-h">The Guardian - Back to home</span>
 <span class="inline-the-guardian-logo inline-logo">
 <svg class="inline-the-guardian-logo__svg inline-logo__svg" viewbox="0 0 297 95">
 <path d="M66.8 50.7l5-2.6V8.4H68l-9.3 12.4h-1L58.2 7h40.5l.6 13.8h-1.1L89 8.4h-3.9V48l5 2.7V52H66.9v-1.3zm37-1.8V5L100 3.5v-.9L114.2.1h1.5v20.8l.3-.4a19 19 0 0 1 12.2-4.5c6.2 0 9 3.5 9 10v23l3.3 1.7V52H122v-1.3l3.4-1.8V26c0-3.6-1.6-5-4.6-5a7.8 7.8 0 0 0-4.9 1.6V49l3.3 1.8V52h-18.5v-1.2zm48.4-13.4c.4 7.2 3.6 12.8 11.4 12.8 3.7 0 6.3-1.7 8.8-3v1.5a17.4 17.4 0 0 1-13.6 6.2c-12 0-18-6.6-18-18.1 0-11.3 6.6-18.3 17.4-18.3 10.2 0 15.5 5 15.5 18.4v.4zm-.2-1.7l10.5-.7c0-9-1.5-15-4.6-15-3.3 0-5.9 7-5.9 15.6M0 69.6c0-19.1 12.7-26 26.8-26 6 0 11.6 1 14.8 2.3l.3 13.4h-1.4l-8.3-13a12.2 12.2 0 0 0-5.2-.8c-7.5 0-11.3

In [46]:
links = soup.find_all('a', attrs={
    'data-component': 'auto-linked-tag'
})

for link in links: 
    print(link['href'], link.text)

https://www.theguardian.com/politics/boris-johnson Boris Johnson
https://www.theguardian.com/uk-news/england England
https://www.theguardian.com/politics/priti-patel Priti Patel


# Chaining queries

Now, let us conisder a more general query that might be done on a website such as this. 
We will query the base technology page, and attempt to list all articles that pertain to this main page

In [47]:
url = 'https://www.theguardian.com/uk/technology'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

After inspecting the DOM (via the `inspect` tool in my browser), I see that the attributes that define 
a `technology` article are: 
    
    class = "js-headline-text"

In [48]:
articles = soup.find_all('a', attrs={
    'class': 'js-headline-text'
})

for article in articles: 
    print(article['href'][:], article.text[:20])

https://www.theguardian.com/technology/2021/feb/01/price-of-bitcoin-jumps-after-elon-musk-says-it-is-a-good-thing Price jumps after El
https://www.theguardian.com/technology/2021/jan/31/the-tyranny-of-passwords-is-it-time-for-a-rethink Is it time for a ret
https://www.theguardian.com/technology/2021/feb/01/microsofts-bing-ready-to-step-in-if-google-pulls-search-from-australia-minister-says Bing ready to step i
https://www.theguardian.com/technology/2021/jan/30/facebook-letting-fake-news-spreaders-profit-investigators-claim Firm ‘still making m
https://www.theguardian.com/business/2021/jan/29/robinhood-to-restore-gamestop-trading-as-it-wins-1bn-backing Shares surge again a
https://www.theguardian.com/commentisfree/2021/jan/30/forget-the-furore-over-trump-facebook-is-interested-only-in-maintaining-its-monopoly Forget the furore ov
https://www.theguardian.com/lifeandstyle/2021/jan/31/hot-tip-to-tackle-the-covid-blues Hot tip to tackle th
https://www.theguardian.com/business/2021/jan/31/re

With this set of articles, it is now possible to chain further querying, for example with code 
similar to the following 

```python
for article in articles: 
    req = requests.get(article['href'])
    source = req.text 
    soup = BeautifulSoup(source, 'html.parser') 
    
    ... and so on...
```

However, I won't go into much detail about this now. For scraping like this tools, such as `scrapy` are more 
appropriate than `BeautifulSoup` since they are designed for multithreadded web crawling. 
Once again, however, I urge caution and hope that before any crawling is initiated you determine whether 
crawling is within the terms of use of the website. 
If in doubt contact the website administrators. 

https://scrapy.org/