In [1]:
from __future__ import print_function

from bs4 import BeautifulSoup

import requests

# Beautiful soup on test data

Here, we create some simple HTML that include some frequently used tags. 
Note, however, that we have also left one paragraph tag unclosed. 

In [2]:
source = """
<!DOCTYPE html>  
<html>  
  <head>
    <title>Scraping</title>
  </head>
  <body class="col-sm-12">
    <h1>section1</h1>
    <p>paragraph1</p>
    <p>paragraph2</p>
    <div class="col-sm-2">
      <h2>section2</h2>
      <p>paragraph3</p>
      <p>unclosed
    </div>
  </body>
</html>  
"""

soup = BeautifulSoup(source, "html.parser")

Once the soup object has been created successfully, we can execute a number of queries on the DOM. 
First we request all data from the `head` tag. 
Note that while it looks like a list of strings was returned, actually, a `bs4.element.Tag` type is returned. 
These examples expore how to extract tags, the text from tags, how to filter queries based on 
attributes, how to retreive attributes from a returned query, and how the BeautifulSoup engine 
is tolerant of unclosed tags. 

In [4]:
print('Head:')
print('', soup.find_all("head"))
# [<head>\n<title>Scraping</title>\n</head>]

print('\nType of head:')
print('', map(type, soup.find_all("head")))
# [<class 'bs4.element.Tag'>]

print('\nTitle tag:')
print('', soup.find("title"))
# <title>Scraping</title>

print('\nTitle text:')
print('', soup.find("title").text)
# Scraping

divs = soup.find_all("div", attrs={"class": "col-sm-2"})
print('\nDiv with class=col-sm-2:')
print('', divs)
# [<div class="col-sm-2">....</div>]

print('\nClass of first div:')
print('', divs[0].attrs['class'])
# [u'col-sm-2']

print('\nAll paragraphs:')
print('', soup.find_all("p"))
# [<p>paragraph1</p>, 
#  <p>paragraph2</p>, 
#  <p>paragraph3</p>, 
#  <p>unclosed\n    </p>]

Head:
 [<head>
<title>Scraping</title>
</head>]

Type of head:
 <map object at 0x111b738d0>

Title tag:
 <title>Scraping</title>

Title text:
 Scraping

Div with class=col-sm-2:
 [<div class="col-sm-2">
<h2>section2</h2>
<p>paragraph3</p>
<p>unclosed
    </p></div>]

Class of first div:
 ['col-sm-2']

All paragraphs:
 [<p>paragraph1</p>, <p>paragraph2</p>, <p>paragraph3</p>, <p>unclosed
    </p>]


# Beautilful soup on real data 

In this example I will show how you can use BeautifulSoup to retreive information from live web pages. 
We make use of The Guardian newspaper, and retreive the HTML from an arbitrary article. 
We then create the BeautifulSoup object, and query the links that were discovered in the DOM. 
Since a large number are returned, we then apply attribute filters that let us reduce significantly 
the number of returned links. 
I selected the filters selected for this example in order to focus on the names in the paper. 
The parameterisation of the attributes was discovered by using the `inspect` functionality of Google Chrome

In [5]:
url = 'https://www.theguardian.com/technology/2017/jan/31/amazon-expedia-microsoft-support-washington-action-against-donald-trump-travel-ban'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

In [6]:
links = soup.find_all('a')
links

[<a class="u-h skip" data-link-name="skip : main content" href="#maincontent">Skip to main content</a>,
 <a class="dropdown-menu__title dropdown-menu__title--active" data-link-name="nav2 : topbar : edition-picker: UK" href="https://www.theguardian.com/preference/edition/uk">
 <span class="u-h">switch to the </span>
 UK edition
 </a>,
 <a class="dropdown-menu__title " data-link-name="nav2 : topbar : edition-picker: US" href="https://www.theguardian.com/preference/edition/us">
 <span class="u-h">switch to the </span>
 US edition
 </a>,
 <a class="dropdown-menu__title " data-link-name="nav2 : topbar : edition-picker: AU" href="https://www.theguardian.com/preference/edition/au">
 <span class="u-h">switch to the </span>
 Australia edition
 </a>,
 <a class="dropdown-menu__title " data-link-name="nav2 : topbar : edition-picker: INT" href="https://www.theguardian.com/preference/edition/int">
 <span class="u-h">switch to the </span>
 International edition
 </a>,
 <a class="new-header__logo" dat

In [7]:
links = soup.find_all('a', attrs={
    'data-component': 'auto-linked-tag'
})

for link in links: 
    print(link['href'], link.text)

https://www.theguardian.com/us-news/donaldtrump Donald Trump
https://www.theguardian.com/technology/amazon Amazon


# Chaining queries

Now, let us conisder a more general query that might be done on a website such as this. 
We will query the base technology page, and attempt to list all articles that pertain to this main page

In [8]:
url = 'https://www.theguardian.com/uk/technology'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

After inspecting the DOM (via the `inspect` tool in my browser), I see that the attributes that define 
a `technology` article are: 
    
    class = "js-headline-text"

In [9]:
articles = soup.find_all('a', attrs={
    'class': 'js-headline-text'
})

for article in articles: 
    print(article['href'][:70], article.text[:20])

https://www.theguardian.com/technology/2018/jan/23/facebook-new-privac Firm to roll out new
https://www.theguardian.com/technology/2018/jan/23/elon-musk-aiming-fo Elon Musk takes aim 
https://www.theguardian.com/media/2018/jan/23/never-get-high-on-your-o Why social media bos
https://www.theguardian.com/technology/2018/jan/23/apple-homepod-avail HomePod finally avai
https://www.theguardian.com/technology/2018/jan/23/bitcoin-ubs-chairma UBS chairman warns a
https://www.theguardian.com/technology/2018/jan/23/cybercrime-130bn-st £130bn stolen from c
https://www.theguardian.com/technology/2018/jan/22/cyber-attack-on-uk- Major cyber-attack o
https://www.theguardian.com/technology/2018/jan/22/rupert-murdoch-face Rupert Murdoch tells
https://www.theguardian.com/us-news/2018/jan/22/amazon-go-convenience- Amazon Go: convenien
https://www.theguardian.com/technology/2018/jan/22/facebook-too-slow-s Facebook: we were to
https://www.theguardian.com/technology/2018/jan/19/tim-cook-i-dont-wan Apple's T

With this set of articles, it is now possible to chain further querying, for example with code 
similar to the following 

```python
for article in articles: 
    req = requests.get(article['href'])
    source = req.text 
    soup = BeautifulSoup(source, 'html.parser') 
    
    ... and so on...
```

However, I won't go into much detail about this now. For scraping like this tools, such as `scrapy` are more 
appropriate than `BeautifulSoup` since they are designed for multithreadded web crawling. 
Once again, however, I urge caution and hope that before any crawling is initiated you determine whether 
crawling is within the terms of use of the website. 
If in doubt contact the website administrators. 

https://scrapy.org/