In [2]:
from bs4 import BeautifulSoup
import requests




# Beautiful soup on test data

Here, we create some simple HTML that include some frequently used tags. 
Note, however, that we have also left one paragraph tag unclosed. 

In [3]:
source = """
<!DOCTYPE html>  
<html>  
  <head>
    <title>Scraping</title>
  </head>
  <body class="col-sm-12">
    <h1>section1</h1>
    <p>paragraph1</p>
    <p>paragraph2</p>
    <div class="col-sm-2">
      <h2>section2</h2>
      <p>paragraph3</p>
      <p>unclosed
    </div>
  </body>
</html>  
"""

soup = BeautifulSoup(source, "html.parser")

Once the soup object has been created successfully, we can execute a number of queries on the DOM. 
First we request all data from the `head` tag. 
Note that while it looks like a list of strings was returned, actually, a `bs4.element.Tag` type is returned. 
These examples expore how to extract tags, the text from tags, how to filter queries based on 
attributes, how to retreive attributes from a returned query, and how the BeautifulSoup engine 
is tolerant of unclosed tags. 

In [9]:
print 'Head:'
print '', soup.find_all("head")
# [<head>\n<title>Scraping</title>\n</head>]

print '\nType of head:'
print '', map(type, soup.find_all("head"))
# [<class 'bs4.element.Tag'>]

print '\nTitle tag:'
print '', soup.find("title")
# <title>Scraping</title>

print '\nTitle text:'
print '', soup.find("title").text
# Scraping

divs = soup.find_all("div", attrs={"class": "col-sm-2"})
print '\nDiv with class=col-sm-2:'
print '', divs
# [<div class="col-sm-2">....</div>]

print '\nClass of first div:'
print '', divs[0].attrs['class']
# [u'col-sm-2']

print '\nAll paragraphs:'
print '', soup.find_all("p")
# [<p>paragraph1</p>, 
#  <p>paragraph2</p>, 
#  <p>paragraph3</p>, 
#  <p>unclosed\n    </p>]

Head:
 [<head>\n<title>Scraping</title>\n</head>]

Type of head:
 [<class 'bs4.element.Tag'>]

Title tag:
 <title>Scraping</title>

Title text:
 Scraping

Div with class=col-sm-2:
 [<div class="col-sm-2">\n<h2>section2</h2>\n<p>paragraph3</p>\n<p>unclosed\n    </p></div>]

Class of first div:
 [u'col-sm-2']

All paragraphs:
 [<p>paragraph1</p>, <p>paragraph2</p>, <p>paragraph3</p>, <p>unclosed\n    </p>]


# Beautilful soup on real data 

In this example I will show how you can use BeautifulSoup to retreive information from live web pages. 
We make use of The Guardian newspaper, and retreive the HTML from an arbitrary article. 
We then create the BeautifulSoup object, and query the links that were discovered in the DOM. 
Since a large number are returned, we then apply attribute filters that let us reduce significantly 
the number of returned links. 
I selected the filters selected for this example in order to focus on the names in the paper. 
The parameterisation of the attributes was discovered by using the `inspect` functionality of Google Chrome

In [10]:
url = 'https://www.theguardian.com/technology/2017/jan/31/amazon-expedia-microsoft-support-washington-action-against-donald-trump-travel-ban'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

In [11]:
links = soup.find_all('a')
print links

[<a class="u-h skip" data-link-name="skip : main content" href="#maincontent">Skip to main content</a>, <a aria-haspopup="true" class="brand-bar__item--action popup__toggle" data-link-name="User profile" data-test-id="sign-in-link" data-toggle="popup--profile" data-toggle-signed-in="true" href="https://profile.theguardian.com/signin?INTCMP=DOTCOM_HEADER_SIGNIN">\n<span class="inline-profile-36 inline-icon rounded-icon control__icon-wrapper">\n<svg class="rounded-icon__svg control__icon-wrapper__svg inline-profile-36__svg inline-icon__svg" height="18" viewbox="0 0 18 18" width="18">\n<path d="M9 7.3c1.6 0 3.4-1.8 3.4-3.9S11.1 0 9 0 5.6 1.3 5.6 3.4s2 3.9 3.4 3.9zm5.9 3.4l-.9-.8c-1.7-.6-3.1-.9-5-.9s-3.3.3-5 .9l-.9.9L1 17.2l.9.8h14.3l.9-.9-2.2-6.4z"></path>\n</svg>\n</span>\n<span class="js-profile-info control__info" data-test-id="sign-in-name">sign in</span>\n</a>, <a class="brand-bar__item--action js-comment-activity u-h" data-link-name="Comment activity">Comment activity</a>, <a class=

In [12]:
links = soup.find_all('a', attrs={
    'data-component': 'auto-linked-tag'
})

for link in links: 
    print link['href'], link.text

https://www.theguardian.com/us-news/donaldtrump Donald Trump
https://www.theguardian.com/technology/amazon Amazon


# Chaining queries

Now, let us conisder a more general query that might be done on a website such as this. 
We will query the base technology page, and attempt to list all articles that pertain to this main page

In [13]:
url = 'https://www.theguardian.com/uk/technology'
req = requests.get(url)
source = req.text
soup = BeautifulSoup(source, 'html.parser')

After inspecting the DOM (via the `inspect` tool in my browser), I see that the attributes that define 
a `technology` article are: 
    
    class = "js-headline-text"

In [14]:
articles = soup.find_all('a', attrs={
    'class': 'js-headline-text'
})

for article in articles: 
    print article['href'][:70], article.text[:20]

https://www.theguardian.com/technology/2017/jan/31/amazon-expedia-micr Amazon pledges legal
https://www.theguardian.com/technology/2017/jan/31/muslim-video-game-d As a Muslim video-ga
https://www.theguardian.com/technology/2017/jan/31/horizon-zero-dawn-t Horizon Zero Dawn – 
https://www.theguardian.com/technology/2017/jan/31/trump-travel-ban-te #DeleteUber: how tec
https://www.theguardian.com/business/2017/jan/31/hand-ocado-robot-shop Hand delivered: will
https://www.theguardian.com/technology/2017/jan/31/hitman-review-a-bea Hitman review – a be
https://www.theguardian.com/technology/2017/jan/31/apple-record-revenu Apple posts record r
https://www.theguardian.com/technology/2017/jan/27/mark-zuckerberg-don Mark Zuckerberg chal
https://www.theguardian.com/technology/2017/jan/27/ai-artificial-intel AI watchdog needed t
https://www.theguardian.com/technology/2017/jan/27/us-russia-hacking-y Alleged hacker held 
https://www.theguardian.com/technology/askjack/2017/jan/26/what-is-the What is t

With this set of articles, it is now possible to chain further querying, for example with code 
similar to the following 

```python
for article in articles: 
    req = requests.get(article['href'])
    source = req.text 
    soup = BeautifulSoup(source, 'html.parser') 
    
    ... and so on...
```

However, I won't go into much detail about this now. For scraping like this tools, such as `scrapy` are more 
appropriate than `BeautifulSoup` since they are designed for multithreadded web crawling. 
Once again, however, I urge caution and hope that before any crawling is initiated you determine whether 
crawling is within the terms of use of the website. 
If in doubt contact the website administrators. 

https://scrapy.org/