# Web scraping with BeautifulSoup



### Import BeautifulSoup

First off, you will need to import the BeautifulSoup library. BS is not part of the Python standard library (i.e. it needs to be installed separately).

In [3]:
# import beautiful soup library
from bs4 import BeautifulSoup

To work with BeautifulSoup, you first require some HTML. HTML can either be loaded from a locally stored file, or it can be \`requested' from a web server over HTTP.
To use the second approach, we will utilise another Python library called `requests`, which is able to make and handle HTTP requests and responses. 

In [4]:
# import requests library
import requests

We can use the `get` method in the requests library to retrieve an HTTP response object. An HTTP request contains header fields which may give the server some additional information about the request. One of the fields is called, \`user-agent', and it tells the server what software is making the request on behalf of the user. It may be a good idea to set this header, to try to \`fool' the server into believing the request is coming via a browser.

The response object has a property, `text`, which contains the HTML that was sent in the response.

In the following example, the HTML for a web page displaying details about a product on the Tesco website is retrieved.

In [5]:
# set a user-agent to be sent with request
#headers = {
#    "user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
#}
# request a resource from a specific URL. You might want to change this for your chosen website.
r  = requests.get("https://twitter.com/seanige")#,headers)

# put the text that is returned in the response in a variable
data = r.text

# look...some HTML has been sent in the response!
# data

The raw HTML is not very easy to work with, because it is in a semantic markup format. We need to \`parse' the HTML (i.e. split it into its component parts), which will make working with it much easier. For that we will create an object which is an instance of the BeautifulSoup class. The object will be a special kind of data structure. It will contain the HTML, but in a format we can work with.

In [6]:
# parse the raw HTML into a `soup' object
soup = BeautifulSoup(data, "html.parser")

Now that we have parsed the HTML, we can call methods of the BeautifulSoup class to access specific elements in the data.

### Extract a single element by tag name
For example, the `find` method will return the first available element with a specified tag name:

In [7]:
h1 = soup.find("h1")
print(h1)

<h1>JavaScript is not available.</h1>


### Extract all of a certain element by tag name
The `find_all` method will return all the elements of a certain type:

In [10]:
# get all the th elements with the attributes class:product__info-table
#table = soup.find_all("table",attrs={'class':"product__info-table"})
#print(table)

### Filter elements by attribute

HTML elements can have attributes. These are key-value pairs defined inside the opening tag. For example, a hyperlink (anchor) tag has an href attribute specifying the URL to link to:

        <a href="https://twitter.com/seanige">Dr Sean McGrath</a>
        
We can be more specific about which elements to retrieve with find all, by including an attribute value:

In [8]:
# extract all the th elements containing the scope attribute, with the value, `row'
#rows = table[0].find_all("tr")
#rows

### Filter elements by contents
We may also decide which elements to extract based on their text contents. For example,

In [None]:
# extract all th elements containing the string, `Data'
#datastore = table[0].find_all("td",string="Data ")
#datastore

### Extract the next sibling element
We might want to get at the element next to another element. 

For example, let's suppose I want the value contained in the `td` element proceding the \`Salt' `th`...

In [None]:
# get the text from the next td element after th
#datastore[0].findNext("td").text

## Further reading
If you are still feeling a bit lost, you may find this [Webscraping article](https://blog.hartleybrody.com/web-scraping/) by Hartley Brody helpful.

The BeautifulSoup documentation can be found here: [BS Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)