Here we'll see how to retrieve HTML pages via HTTP requests using the `requests` module and how to interpret and extract usefull information from them using the BeautifulSoup (`bs4`) module.

# HTTP requests

This has been actually used before to retrieve JSON content. Let's dive deeper into what we can do with the requests module. First let's do a succesfull request and see how we check that it was successfull. Have in mind the following info about HTTP requests return codes:
- 1xx: informational. Stuff like *request received*
- 2xx: success processing the request
- 3xx: redirection, more stuff needs to be done
- 4xx: client error. There could be syntax error
- 5xx: server error. It could not fullfill the request

In [5]:
import requests

url = 'http://www.python.org'
response = requests.get(url)
print(f'This is the type of the object: {type(response)}')
print(f'And here is the return code for our request: {response.status_code}')
print(f'Overall, was our request successful? {response.ok}')

This is the type of the object: <class 'requests.models.Response'>
And here is the return code for our request: 200
Overall, was our request successful? True


Let's see and error code by passing an invalid URL now.

In [6]:
url = 'http://www.python.org/lasanha.html'
response = requests.get(url)
print(f'Here is our error code: {response.status_code}')
print(f'Overall, was our request successful? {response.ok}')

Here is our error code: 404
Overall, was our request successful? False


Ok. Let's do a request that works again and see how we can check the content returned by it. First let's look at the `content` attribute. It will return the page as a bytes object, therefore, in order to interpret it we have to decode it.

In [21]:
url = 'http://www.google.com'
response = requests.get(url)
print(response.content.decode(encoding='ISO 8859-1'))

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="pt-BR"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="ODwKdyQE0m1Z1ovuP8DPHg==">(function(){window.google={kEI:'fQsmXruyJK745gKr6omgCQ',kEXPI:'0,1353747,5662,730,224,3657,1070,377,207,1244,1711,249,10,1051,175,364,214,711,229,3,205,73,4,60,742,208,10,23,1233,1128212,143,1197729,420,39,329079,1294,12383,4855,32691,15248,867,17444,1954,9286,364,3319,5505,8384,4858,1362,283,4040,4967,3029,4739,7,3111,4882,3033,1808,1976,2044,8909,5297,2054,920,873,1217,1714,1,6651,315,724,7432,3874,2883,19,320,1981,2535,2778,520,399,2277,8,4389,667,612,2212,202,328,149,1103,327,515,515,317,1157,48,4258,260,52,1137,2,2063,606,1839,184,595,1138,43,521,1947,692,55,429,44,1009,95,326,1285,15,84,417,2426,1639,607,474,459,880,508,240,1039,3227,773,1217,331,524,7,1320,1574,4739,3,532

And we could also get the contents in Unicode. It's actually the same content as above.

In [30]:
url = 'http://www.google.com'
response = requests.get(url)
print(response.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="pt-BR"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="9hRozkCZkKARbFWAYEYwmA==">(function(){window.google={kEI:'Xw4mXrzlH5Ka5gKGv6jACQ',kEXPI:'0,1353746,5663,731,223,4727,377,207,2954,250,10,1051,175,364,214,762,178,3,205,73,4,60,690,52,208,10,754,41,449,1128224,143,1197775,374,28,329090,1294,12383,4855,32692,15247,867,28684,363,3320,5505,8384,4858,1362,4323,4968,3028,4739,2900,218,7915,1808,1976,2044,8909,5297,2054,920,873,1217,1714,1,1260,5391,315,724,11308,2881,21,317,1981,2537,1396,1381,520,399,2277,8,85,3598,706,423,244,612,2212,202,329,148,1103,327,513,124,393,318,1156,48,158,662,3438,109,151,52,1137,2,2669,1839,184,595,1182,520,1704,243,693,54,429,44,1009,93,328,1284,16,84,417,2426,1639,608,473,1339,748,1039,3227,773,1548,524,7,1320,1574,1911,1,1282,

When using the `text` attribute we can see the encoding scheme used. It's possible to change it's value if we don't like the one that has been assumed.

In [31]:
response.encoding

'ISO-8859-1'

We can also see the headers of the HTML.

In [32]:
response.headers

{'Date': 'Mon, 20 Jan 2020 20:32:31 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '5291', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2020-01-20-20; expires=Wed, 19-Feb-2020 20:32:31 GMT; path=/; domain=.google.com; Secure, NID=196=k0FzbnxVXKWkm7mOQRqNcOTlKJeJMMOKZTr7qYNK-s4hjc1R19N-UeSRxQNgdndrZRXYFsYKiuvbRyeo6Y-NlSPYV-0rqP1mR9QH8ocyRiX6s9ASjgTlv7CFBWt2fdsSWwPIe7gU4B323koI_hlWLUBaJMQ-dV_q96UmTbBvL2Y; expires=Tue, 21-Jul-2020 20:32:31 GMT; path=/; domain=.google.com; HttpOnly'}

# Scraping with BeautifulSoup]
Normally you will get the HTML code from the web. Here, to make things clearer for understanding, I'll put a simple HTML code in a variable. Note we will use BeautifulSoup's HTML parser. It also has an XML parser.

In [38]:
from bs4 import BeautifulSoup

html = '''
<html>
<head>
     <title>My title</title>
</head>
<body>
This is just a text <br />
<p class='css1'>first paragraph </p>
<p id="myid">second paragraph </p>
<p>third paragraph </p>
<br />
<a href="https://www.xpto.org" class='mylink'>first link</a><br />
<a href="https://www.xyz.org">second link</a><br />
<br />
<div class="some_class":
    this is inside a div<br />
    <p>paragraph inside a div</p>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')

Let's get the HTML title

In [42]:
soup.title

<title>My title</title>

Now let's get the whole body

In [43]:
soup.body

<body>
This is just a text <br/>
<p class="css1">first paragraph </p>
<p id="myid">second paragraph </p>
<p>third paragraph </p>
<br/>
<a class="mylink" href="https://www.xpto.org">first link</a><br/>
<a href="https://www.xyz.org">second link</a><br/>
<br/>
<div :="" a="" class="some_class" div<br="" inside="" is="" this=""></div>
<p>paragraph inside a div</p>
</body>

This returns the first div inside the body.

In [44]:
soup.body.div

<div :="" a="" class="some_class" div<br="" inside="" is="" this=""></div>

The return is not a str object. It's actually an object of type tag.

In [49]:
type(soup.body.div)

bs4.element.Tag

This is another way of getting the first tag of a certain type. Here we get the first paragraph object. The `find()` method.

In [51]:
soup.body.find('p')

<p class="css1">first paragraph </p>

If we wanna get a list of all tags, we use the `find_all()` method.


In [52]:
soup.body.find_all('p')

[<p class="css1">first paragraph </p>,
 <p id="myid">second paragraph </p>,
 <p>third paragraph </p>,
 <p>paragraph inside a div</p>]

If you want only the text of the tag, you use the `text` attribute.

In [69]:
print(soup.find('p').text)

p_list = soup.find_all('p')
print('----------')
for p in p_list:
    print(p.text)

first paragraph 
----------
first paragraph 
second paragraph 
third paragraph 
paragraph inside a div


The `find_all()` method can also take a list of tags and search for them all.

In [70]:
soup.find_all(['div', 'p'])

[<p class="css1">first paragraph </p>,
 <p id="myid">second paragraph </p>,
 <p>third paragraph </p>,
 <div :="" a="" class="some_class" div<br="" inside="" is="" this=""></div>,
 <p>paragraph inside a div</p>]

It's possible to select tags by their attributes.

In [75]:
print(soup.find_all('p', id='myid'))
print(soup.find_all('div', class_='some_class')) # notice the _ here. Remember class is a Python reserved word.

[<p id="myid">second paragraph </p>]
[<div :="" a="" class="some_class" div<br="" inside="" is="" this=""></div>]


And this is how we'd find all the links in the page. Above we'll get anchor elements that only trully work as a hyperlink.

In [76]:
links = soup.find_all('a', href=True)
print(links)

[<a class="mylink" href="https://www.xpto.org">first link</a>, <a href="https://www.xyz.org">second link</a>]


And this is how you'd get the URL out of the links

In [78]:
for link in links:
    print(link.get('href'))

https://www.xpto.org
https://www.xyz.org
