<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#requests" data-toc-modified-id="requests-2">requests</a></span></li><li><span><a href="#HTTP" data-toc-modified-id="HTTP-3">HTTP</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-4">HTML</a></span><ul class="toc-item"><li><span><a href="#Tags" data-toc-modified-id="Tags-4.1">Tags</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-4.2">Comments</a></span></li><li><span><a href="#Hyperlinks" data-toc-modified-id="Hyperlinks-4.3">Hyperlinks</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-4.4">Attributes</a></span></li></ul></li><li><span><a href="#Parsing-HTML" data-toc-modified-id="Parsing-HTML-5">Parsing HTML</a></span></li></ul></div>

# The internet

The internet contains a vast wealth of information, and some of it is even useful. In this lesson we will learn how to retrieve information from the internet in a Python program.

## Objective

Let's set ourselves a fairly simple task. We would like to automatically retrieve and print out the text of the headline article on a news website. In addition, we would like to print out any hyperlinks contained in the article text, and show where those links point to.

As an example website, let's use the international edition of [The Guardian](https://www.theguardian.com/international), a mainstream online newspaper. Our first step is to [assign](extras/glossary.ipynb#assignment) a [string](extras/glossary.ipynb#string) variable containing the [URL](extras/glossary.ipynb#url) (more commonly called 'web address') that points to the main page of the site:

In [1]:
url = 'https://www.theguardian.com/international'

Copy this URL into your web browser and navigate to the page. Keep it open so that you can refer to it and compare what you see in the browser to what you see happening in the Spyder console as you try out the example commands below.

## requests

You might already have guessed what the next step is. That's right, we need to [import](extras/glossary.ipynb#import) an additional [package](extras/glossary.ipynb#package) containing some new [functions](extras/glossary.ipynb#function). For retrieving data from the internet, the standard most popular package is one called `requests`. It isn't part of the Python [standard library](standard_library.ipynb), but it is very widely used and is included in the default Anaconda installation.

In [2]:
import requests

The most important function in the `requests` package, and often the only one we will need, is called `get()`. As the name suggests, it gets the content of a web page. The [argument](extras/glossary.ipynb#argument) is the URL of the page, which in our case we have already stored in a variable:

In [3]:
response = requests.get(url)

So what did we get?

In [4]:
type(response)

requests.models.Response

We get a `Response` object. This variable contains the content of the webpage (assuming that the URL we entered was a valid one), along with some other information.

If you haven't yet had the occasion to learn about how the internet and webpages are structured, you might want to pause for a moment and think about what you are expecting to find once we work out how to get the content of the webpage out of the `Response` object. Will we get an image showing what the webpage looks like when viewed? Or just the plain text content of the page? Or something else?

## HTTP

To understand what we get from `requests.get()` and how it is organized, let's look very briefly at what happens behind the scenes when an application on our computer, for example a web browser, gets some information, for example a web page, from the internet.

When we open a web page in our browser, we just see the content of the page, as if our computer is looking down a tube into the internet and viewing parts of it. But of course what actually happens is rather different. When we navigate to a new web page in our browser, our browser sends out a 'request' for that page. The request is sent via various intermediate computers until reaching a computer on which the page is stored. This computer then sends back a response. In this setup, we say that our browser is the [client](extras/glossary.ipynb#client) program, and the other computer on the internet that controls access to the web page is the [server](extras/glossary.ipynb#server).

HTTP (HyperText Transfer Protocol) is a standard procedure prescribing how requests and responses over the internet should be formulated and transmitted. We do not need to know about the details of HTTP. When we surf the web in our browser, the browser handles implementing the requirements of HTTP. And when we get data from the web in a Python program, `requests` handles this. This is one of the benefits of using a well-written pre-made [package](extras/glossary.ipynb#package) like `requests`; it hides away unnecessary complexity inside [functions](extras/glossary.ipynb#function) that allow us to control just a few important aspects of a task. (The slogan for the `requests` project is '[HTTP for Humans](https://requests.readthedocs.io/en/master/)'.)

The only thing that we need to know about the details of HTTP is that it prescribes certain 'response status codes'. These codes are short three-digit numbers, each of which has a particular meaning concerning our request. For example, the status code '200' means that our request was successful. This code is contained at the start of the response that we receive from the server. The `requests` package places it in an [attribute](extras/glossary.ipynb#attribute) of the `Response` object called `status_code`.

In [5]:
print(response.status_code)

200


The '200' we see here means that the Guardian's server was able to fulfill our request, and has sent us the web page that we wanted.

There are various [other HTTP response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), but you will commonly encounter only a few of them. There is one in particular that you might be familiar with already, having seen it displayed occasionally in your web browser. It occurs if we request a page that the server does not have.

In [6]:
response2 = requests.get('https://www.theguardian.com/top_secret_prince_philip_sex_tape')
print(response2.status_code)

404


'404' means 'not found'.

In fact, the `requests` package goes the extra mile for us and also stores an [attribute](extras/glossary.ipynb#attribute) in the `Response` that gives the human-readable meaning of the status code:

In [7]:
print(response.reason)

OK


In [8]:
print(response2.reason)

Not Found


A nice simple flourish that we can add to this first part of our program is a printout confirming the URL of our request (in case we typed it wrong), and the status of our request:

In [9]:
print(response.status_code, response.reason, url)

200 OK https://www.theguardian.com/international


## HTML

As long as our request was successful, the `Reponse` object will also contain the web page or other data that we requested. We can get it as a [string](extras/glossary.ipynb#string) from the `text` [attribute](extras/glossary.ipynb#attribute).

Since the web page will probably be quite big, we won't print it all out. Let's instead first assign it into a new variable for convenience, and check how long it is:

In [10]:
page = response.text

len(page)

926314

Let's just print out the first thousand characters to see the top of the web page:

In [11]:
print(page[:1000])


<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/international">
<head>
<!--
     __        __                      _     _      _
     \ \      / /__    __ _ _ __ ___  | |__ (_)_ __(_)_ __   __ _
      \ \ /\ / / _ \  / _` | '__/ _ \ | '_ \| | '__| | '_ \ / _` |
       \ V  V /  __/ | (_| | | |  __/ | | | | | |  | | | | | (_| |
        \_/\_/ \___|  \__,_|_|  \___| |_| |_|_|_|  |_|_| |_|\__, |
                                                            |___/
    Ever thought about joining us?
    https://workforus.theguardian.com/careers/digital-development/
     --->
<title>News, sport and opinion from the Guardian's global edition | The Guardian</title>
<meta charset="utf-8">
<meta name="description" content="Latest international news, sport and comment from the Guardian"/>
<meta http-equiv="X-UA-Compatible" content="IE=Edge"/>
<meta name="format-detection" content="telephone=no"/>
<meta name="HandheldFriendly" content="Tr

Go back to your web browser and compare this with what you see there. It is of course completely different. This is because the web browser displays web pages in a prettified format. What we are looking at here in Python is what our web browser initially receives from the Guardian server. This is a plain text set of instructions about how the web page is to be displayed. The browser then implements these instructions, putting colors here, images there, highlighting or linking some of the text, and so on, and then shows us the result.

That things should be this way makes a certain amount of sense. It would be horribly inefficient for web pages to be stored online as complete, pixel-by-pixel images of the page content, all in color and already laid out. Not only would this require sending very large files across the internet every time someone requested a web page, it would also make web pages inflexible, since they would look the same on any computer or device, no matter the dimensions of its screen or the preferences of its human user. Instead, web pages are stored as fairly minimal instructions about how to display the content of a web page, and it is then up to the web browser program to implement those instructions, or to ignore or modify some of them in order to adapt the display to a particular device or user, for example making the interface less crowded on a mobile phone, or making the text bigger for a visually-impaired user.

So the content of a web page is behind the scenes really like a set of instructions for a web browser. Instructions for a computer must be written in a programming language of some sort. Web pages are usually written in a language called HTML (HyperText Markup Language).

Let's take a look at the HTML content of the Guardian web page, and see how the HTML language is structured. It would be a bit unwieldy to print it all out in the Python console, but fortunately you can also view the underlying HTML of a web page in your web browser. Go to the Guardian page that you have opened in your browser and try one of the following (which one works will depend on which web browser you are using, but some variant of this should work in any major browser):

* Find a blank area of the web page and click on it with the right button of your mouse. From the menu that appears, select the option 'View Page Source' or something similar-sounding.
* In the toolbar of the browser, click on the main menu button (often at the top right). Select the option 'Page Source' or something similar sounding. This option may be contained in a sub-menu called something like 'Web Developer' or 'Developer Tools', which provides tools for web programmers.

This should open a separate panel or a whole new tab displaying the plain text HTML of the web page. Something like this:

![](images/page_source.png)

If you can't make this work, then you can use the following commands in your Spyder console to save the string of HTML into a text file, then go and open the text file:

In [12]:
with open('page_source.html', mode='w', encoding='utf-8') as f:
    f.write(page)

But beware that if you just click on this text file in your file explorer it will most probably open in your web browser and show you the prettified web browser version of the page! Open it in a text editor instead.

### Tags

So what is the general [syntax](extras/glossary.ipynb#syntax) of HTML? The 'M' in HTML stands for 'markup', and a [markup language](extras/glossary.ipynb#markup) is a language that consists mainly of normal human-readable text. In a markup language, the text is 'marked up', that is it is decorated with various surrounding instructions that tell a computer various things about how to display the text. For example, there are instructions for specifying the color and size of the text, for turning text into a clickable link, for inserting images, and so on.

In HTML, these additional instructions take the form of 'tags'. Tags mark up some part of the text by enclosing it in an opening tag and a closing tag. An opening tag begins and with the character `<` and ends with `>`. Between these characters is the name of the tag, and possibly some further information. The information in the opening tag says something about how to treat the text that follows. A closing tag also begins and ends with `<` and `>`, but between these characters comes a `/`, followed by the name of the tag that is being closed.

For example, on line 15 of the HTML for the Guardian front page there is some text enclosed in opening and closing 'title' tags:

> `<title>News, sport and opinion from the Guardian's global edition | The Guardian</title>`

The placement of these tags says merely 'this text is the title of the document'. Typically, web browsers display the title in the header of the browser window and/or in the tab in which the web page is open. But in principle it is entirely up to the web browser program what to do with tagged text. If we were writing our own web browser app, we could instead write it such that text tagged as 'title' is displayed in massive overlaid letters across the middle of the screen, or is fed into a text-to-speech algorithm and then auto-tuned to the melody of a popular light opera and sung out of the speakers.

In the example HTML file [html_examples.html](examples/html_examples.html) you can see some other common HTML tags in action. Remember, if you just download this file and open it, it will probably open in your web browser and show you the finished web page. To see the underlying text HTML, open the file in a text editor instead.

### Comments

Like Python, the HTML language allows for [comments](extras/glossary.ipynb#comment), pieces of text that have no effect on the computer and are there only for human readers. Python comments are simple; any line beginning with the hash symbol `#` is a comment. HTML comments are slightly more complex. HTML marks both the start and the end of a comment. A comment begins with the character combination `<!--` and ends with the combination `--->`. Anything between these two groups of characters is ignored by the web browser.

On the Guardian page, you can see a comment near the top of the page used for a recruitment advertisement for web developers.

### Hyperlinks

One of the most important features of HTML is that it can specify links from one document to another (hyperlinks). Somewhat confusingly, it is not the `<link>` tag that turns a piece of text into a clickable hyperlink (this tag has another role that we will not go into here). Instead it is the `<a>` tag. The 'a' here stands for 'anchor', in the sense of one text or position in a text being 'anchored' (i.e. linked) to another.

So in HTML to turn the piece of plain text 'Complaints & Corrections' into a clickable link, it would be enclosed in tags like this:

> `<a>Complaints & corrections</a>`

The result looks like this:

<a>Complaints & corrections</a>

### Attributes

Some HTML tags need to specify additional information. For example the `<a>` tag isn't much use without specifying where the hyperlink should point to. On its own, enclosing text in `<a></a>` makes a piece of text behave a bit like a hyperlink (it is highlighted and responds to the mouse hovering over it and so on), but when the text is clicked nothing happens.

To specify additional things about how a piece of tagged text is to be treated, the opening tag may contain one or more 'attributes'. These attributes have specific names and control further details relevant to that type of tag. The syntax for an attribute is to write the attribute name, followed by the equals sign `=`, followed by some value for the attribute. This is syntactically just like [assignment](extras/glossary.ipynb#assignment) in Python, but don't confuse the two; the `=` does not create a variable in HTML.

The 'href' attribute ('hypertext reference') of an `<a>` tag specifies the [URL](extras/glossary.ipynb#url) of the page that the link should point to. So the link to the Guardian's complaints page looks something like this in the underlying HTML:

> `<a href="https://www.theguardian.com/info/complaints-and-corrections">Complaints & corrections</a>`

And now the link on the resulting web page actually points somewhere:

<a href="https://www.theguardian.com/info/complaints-and-corrections">Complaints & corrections</a>

## Parsing HTML

We now have the HTML content of our target web page, and our next task is to extract the part of it that we are interested in. The HTML of real-world web pages is often very convoluted and full of all sorts of irrelevant technical and filler information. If you scroll through the HTML of the Guardian front page for example, you will see lots and lots of incomprehensible tags and comments before you get to anything that looks like an article headline.

We are looking for a link to the main headline article. The first step in finding it is to find out whether there is a particular recognizable tag that encloses this link. For this, it is easiest to head to the page source in the web browser and search there first.

In [13]:
import os

try:
    os.remove('page_source.html')
except FileNotFoundError:
    pass