Contents
---
- [HTTP](#http)
- [Sockets](#sockets)
- [urllib](#urllib)
- [Web Scraping](#scraping)

This module is edited from Charles Severance's Python for Informatics book.

HTTP
---
<a class="anchor" id="http"></a>

Let's now look at reading information from the internet instead of files. 

### Sockets & HyperText Transport Protocol (HTML)

The network protocol that powers the web is actually quite simple and there is built-in support in Python called sockets which makes it very easy to make network connections and retrieve data over those sockets in a Python program.

A socket is much like a file, except that a single socket provides a two-way con- nection between two programs. You can both read from and write to the same socket. If you write something to a socket, it is sent to the application at the other end of the socket. If you read from the socket, you are given the data which the other application has sent.


But if you try to read a socket when the program on the other end of the socket has not sent any data—you just sit and wait. If the programs on both ends of the socket simply wait for some data without sending anything, they will wait for a very long time.


So an important part of programs that communicate over the Internet is to have some sort of protocol. A protocol is a set of precise rules that determine who is to go first, what they are to do, and then what the responses are to that message, and who sends next, and so on. In a sense the two applications at either end of the socket are doing a dance and making sure not to step on each other’s toes.

There are many documents which describe these network protocols. The Hyper- Text Transport Protocol is described in the following document:


http://www.w3.org/Protocols/rfc2616/rfc2616.txt


This is a long and complex 176-page document with a lot of detail. If you find it interesting, feel free to read it all. But if you take a look around page 36 of RFC2616 you will find the syntax for the GET request. To request a document from a web server, we make a connection to the www.py4inf.com server on port 80, and then send a line of the form


GET http://www.py4inf.com/code/romeo.txt HTTP/1.0


where the second parameter is the web page we are requesting, and then we also send a blank line. The web server will respond with some header information about the document and a blank line followed by the document content.

Retrieving web pages using sockets
---
<a class="anchor" id="sockets"></a>

Let's write a simple Python program that makes a connection to a web server and follows the rules of the HTTP protocol to requests a document and display what the server sends back.

Notice that the website http://data.pr4e.org/romeo.txt contains text from Romeo and Juliet. We can read it off the website using the following program:

In [21]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

HTTP/1.1 200 OK
Date: Fri, 02 Jun 2017 01:12:57 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already si
ck and pale with grief



First the program makes a connection to port 80 on the server www.py4inf.com. Since our program is playing the role of the “web browser”, the HTTP protocol says we must send the GET command followed by a blank line.

Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).

The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document (text/plain).


After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt.

This example shows how to make a low-level network connection with sockets. Sockets can be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to find the document which describes the protocol and write the code to send and receive the data according to the protocol.


However, since the protocol that we use most commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and data over the web.


Retrieving web pages using urllib
---
<a class="anchor" id="urllib"></a>

While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the urllib library.
Using urllib, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.
The equivalent code to read the romeo.txt file from the web using urllib is as follows:

In [20]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


Once the web page has been opened with urllib.request.urlopen, we can treat it like a
file and read through it using a for loop. When the program runs, we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

We can incorporate the code above to count the words 

In [39]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = {}
for line in fhand:
    words = line.split() 
    for word in words:
        counts[word] = counts.get(word,0) + 1 

counts_list=[]
for key, val in counts.items():
    counts_list.append((val,key))

counts_list.sort(reverse = True)
print(counts_list)

[(3, b'the'), (3, b'is'), (3, b'and'), (2, b'sun'), (1, b'yonder'), (1, b'with'), (1, b'window'), (1, b'what'), (1, b'through'), (1, b'soft'), (1, b'sick'), (1, b'pale'), (1, b'moon'), (1, b'light'), (1, b'kill'), (1, b'grief'), (1, b'fair'), (1, b'envious'), (1, b'east'), (1, b'breaks'), (1, b'already'), (1, b'Who'), (1, b'Juliet'), (1, b'It'), (1, b'But'), (1, b'Arise')]


Why is there a "b" before each word? It represents bytes. If we want to remove it, we can use .decode() on the values:

In [40]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = {}
for line in fhand:
    words = line.split() 
    for word in words:
        counts[word] = counts.get(word,0) + 1 

counts_list=[]
for key, val in counts.items():
    counts_list.append((val,key.decode()))

counts_list.sort(reverse = True)
print(counts_list)

[(3, 'the'), (3, 'is'), (3, 'and'), (2, 'sun'), (1, 'yonder'), (1, 'with'), (1, 'window'), (1, 'what'), (1, 'through'), (1, 'soft'), (1, 'sick'), (1, 'pale'), (1, 'moon'), (1, 'light'), (1, 'kill'), (1, 'grief'), (1, 'fair'), (1, 'envious'), (1, 'east'), (1, 'breaks'), (1, 'already'), (1, 'Who'), (1, 'Juliet'), (1, 'It'), (1, 'But'), (1, 'Arise')]


Here's another example. Suppose we want to get information from the OES Faculty/Staff web page. Read the output carefully below to see where the faculty names are located. 

In [45]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff')
for line in fhand:
    print(line.decode().strip())

<!DOCTYPE html>
<!--[if lte IE 8]>         <html lang="en-US" class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-US"> <!--<![endif]-->
<head>
<meta charset="utf-8">

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"87d38be11c","applicationID":"11580779","transactionName":"JVgLEhBaXg4BSxgTWQFSFkkKVFwGCFxoEFQTUA==","queueTime":0,"applicationTime":542,"agent":""}</script>
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UwMPVVVUGwIBUVlSAAYO"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=local

The names seem to be located below a line that contains the term "FullName." We can make a loop to print out just the names:

In [47]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff')
next = False
for line in fhand:
    if next == True:
        print(line.decode().strip())
        next= False
    if b'FullName' in line:
        next=True

Deri Bash
Brad Baugher
Carmen Boyle
Peter Buonincontro
Eduard Cecere
Dennis Chang
Chiman Chen
Corbet Clark
Jenny Cleveland
Bevin Daglen
Tessa Daniel
Coleen Davis


### Exercise - Socket 1
Change the socket program to prompt the user for the URL so it can read any web page, not just the Romeo text. You can use split('/') to break the URL into its component parts so you can extract the host name for the socket connect call. Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.

In [25]:
#insert socket 1 exercise

### Exercise - Socket 2
Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.

In [26]:
#insert socket 2 exercise

### Exercise - urllib 1
Use urllib to replicate the Socket Exercise 1 by retrieving the document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don't worry about the headers for this exercise, simply show the first 3000 characters of the document contents.

In [None]:
#insert urllib 1 exercise

Web Scraping
---
<a class="anchor" id="scraping"></a>

One of the common uses of the urllib capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.

As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web.


Google also uses the frequency of links from pages it finds to a particular page as one measure of how “important” a page is and how high the page should appear in its search results. 

BeautifulSoup is one Python package that helps us to scrape the web. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need.

To download it, if you have a Mac, type "pip install beautifulsoup4" into your terminal. If that doesn't work or if you have a PC, download it directly from this website: https://www.crummy.com/software/BeautifulSoup/

Okay, let's first view what information is contained on the following page:

In [8]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm').read()
print(fhand.decode())


<h1>The First Page</h1>
<p>
If you like, you can switch to the 
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>



Notice that using HTML, headers are contained between the "h1" terms, web link tags are contained within the "a" terms, and paragraphs within the "p" terms. Thus, if we wanted to use BeautifulSoup to search for just the web links, we could type the following:

In [10]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm').read()

soup = BeautifulSoup(fhand, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

http://www.dr-chuck.com/page2.htm


If we want to get more specific with all of the different types of info stored in the tags, we can type:

In [17]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm').read()

soup = BeautifulSoup(fhand, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)

TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Contents: 
Second Page
Attrs: {'href': 'http://www.dr-chuck.com/page2.htm'}


We notice that the entire HTML tag is stored in "tag", the URL alone can be accessed by tag.get('href', None), etc.

If we wanted instead to search for the headers, we could use "h1":

In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm').read()

soup = BeautifulSoup(fhand, "html.parser")

headers = soup('h1')
print(headers)

[<h1>The First Page</h1>]


Let's return to our OES faculty example. Look carefully at the teachers' names. Where are they stored? Inside "h3"'s:

In [18]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff')
for line in fhand:
    print(line.decode().strip())

<!DOCTYPE html>
<!--[if lte IE 8]>         <html lang="en-US" class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-US"> <!--<![endif]-->
<head>
<meta charset="utf-8">

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"87d38be11c","applicationID":"11580779","transactionName":"JVgLEhBaXg4BSxgTWQFSFkkKVFwGCFxoEFQTUA==","queueTime":0,"applicationTime":456,"agent":""}</script>
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UwMPVVVUGwIBUVlSAAYO"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=local

Let's try printing out just the h3 info using BeautifulSoup:

In [16]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff').read()

soup = BeautifulSoup(fhand, "html.parser")

headers = soup('h3')
print(headers)

[<h3 class="fsFullName">
				Deri Bash 
		</h3>, <h3 class="fsFullName">
				Brad Baugher 
		</h3>, <h3 class="fsFullName">
				Carmen Boyle 
		</h3>, <h3 class="fsFullName">
				Peter Buonincontro 
		</h3>, <h3 class="fsFullName">
				Eduard Cecere 
		</h3>, <h3 class="fsFullName">
				Dennis Chang 
		</h3>, <h3 class="fsFullName">
				Chiman Chen 
		</h3>, <h3 class="fsFullName">
				Corbet Clark 
		</h3>, <h3 class="fsFullName">
				Jenny Cleveland 
		</h3>, <h3 class="fsFullName">
				Bevin Daglen 
		</h3>, <h3 class="fsFullName">
				Tessa Daniel 
		</h3>, <h3 class="fsFullName">
				Coleen Davis 
		</h3>]


That's still not quite as pretty as we'd like, but remember that there are multiple attributes stored in the headers. In this case, we want to access its contents:

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff').read()

soup = BeautifulSoup(fhand, "html.parser")

headers = soup('h3')
for header in headers:
    print(header.contents[0].strip())

Deri Bash
Brad Baugher
Carmen Boyle
Peter Buonincontro
Eduard Cecere
Dennis Chang
Chiman Chen
Corbet Clark
Jenny Cleveland
Bevin Daglen
Tessa Daniel
Coleen Davis


Notice that by using BeautifulSoup, we didn't need to create a loop to search for "Full Name."

Suppose we wanted to get what department each faculty member works in. First we notice that their departments are inside "div class":

In [44]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff')
for line in fhand:
    print(line.decode().strip())

<!DOCTYPE html>
<!--[if lte IE 8]>         <html lang="en-US" class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-US"> <!--<![endif]-->
<head>
<meta charset="utf-8">

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"87d38be11c","applicationID":"11580779","transactionName":"JVgLEhBaXg4BSxgTWQFSFkkKVFwGCFxoEFQTUA==","queueTime":0,"applicationTime":532,"agent":""}</script>
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UwMPVVVUGwIBUVlSAAYO"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=local

Let's first try printing out all of the tags labeled "div":

In [45]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff').read()
soup = BeautifulSoup(fhand, "html.parser")

divs = soup('div')
print(divs)

[<div id="fsPageWrapper">
<div id="fsMenu">
<div class=" fsMenu fsStyleAutoclear" id="fsEl_493">
<div class="fsElement fsContent close-button-container" id="fsEl_920">
<div class="fsElementContent">
<button class="drawer-trigger" href="#"></button>
</div>
</div>
<div class="fsElement fsNavigation fsList nav-main" id="fsEl_494">
<div class="fsElementContent">
<nav><ul class="fsNavLevel1"><li class="fsNavParentPage"><a href="/aboutoes">ABOUT OES</a><div class="fsNavPageInfo"><ul class="fsNavLevel2"><li><a href="/aboutoes/ataglance">OES At a Glance</a></li><li><a href="/aboutoes/welcome-from-head-of-school">Welcome From Head of School</a></li><li><a href="/aboutoes/mission-vision-identity">Mission, Vision, and Identity</a></li><li><a href="/aboutoes/history">Brief History</a></li><li class="fsNavParentPage"><a href="/aboutoes/leadership">Leadership</a><div class="fsNavPageInfo"><ul class="fsNavLevel3"><li><a href="/aboutoes/leadership/board-of-trustees">Board of Trustees Information</a></

That's still too much info. To narrow down our search, we can use BeautifulSoup and the findAll command to find the classes labeled "fsDepartments":

In [46]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff').read()
soup = BeautifulSoup(fhand, "html.parser")

headers = soup('div')

divs = soup.findAll("div", { "class" : "fsDepartments" })
for div in divs:
    print(div.get_text())
print(mydivs)


Departments:
    Upper School Administration, Dorms


Departments:
    Information Technology


Departments:
    World Languages


Departments:
    Dorms, Visual & Performing Arts (VaPA)


Departments:
    Information Technology


Departments:
    Athletics, Mathematics


Departments:
    World Languages


Departments:
    Upper School Administration, Administration


Departments:
    Chaplaincy, College Counseling


Departments:
    Science


Departments:
    World Languages


Departments:
    Physical Education, Athletics, Tennis

[<div class="fsDepartments">
<strong>Departments:</strong>
    Upper School Administration, Dorms
</div>, <div class="fsDepartments">
<strong>Departments:</strong>
    Information Technology
</div>, <div class="fsDepartments">
<strong>Departments:</strong>
    World Languages
</div>, <div class="fsDepartments">
<strong>Departments:</strong>
    Dorms, Visual &amp; Performing Arts (VaPA)
</div>, <div class="fsDepartments">
<strong>Departments:</strong>
    

### Exercise - BeautifulSoup 1
Write a program using BeautifulSoup to print the information contained in the paragraphs (between the "p" terms) of the www.oes.edu website. It should look familiar!

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.oes.edu/academics/upper-school/faculty-staff').read()

soup = BeautifulSoup(fhand, "html.parser")

headers = soup('p')
for header in headers:
    print(header)

<p></p>
<p>Oregon Episcopal School is a college preparatory, independent school in Portland, Oregon, serving 860 students from Pre-Kindergarten through Grade 12, including 60 boarding students from around the world in Grades 9-12.</p>
<p></p>


### Exercise - Beautiful Soup2
Count the number of hyperlink tags on the www.cnn.com website. You don't need to print them, just count them.

In [36]:
#insert Beautiful Soup 2

### Exercise - Beautiful Soup3
Write a program that prints out the OES faculty's titles.

In [47]:
#insert Beautiful Soup 3


### Exercise - Beautiful Soup 4
Write a program that prints out the OES faculty's phone numbers.

In [48]:
#insert BeautifulSoup 4