<img src="https://www.evernote.com/l/AUXIi328hU9Im7vbcG4SauyAkY-8L11rdU8B/image.png">
Python Seminar (AY250) UC Berkeley

In [1]:
%run ../00_AdvancedPythonConcepts/talktools.py

## Outline


1. Websites and webservers

   - `urllib`, `ftplib`, `httplib`, `httplib2`, `requests`
   - Parsing with `html5lib`, `BeautifulSoup4`
   - `conda install beautifulsoup4`

2. Transmission Control Protocol (TCP)

   - `socket`
    
3. Breakout Exercise

   - Focus on automating website access

4. Remote Procedure Call

  - `SimpleXMLRPCServer`, `xmlrpclib`

## Network Communication Overview

- TCP/IP sockets: Most all network communication, also UDP  
- TCP (Transmission Control Protocol): exchange data reliably between two network hosts
- IP (Internet Protocol): handles addressing & routing messages across one or more networks

<hr>
<img src="http://flylib.com/books/3/475/1/html/2/images/0131777203/graphics/14fig02.gif">
<hr>
<img src="http://i.cloud.opensystemsmedia.com/i__srcbc84f1fa314969f2dc009b8711c679ce_paraf0d99c20bd457d46a92c72841873c47.jpeg">
<hr>
<img src="https://microchip.wdfiles.com/local--files/tcpip:tcp-ip-five-layer-model/layer_terminology.JPG">

# Accessing a Web address (URL)

<quote>Why? Who would ever want to easily automate URL (Uniform Resource Locator) retrieval and form submission in a scripting language?
</quote>

 - Data mining (we’ll do this in the breakout)
 - Submitting information to another system
 - Accessing remote compute resources (“webservices”)
 - Get microservices

`urllib` provides tools & functions for high-level, but less modern, interactions. It's suited for complex interactions, supporting basic and digest authentication, redirections, cookies, and more:

 - `urllib.request` for opening and reading URLs
 - `urllib.error` containing the exceptions raised by urllib.request
 - `urllib.parse` for parsing URLs
 - `urllib.robotparser` for parsing robots.txt files


 
Note: `urllib.request.urlopen` function always returns an object which can work as a context manager

See https://docs.python.org/3/library/urllib.request.html#module-urllib.request

# Super simple webpage access

In [2]:
from __future__ import absolute_import, division, print_function

In [3]:
# URL = Uniform Resource Locator
try:
    # For Python 3.0 and later
    from urllib.request import urlopen
except ImportError:
    # Fall back to Python 2's urllib2
    from urllib2 import urlopen
    
# Brain maps data "Explore the Brain like never before"
url = "http://brainmaps.org/"  
response = urlopen(url) # response is a file-like object
html_data = response.read()
response.close() # close response as you would a normal file
print(html_data[:300])

b'<!-- start of preamble -->\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"\n      id="brainmaps">\n<HEAD>\n    <title>BRAINMAPS.ORG - BRAIN ATLAS, BRAIN MAPS, BRAIN STRUC'


Small aside: if you have HTML data that you want to render, you can use `webbrowser` module

see http://docs.python.org/library/webbrowser.html

In [4]:
import webbrowser
open("/tmp/tmp.html","w").write(html_data.decode("UTF-8"))
webbrowser.open("file:///tmp/tmp.html")

True

# HTTP Overview

Hypertext Transfer Protocol

- HTTP takes place along TCP/IP sockets (typically port 80)
- HTTP is used to transmit resources
    - resources can be files, query results, server side script output

<img src="http://www.oreilly.com/openbook/webclient/wcp_0302.gif">

Communication initiated by Client opening connection & sending request message to Server.  Server then returns a response message containing the resource that was requested. After delivering the response the Server closes the connection.

The two most used request methods are **GET** and **POST**

# Scripting an HTTP GET request

In [5]:
try:
    from urllib.parse import urlencode
except:
    from urllib import urlencode


# create a dictionary to store the GET data
get_info = {"q": "Joshua S. Bloom", "page": "2"} 

# encode the data in proper URL format
url_values = urlencode(get_info) 
print(url_values)

page=2&q=Joshua+S.+Bloom


In [6]:
url = "http://pubget.com/search"

# open the url as before
#alternatively: urlopen(url + "?" + url_values.encode("utf-8"))
response = urlopen(url,data=url_values.encode("utf-8"))  

html = response.read()
response.close()
print(html[8000:9000])

b'ults">\n  <ol>\n      \n<li class="result" id="article-21680812">\n  <div class="tools">\n  </div>\n\n  <!-- Title -->\n  <a href="/articles/elasticsearch_show/21680812" class="title" target="">A possible relativistic jetted outburst from a massive black hole fed by a tidally disrupted star.</a>\n\n  <!-- Language -->\n  \n\n  <!-- Authors -->\n  <div class="authors">\n    <a href="/author/joshua-s-bloom">Joshua S Bloom</a>, <a href="/author/dimitrios-giannios">Dimitrios Giannios</a>, <a href="/author/brian-d-metzger">Brian D Metzger</a>, <a href="/author/s-bradley-cenko">S Bradley Cenko</a>, <a href="/author/daniel-a-perley">Daniel A Perley</a>, <a href="/author/nathaniel-r-butler">Nathaniel R Butler</a>, <a href="/author/nial-r-tanvir">Nial R Tanvir</a>, <a href="/author/andrew-j-levan">Andrew J Levan</a>, <a href="/author/paul-t-o-brien">Paul T O&#x27;Brien</a>, <a href="/author/linda-e-strubbe">Linda E Strubbe</a>, <a href="/author/fabio-de-colle">Fabio De Colle</a>, <a href="/au


   - **GET** default method for retrieving resources. Form data is encoded in the URL. GET should be used when the form processing is “idempotent” - when it has no side effects. GET is basically just for retrieving data (static files).

   - **POST** places form data in the message body. It is more appropriate for wider range of processes, e.g., storing/updating data, ordering or sending a product, and sending email.

# Scripting an HTTP POST request

In [7]:
data = {}
data["author"] = "Sagan, Carl"
params = urlencode(data).encode("UTF-8") # same urlencode method
url = "http://adsabs.harvard.edu/cgi-bin/nph-abs_connect"
response = urlopen(url, params) 
# POST request is indicated by including the params in urlopen
html = response.read()
response.close()
print(html[16474:19000])

b'Sagan,&#160;Carl</td><td><br></td><td align="left" valign="top" colspan=3>Ernst Mayr and Carl Sagan debate about the probability of intelligent life in the universe</td></tr>\n<tr><td colspan=6><HR></td></tr>\n<tr><td align="left" valign="baseline" nowrap>2</td><td align="left" valign="baseline" width="5%"><input type="checkbox"  name="bibcode" value="2002C&amp;T...118Q.147J">&nbsp;<a href="http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2002C%26T...118Q.147J&amp;db_key=AST&amp;link_type=ABSTRACT&amp;high=57dc5c962d06701">2002C&amp;T...118Q.147J</a></td><td><br></td><td align="left" valign="baseline">0.000</td><td align="left" valign="baseline">12/2002</td><td align="left" valign="baseline">&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&

## Basic Authenitication

```python
from urllib.request import HTTPBasicAuthHandler, build_opener, install_opener, urlopen
auth_handler = HTTPBasicAuthHandler()
auth_handler.add_password("realm", "example.com", 
       "username", "password")
opener = build_opener(auth_handler)

# ...install it globally so it can be used with urlopen.
install_opener(opener)
urlopen('http://www.example.com/login.html')
```

Browsers handle this by popping up a dialog box requesting you to “Enter user name and password for “realm” at http://example.com”.

## Form based Authentication

```python
from urllib.request import HTTPBasicAuthHandler, build_opener, install_opener, urlopen, HTTPCookieProcessor
from urllib.parse import urlencode
opener = build_opener(HTTPCookieProcessor())
params = urlencode(dict(username="uname", password="pswd"))
response = opener.open("http://example.com/login/", params)
data = response.read()
response.close()
response = opener.open("http://example.com/my/protected/page.html")
data = response.read()
response.close()
```

Login information is stored in a cookie and included in subsequent requests. The opener is used to POST to the login form and the protected page.

See also:
  - *RoboBrowser*: Your friendly neighborhood web scraper (http://robobrowser.readthedocs.org/)
  - *MechanicalSoup*: A Python library for automating interaction with websites (https://github.com/hickford/MechanicalSoup)

## Requests

Most modern web interactions are complicated. `requests` is your friend


```python
requests.get('https://api.github.com/user', \
              auth=('user', 'pass'))
```
streaming, keep-alive, etc.

http://docs.python-requests.org/en/latest/user/advanced/

# Access an FTP server

In [8]:
import ftplib
ftp = ftplib.FTP("ftp.cdc.gov")
ftp.login()

'230 User logged in.'

In [9]:
ftp.cwd("/pub/OPD")

'250 CWD command successful.'

In [10]:
ftp.dir()

drwxrwxrwx   1 owner    group               0 Aug 31 14:32 Susanna
drwxrwxrwx   1 owner    group               0 Sep  1 13:01 T48700


In [11]:
ftp.cwd("Susanna")

'250 CWD command successful.'

In [12]:
ftp.dir()

-rwxrwxrwx   1 owner    group        76730058 Aug 31 14:32 01 TABLE BANNER 51305-51441-ZIKA-OOH-COMUNIDAD-INGLES_JM_PRINT.pdf
-rwxrwxrwx   1 owner    group         5158278 Aug 31 14:32 02 STAND BANNER 51357-51441-ZIKA-GFR-300x600_PRINT.pdf
-rwxrwxrwx   1 owner    group        80219177 Aug 31 14:32 03A POSTER 51547-51441-ZIKA-FP-MUJER_NIN¦âA-INGLES_JM_PRINT.pdf
-rwxrwxrwx   1 owner    group        53729873 Aug 31 14:32 03B POSTER 51548-ZIKA-FP-END-EMBARASADA-INGLES_JM_PRINT.pdf
-rwxrwxrwx   1 owner    group        18281007 Aug 31 14:32 10ft-poster-2.pdf
-rwxrwxrwx   1 owner    group         1559395 Aug 31 14:32 51488-ZIKA-36x84_Table Banner.pdf
-rwxrwxrwx   1 owner    group         4098323 Aug 31 14:32 SIKA_BANNER_7X3_reduced.pdf
-rwxrwxrwx   1 owner    group         1443729 Aug 31 14:32 standing_banner_7x3_v1_dual - Copy.pdf


In [13]:
ftp.retrbinary('RETR SIKA_BANNER_7X3_reduced.pdf', open('zika.pdf', 'wb').write)

'226 Transfer complete.'

In [14]:
pwd = !pwd

In [15]:
import webbrowser

webbrowser.open_new('file://{}/zika.pdf'.format(pwd[0]))

True

# HTML Overview

 - HyperText Markup Language
 - The code in which webpages are written
 - Consists of tags surrounded by angled brackets, < and >
 - An HTML document has a hierarchy enforced by the ordering and nesting of tags
 - It can be thought of like a tree with branches
 
 Examples at 
http://www.w3schools.com/html/html_examples.asp
http://www.sheldonbrown.com/web_sample1.html

Let's take a look at a page: http://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=I/337/gaia

## html5lib

In [25]:
import html5lib
response = urlopen("http://words.bighugelabs.com/")
html = response.read()
doc = html5lib.parse(html)

- doc is now a tree in “simpletree” format. 
- html5lib also supports minidom, ElementTree, lxml, and BeautifulSoup tree formats.
- lxml, in particular, is good for creating well-formed html and xml.

# Parsing HTML with BeautifulSoup

Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree.


See: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [24]:
try:
    # For Python 3.0 and later
    from urllib.request import urlopen
except ImportError:
    # Fall back to Python 2's urllib2
    from urllib2 import urlopen
    
response = urlopen("http://words.bighugelabs.com/")
html = response.read()
response.close()

# pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html5lib")
forms = soup.findAll("form")
forms

[<form><input id="searchBox" name="q" type="text"/></form>]

In [17]:

print(html)

b'<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv="Content-type" content="text/html; charset=utf-8">\n<title>Big Huge Thesaurus: Synonyms, antonyms, and rhymes (oh my!)</title>\n<meta name="description" content="Get english synonyms, antonyms, sound-alike, and rhyming words from the Big Huge Thesaurus.">\n<link rel="shortcut icon" href="/images/favicon.ico">\n<link rel="apple-touch-icon" href="/images/apple-touch-icon.png">\n<meta name="viewport" content="width=device-width">\n<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>\n<link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css">\n<script src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></script>\n<link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet">\n<link href=\'//fonts.googleapis.com/css?family=PT+Sans:400,700,400italic|PT+Serif:400,700,400italic\' rel=\'stylesheet\' type=\'

In [18]:
links = soup.findAll("form")
for link in links:
    print(link)

<form action="/" id="inputform" method="get" onsubmit="return lookup()">
    <input class="form-control" id="q" name="q" value=""/>
    <input class="btn btn-primary" type="submit" value="Lookup"/>
  </form>


Let's load up a whole bunch of baby names, by combining scripted webpage access with BeautifulSoup:

In [26]:
from bs4 import BeautifulSoup
url = "http://nameberry.com/search/boys_names/J"
response = urlopen(url)
html = response.read()
response.close()
soup = BeautifulSoup(html,"html.parser")

In [27]:
items = soup.findAll("li", class_="name_in_list")
print(items)

[<li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jaap" target="_top">Jaap</a></h4></li>, <li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jabari" target="_top">Jabari</a></h4></li>, <li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jabbar" target="_top">Jabbar</a></h4></li>, <li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jabez" target="_top">Jabez</a></h4></li>, <li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jabin" target="_top">Jabin</a></h4></li>, <li class="name_in_list grid__cell unit-4-12--med unit-3-12--wideload simple-result"><h4><a class="name_link boy" href="/babyname/Jabiru"

In [29]:
items[34].a.get_text()

'Jakez'

In [22]:
import string

letters = "qwertyuioplkjhgfdsazxcvbnm"
boy_names = []
for n in string.ascii_uppercase[:26]:
    url = "http://nameberry.com/search/boys_names/" + n
    response = urlopen(url)
    html = response.read()
    response.close()
    soup = BeautifulSoup(html,"html.parser")
    items = soup.findAll("li", class_="name_in_list")
    for item in items:
        if len(item.findAll("a")) == 1:
            boy_names.append(item.a.get_text())

In [23]:
print(boy_names)

['Aaden', 'Aakil', 'Aalto', 'Aarav', 'Aaron', 'Aart', 'Aaru', 'Aarush', 'Abacus', 'Aban', 'Abanito', 'Abanu', 'Abba', 'Abbas', 'Abbott', 'Abdalla', 'Abdallah', 'Abdiel', 'Abdu', 'Abdul', 'Abdullah', 'Abe', 'Abeeku', 'Abel', 'Abelard', 'Abelardo', 'Aberdeen', 'Abi', 'Abiah', 'Abidan', 'Abiel', 'Abijah', 'Abilene', 'Abimael', 'Abir', 'Abner', 'Abraham', 'Abram', 'Abraxas', 'Absalom', 'Abt', 'Abush', 'Acacius', 'Ace', 'Acer', 'Achille', 'Achilles', 'Acker', 'Actaeon', 'Acton', 'Adagio', 'Adaiah', 'Adair', 'Adalius', 'Adam', 'Adan', 'Addar', 'Addison', 'Adelio', 'Aden', 'Adeon', 'Adhit', 'Adil', 'Adir', 'Aditya', 'Adiv', 'Adlai', 'Adler', 'Adley', 'Adnan', 'Adolf', 'Adolfo', 'Adolph', 'Adolphe', 'Adolphus', 'Adonijah', 'Adonis', 'Adrian', 'Adriano', 'Adriel', 'Adrien', 'Aegis', 'Aeneas', 'Aeron', 'Aesop', 'Agassi', 'Agni', 'Agu', 'Agung', 'Agustin', 'Ahab', 'Ahearne', 'Ahmad', 'Ahman', 'Ahmed', 'Ahmet', 'Aidan', 'Aiden', 'Aidyn', 'Aimilios', 'Ainsley', 'Aio', 'Airlie', 'Aither', 'Aja', 'Aj

In [None]:
boy_names.sort()
print(str(len(boy_names)) + " names from " + \
       boy_names[0] + " to " + boy_names[-1] + ".")

To demonstrate we downloaded and parsed all the names, and to have a little fun, let's make up an official-sounding name for a childish Congressman.

In [None]:
import random
proper_person_name = ""
for n in range(5):
    proper_person_name += random.choice(boy_names) + " "
proper_person_name = "Congressman " + proper_person_name[:-1] + " XVI" + " PhD"
print(proper_person_name)

# JSON

JSON is a light-weight data interchange format. 

Some web service APIs can output in JSON and the json  Python module facilitates parsing.

www.json.org/

In [None]:
import json
import joshkey
base_domain = "http://words.bighugelabs.com/"

api_key =  joshkey.API # get your own damn key!
word = "hacker"

url = base_domain + "api/2/" + api_key + "/" + word + "/json"
print(url)

result = json.loads(urlopen(url).read().decode("UTF-8")) # a dictionary!

print(result)

In [None]:
import pprint
pprint.pprint(result)

A more fleshed-out example code, prints the output more cleanly.

In [None]:
import sys
base_domain = "http://words.bighugelabs.com/"
api_key = "483e281b60496d7961d852629799e733"
word = "notebook"
print("Retrieving thesaurus entry for \"" + word + "\".")
url = base_domain + "api/2/" + api_key + "/" + word + "/json"
try:
    result = json.loads(urlopen(url).read().decode("UTF-8")) # a dictionary!
except:
    print("Error - word probably not in thesaurus.")
    #sys.exit()
for part_of_speech in result:
    print("-"*50)
    print("These are the " + part_of_speech + " entries:")
    for key in ["syn", "ant", "rel"]:
        try:
            for synonym in result[part_of_speech][key]:
                print(key + " - " + synonym)
        except:
            continue