<img src="https://www.evernote.com/l/AUXIi328hU9Im7vbcG4SauyAkY-8L11rdU8B/image.png">
Python Seminar (AY250) UC Berkeley

In [None]:
%run ../00_AdvancedPythonConcepts/talktools.py

## Outline


1. Websites and webservers

   - `urllib`, `ftplib`, `httplib`, `httplib2`, `requests`
   - Parsing with `html5lib`, `BeautifulSoup4`
   - `conda install beautifulsoup4`

2. Transmission Control Protocol (TCP)

   - `socket`
    
3. Breakout Exercise

   - Focus on automating website access

4. Remote Procedure Call

  - `SimpleXMLRPCServer`, `xmlrpclib`

## Network Communication Overview

- TCP/IP sockets: Most all network communication, also UDP  
- TCP (Transmission Control Protocol): exchange data reliably between two network hosts
- IP (Internet Protocol): handles addressing & routing messages across one or more networks

<hr>
<img src="http://flylib.com/books/3/475/1/html/2/images/0131777203/graphics/14fig02.gif">
<hr>
<img src="http://i.cloud.opensystemsmedia.com/i__srcbc84f1fa314969f2dc009b8711c679ce_paraf0d99c20bd457d46a92c72841873c47.jpeg">
<hr>
<img src="https://microchip.wdfiles.com/local--files/tcpip:tcp-ip-five-layer-model/layer_terminology.JPG">

# Accessing a Web address (URL)

<quote>Why? Who would ever want to easily automate URL (Uniform Resource Locator) retrieval and form submission in a scripting language?
</quote>

 - Data mining (we’ll do this in the breakout)
 - Submitting information to another system
 - Accessing remote compute resources (“webservices”)
 - Get microservices

`urllib` provides tools & functions for high-level, but less modern, interactions. It's suited for complex interactions, supporting basic and digest authentication, redirections, cookies, and more:

 - `urllib.request` for opening and reading URLs
 - `urllib.error` containing the exceptions raised by urllib.request
 - `urllib.parse` for parsing URLs
 - `urllib.robotparser` for parsing robots.txt files


 
Note: `urllib.request.urlopen` function always returns an object which can work as a context manager

See https://docs.python.org/3/library/urllib.request.html#module-urllib.request

# Super simple webpage access

In [None]:
from __future__ import absolute_import, division, print_function

In [None]:
# URL = Uniform Resource Locator
try:
    # For Python 3.0 and later
    from urllib.request import urlopen
except ImportError:
    # Fall back to Python 2's urllib2
    from urllib2 import urlopen
    
# Brain maps data "Explore the Brain like never before"
url = "http://brainmaps.org/"  
response = urlopen(url) # response is a file-like object
html_data = response.read()
response.close() # close response as you would a normal file
print(html_data[:300])

Small aside: if you have HTML data that you want to render, you can use `webbrowser` module

see http://docs.python.org/library/webbrowser.html

In [None]:
import webbrowser
open("/tmp/tmp.html","w").write(html_data.decode("UTF-8"))
webbrowser.open("file:///tmp/tmp.html")

# HTTP Overview

Hypertext Transfer Protocol

- HTTP takes place along TCP/IP sockets (typically port 80)
- HTTP is used to transmit resources
    - resources can be files, query results, server side script output

<img src="http://www.oreilly.com/openbook/webclient/wcp_0302.gif">

Communication initiated by Client opening connection & sending request message to Server.  Server then returns a response message containing the resource that was requested. After delivering the response the Server closes the connection.

The two most used request methods are **GET** and **POST**

# Scripting an HTTP GET request

In [None]:
try:
    from urllib.parse import urlencode
except:
    from urllib import urlencode


# create a dictionary to store the GET data
get_info = {"q": "Joshua S. Bloom", "page": "2"} 

# encode the data in proper URL format
url_values = urlencode(get_info) 
print(url_values)

In [None]:
url = "http://pubget.com/search"

# open the url as before
#alternatively: urlopen(url + "?" + url_values.encode("utf-8"))
response = urlopen(url,data=url_values.encode("utf-8"))  

html = response.read()
response.close()
print(html[8000:9000])


   - **GET** default method for retrieving resources. Form data is encoded in the URL. GET should be used when the form processing is “idempotent” - when it has no side effects. GET is basically just for retrieving data (static files).

   - **POST** places form data in the message body. It is more appropriate for wider range of processes, e.g., storing/updating data, ordering or sending a product, and sending email.

# Scripting an HTTP POST request

In [None]:
data = {}
data["author"] = "Sagan, Carl"
params = urlencode(data).encode("UTF-8") # same urlencode method
url = "http://adsabs.harvard.edu/cgi-bin/nph-abs_connect"
response = urlopen(url, params) 
# POST request is indicated by including the params in urlopen
html = response.read()
response.close()
print(html[16474:19000])

## Basic Authenitication

```python
from urllib.request import HTTPBasicAuthHandler, build_opener, install_opener, urlopen
auth_handler = HTTPBasicAuthHandler()
auth_handler.add_password("realm", "example.com", 
       "username", "password")
opener = build_opener(auth_handler)

# ...install it globally so it can be used with urlopen.
install_opener(opener)
urlopen('http://www.example.com/login.html')
```

Browsers handle this by popping up a dialog box requesting you to “Enter user name and password for “realm” at http://example.com”.

## Form based Authentication

```python
from urllib.request import HTTPBasicAuthHandler, build_opener, install_opener, urlopen, HTTPCookieProcessor
from urllib.parse import urlencode
opener = build_opener(HTTPCookieProcessor())
params = urlencode(dict(username="uname", password="pswd"))
response = opener.open("http://example.com/login/", params)
data = response.read()
response.close()
response = opener.open("http://example.com/my/protected/page.html")
data = response.read()
response.close()
```

Login information is stored in a cookie and included in subsequent requests. The opener is used to POST to the login form and the protected page.

See also:
  - *RoboBrowser*: Your friendly neighborhood web scraper (http://robobrowser.readthedocs.org/)
  - *MechanicalSoup*: A Python library for automating interaction with websites (https://github.com/hickford/MechanicalSoup)

## Requests

Most modern web interactions are complicated. `requests` is your friend


```python
requests.get('https://api.github.com/user', \
              auth=('user', 'pass'))
```
streaming, keep-alive, etc.

http://docs.python-requests.org/en/latest/user/advanced/

# Access an FTP server

In [None]:
import ftplib
ftp = ftplib.FTP("ftp.cdc.gov")
ftp.login()

In [None]:
ftp.cwd("/pub/OPD")

In [None]:
ftp.dir()

In [None]:
ftp.cwd("Susanna")

In [None]:
ftp.dir()

In [None]:
ftp.retrbinary('RETR SIKA_BANNER_7X3_reduced.pdf', open('zika.pdf', 'wb').write)

In [None]:
pwd = !pwd

In [None]:
import webbrowser

webbrowser.open_new('file://{}/zika.pdf'.format(pwd[0]))

# HTML Overview

 - HyperText Markup Language
 - The code in which webpages are written
 - Consists of tags surrounded by angled brackets, < and >
 - An HTML document has a hierarchy enforced by the ordering and nesting of tags
 - It can be thought of like a tree with branches
 
 Examples at 
http://www.w3schools.com/html/html_examples.asp
http://www.sheldonbrown.com/web_sample1.html

Let's take a look at a page: http://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=I/337/gaia

## html5lib

In [None]:
import html5lib
response = urlopen("http://words.bighugelabs.com/")
html = response.read()
doc = html5lib.parse(html)

- doc is now a tree in “simpletree” format. 
- html5lib also supports minidom, ElementTree, lxml, and BeautifulSoup tree formats.
- lxml, in particular, is good for creating well-formed html and xml.

# Parsing HTML with BeautifulSoup

Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree.


See: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
try:
    # For Python 3.0 and later
    from urllib.request import urlopen
except ImportError:
    # Fall back to Python 2's urllib2
    from urllib2 import urlopen
    
response = urlopen("http://words.bighugelabs.com/")
html = response.read()
response.close()

# pip install beautifulsoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html5lib")
forms = soup.findAll("form")
forms

In [None]:
print(html)

In [None]:
links = soup.findAll("form")
for link in links:
    print(link)

Let's load up a whole bunch of baby names, by combining scripted webpage access with BeautifulSoup:

In [None]:
from bs4 import BeautifulSoup
url = "http://nameberry.com/search/boys_names/J"
response = urlopen(url)
html = response.read()
response.close()
soup = BeautifulSoup(html,"html.parser")

In [None]:
items = soup.findAll("li", class_="name_in_list")
print(items)

In [None]:
items[35].a.get_text()

In [None]:
import string

letters = "qwertyuioplkjhgfdsazxcvbnm"
boy_names = []
for n in string.ascii_uppercase[:26]:
    url = "http://nameberry.com/search/boys_names/" + n
    response = urlopen(url)
    html = response.read()
    response.close()
    soup = BeautifulSoup(html,"html.parser")
    items = soup.findAll("li", class_="name_in_list")
    for item in items:
        if len(item.findAll("a")) == 1:
            boy_names.append(item.a.get_text())

In [None]:
print(boy_names)

In [None]:
boy_names.sort()
print(str(len(boy_names)) + " names from " + \
       boy_names[0] + " to " + boy_names[-1] + ".")

To demonstrate we downloaded and parsed all the names, and to have a little fun, let's make up an official-sounding name for a childish Congressman.

In [None]:
import random
proper_person_name = ""
for n in range(5):
    proper_person_name += random.choice(boy_names) + " "
proper_person_name = "Congressman " + proper_person_name[:-1] + " XVI" + " PhD"
print(proper_person_name)

# JSON

JSON is a light-weight data interchange format. 

Some web service APIs can output in JSON and the json  Python module facilitates parsing.

www.json.org/

In [None]:
import json
import joshkey
base_domain = "http://words.bighugelabs.com/"

api_key =  joshkey.API # get your own damn key!
word = "hacker"

url = base_domain + "api/2/" + api_key + "/" + word + "/json"
print(url)

result = json.loads(urlopen(url).read().decode("UTF-8")) # a dictionary!

print(result)

In [None]:
import pprint
pprint.pprint(result)

A more fleshed-out example code, prints the output more cleanly.

In [None]:
import sys
base_domain = "http://words.bighugelabs.com/"
api_key = "483e281b60496d7961d852629799e733"
word = "notebook"
print("Retrieving thesaurus entry for \"" + word + "\".")
url = base_domain + "api/2/" + api_key + "/" + word + "/json"
try:
    result = json.loads(urlopen(url).read().decode("UTF-8")) # a dictionary!
except:
    print("Error - word probably not in thesaurus.")
    #sys.exit()
for part_of_speech in result:
    print("-"*50)
    print("These are the " + part_of_speech + " entries:")
    for key in ["syn", "ant", "rel"]:
        try:
            for synonym in result[part_of_speech][key]:
                print(key + " - " + synonym)
        except:
            continue