# Web Programming for Data Scientists

This tutorial is meant to introduce data scientists, analysts, statisticians and academic programmers to useful concepts in web programming. We'll cover a bit about how the internet works, then go into depth on how to retrieve datasets over the web, how to build fast crawlers and scrapers for extracting structured information from large numbers of web sites, how to use web APIs, and how to build simple simple web servers for hosting dashhboards or exposing your models to data consumers.

## Assumptions
### The Reader
We assume that the reader is already familiar with concepts in statistics and statistical programming, as well as the Python libraries numpy, scipy, and pandas.

### Programming Environment
This tutorial is written for users running Python 2.7, with the following packages installed:
- Scipy
- Numpy
- Pandas
- Scikit-Learn
- Beautiful Soup

If you are missing any of these packages, you can get them via pip at the command line:

```
pip install numpy scipy pandas scikit-learn beautifulsoup4
```

Let's get started!

## 1. Fetching your First Web Page

In [34]:
import urllib2
import bs4

request = urllib2.Request("http://groups.linguistics.northwestern.edu/soundlab/contact.html")
response = urllib2.urlopen(request)
html = response.read()
soup = bs4.BeautifulSoup(html)
print soup.prettify()

<html>
 <head>
  <title>
   Contact the Sound Lab
  </title>
 </head>
 <body>
  <center>
   <h2>
    Email:
    <br/>
    SoundLab @ ling.northwestern.edu
   </h2>
   <p>
    Phone:
    <br/>
    (847) 491 - 6691
   </p>
   <p>
   </p>
  </center>
 </body>
</html>



### So, what did you just do?
1. Create an HTTP request, send it off over the internet, and receive an HTTP response.
2. Print out the response's HTTP headers.
3. Read the response body.
4. Parse the html in the repsonse body into a structured Python object.
5. Print out a niecely formatted representation of that html.

We go through that in more detail below. If you don't care about the plumbing and just want to get things done, you can skip ahead to section 2.

### HTTP Requests and Responses
HTTP, or Hyper-Text Transfer Protocol, is one of the main protocols that computers use to talk to each other over the internet. An HTTP request is just a bit of text that a *client* computer sends to a *server* to tell it something, typically that it wants some piece of information, like a web page.

An HTTP response is the text that the server sends back to the client.

Both requests and responses have similar structure. Everything up to the first newline (actually a carriage return + newline) is the *request line* (in a request) or the *status line* (in a response). This is followed by any number of *headers*, each terminated by a newline + carriage return. The last header is terminated by a double newline + carriage return, and everything else is the message body. 

Let's send the request again, this time logging a bit of extra information to see some more detail. The text of the request will be printed out after "`send: `".

In [33]:
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
opener.open(request).read()

send: 'GET /soundlab/contact.html HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: groups.linguistics.northwestern.edu\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 11 Dec 2015 18:21:30 GMT
header: Server: Apache/2.2.3 (Red Hat)
header: Accept-Ranges: bytes
header: Content-Length: 183
header: Content-Type: text/html
header: X-Cache: MISS from m00180A86462C
header: X-Cache-Lookup: MISS from m00180A86462C:3128
header: Via: 1.1 m00180A86462C (squid/3.3.5)
header: Connection: close


'<html>\n<head>\n<title>\nContact the Sound Lab\n</title>\n<head>\n<body>\n<center>\n<h2>\nEmail:<br>\nSoundLab @ ling.northwestern.edu<p>\n\nPhone:<br>\n(847) 491 - 6691<p>\n</h2>\n</body>\n</html>\n\n'

#### Anatomy of a Request
The request should look something like this:
```
GET /soundlab/contact.html HTTP/1.1
Accept-Encoding: identity
Host: groups.linguistics.northwestern.edu
Connection: close
User-Agent: Python-urllib/2.7
```

This request has the required request line, four headers, and no body.

##### Request Line
The first bit request line tells us that it's a `GET` request, which means the client is asking for information. Most of thetime, you'll be sending GET requests. You'll also sometimes use a `POST` request, typically used when the client is providing new information to a server. There are other request types which you'll rarely have to interact with as a data scientist.

The second bit of the request line is the URI, universal resource indicator. Semantically, the URI originally indicated the location of a file on the host computer. These days things are much more dynamic and URIs are basically arbitrary, but we still generally structure them like file paths.

The last element of the request line specifies the version of the HTTP protocol being used.

##### Request Headers
Next come several headers. The HTTP spec defines a lot of them, most of which are not required. You can also add more headers yourself; custom headers must start with "X-", and will probably be ignored by the server. Here's what the headers in our example mean:
- `"Accept-Endcodeing: identity"` - The text encoding of the response is expected to be the same as that used in the request.
- `"Host: groups.linguistics.northwestern.edu"` - You can find the IP address of the server we want to talk to by looking up "groups.linguistics.northwestern.edu" on a nameserver.
- `"Connection: close"` - After the server sends its reply, consider the communication over.
- `"User-Agent: Python-urllib/2.7"` The system that sent the request is the Python library urllib.

`User-Agent` might be the most interesting of these headers. It's purpose is to identify some general information about the client. When browsing web sites in Chrome, Firefox, Safari, or whatever, the user agent typically indicates which web browser and operating system is being used. For instance, by user agent right now is `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36`. This lets servers know a little bit about the client that is talking to them; in my case, that I'm running Chrome 46 on an Intel Mac.

A very, very important thing to remember about user agents is that **THEY ARE COMPLETELY UNRELIABLE**. Anyone can set their user agent to be whatever they want. For instance, let's pretend that we're sending a request as me using my browser, rather than as a Python script:

In [38]:
fake_ua_request = urllib2.Request(
    "http://groups.linguistics.northwestern.edu/soundlab/contact.html",
    headers={"User-Agent": ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) "
                            "AppleWebKit/537.36 (KHTML, like Gecko) "
                            "Chrome/46.0.2490.86 Safari/537.36")})
opener.open(fake_ua_request).read()

send: 'GET /soundlab/contact.html HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: groups.linguistics.northwestern.edu\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 11 Dec 2015 19:03:55 GMT
header: Server: Apache/2.2.3 (Red Hat)
header: Accept-Ranges: bytes
header: Content-Length: 183
header: Content-Type: text/html
header: X-Cache: MISS from m00180A86462C
header: X-Cache-Lookup: MISS from m00180A86462C:3128
header: Via: 1.1 m00180A86462C (squid/3.3.5)
header: Connection: close


'<html>\n<head>\n<title>\nContact the Sound Lab\n</title>\n<head>\n<body>\n<center>\n<h2>\nEmail:<br>\nSoundLab @ ling.northwestern.edu<p>\n\nPhone:<br>\n(847) 491 - 6691<p>\n</h2>\n</body>\n</html>\n\n'

See the difference?

Because some servers respond differently to different browsers, or mobile versus desktop clients, user agent spoofing can come in handy.

##### Request Body
`GET` requests don't have a body. Let's send a `POST` request to see what a body looks like. When using Python's urllib2, the only thing you need to do to turn a `GET` request into a `POST` is to pass in a `body` argument.

In [43]:
post_request = urllib2.Request(
    "http://groups.linguistics.northwestern.edu/soundlab/contact.html",
    data="Foobarabaz")
opener.open(post_request).read()

send: 'POST /soundlab/contact.html HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 10\r\nHost: groups.linguistics.northwestern.edu\r\nContent-Type: application/x-www-form-urlencoded\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\nFoobarabaz'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Fri, 11 Dec 2015 19:09:57 GMT
header: Server: Apache/2.2.3 (Red Hat)
header: Accept-Ranges: bytes
header: Content-Length: 183
header: Content-Type: text/html
header: X-Cache: MISS from m00180A86462C
header: X-Cache-Lookup: MISS from m00180A86462C:3128
header: Via: 1.1 m00180A86462C (squid/3.3.5)
header: Connection: close


'<html>\n<head>\n<title>\nContact the Sound Lab\n</title>\n<head>\n<body>\n<center>\n<h2>\nEmail:<br>\nSoundLab @ ling.northwestern.edu<p>\n\nPhone:<br>\n(847) 491 - 6691<p>\n</h2>\n</body>\n</html>\n\n'

Notice that "Foobarbaz" appears at the end of our request, after that double carriage return + newline.

The page we requested ignored the body and interepreted the GET and POST requests identically. This isn't always the case. Many URLs will respond only to one type of request, and might do something with the body you pass in, like turn it into a new posting on a message board.

<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5">You can find the full HTTP/1.1 Request spec here</a>.

#### Anatomy of a Response
You can see that the responses we've been getting back are structured similarly to our requests. They look something like this:
```
HTTP/1.1 200 OK\r\n
Date: Fri, 11 Dec 2015 19:09:57 GMT
Server: Apache/2.2.3 (Red Hat)
Accept-Ranges: bytes
Content-Length: 183
Content-Type: text/html
X-Cache: MISS from m00180A86462C
X-Cache-Lookup: MISS from m00180A86462C:3128
Via: 1.1 m00180A86462C (squid/3.3.5)
Connection: close

<html>\n<head>\n<title>\nContact the Sound Lab\n</title>\n<head>\n<body>\n<center>\n<h2>\nEmail:<br>\nSoundLab @ ling.northwestern.edu<p>\n\nPhone:<br>\n(847) 491 - 6691<p>\n</h2>\n</body>\n</html>\n\n
```

##### Response Line
The response line includes the version of the HTTP protocol being used, just like the request, and a *status code*. The status code indicates what the server did with your request. You can access it directly on the response object:

In [49]:
response.code

200

200 OK is what you want to see! It means everything worked fine. The <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html">HTTP/1.1 Response spec</a> lists all possible status codes.

##### Respose Headers
Response headers have the same formt as request headers but contain different information. Some are easy to interepret, like "Date", or the same as their request counterparts, like "Connection". You can see some example of non-standard headers.

Response headers won't often be terribly 