##### Diving In

HTTP webservices can be described as _exchanging data with remote servers using nothing but the operations of HTTP_

* `HTTP GET` to get data from a server
* `HTTP POST` to send new data to a server
* `HTTP PUT` to create or modify data on a server
* `HTTP DELETE` to delete data on a server.

The "verbs" built into `HTTP` protocol (`GET`, `POST`, `PUT`, `DELETE`) map directly to application-level operations for retrieving, creating, modifying and deleting data.

Python comes with two different libraries for interacting with `HTTP` we services

* `http.client` is a low-level library that implements `RFC` 2616, the `HTTP` protocol
* `urllib.request` is an abstraction layer built on top of `http.client`. It provides a standard `API` for accessing both `HTTP` and `FTP` servers, automaticall follows `HTTP` redirects and handles some common forms of `HTTP` authentication.

A preferred library for http use cases in python is the open source `httplib2` library that more fully implements `http.client` and provides a better abstraction than `urllib.request`.


##### Features of HTTP

**Caching**

Network requests are very expensive and add unacceptable latency to receiving a response. Due to this `HTTP` is designed with caching in  mind. 

The `Cache-Control` and `Expires` headers tell your browser / client that content can be cached, in additional the service via the `Expires` header specifies when the content expires.

When the browser / client needs to make a request, it may find the content cached locally based on the headers in a previous response. If the content is cached locally, the browser will not make a network request.

If for any reason the local cache has be purged, the browser makes a network request, that may be satisfied by an intermediate caching proxy which does not necessitate a request to the `origin` server.

Python `HTTP` libraries do not support caching, but `httplib2` does.

**Last modified Checking**

Some data never changes, while other data changes all the time. In between, there is a vast field of data that _might_ have changed.

CNN.com's data changes every few minutes, a weblog data might change once a week or less frequently.

When data has not change we do not want to re-download the content as putting data on a network and recieving the data is expensive. `HTTP` facilitates this with the `Last-Modified` header.

    Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
    
In the above example, include the `Last-Modified` header with the value `Fri, 22 Aug 2008 04:28:16 GMT` in the request will allow the service to respond with a `304` (`Not Modified`) response and no content if that were the cases. The service may also include updated `Cache-Control` and `Expires` headers in the response allowing intermediate proxies and client to update the status on the cache.

`httplib2` supports last-modified date checking.

**ETag Checking**

ETags are an alternative way to accomplish the same thing as _last modified checking_. With ETags the service responds with a hash code along with the data (Exactly how the hash is determine is up to the service, the only requirement is that it changes when the data changes). 

The second time you request the same data you include the Etag in a `If-None-Match` header of your request.

    If-None-Match: "3075-ddc8d800"
    
As with the last-modified checking, the srver sends back only the 304 status code; it doesn't send back the same data a second time. By including ETag has in your request, you are telling the service that there is no need to re-send the same data if it still matches this hash, since you still have the data from last time.

`httplib2` supports etags.

**Compression**

`HTTP` supports several compression algorithms. The tow most common types are `gzip` and `deflate`. Client can specify that they are willing to accept compressed context via the `Accept-Encoding` header.

If a service support one of the `Accept-encoding`'s specified by the client, the service responds with the encoded content in the body and the encoding used in `Content-encoding` header.

Most body-content in http is text, and text compresses very well, hence the support for compression in HTTP.

`httplib2` support compression

**Redirects**

Cool `URI`s don't change, but many `URI`s change. A syndicated feed might be moved. An entire domain changes due to the organization expanding or reorganizing.

Every time you request any kind of resource from an `HTTP` server the server incldues status code in its response. Status code `200` means "everything is normal, here's the content you requested".

`404` is page not found, a common error that everyone is familiar with.

Status codes in the `300`'s indicate some kind of redirection. `301` indicated permanent redirection, and `302` indicated temporary rediction. In either case the new location for the resource is specificed in the `Location` header.

I you get a `302` status code and a new location, the `HTTP` specification says, you should use the new address once, the next time you should retry the old address. But with a `301` status code, you're supposed to use the new address from now on.

`httplib2` handles permanent redirects for you. It will tell you that a permanent redirect occurred and keep track of them locally and automatically rewrite redirected URLs when requesting them.

##### How NOT to fetch data over http

Let us fetch an Atom feed, being a feed it will be downloaded repeatedly. Let's do it the quick and dirty way first and then see how to do it better.

In [18]:
import urllib.request

a_url = 'https://feeds.simplecast.com/wgl4xEgL'
data = urllib.request.urlopen(a_url).read()
type(data)

bytes

To see why this is inefficient let's turn on debugging features of `HTTP` library and see what's being sent "on the wire"

In [2]:
from http.client import HTTPSConnection, HTTPResponse
HTTPSConnection.debuglevel = 1
HTTPResponse.debuglevel = 1

from urllib.request import urlopen

a_url = 'https://feeds.simplecast.com/wgl4xEgL'
response = urlopen(a_url)

In [3]:
type(response)

http.client.HTTPResponse

In [4]:
print(response.headers.as_string())

Content-Type: application/xml
Content-Length: 2594947
Connection: close
Last-Modified: Fri, 01 Jan 2021 11:18:22 GMT
x-amz-version-id: HtWmRArgQPymxM8PdYbtg8CryJk7C4lT
Server: AmazonS3
Date: Fri, 01 Jan 2021 17:28:03 GMT
Cache-Control: max-age=3600
ETag: "c6601ecd256dae5ec11c8f3714be4483"
Vary: Accept-Encoding
X-Cache: Hit from cloudfront
Via: 1.1 d4cdd862c8bc0148f37b685614031cf5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: EWR52-C1
X-Amz-Cf-Id: VMQ_M1qqcxh_zgaOb0wdELUlyFNiCHZIhLomqEOD4NeOajAeZa2cdA==
Age: 497




**FAILED!!!***

Since I couldn't get the debug message to show up this write up relies on the content of the book rather than verification.

Reviewing the response headers, we can see that we did not specify a `Accept-encoding` header resulting in a plain text response. If we made another request the request will ignore the `Cache-Control` and `ETag` response headers from this response.

Finally an interesing thing here the that we can see the use of AWS Cloudfront for the feed.

##### Introducing `httplib2`

`httplib2` solves some of these problem associated with the built in library

In [6]:
import httplib2

h = httplib2.Http('examples/.cache')

response, content = h.request(a_url)
print(response.status)
print(len(content))

200
2594947


In [7]:
response, content = h.request(a_url)
print(response.fromcache)

True


As we can see in this case, the second request retrieves the response from cache.

How can we force httplib2 to ignore the cache

In [8]:
response, content = h.request(a_url, headers= {'Cache-control': 'no-cache'})
print(response.fromcache)

False


Lets review the request headers by enabling the debug level

In [11]:
httplib2.debuglevel = 1

h = httplib2.Http('examples/.cache')
response, content = h.request(a_url, headers= {'Cache-control': 'no-cache'})

connect: (feeds.simplecast.com, 443)
send: b'GET /wgl4xEgL HTTP/1.1\r\nHost: feeds.simplecast.com\r\ncache-control: no-cache\r\nuser-agent: Python-httplib2/0.18.1 (gzip)\r\naccept-encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: application/xml
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Last-Modified: Fri, 01 Jan 2021 11:18:22 GMT
header: x-amz-version-id: HtWmRArgQPymxM8PdYbtg8CryJk7C4lT
header: Server: AmazonS3
header: Content-Encoding: gzip
header: Date: Fri, 01 Jan 2021 17:18:42 GMT
header: Cache-Control: max-age=3600
header: ETag: W/"c6601ecd256dae5ec11c8f3714be4483"
header: Vary: Accept-Encoding
header: X-Cache: Hit from cloudfront
header: Via: 1.1 5d70fbb2ed26aa231fed552696cfa0a5.cloudfront.net (CloudFront)
header: X-Amz-Cf-Pop: EWR52-C1
header: X-Amz-Cf-Id: C6195j8FWBkGol7DgLG4TFRPQQAT8CZSi3t2ONa_C5efNHgdwo8x8A==
header: Age: 1391


`httplib2` make a request with a number of headers that are best practice for HTTP

    accept-encoding: gzip, deflate
    
The server responds with `gzip` response and include both a `Last-Modified` and `ETag` headers.

    Content-Encoding: gzip
    Last-Modified: Fri, 01 Jan 2021 11:18:22 GMT
    ETag: W/"c6601ecd256dae5ec11c8f3714be4483"