# HTTP in Python: The Requests Library

Python ecosystem that can take care of HTTP for us. To name a few:

* Python 3 comes with a built-in module called “urllib,” which can deal with all things HTTP (see https://docs.python.org/3/library/urllib.html).
* “httplib2” (see https://github.com/httplib2/httplib2): a small, fast HTTP client library.
* “urllib3” (see https://urllib3.readthedocs.io/): a powerful HTTP client for Python, used by the requests library below.
* “requests” (see http://docs.python-requests.org/): an elegant and simple HTTP library for Python, built “for human beings.”
* “grequests” (see https://pypi.python.org/pypi/grequests): which extends requests to deal with asynchronous, concurrent HTTP requests.
* “aiohttp” (see http://aiohttp.readthedocs.io/): another library focusing on asynchronous HTTP.

### Installing/updating requests library
pip install -U requests

In [None]:
import requests
url = 'http://www.webscrapingfordatascience.com/basichttp/'
#url = 'http://www.google.com'
r = requests.get(url)
# r = requests.request('GET', url). #same as above
print(r.text)

In [None]:
type(r)

In [None]:
dir(requests)

In [None]:
print(r.status_code) # Which HTTP status code did we get back from the server?
print(r.reason) # What is the textual status code?
print(r.headers) # What were the HTTP response headers?
print(r.request) # The request information is saved as a Python object in r.request:
print(r.request.headers) # What were the HTTP request headers?
print(r.text) # The HTTP response content:

# Query Strings: URLs with Parameters

##### URL parameters

* http://www.webscrapingfordatascience.com/paramhttp/
* http://www.webscrapingfordatascience.com/paramhttp/?query=test

The optional “?…” part in URLs is called the “query string,”

e.g

* http://www.example.com/product_page.html?product_id=304
* https://www.google.com/search?dcr=0&source=hp&q=test&oq=test
* http://example.com/path/to/page/?type=animal&location=asia

Web servers are smart pieces of software. When a server receives an HTTP request for such URLs, it may run a program that uses the parameters included in the query string — the “URL parameters” — to render different content. 
Compare 
* http://www.webscrapingfordatascience.com/paramhttp/?query=test 
* http://www.webscrapingfordatascience.com/paramhttp/?query=anothertest

for instance. Evenfor this simple page, you see how the response dynamically incorporates the parameter data that you provided in the URL.

Query strings in URLs should adhere to the following conventions:
* A query string comes at the end of a URL, starting with a single question mark, “?”.
* Parameters are provided as key-value pairs and separated by an ampersand, “&”.
* The key and value are separated using an equals sign, “=”. Since some characters cannot be part of a URL or have a special meaning (the characters “/”, “?”, “&”, and “=” for instance), URL “encoding” needs to be applied to properly format such characters when using them inside of a URL. Try this out using the URL http://www.webscrapingfordatascience.com/paramhttp/?query=another%20test%3F%26, which sends “another test?&” as the value for the “query” parameter to the server in an encoded form.
* Other exact semantics are not standardized. In general, the order in which the URL parameters are specified is not taken into account by web servers, though some might. Many web servers will also be able to deal and use pages with URL parameters without a value, for example, http://www.example.com/?noparam=&anotherparam. Since the full URL is included in the request line of an HTTP request, the web server can decide how to parse and deal with these.

### URL Rewriting

* Most web servers will pay attention to parse URL on their end in order to use their information while rendering a page (or even ignore them when they’re unused — try the URL http://www.webscrapingfordatascience.com/paramhttp/query=test&other=ignored, for instance), but in recent years, the usage of URL parameters is being avoided somewhat. 
* Instead, most web frameworks will allow us to define “nice looking” URLs that just include the parameters in the path of a URL, for example, “/product/302/” instead of “products.html?p=302”. The former looks nicer when looking at the URL as a human, and search engine optimization (SEO) people will also tell you that search engines prefer such URLs as well. On the server-side of things, any incoming URL can hence be parsed at will, taking pieces from it and “rewriting” it, as it is called, so some parts might end up being used as input while preparing a reply. 
* For us web scrapers, this basically means that even although you don’t see a query string in a URL, there might still be dynamic parts in the URL to which the server might respond in different ways.


#### URL Rewriting in Action

* https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes
* https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes

In [None]:
url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=test'
r = requests.get(url)
print(r.text)

In [None]:
# In some circumstances, requests will try to help you out and encode some characters for you:
url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=a query with spaces'
r = requests.get(url)
print(r.request.url)

In [None]:
print(r.text)

In [None]:
# However, sometimes the URL is too ambiguous for requests to make sense of it:
url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=complex?&'
r = requests.get(url)
print(r.request.url)

In this case, requests is unsure whether you meant “?&” to belong to the actual URL as is or whether you wanted to encode it. Hence, requests will do nothing and just request the URL as is. On the server-side, this particular web server is able to derive that the second question mark (“?”) should be part of the URL parameter (and should have been properly encoded, but it won’t complain), though the ampersand “&” is too ambiguous in this case. Here, the web server assumes that it is a normal separator and not part of the URL parameter value.

###### So how then, can we properly resolve this issue?

In [None]:
print(r.text)

In [None]:
from urllib.parse import quote, quote_plus

In [None]:
raw_string = 'a query with /, spaces and?&'
print(quote(raw_string))
print(quote_plus(raw_string))

In [None]:
url = 'http://www.webscrapingfordatascience.com/paramhttp/?query='

print('\nUsing quote:')
# Nothing is safe, not even '/' characters, so encode everything
r = requests.get(url + quote(raw_string, safe=''))
print(r.url)
print(r.text)

In [None]:
print('\nUsing quote_plus:')
r = requests.get(url + quote_plus(raw_string))
print(r.url)
print(r.text)

In [None]:
url = 'http://www.webscrapingfordatascience.com/paramhttp/'

parameters = {
    'query': 'a query with /, spaces and?&'
}

r = requests.get(url, params=parameters)
print(r.url)
print(r.text)

In [None]:
#Empty parameters, for example, as in “params={’query’: ”}” will end up in the URL with an equals sign included

url = 'http://www.webscrapingfordatascience.com/paramhttp/'

parameters = {
    'query':''
}

r = requests.get(url, params=parameters)
print(r.url)
print(r.text)

In [None]:
url = 'https://en.wikipedia.org/w/index.php/?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
print(r.text)

As you can see, the response body captured by r.text now spits out a slew of confusing-looking text. This is HTML-formatted text, and although the content we’re looking for is buried somewhere inside this soup, we’ll need to learn about a proper way to get out the information we want from there.

#### The Fragment Identifier

* http://www.example.org/about.htm?p=8#contact

The fragmentidentifier, or “hash,” as it is sometimes called. It is prepended by a hash mark (“#”) and comes at the very end of a URL, even after the query string, for instance, as in .This part of the URL is meant to identify a portion of the document corresponding to the URL. 

For instance, a web page can include a link including a fragment identifier that, if you click on it, immediately scrolls your view to the corresponding part of the page. However, the fragment identifier functions differently than the rest of the URL, as it is processed exclusively by the web browser with no participation at all from the web server.