# Making requests with the HTTP package
Pkg, Julia's built-in package manager, provides access to the excellent HTTP package. It exposes a powerful functionality for building web clients and servers—and we'll use it extensively.

As you're already accustomed to, extra functionality is only two commands away—pkg> add HTTP and julia> using HTTP.

Recall our discussion about HTTP methods from the previous section; the most important ones were GET, used to ask for a resource from the server, and POST, which sends a data payload to the server and accepts the response. The HTTP package exposes a matching set of functions—we get access to HTTP.get, HTTP.post, HTTP.delete, HTTP.put, and so on.

Let's say we want to request Julia's Wikipedia page. All we need is the page's URL and the HTTP.get method:

In [10]:
using HTTP
resp = HTTP.get("https://en.wikipedia.org/wiki/Julia_(programming_language)") ;

The result will be a Response object that represents Julia's Wikipedia page in all its glory. 

In [11]:
resp

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Sat, 18 Dec 2021 03:37:30 GMT
Vary: Accept-Encoding,Cookie,Authorization
Server: ATS/8.0.8
X-Content-Type-Options: nosniff
P3p: CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Content-Language: en
Last-Modified: Wed, 15 Dec 2021 14:38:35 GMT
Content-Type: text/html; charset=UTF-8
Age: 38729
X-Cache: cp2037 hit, cp2037 hit/3
X-Cache-Status: hit-front
Server-Timing: cache;desc="hit-front", host;desc="cp2037"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Report-To: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
NEL: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
Permissions-Policy: interest-cohort=()
Set-Cookie: WMF-Last-Access=18-Dec-2021;Path=/;HttpOnly;secure;

# Handling HTTP responses
After receiving and processing a request, the server sends back a HTTP response message. These messages have a standardized structure. They contain a wealth of information, with the most important pieces being the status code, the headers, and the body.

# HTTP status codes
The status code is a three-digit integer where the first digit represents the category, while the next two digits are used to define the subcategory. They are as follows:

- 1XX - Informational: Request was received. This indicates a provisional response.
- 2XX - Success: This is the most important response status, acknowledging that the request was successfully received, understood, and accepted. It's what we're looking for in our web-mining scripts.
- 3XX - Redirection: This class of status codes indicates that the client must take additional action. It usually means that additional requests must be made in order to get to the resource, so our scripts will have to handle this scenario. We also need to actively prevent cyclical redirects. We won't deal with such complex scenarios in our project, but in real-life applications, 3XX status codes will require specialized handling based on the subcategory.
Wikipedia provides a good description of the various 3XX status codes and instructions for what to do in each case: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_Redirection.
- 4XX - Client Error: This means that we've probably made a mistake when sending our request. Maybe the URL is wrong and the resource cannot be found (404) or maybe we're not allowed to access the page (401 and 403 status codes). There's a long list of 4XX response codes and, similar to 3XX ones, our program should handle the various scenarios to ensure that the requests are eventually successful.
- 5XX - Server Error: Congratulations, you found or caused a problem on the server! Depending on the actual status code, this may or may not be actionable. 503 (service unavailable) or 504 (gateway timeout) are relevant as they indicate that we should try again later.

# The HTTP message body
The message body, the most important part and the reason for web scraping (the content of the web page itself), is actually an optional part of the response. The presence of the body, its properties, and its size are specified by the Content-* family of headers.

# Understanding HTTP responses
The result of the HTTP.get invocation is an object that closely mirrors a raw HTTP response. The package makes our lives easier by extracting the raw HTTP data and neatly setting it up in a data structure, which makes manipulating it a breeze.

Let's take a look at its properties (or fields in Julia's lingo):

In [12]:
fieldnames(typeof(resp))

(:version, :status, :headers, :body, :request)

The fieldnames function accepts a type as its argument and returns a tuple containing the names of the fields (or properties) of the argument. In order to get the type of a value, we can use the typeof function, like in the previous example.

Right! The status, headers, and body fields should by now sound familiar. The version field represents the version of the HTTP protocol (the HTTP/1.1 part in the first line of the response). Most web servers on the internet today use version 1.1 of the protocol, but a new major version, 2.0, is almost ready for wide deployment. Finally, the request field holds a reference to the HTTP.Messages.Request object that triggered the current response.

# The status code
Let's take a closer look at the status code:

In [13]:
resp.status

200

# The headers
What about the headers? As already mentioned, they contain critical information indicating whether a message body is present. Let's check them out:

In [14]:
resp.headers

24-element Vector{Pair{SubString{String}, SubString{String}}}:
                   "Date" => "Sat, 18 Dec 2021 03:37:30 GMT"
                   "Vary" => "Accept-Encoding,Cookie,Authorization"
                 "Server" => "ATS/8.0.8"
 "X-Content-Type-Options" => "nosniff"
                    "P3p" => "CP=\"See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info.\""
       "Content-Language" => "en"
          "Last-Modified" => "Wed, 15 Dec 2021 14:38:35 GMT"
           "Content-Type" => "text/html; charset=UTF-8"
                    "Age" => "38729"
                "X-Cache" => "cp2037 hit, cp2037 hit/3"
                          ⋮
     "Permissions-Policy" => "interest-cohort=()"
             "Set-Cookie" => "WMF-Last-Access=18-Dec-2021;Path=/;HttpOnly;secure;Expires=Wed, 19 Jan 2022 12:00:00 GMT"
             "Set-Cookie" => "WMF-Last-Access-Global=18-Dec-2021;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 19 Jan 2022 12:00:00 GMT"
            "X-Client

# The message body
Since we just confirmed that we have a response body, let's see it:

In [15]:
resp.body

352982-element Vector{UInt8}:
 0x3c
 0x21
 0x44
 0x4f
 0x43
 0x54
 0x59
 0x50
 0x45
 0x20
    ⋮
 0x3e
 0x0a
 0x3c
 0x2f
 0x68
 0x74
 0x6d
 0x6c
 0x3e

Oops, that doesn't look like the web page we were expecting. No worries though, these are the bytes of the raw response—which we can easily convert to a human-readable HTML string. Remember that I mentioned the String method when learning about strings? Well, this is where it comes in handy:

In [16]:
resp_body = String(resp.body)

"<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Julia (programming language) - Wikipedia</title>\n<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":false,\"wgSeparatorTransformTable\":[\"\",\"\"" ⋯ 352444 bytes ⋯ "pedia\\/commons\\/1\\/1f\\/Julia_Programming_Language_Logo.svg\",\"headline\":\"high-performance dynamic programming language\"}</script>\n<script>(RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgBackendResponseTime\":202,\"wgHostname\":\"mw1327\"});});</script>\n</body>\n</html>"

## Learning about pairs
While looking at the response header, you might've noticed that its type is an Array of Pair objects:

In [17]:
resp.headers

24-element Vector{Pair{SubString{String}, SubString{String}}}:
                   "Date" => "Sat, 18 Dec 2021 03:37:30 GMT"
                   "Vary" => "Accept-Encoding,Cookie,Authorization"
                 "Server" => "ATS/8.0.8"
 "X-Content-Type-Options" => "nosniff"
                    "P3p" => "CP=\"See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info.\""
       "Content-Language" => "en"
          "Last-Modified" => "Wed, 15 Dec 2021 14:38:35 GMT"
           "Content-Type" => "text/html; charset=UTF-8"
                    "Age" => "38729"
                "X-Cache" => "cp2037 hit, cp2037 hit/3"
                          ⋮
     "Permissions-Policy" => "interest-cohort=()"
             "Set-Cookie" => "WMF-Last-Access=18-Dec-2021;Path=/;HttpOnly;secure;Expires=Wed, 19 Jan 2022 12:00:00 GMT"
             "Set-Cookie" => "WMF-Last-Access-Global=18-Dec-2021;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 19 Jan 2022 12:00:00 GMT"
            "X-Client

A Pair represents a Julia data structure—and the corresponding type. The Pair contains a couple of values that are generally used to reference key-value relationships. The types of the two elements determine the concrete type of the Pair.

For example, we can construct a Pair with the following:

In [19]:
Pair(:foo, "bar") 

:foo => "bar"

In [20]:
typeof(Pair(:foo, "bar")) 

Pair{Symbol, String}

We can also create Pairs by using the x => y literal notation:

3 => "c"

In [21]:
typeof(3 => "c")

Pair{Int64, String}

Obviously, once created, it is possible to access the values stored in a Pair. One way to do it is by indexing into it:

In [22]:
p = 3 => "c"

3 => "c"

In [23]:
p[1]

3

In [24]:
p[2]

"c"

We can also access the first and second fields in order to get to the first and second values, respectively:

In [25]:
p.first

3

In [26]:
p.second

"c"

Pairs are one of the building blocks of Julia and can be used, among other things, for creating dictionaries, one of the most important types and data structures.

In [29]:
dic = Dict(p)

Dict{Int64, String} with 1 entry:
  3 => "c"

# Using the HTTP response
Armed with a good understanding of Julia's dictionary data structure, we can now take a closer look at the headers property of resp, our HTTP response object.

To make it easier to access the various headers, first let's convert the array of Pair to a Dict:

In [30]:
headers = Dict(resp.headers)

Dict{SubString{String}, SubString{String}} with 22 entries:
  "Connection"                => "keep-alive"
  "Date"                      => "Sat, 18 Dec 2021 03:37:30 GMT"
  "Age"                       => "38729"
  "Accept-Ranges"             => "bytes"
  "P3p"                       => "CP=\"See https://en.wikipedia.org/wiki/Specia…
  "Cache-Control"             => "private, s-maxage=0, max-age=0, must-revalida…
  "X-Cache-Status"            => "hit-front"
  "Server"                    => "ATS/8.0.8"
  "Content-Length"            => "352982"
  "Server-Timing"             => "cache;desc=\"hit-front\", host;desc=\"cp2037\…
  "Last-Modified"             => "Wed, 15 Dec 2021 14:38:35 GMT"
  "NEL"                       => "{ \"report_to\": \"wm_nel\", \"max_age\": 864…
  "X-Content-Type-Options"    => "nosniff"
  "Permissions-Policy"        => "interest-cohort=()"
  "Vary"                      => "Accept-Encoding,Cookie,Authorization"
  "X-Cache"                   => "cp2037 hit, cp2037 hit/

We can check the Content-Length value to determine whether or not we have a response body. If it's larger than 0, that means we got back a HTML message:

In [32]:
headers["Content-Length"]

"352982"

t's important to remember that all the values in the headers dictionary are strings, so we can't go comparing them straight away:

In [33]:
headers["Content-Length"] > 0 

MethodError: MethodError: no method matching isless(::Int64, ::SubString{String})
Closest candidates are:
  isless(!Matched::AbstractString, ::AbstractString) at C:\Users\gilju\AppData\Local\Programs\Julia-1.7.0\share\julia\base\strings\basic.jl:344
  isless(::Real, !Matched::AbstractFloat) at C:\Users\gilju\AppData\Local\Programs\Julia-1.7.0\share\julia\base\operators.jl:185
  isless(::Real, !Matched::Real) at C:\Users\gilju\AppData\Local\Programs\Julia-1.7.0\share\julia\base\operators.jl:430
  ...

In [34]:
parse(Int, headers["Content-Length"]) > 0 

true

# Manipulating the response body
Earlier, we read the response body into a String and stored it into the resp_body variable. It's a long HTML string and, in theory, we could use Regex and other string-processing functions to find and extract the data that we need. However, such an approach would be extremely complicated and error-prone. The best way to search for content in a HTML document is via HTML and CSS selectors. The only problem is that these selectors don't operate on strings—they only work against a Document Object Model (DOM).

# Building a DOM representation of the page
The DOM represents an in-memory structure of an HTML document. It is a data structure that allows us to programmatically manipulate the underlying HTML elements. The DOM represents a document as a logical tree, and we can use selectors to traverse and query this hierarchy.

# Parsing HTML with Gumbo
Julia's Pkg ecosystem provides access to Gumbo, a HTML parser library. Provided with a HTML string, Gumbo will parse it into a document and its corresponding DOM.

In [35]:
using Gumbo

In [36]:
dom = parsehtml(resp_body)

HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML class="client-nojs" dir="ltr" lang="en">
  <head>
    <meta charset="UTF-8"/>
    <title>
      Julia (programming language) - Wikipedia
    </title>
    <script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3dbfaf3d-4ab8-4294-bb06-9e95b5a775d0","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Julia_(programming_language)","wgTitle":"Julia (programming language)","wgCurRevisionId":1060439771,"wgRevisionId":1060439771,"wgArticleId":38455554,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: uses authors parameter","CS1 maint: 

The dom variable now references a Gumbo.HTMLDocument, an in-memory Julia representation of the web page. It's a simple object that has only two fields:

In [37]:
fieldnames(typeof(dom)) 

(:doctype, :root)

The doctype represents the HTML <!DOCTYPE html> element, which is what the Wikipedia page uses:

In [39]:
dom.doctype 

"html"

Now, let's focus on the root property. This is effectively the outermost element of the HTML page—the <html> tag containing the rest of the elements. It provides us with an entry point into the DOM. We can ask Gumbo about its attributes:

In [40]:
dom.root.attributes 

Dict{AbstractString, AbstractString} with 3 entries:
  "class" => "client-nojs"
  "lang"  => "en"
  "dir"   => "ltr"

When in doubt, we can just ask about the name of an element using the tag method:

In [41]:
tag(dom.root) 

:HTML

Gumbo exposes a children method which returns an array containing all the nested HTMLElement. If you just go ahead and execute julia> children(dom.root), the REPL output will be hard to follow. The REPL representation of an HTMLElement is its HTML code, which, for top-level elements with many children, will fill up many Terminal screens. Let's use a for loop to iterate over the children and show just their tags:

In [42]:
for c in children(dom.root) 
    @show tag(c) 
end 

tag(c) = :head
tag(c) = :body


Much better!

Since the children are part of a collection, we can index into them:

In [43]:
body = children(dom.root)[2]

HTMLElement{:body}:<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Julia_programming_language rootpage-Julia_programming_language skin-vector action-view skin-vector-legacy">
  <div class="noprint" id="mw-page-base"></div>
  <div class="noprint" id="mw-head-base"></div>
  <div class="mw-body" id="content" role="main">
    <a id="top"></a>
    <div id="siteNotice"></div>
    <div class="mw-indicators"></div>
    <h1 class="firstHeading" id="firstHeading">
      Julia (programming language)
    </h1>
    <div class="vector-body" id="bodyContent">
      <div class="noprint" id="siteSub">
        From Wikipedia, the free encyclopedia
      </div>
      <div id="contentSub"></div>
      <div id="contentSub2"></div>
      <div id="jump-to-nav"></div>
      <a class="mw-jump-link" href="#mw-head">
        Jump to navigation
      </a>
...


The body variable will now reference an instance of HTMLElement{:body}:

The last method that we'll need is getattr, which returns the value of an attribute name. If the attribute is not defined for the element, it raises a KeyError:

In [45]:
getattr(dom.root, "class") 

"client-nojs"

# Coding defensively
An error like the previous one, when part of a larger script, has the potential to completely alter a program's execution, leading to undesired and potentially costly results. In general, when something unexpected occurs during the execution of a program, it may leave the software in an erroneous state, making it impossible to return a correct value. In such cases, rather than pushing on and potentially propagating the problem throughout the whole execution stack, it's preferable to explicitly notify the calling code about the situation by throwing an Exception.

Many functions, both in Julia's core and within third-party packages, make good use of the error-throwing mechanism. It's good practice to check the docs for the functions you use and to see what kinds of errors they throw. An error is called an exception in programming lingo.

As in the case of getattr, the author of the Gumbo package warned us that attempting to read an attribute that was not defined would result in a KeyError exception. We'll learn soon how to handle exceptions by capturing them in our code, getting info about the problem, and stopping or allowing the exception to propagate further up the call stack. Sometimes it's the best approach, but it's not a technique we want to abuse since handling errors this way can be resource-intensive. Dealing with exceptions is considerably slower than performing simple data integrity checks and branching.

For our project, the first line of defense is to simply check if the attribute is in fact defined in the element. We can do this by retrieving the keys of the attributes Dict and checking if the one we want is part of the collection. It's a one-liner:

In [46]:
in("href", collect(keys(attrs(dom.root)))) 

false

Clearly, href is not an attribute of the <html> tag.

Using this approach, we can easily write logic to check for the existence of an attribute before we attempt to look up its value.

# The pipe operator
Reading multiple nested functions can be taxing on the brain. The previous example, collect(keys(attrs(dom.root))), can be rewritten to improve readability using Julia's pipe operator, |>.

For example, the following snippet nests three function calls, each inner function becoming the argument of the outermost one:

In [47]:
collect(keys(attrs(dom.root))) 

3-element Vector{AbstractString}:
 "class"
 "lang"
 "dir"

This can be rewritten for improved readability as a chain of functions using the pipe operator. This code produces the exact same result:

In [48]:
dom.root |> attrs |> keys |> collect 

3-element Vector{AbstractString}:
 "class"
 "lang"
 "dir"

What the |> operator does is that it takes the output of the first value and pipes it as the argument of the next function. So dom.root |> attrs is identical to attrs(dom.root). Unfortunately, the pipe operator works only for one-argument functions. But it's still very useful for decluttering code, massively improving readability.

# Handling errors like a pro
Sometimes, coding defensively won't be the solution. Maybe a key part of your program requires reading a file on the network or accessing a database. If the resource can't be accessed due to a temporary network failure, there's really not much you can do in the absence of the data

If you identify parts of your code where you think the execution can go off the rails due to conditions that are out of your control (that is, exceptional conditions—hence the name exception), you can use Julia's try...catch statements. This is exactly what it sounds like—you instruct the compiler to try a piece of code and if, as a result of a problem, an exception is thrown, to catch it. The fact that an exception is caught implies that it won't propagate throughout the whole application.

Let's see it in action



In [50]:
try 
    getattr(dom.root, "href") 
catch 
    println("The $(tag(dom.root)) tag doesn't have a 'href' attribute.") 
end 

The HTML tag doesn't have a 'href' attribute.


In this example, once an error is encountered, the execution of the code in the try branch is stopped exactly at that point, and the execution flow continues right away, in the catch branch.

# The finally clause
In code that performs state changes or uses resources such as files or databases, there is typically some clean-up work (such as closing files or database connections) that needs to be done when the code is finished. This code would normally go into the try branch—but what happens if an exception is thrown?

In such cases, the finally clause comes into play. This can be added after a try or after a catch branch. The code within the finally block is guaranteed to be executed, regardless of whether exceptions are thrown or not:

In [52]:
try 
    getattr(dom.root, "href") 
catch ex 
    println("The $(tag(dom.root)) tag doesn't have a '$(ex.key)' attribute.") 
finally 
    println("I always get called") 
end 

The HTML tag doesn't have a 'href' attribute.
I always get called


ArgumentError: ArgumentError: Package webcrawler not found in current path:
- Run `import Pkg; Pkg.add("webcrawler")` to install the webcrawler package.
