# HttpCommand

> Ignorance more frequently begets confidence than does knowledge: it is those who know little, not those who know much, who so positively assert that this or that problem will never be solved by science. --_Charles Darwin_

In [1]:
⎕IO ← 0
]box on
]rows on

Ah yes, the web. I'm sure you've heard of it. Dyalog has a nifty http client library built in, called `HttpCommand`. In order to make us of it, we first need to load it up:

In [2]:
hc ← ⎕SE.SALT.Load'HttpCommand'

This loads the `HttpCommand` class, calling it `hc`. We could also have used the Dyalog user command `]load HttpCommand`, which loads it as `HttpCommand` -- but who wants to type all that? The above approach is also usable programmatically. 

## Kanye.rest

There is a handy web service delivering random Kanye West quotations we can put to good use, `kanye.rest`, to demonstrate this. Kanye as a Service?

In [3]:
⎕ ← resp ← hc.Get 'https://api.kanye.rest/'

What did we get back? Let's look in the headers to start with:

In [4]:
{(⍵[;0]∊⊂'Content-Type')⌿⍵} resp.Headers

We know JSON; good. Let's unpack that.

In [5]:
body ← ⎕JSON resp.Data

We can use the handy user command `]map` to show what the inside of a namespace looks like (and yes, there is no way you'd ever discover its existence without being told):

In [7]:
]map body  ⍝ list the fields

So the payload we're interested in is in the `quote` field of the `body` namespace:

In [8]:
body.quote

If we already know we're dealing with JSON, there is a handy shortcut, called `GetJSON`:

In [9]:
(hc.GetJSON 'GET' 'https://api.kanye.rest/').Data.quote

`GetJSON` will do a couple of things for us behind the scenes. It will unpack the JSON body of the http response. If we're POSTing to the url, it will also treat a parameter namespace as the JSON body of the _request_; we'll look at that in more detail below. One last pearl of wisdom from Kanye:

In [10]:
(hc.GetJSON 'GET' 'https://api.kanye.rest/').Data.quote

Thanks for that, Kanye. Here's the seminal [Gold Digger](https://open.spotify.com/track/3QHPHLAkYV5cQBUYs6rowx?si=7JnJeGWiT6aGCAXr1YlRIg), feat. Jamie Foxx. If you're easily offended by colorful language, that's probably not a choon for you.

<iframe src="https://open.spotify.com/embed/track/3QHPHLAkYV5cQBUYs6rowx" width="300" height="380" frameborder="0" allowtransparency="true" allow="encrypted-media"></iframe>

## A more complex API

Anyway. Back to APL. Let's look at a more complex API to examine a large data set. The Cloudant database `https://skruger.cloudant.com/airaccidents` contains a large data set drawn from the FAA, listing air accident reports. [Cloudant](https://www.ibm.com/cloud/cloudant) is a db as a service running the open source [CouchDB](https://couchdb.apache.org/) database, a JSON-over-HTTP distributed document store. Let's pull a few documents from it and see what they look like.

We're going to hit the `_all_docs` endpoint, but as this is a large database, we only want to fetch a few documents. In order to do so, we pass the parameters `limit=3` and `include_docs=true` on the URL.

In [12]:
url ← 'https://skruger.cloudant.com/airaccidents/_all_docs'
(params←⎕NS⍬).(include_docs limit) ← 'true' 3

In [13]:
resp ← hc.Get url params

We know it's going to be JSON, as everything in CouchDB is JSON.

In [14]:
body ← ⎕JSON resp.Data
]map body

For the `_all_docs` API endpoint, the data is returned under `rows`, and if we set the `include_docs` parameter, each of those entries will have a `doc` field, containing the document itself. Let's look at the first one.

In [15]:
]map body.rows[0].doc

Yuck. What _is_ that‽ So the JSON field names aren't valid APL names, meaning Dyalog had to mangle them when converting to namespaces. We _can_ read them like that if we want to, for example

In [16]:
body.rows[0].doc.⍙Event⍙32⍙Date

but it sure hurts the eyes. A handy trick if we want to quickly peer into a nested JSON namespace thing is to... turn it back into JSON, but nicer:

In [41]:
1(⎕JSON⍠'Compact' 0)body.rows[0].doc

As before, we could have used `GetJSON` instead, and we can utilize the scalar extension behavior of arrays of namespaces to pick out all the embedded docs in the `rows` field of the CouchDB `_all_docs` response. 

A quirk with `GetJSON` if you're used to, say, Python's `requests` library, is that it will encode any parameters given as JSON and pass those in the request body, which isn't going to work against the CouchDB API, so we need to tag on the parameters on the URL ourselves first:

In [17]:
(hc.GetJSON 'GET' (url,'?include_docs=true&limit=3')).Data.rows.doc

The other option is to convert the data to an array instead:

In [19]:
⎕ ← body ← ⎕JSON⍠'M' ⊢ resp.Data

Which is actually quite suitable for this data -- the documents are completely flat. The documents themselves are at "depth 4", as indicated by the first column.

The database has a few handy indexes, too, which in CouchDB-speak is called _views_. Let's look at a couple of those. The first view allows us to fetch documents based on the make of aircraft. Here's the first entry in the index where the make of the plane involved was a Cessna:

In [20]:
url ← 'https://skruger.cloudant.com/airaccidents/_design/make/_view/by-make'
(params ← ⎕NS⍬).(limit reduce key) ← 1 'false' '"Cessna"'
resp ← hc.Get url params
⎕ ← body ← ⎕JSON⍠'M' ⊢ resp.Data

This is a materialised view, keyed on make. The view iteself contains no particularly interesting information beyond the document id and the "value" 1. We can fetch this document by the id:

In [21]:
url ← 'https://skruger.cloudant.com/airaccidents/5b97c6d78b17b37ceff620baf9657693'
resp ← hc.Get url
⎕ ← body ← ⎕JSON⍠'M' ⊢ resp.Data

but perhaps more interesting is that we can do aggregations if we enable the `reduce` part of the view. We can also exploit the CouchDB API a bit further by using a POST instead, noting that we again treat the body and URL parameters separately. Let's say we want to find the accident distribution, per make, for a make subset:

In [22]:
url ← 'https://skruger.cloudant.com/airaccidents/_design/make/_view/by-make?group=true'
(params ← ⎕NS⍬).keys ← 'Cessna' 'Boeing' 'Airbus' ⍝ Request body payload; will be JSON-encoded
body ← (hc.GetJSON 'POST' url params).Data

As we're now relying on `GetJSON` to encode our parameter list, we no longer need the ugly double-quotes in our list of keys.

Reductions in CouchDB views are similar to reduces in APL. All we did there was a `+/` over the values in the view, which as we saw earlier was a "1", grouping by key:

In [23]:
1(⎕JSON⍠'Compact' 0)body.rows

Now we're running the risk of making this about the CouchDB API, but this is quite an interesting data set. I made a [video](https://www.youtube.com/watch?v=2SXPMCvuTQA) a long time ago about it and how to process the data with map-reduce using CouchDB, and this was all inspired by a very old [blog post](https://blog.cloudant.com/2011/01/13/mapreduce-from-the-basics-to-the-actually-useful.html) from Cloudant founder, [Mike Miller](https://www.linkedin.com/in/mlmilleratmit). 

## Other useful bits

You can pass a left argument 1 to `HttpCommand`'s functions to inspect what the request would have looked like had it been issued:

In [24]:
1 hc.GetJSON 'GET' 'https://api.kanye.rest/'

HttpCommand will strip basic auth params passed on the URL and turn them into a header instead:

In [25]:
1 hc.Get 'https://username:password@example.com'

## Web scraping

The Dyalog student competition in 2020 had a web-scraping problem set, problem 3, from [Phase 2](https://www.dyalog.com/uploads/files/student_competition/2020_problems_phase2.pdf), asking us to find all URLs referencing PDF-files off the competition website, [https://www.dyalog.com/student-competition.htm](https://www.dyalog.com/student-competition.htm). There will be spoilers here, so if you want to have a go yourself, stop reading here.

Still here? The suggestion is that we process the data as XML (*sigh*). Let's grab that page and see what we can find:

In [27]:
page ← (_←hc.Get 'https://www.dyalog.com/student-competition.htm').Data

In [28]:
xml ← ⎕XML page

What we get back from `⎕XML` is a matrix with columns for depth, tag, content, attribute and type. We care only about the tag and attribute columns:

In [29]:
(tags attributes) ← (⊂1 3)⌷↓⍉xml

To pick out URLS, we need to look at the anchor tags:

In [30]:
anchors ← ((,'a')∘≡¨tags)/attributes

Each such tag has a set of attribute key-value pairs. Let's grab those:

In [31]:
(names vals) ← ↓⍉⊃⍪/anchors

Now we can look through the attribute values to find things that end in `.pdf`:

In [32]:
pdfs ← vals/⍨{'.pdf'∘≡¯4↑⍵}¨vals
↑3↑pdfs

As we can see, these are all relative URLs. To convert to absolute, we need to extract the `base`, and prepend that:

In [33]:
⎕ ← base ← ⊃⌽⊃('base'∘≡¨tags)/attributes

In [34]:
↑3↑base∘,¨pdfs

Putting it all together, we get something like

In [35]:
]dinput
PastTasks ← {
    (tags attributes) ← (⊂1 3)⌷↓⍉⎕XML(hc.Get ⍵).Data
    (names vals) ← ↓⍉⊃⍪/((,'a')∘≡¨tags)/attributes ⍝ Names and values of attributes of anchor tags
    pdfs ← vals/⍨{'.pdf'∘≡¯4↑⍵}¨vals
    base ← ⊃⌽⊃('base'∘≡¨tags)/attributes
    base∘,¨pdfs
}

Here's what we get:

In [36]:
↑PastTasks 'https://www.dyalog.com/student-competition.htm'

For the sake of completeness, we could equally have solved this with some filthy regexing if we were so inclined:

In [37]:
]dinput
PastTasksRE ← {
    body ← (hc.Get ⍵).Data
    pdfs ← '<a href="(.+?\.pdf)"'⎕S'\1'⊢body
    base ← '<base href="(.+?)"'⎕S'\1'⊢body
    base,¨pdfs
}

In [38]:
↑PastTasksRE 'https://www.dyalog.com/student-competition.htm'

So which one is better? I wrote a bit on this topic on my guest [blog post on Dyalog's blog](https://www.dyalog.com/blog/2021/04/2020-problem-solving-competition-phase-ii-highlights/). In this case, as we _know_ that the page is valid XML, we can delegate a lot of complexity to the `⎕XML` function, such as different quotes, whitespace etc, which we'd need to be explicit about in anything regexy if we wanted it to be robust. However, regular expressions are hard to beat when looking for complex patterns in textual data. If the page had _not_ been correct XML, it would have been a lot harder solving this problem without reaching for regular expressions.