## Accessing APIs

*I wrote version 1 of this notebook in 2019, based off the tutorial by [Allison Parrish](http://www.decontextualize.com/). It was updated in 2020 by Dan Sinykin and again by me in 2021.* 

Where we are for today's class is as follows: We all agree that we want to do data science with text. We know we can acquire text from digital libraries, and we've also learned how to acquire text via web scraping. But sometimes our best (by which we mean easiest and most efficient) option is to acquire text through web APIs.

Before we go further, what even is an API? [Here](https://medium.com/epfl-extension-school/an-illustrated-introduction-to-apis-10f8000313b9) is a detailed explanation (which you should have read for today's class).

But all you really need to know for now is this:

**A web API is some collection of data, made available on the web, provided in a format easy for computers to parse.**

Note also that the data can be text, but doesn't have to be.

Twitter's [Search API](https://developer.twitter.com/) is probably the most commonly used API by social scientists and computational linguists seeking to analyze text. You might want to use that for your final project.

But let's start with something simple:

[Yes or No?](https://yesno.wtf/) (more fun ones on [this list](https://dev.to/mkrl/apis-you-didnt-know-you-needed-38c))

**Reload "Yes or No?" a few times. What does it do?**

In [None]:
# It shows a gif that accompanies either "Yes" or "No"

Now, if we wanted to use this for a project, we *could* scrape the HTML in the way we've just learned. 

But the site's API gives us an easier way!

Take a look:
[Yes or No API?](https://yesno.wtf/#api)

**What do you think this does?**

In [None]:
# This gives direct access to the back end of the site, giving us the answer ("yes" or "no") and the site where the gif comes from.


**Can anyone identify the data format that's being used?**

In [None]:
# JSON

And since Python has lots of ways for dealing with json, we can use this API in a Very Important Decision-Making Program:

In [4]:
import requests # remember this from last class

url = "https://yesno.wtf/api/" # note the url is slightly different than the human-facing URL
response = requests.get(url)
data_dict = response.json() # this turns the json into a dictionary that can be accessed by key/value pairs

print("This is what 'data' looks like as a string: \n" + str(data))

This is what 'data' looks like as a string: 
{'answer': 'no', 'forced': False, 'image': 'https://yesno.wtf/assets/no/22-8806dbccb1edf544723b7f095ff722e8.gif'}


But because our `data_dict` object is, as suggeted, a dictionary, and because the API documentation (and what's printed above) tells us the name of each key, we can also access each of these keys as follows:

In [10]:
print("Key: answer. Value: " + data_dict['answer'])
print("Key: forced. Value: " + data_dict['forced'])
print("Key: image. Value: " + data_dict['image'])

Key: answer. Value: no


TypeError: can only concatenate str (not "bool") to str

**This breaks halfway through... why?**

In [None]:
# False is a boolean not a string

Let's try again!

In [12]:
print("Key: answer. Value: " + data_dict['answer'])
print("Key: forced. Value: " + str(data_dict['forced']))
print("Key: image. Value: " + data_dict['image'])

Key: answer. Value: no
Key: forced. Value: False
Key: image. Value: https://yesno.wtf/assets/no/22-8806dbccb1edf544723b7f095ff722e8.gif


**NOTE:** For those of you wanting more information about / practice with dictionaries, I recommend working through [this notebook](dictionaries-sets-tuples.ipynb).

But for now, let's get back to our Very Important Decision-Making Program:

In [16]:
# remember that we've already imported the requests library and assigned the "url" variable up above
# so we can just do this as many times as we want 

response = requests.get(url) 
data_dict = response.json() 
print("Answer to your very important question: " + data_dict['answer'] + "!")

Answer to your very important question: no!


So you can (hopefully) already start to see how this might be useful in text analysis projects. Especially given the amount of time that we've already spent learning how to scrape the web.

But how does it work?

Let's slow it down and look at all of the parts. Here again is the code for our Very Important Decision-Making Program:

In [17]:
url = "https://yesno.wtf/api/"
response = requests.get(url)
data = response.json() # this turns the json into a dictionary that can be accessed by key/value pairs
data

{'answer': 'yes',
 'forced': False,
 'image': 'https://yesno.wtf/assets/yes/2-5df1b403f2654fa77559af1bf2332d7a.gif'}

Notice is that it starts with a URL, just like a regular website. 

And you can [go to the same URL in your web browser](https://yesno.wtf/api/) and see a version of the same thing.

So now let's talk a bit more about URLs

### URLs

A URL ("uniform resource locator") uniquely identifies a document on the web, and provides instructions for how to access it. It's the thing you type into your web browser's address bar. It's what you cut-and-paste when you want to e-mail an article to a friend. In fact, most of what we do on the web---whether we're using a web browser or writing a program that accesses the web---boils down to manipulating URLs.

It's important to understand the structure of URLs so we can take them apart and put them back together (both in our heads and programmatically). URLs have a conventional structure that is specified in Internet standards documentation, and many web APIs assume knowledge of this structure. So let's take the following URL:

    http://www.example.com/foo/bar?arg1=baz&arg2=quux
    
... and break it down into parts, so we have a common vocabulary.

| Part | Name |
|------|------|
| `http` | scheme |
| `www.example.com` | host |
| `/foo/bar` | path |
| `?arg1=baz&arg2=quux` | query string |

All of these parts are required, except for the query string, which is optional. Explanations:

* The *scheme* determines what *protocol* will be used to access this resource. For our purposes, this will almost always be `http` (HyperText Transfer Protocol) or `https` (HTTP, but over an encrypted connection).
* The *host* specifies which server on the Internet we're going to talk to in order to retrieve the document we want.
* The *path* names a resource on the server, often using slashes (`/`) to represent hierarchical relationships between resources. (Sometimes this corresponds to actual files on the server, but just as often it does not.)
* The *query string* is a means to tell the server *how* we want the document delivered. (More examples of this soon.)

Most of the work you'll do in learning how to use a web API is learning how to construct and manipulate URLs. 

### HTML, JSON and web APIs

The most common format for documents on the web is HTML (HyperText Markup Language). HTML was specifically designed to be a tool for creating web pages, and it excels at that. But it's not so great for describing structured data. 

But another popular format---and the format we'll be learning how to work with today---is JSON (JavaScript Object Notation). Like HTML, JSON is a format for exchanging structured data between two computer programs. Unlike HTML, JSON is primarily intended to communicate content, rather than layout.

You saw that already when we accessed the reponses to the "Yes or No" API via our web browser just a minute ago. 

Roughly speaking, whenever a web site exposes a URL for human readers, the document at that URL is in HTML. Whenever a web site exposes a URL for programmatic use, the document at that URL is in JSON. (There are other formats commonly used for computer-readable documents, like XML. But let's keep it simple for now.) To review:

Yes or No has both a human-readable version of its page found at this URL, written in HTML:

> https://yesno.wtf/

and a version of the same content designed to be easily readable by computers. This is the URL, and it returns a document in JSON format, as we saw:

> https://yesno.wtf/api/

Every web site makes available a number of URLs that return human-readable documents; many web sites (like Yes or No, or Twitter, or Genius, which we'll get to next) also make available URLs that return documents intended to be read by computer programs. Often---as is the case with Facebook-- these are just two views into the same data.

So, another way to define a web API is as follows: 

**API: A set of URLs, and rules for manipulating URLs, that a website makes available and that are intended to be read by computer programs.** 

(API stands for "application programming interface"; a "web API" is an interface enables you to program applications that use the web site's data.)

### A Note about API Keys

Very often, when playing around with APIs via your web browser, you will see this message (or one like it):

    {"message": "unauthorized", "type": "error"}

This message results from the fact that most web APIs (unlike most web pages) require some kind of *authentication*. "Authentication" here means some kind of information that associates the request with an individual. In many APIs, this takes the form of a "token" or "key" (also called a "client ID" and/or "secret")---most usually an extra parameter that you pass on the end of the URL (or in an HTTP header) that identifies the request as having come from a unique user. Some services (like Facebook) provide a subset of functionality to non-authenticated ("anonymous") requests; others require authentication for all requests.

So how do you get "keys" or "tokens"? There's usually some kind of sign-up form in or near the developer documentation for the service in question. The form may ask you for a description of your application; it's usually safe to leave this blank, or to put in some placeholder text. Only rarely is this text reviewed by an actual human being; your key is usually issued automatically.

Different services have different requirements regarding how to include your keys in your request; you'll have to consult the documentation to know for sure.

The API we'll be using for the next exercise, Wordnik, requires a key. But we can use a pre-existing one which you'll see below. (When we get to the exercise after that, using the Genius API, you'll need to sign up for your own).  

### Making API requests with `requests`

All you need to make (most) web API requests is a web browser. But it would be tedious to do these requests in a web browser and then copy over the responses into Python for analysis. We may also want to make *many* requests to a web API (for example, to get the song IDs for *all* of the songs included in our candidate playlist), which is inconvenient to do "by hand" with a web browser. 

Ideally, there would be some way to make web requests *directly inside a Python program* and in fact, we've already used it: the [requests](http://docs.python-requests.org/en/master/) package.

Here's an example of how to use `requests` to get the contents of a document intended to be read by computers. In this case, the request is for the definition of the word "computer":

In [19]:
import requests

api_key = "a80a5131f7620c32a8919063dce09d01b6239543e3d0063bf"
url = "http://api.wordnik.com:80/v4/word.json/API/definitions?api_key=" + api_key # notice string concatination 
response = requests.get(url)
data = response.json()
data

[{'id': 'A5382700-1',
  'partOfSpeech': 'abbreviation',
  'attributionText': 'from The American Heritage® Dictionary of the English Language, 5th Edition.',
  'sourceDictionary': 'ahd-5',
  'text': 'application programming interface',
  'sequence': '1',
  'score': 0,
  'labels': [],
  'citations': [],
  'word': 'API',
  'relatedWords': [],
  'exampleUses': [],
  'textProns': [],
  'notes': [],
  'attributionUrl': 'https://ahdictionary.com/',
  'wordnikUrl': 'https://www.wordnik.com/words/API'},
 {'id': 'A5382700-2',
  'partOfSpeech': 'abbreviation',
  'attributionText': 'from The American Heritage® Dictionary of the English Language, 5th Edition.',
  'sourceDictionary': 'ahd-5',
  'text': 'Asian and Pacific Islander',
  'sequence': '2',
  'score': 0,
  'labels': [],
  'citations': [],
  'word': 'API',
  'relatedWords': [],
  'exampleUses': [],
  'textProns': [],
  'notes': [],
  'attributionUrl': 'https://ahdictionary.com/',
  'wordnikUrl': 'https://www.wordnik.com/words/API'},
 {'part

Oh hey wow look at all the deets it gives us about the word "API"--that's pretty rad! We'll break down *how* exactly to know what the URL for a particular resource is a bit later (and how to add the API key to the request). But for now, let's just note a few key features.

As with the Yes or No example above, the following lines are most important:

    response = requests.get(url)
    data = response.json()

The first line calls the `get()` function in the `requests` package, with one parameter, the URL that you want to fetch (which we previously stored in a variable called `url`). 

When this function gets called, the `requests` library makes a network request to the specified URL and retrieves its contents, returning a special kind of value called a "response," which contains information about the response generated by the remote server, along with the content of that response. 

Note that this is the exact same line that we used to request the contents of the NYT page when we were web scraping. 

The next line is different. With the NYT page, we turned the response into text with the line

    html_str = response.text

Here, we are using the `.json()` method, which takes data in the response in JSON format (if present) and converts it to the corresponding Python data structure. (Note that [response objects have many other methods and attributes as well](http://docs.python-requests.org/en/master/api/#requests.Response), but the one we're most interested in right now is `.json()`). 

    data = response.json()

If you're familiar looking at Python code, you might observe that the json object is just a list of dictionaries. That's why it starts with a `'['` up above and then seems to have sets of curly braces (`{ ... }`). 

This is helpful to know, since we can use standard ways of iterating through and accessing lists and dictionaries. For example, we can use a `for` loop to print out the text of each definition of "hello.":

In [23]:
for item in data:
    print("Definition: " + item['text'])

Definition: application programming interface
Definition: Asian and Pacific Islander
Definition: <xref>application programming interface</xref>
Definition: active pharmaceutical ingredient


The combination of JSON, dictionaries, and for-loops is the magic stew that will allow us to automate the gathering of text from web APIs, making it often the most convenient method for developing a corpus for performing data science with text.

And believe it or not, you now know (almost) everything you need to get started with APIs.

Hurray!