# Using APIs

- Like many programmers who have worked on large projects, I have my share of horror stories when it comes to working with other people's code. From namespacse issues to type issues to misunderstandings of function output, simply trying to get information from point A to method B can be a nightmare.

- This is where application programming interfaces come in handy: they provide nice, convenient interfaces between mulitple disparate applications. Is doesn't matter if the applications are written by different programmers, with different architectures, or even in different languages-APIs are designed to serve as a lingua franca between different pieces of software that need to share information with each other.

- Although various APIs exist for a variety of different software applications, in recent times "API" has been commonly understood as meaning "web application API." Typically, a programmer will make a request to an API via HTTP for some type of data, and the API will return this data in the form of XML or JSON. Although most APIs still support XML, JSON is quickly becoming the encoding protocol of choice.

- If taking advantage of a ready-to-use program to get information prepackaged in a useful format seems like a bit of a departure from the rest of this book, well, it is and it isn't. Although using APIs isn't generally considered web scraping by most people, both pratices use many of the same techniques (sending HTTP requests) and produce similar results (getting information); they often can be vert complementary to each other.

- For instance, you might want to combine information gleanced from a web scraper with information from a published API in order to make the information more useful to you. In an example later in this chapter, we will look at combining Wikipedia edit histories (which contain IP address) with an IP address resolver API in order to get the geographic location of Wikipedia edits around the world.

- In this chapter, we'll offer a general overview of APIs and how they work, look at a few popular APIs available today, and look at how you might use an API in your own web scrapers.

## How APIs work

- Although APIs are not nearly as ubiquitous as they should be (a large motivation for writing this book, because if you can't find an API, you can still get the data through scraping), you can find APIs for many types of information. Interested in music? There are a few different APIs that give you songs, artists, albums, and even information about musical styles and related artists. Need sports data? ESPN provides APIs for athlete information, game scores, and more. Google has dozens of APIs in its Developer section for language translations, analytics, geolocation, and more.

## Common Conventions 

- Unlike the subjects of most web scraping, APIs follow an extremely standardized set of rules to produce information, and they produce that information in an extremely standardized way as well. Because of this, it is easy to learn a few simple ground rules that will help you to quickly get up and running with any given API, as long as it's fairly well written.

## Methods

- There are four ways to request information from a web server via HTTP:

- GET
- POST
- PUT
- DELETE

- GET is what you use when you visit a website through the address bar in your browser. GET is the method you are using when you make a call to URL. You can think of GET as saying, "Hey, web server, please get me this information."

- POST is what you use when you fill out a form, or submit information, presumably to a backend script on the server. Every time you log into a website, you are making a POST request with your username and (hopefully) encrypted password. If you are making a POST request with an API, you are saying, "Please store this information in your database."

- PUT is less. commonly used when interacting with websites, but is used from time to time in APIs. A PUT request is used to update an object or information. An API might require a POST request to create a new user, for example, but it might need a PUT request if you want to update that user's email address.

- DELETE is straightforward; it is used to delete an object. For instance, if I send a DELETE request to http://myapi.com/user/23, it will delete the user with the ID 23. DELETE methods are not often encountered in public APIs, which are primarily created to disseminate information rather than allow random users to remove that information from their databases. However, like the PUT method, it's a good one to know about.

- Although a handful of other HTTP methods are defined under the specifications for HTTP, these four constitute the entirety of what is used in just about any API you will ever encounter.

## Authentication

- Although some APIs do not use any authentication to operate (meaning anyone can make an API call for free, without registering with the application first), many modern APIs require some type of authentication before they can be used.

- Some APIs require authentication in order to charge money per API call, or they might offer their service on some sort of a monthly subscription basis. Others authenticate in order to "rate limit" users (restrict them to a certain number of calls per second, hour, or day), or to restrict the access of certain kinds of information or types of API calls for some users. Other APIs might not place restrictions, but they might want to keep track of which users are making which calls for marketing purposes.

- All methods of API authentication generally revolve around the use of a token of some sort, which is passed to the web server with each API call made. This token is either provided to the user when the user registers and is a permanent fixture of the user's calls (generally in lower-security applications), or it can frequently change, and is retrieved from the server using a username and password combination. 

- For example, to make a call to Echo Nest API in order to retrieve a list of songs by the band Guns N' Roses we would use:

- This provides the server with an api_key value of what was provided to me on registration, allowing the server to identify the requester as Ryan Mitchell, and provide the requester with teh JSON data.

- In addition to passing tokens in the URL of the request itself, tokens might also be passed to the server via a cookie in the request header. 

```python
token = "<your api key>"
webRequest = urllib.request.Request("http://myapi.com", headers={"token":token})
html = urlopen(webRequest)
```

## Responses

- As you saw in the FreeGeoIP example at the beginning of the chapter, an important feature of APIs is that they have well-formatted responses. The most common types of response formatting are eXtensible Markup Language(XML) and JavaScript Object Notation(JSON).

- In recent years, JSON hav become vastly more popular than XML for a couple of major reasons. First, JSON file are generally smaller than well-designed XML files. Compare, for example, the XML data:

```xml
<user><firstname>Ryan</firstname><lastname>Mitchell</lastname><username>Kludgist </username></user>
```

which clocks in at 98 characters, and the same data in JSON:

```json
{"user":{"firstname":"Ryan","lastname":"Mitchell","username":"Kludgist"}}
```


which is only 73 characters, or a whipping 36% smaller than the equivalent XML.

Of course, one could argue that the XML could be formatted like this:

```XML
<user firstname="ryan" lastname="mitchell" username="Kludgist"></user>
```


but this is considered bad practice because it doesn't support deep nesting of data. Regardless, it still requires 71 characters, about the same length as the equivalent JSON.

Another reason JSON is quickly becoming more popular than XML is simply due to a shift in web technologies. In the past, it was more common for a server-side script such as PHP or .NET to be on the receiving end of an API. Nowadays, it is likely that a framework, such as Angular or Backbone, will be sending and receiving API calls. Server-side technologies are somewhat agnostic as to the form in which their data comes. But JavaScript libraries like Backbone find JSON easier to handle.

## Echo Nest

The Echo Nest is a fantastic example of a company that is built on web scrapers. Although some music-based companies, such as Pandora, depend on human intervention to categorize and annotate music, The Echo Nest relies on automated intelligence and information scraped from blogs and news articles in order to categorize musical artists, songs, and albums.

## A Few Examples

The Echo Nest API is built around several basic content types: artists, songs, tracks, and genres. Except for genres, these content types all have unique IDs, which are used to retrieve information about them in various forms, through API calls. For example, if I wanted to retrieve a list of songs performed by Monty Python, I would make the following call to retrieve their ID:

## Twitter

## Google API

## Parsing JSON

## Binging It All Back Home

Although the raison detre of many modern web applications is to take existing data and format it in a more appealing way, I would argue that this isn't very interesting thing to do in most instances. If you're using as API as your only data source, the best you can do is merely copy someone else's database that already exists, and which is, essentially, already published. What can be far more interesting is to take two or more data sources and combine them in a novel way, or use an API as a tool to look at scraped data from a new perspective.

If you've spent much time on Wikipedia, you've likely come across an article's revision history page, which displays a list of recent edits. 

The IP address outlined the history page is. By using the freegeoip.net API, as of this writing that IP address is from Quezon, Phillipines.

In [None]:
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "lxml")
    return bsObj.find("div", {"id":"bodyContent"}).find_all("a",
                                    href=re.compile("&(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):
    # Format of revision history pages is:
    pageUrl = pageUrl.replace("/wiki/", "")
    historyUrl = "http://en.wikipedia.org/w/index.php?title="
                 +pageUrl+"&action=history"
    print("history url is : "+historyUrl)
    html = urlopen(historyUrl)
    bsObj = BeautifulSoup(html, "lxml")
    ipAddresses = bsObj.find_all("a", {"class":"mw-anouserlink"})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.append(ipAddress.get_text())
    return addressList

links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        historyIPs = getHistoryIPs(link.attrs["href"])
        for historyIP in historyIPs:
            print(historyIP)
            
    newLink = links[random.randint(0, len(links)-1)].attrs["href"]
    links = getLinks(newLink)

Now that we have code that retrieves IP addresses as a string, we can combine this with the getCountry function from the previous section in order to resolve these IP addresses to countries. I modified getCountry slightly, in order to account for invalid or malformed IP addresses that will result in a "404 Not Found" error:

In [None]:
def getCountry(ipAddress):
    try:
        response = urlopen("http://freegeoip.net/json/"
                          _ipAddress).read().decode('utf-8')
    except HTTPError:
        return None
    responseJson = json.loads(response)
    return responseJson.get("country_code")

links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        print("--------------")
        historyIPs = getHistoryIPs(link.attrs['href'])
        for historyIP in historyIPs:
            if country is not None:
                print(historyIP + " is from " + country)
        
newLink = links[random.randint(0, len(links)-1)].attrs['href']
links = getLinks(newLink)

## More About APIs

In this chapter, we've looked at a few ways that modern APIs are commonly used to access data onthe Web, in particular uses of APIs that you might find useful in we scraping.

- RESTful WEb APIs 
- Designing APIs for the Web