<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch12_crawling_thru_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A Brief Introduction to APIs
An *API* defines a standardized syntax that allows one piece of software to communicate with another piece of software, even though they might be written in different languages or otherwise structured differently.

##HTTP Methods and APIs
There are four main ways to request information from a web server via HTTP:
* `GET` is used to visit a website through the address bar in the browser. A `GET` request makes no changes to the information in the server's database. Nothing is stored; nothing is modified. Information is only read.
* `POST` is used to fill out a form or submit information, presumably to a backend script on the sever. Including log into a website with username and password.
* `PUT` is used to update an object or information. An API might require a `POST` request to create a new user, for example, but it might need a `PUT` request if you want to update that user’s email address
* `DELETE` is used to delete an object.

The data in the body from the web server is typically formatted as JSON or, less commonly, as XML.

If using an API that creates comments on blog posts, the user might make a `PUT` request to


```
http://example.com/comments?post=123
```
with the following request body:

```
{"title": "Great post about APIs!", "body": "Very informative. Really helped me out with a tricky technical challenge I was facing. Thanks for taking the time to write such a detailed blog post about PUT requests!", "author": {"name": "Ryan Mitchell", "website": "http://pythonscraping.com", "company": "O'Reilly Media"}}
```

Note that the ID of the blog post (`123`) is passed as a parameter in the URL, where the content for the new comment is passed in the body of the request.

#Parsing JSON
The example uses the *freegeoip.net* here


```
http://freegeoip.net/json/50.78.253.58
```
Take the output of this request and use Python's JSON-parsing functions to decode it:

In [0]:
import json
from urllib.request import urlopen

In [0]:
def getCountry(ipAddress):
  response = urlopen('http://freegeoip.net/json/'+ipAddress).read().decode('utf-8')
  responseJSON = json.loads(response)
  return responseJSON.get('country_code')

In [0]:
print(getCountry('50.78.253.58'))

The following gives a quick demonstration of how Python's JSON library handles the values that might be encountered in a JSON string:

In [0]:
jsonString = '{"arrayOfNums":[{"number":0}, {"number":1}, {"number":2}], \
               "arrayOfFruits":[{"fruit":"apple"}, {"fruit":"banana"}, {"fruit":"pear"}]}'
jsonObj = json.loads(jsonString)

In [6]:
print(jsonObj.get('arrayOfNums'))

[{'number': 0}, {'number': 1}, {'number': 2}]


In [7]:
print(jsonObj.get('arrayOfNums')[1])

{'number': 1}


In [8]:
print(jsonObj.get('arrayOfNums')[1].get('number') + 
      jsonObj.get('arrayOfNums')[2].get('number'))

3


In [9]:
print(jsonObj.get('arrayOfFruits')[2].get('fruit'))

pear


#Undocumented APIs
As JavaScript frameworks became more ubiquitous, many of the HTML creation
tasks handled by the server moved into the browser. Because the entire content management system (that used to reside only in the web server) had essentially moved to the browser client, even the simplest websites could balloon into several megabytes of content and a dozen HTTP requests.

Because servers are no longer formatting the data into HTML, they often act as thin
wrappers around the database itself. This thin wrapper simply extracts data from the
database, and returns it to the page via an API.

For example, the *New York Times* website loads all of its search results via JSON. If visiting the link (`https://query.nytimes.com/search/sitesearch/#/python`), this will reveal recent news articles for the search term "python". But if this page is scraped using `urllib` or the `requests` library, there will not be any  search results. These are loaded separately via an API call:

```
https://query.nytimes.com/svc/add/v1/sitesearch.json?q=python&spotlight=true&facet=true
```
If this page is loaded with Selenium, this makes about 100 requests and transfers 600-700 kB of data with each search. Using the API directly, only one request is made and transferred aproximately only 60 kb of data needed.


##Finding Undocumented APIs
Use the Chrome inspector to examine the requests and responses of the calls that are used to construct that page.

API calls tend to have several features that are useful for locating them in the list of
network calls:
* They often have JSON or XML in them. You can filter the list of requests by using
the search/filter field.
* With `GET` requests, the URL will contain the parameter values passed to them.
This will be useful if, for example, you’re looking for an API call that returns the
results of a search or is loading data for a specific page. Simply filter the results
with the search term you used, page ID, or other identifying information.
* They will usually be of the type XHR.

##Documenting Undocumented APIs
Every API call can be identified and documented by paying attention to the following
fields:
* HTTP method used
* Inputs
  * Path parameters
  * Headers (including cookies)
  * Body content (for PUT and POST calls)
* Outputs
  * Response headers (including cookies set)
  * Response body type
  * Response body fields

##Finding and Documenting APIs Automatically
The author created a GitHub repository at https://github.com/REMitchell/apiscraper that attempts to finish this task. Find the details about this project on this page.

#Combing APIs with Other Data Sources
Use the following code to create a basi script tha crawls Wikipedia, look for revision history pages, and then look for IP addresses on those revision history pages:

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import datetime
import random
import re

In [0]:
random.seed(datetime.datetime.now())
def getLinks(articleURL):
  html = urlopen('http://en.wikipedia.org{}'.format(articleURL))
  bs = BeautifulSoup(html, 'html.parser')
  return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

In [0]:
def getHistoryIPs(pageURL):
  #Format of revision history pages is:
  #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
  pageURL = pageURL.replace('/wiki/', '')
  historyURL = 'http://en.wikipedia.org/w/index.php?title={}&action=history'.format(pageURL)
  print('history URL is: {}'.format(historyURL))
  html = urlopen(historyURL)
  bs = BeautifulSoup(html, 'html.parser')
  #Finds only the links with class "mw-anonuserlink" which has IP addresses
  #instead of usernames
  ipAddresses = bs.find_all('a', {'class': 'mw-anonuserlink'})
  addressList = set()
  
  for ipAddress in ipAddresses:
    addressList.add(ipAddress.get_text())
  return addressList

In [0]:
links = getLinks('/wiki/Python_(programming_language)')

In [24]:
while len(links) > 0:
  for link in links:
    print('-'* 20)
    historyIPs = getHistoryIPs(link.attrs['href'])
    for historyIP in historyIPs:
      print(historyIP)
      
  newLink = links[random.randint(0, len(links)-1)].attrs['href']
  links = getLinks(newLink)

--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Python_(disambiguation)&action=history
--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history
14.139.174.51
2601:14b:4301:19c3:9107:8bad:dbf7:5803
174.254.128.149
213.133.47.254
154.149.30.36
117.221.183.123
49.197.5.59
129.7.106.20
218.17.157.55
31.223.170.65
192.159.69.162
2605:6000:ec0f:c800:edfd:179f:b648:b4b9
2405:204:6694:d402::2964:90a1
66.87.149.174
--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Multi-paradigm_programming_language&action=history
75.139.254.117
98.197.198.46
--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Functional_programming&action=history
45.234.78.9
108.28.15.54
144.51.242.26
2a01:cb1d:8188:8e00:38cf:9550:10ea:c58
194.204.198.199
47.49.35.242
91.11.237.94
140.177.205.223
80.136.234.124
178.168.76.228
2600:1009:b112:79c8:75d8:597d:23fc:a3a4
-------------------

KeyboardInterrupt: ignored

The two main functions, `getLinks` and `getHistoryIPs` search for the contents of all links with the class `mw-anonuserlink` and returns itas a set.

Once the IP addresses were retrieved as a string, the next objective is to combine this with the `getCountry` in order to resolve these IP addresses to countries. Before executing, modify `getCountry` in order to account for invalid or malformed IP addresses that will result in a 404 Not Found error:

In [0]:
def getCountry(ipAddress):
  try:
    response = urlopen('http://freegeoip.net/json/{}'.format(ipAddress)).read().decode('utf-8')
  except HTTPError:
    return None
  responseJSON = json.loads(response)
  return responseJSON.get('country_code')

In [0]:
links = getLinks('/wiki/Python_(programming_language)')

In [27]:
while len(links) > 0:
  for link in links:
    print('-' * 20)
    historyIPs = getHistoryIPs(link.attrs['href'])
    for historyIP in historyIPs:
      country = getCountry(historyIP)
      if country is not None:
        print('{} is from {}'.format(historyIP, country))
        
  newLink = links[random.randint(0, len(links)-1)].attrs['href']
  links = getLinks(newLink)

--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Python_(disambiguation)&action=history
--------------------
history URL is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history


NameError: ignored