## APIs

Let's start by looking at [OMDb API](https://www.omdbapi.com/).

The OMDb API is a free web service to obtain movie information, all content and images on the site are contributed and maintained by our users.

The Python package [urllib](https://docs.python.org/3/howto/urllib2.html) can be used to fetch resources from the internet.

OMDb tells us what kinds of requests we can make. We are going to do a title search. As you can see below, we have an additional parameter "&Season=1" which does not appear in the parameter tables. If you read through the change log, you will see it documented there. 

Using the urllib and json packages allow us to call an API and store the results locally.

In [1]:
import json
import urllib.request

In [2]:
data = json.loads(urllib.request.urlopen('http://www.omdbapi.com/?t=Game%20of%20Thrones&Season=1').read().\
                  decode('utf8'))

What should we expect the type to be for the variable data?

In [3]:
print(type(data))

<class 'dict'>


What do you think the data will look like?

In [4]:
data

{'Episodes': [{'Episode': '1',
   'Released': '2011-04-17',
   'Title': 'Winter Is Coming',
   'imdbID': 'tt1480055',
   'imdbRating': '8.9'},
  {'Episode': '2',
   'Released': '2011-04-24',
   'Title': 'The Kingsroad',
   'imdbID': 'tt1668746',
   'imdbRating': '8.7'},
  {'Episode': '3',
   'Released': '2011-05-01',
   'Title': 'Lord Snow',
   'imdbID': 'tt1829962',
   'imdbRating': '8.6'},
  {'Episode': '4',
   'Released': '2011-05-08',
   'Title': 'Cripples, Bastards, and Broken Things',
   'imdbID': 'tt1829963',
   'imdbRating': '8.7'},
  {'Episode': '5',
   'Released': '2011-05-15',
   'Title': 'The Wolf and the Lion',
   'imdbID': 'tt1829964',
   'imdbRating': '9.0'},
  {'Episode': '6',
   'Released': '2011-05-22',
   'Title': 'A Golden Crown',
   'imdbID': 'tt1837862',
   'imdbRating': '9.1'},
  {'Episode': '7',
   'Released': '2011-05-29',
   'Title': 'You Win or You Die',
   'imdbID': 'tt1837863',
   'imdbRating': '9.2'},
  {'Episode': '8',
   'Released': '2011-06-05',
   'Tit

We know have a dictionary object of our data. We can use python to manipulate it in a variety of ways. For example, we can print all the titles of the episodes.

In [6]:
for episode in data['Episodes']:
  print(episode['Title'], episode['imdbRating'])

Winter Is Coming 8.9
The Kingsroad 8.7
Lord Snow 8.6
Cripples, Bastards, and Broken Things 8.7
The Wolf and the Lion 9.0
A Golden Crown 9.1
You Win or You Die 9.2
The Pointy End 9.0
Baelor 9.5
Fire and Blood 9.4


We can use pandas to convert the episode information to a dataframe.

In [7]:
import pandas as pd

df = pd.DataFrame.from_dict(data['Episodes'])

In [8]:
df

Unnamed: 0,Episode,Released,Title,imdbID,imdbRating
0,1,2011-04-17,Winter Is Coming,tt1480055,8.9
1,2,2011-04-24,The Kingsroad,tt1668746,8.7
2,3,2011-05-01,Lord Snow,tt1829962,8.6
3,4,2011-05-08,"Cripples, Bastards, and Broken Things",tt1829963,8.7
4,5,2011-05-15,The Wolf and the Lion,tt1829964,9.0
5,6,2011-05-22,A Golden Crown,tt1837862,9.1
6,7,2011-05-29,You Win or You Die,tt1837863,9.2
7,8,2011-06-05,The Pointy End,tt1837864,9.0
8,9,2011-06-12,Baelor,tt1851398,9.5
9,10,2011-06-19,Fire and Blood,tt1851397,9.4


And, we can save our data locally to use later.

In [9]:
with open('omdb_api_data.json', 'w') as f:
    json.dump(data, f)

Let's try an API that requires an API key!

"The [Digital Public Library of America](https://dp.la/) brings together the riches of America’s libraries, archives, and museums, and makes them freely available to the world. It strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science."

And, they have an [API](https://dp.la/info/developers/codex/api-basics/).

In order to use the API, you need to [request a key](https://dp.la/info/developers/codex/policies/#get-a-key). You can do this with an HTTP POST request.

If you are using OS X or Linux, replace "YOUR_EMAIL@example.com" in the cell below with your email address and execute the cell. This will send the rquest to DPLA and they will email your API key to the email address you provided. To successfully query the API, you must include the ?api_key= parameter with the 32-character hash following.


In [10]:
# execute this on OS X or Linux
! curl -v -XPOST http://api.dp.la/v2/api_key/nicole@nicoledonnelly.me

*   Trying 52.21.85.131...
* Connected to api.dp.la (52.21.85.131) port 80 (#0)
> POST /v2/api_key/nicole@nicoledonnelly.me HTTP/1.1
> Host: api.dp.la
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 201 Created
< Access-Control-Allow-Origin: *
< Cache-Control: max-age=0, private, must-revalidate
< Content-Type: application/json; charset=utf-8
< Date: Sat, 22 Oct 2016 18:28:47 GMT
< ETag: "8b66d9fe7ded79e3151d5a22f0580d99"
< Server: nginx/1.1.19
< Status: 201 Created
< X-Request-Id: d14dfbdc08191f8e0e74286d2fd375a8
< X-Runtime: 0.195748
< X-UA-Compatible: IE=Edge,chrome=1
< Content-Length: 89
< Connection: keep-alive
< 
* Connection #0 to host api.dp.la left intact
{"message":"API key created and sent via email. Be sure to check your Spam folder, too."}

If you are on Windows 7 or 10, [open PowerShell](http://www.tenforums.com/tutorials/25581-windows-powershell-open-windows-10-a.html). Replace "YOUR_EMAIL@example.com" in the cell below with your email address. Copy the code and paste it at the command prompt in PowerShell. This will send the rquest to DPLA and they will email your API key to the email address you provided. To successfully query the API, you must include the ?api_key= parameter with the 32-character hash following.

In [None]:
#execute this on Windows
Invoke-WebRequest -Uri ("http://api.dp.la/v2/api_key/YOUR_EMAIL@example.com") -Method POST -Verbose -usebasicparsing

You will get a response similar to what is shown below and will receive an email fairly quickly from DPLA with your key.

    shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
    *   Trying 52.2.169.251...
    * Connected to api.dp.la (52.2.169.251) port 80 (#0)
    > POST /v2/api_key/YOUR_EMAIL@example.com HTTP/1.1
    > Host: api.dp.la
    > User-Agent: curl/7.43.0
    > Accept: */*
    > 
    < HTTP/1.1 201 Created
    < Access-Control-Allow-Origin: *
    < Cache-Control: max-age=0, private, must-revalidate
    < Content-Type: application/json; charset=utf-8
    < Date: Thu, 20 Oct 2016 20:53:24 GMT
    < ETag: "8b66d9fe7ded79e3151d5a22f0580d99"
    < Server: nginx/1.1.19
    < Status: 201 Created
    < X-Request-Id: d61618751a376452ac3540b3157dcf48
    < X-Runtime: 0.179920
    < X-UA-Compatible: IE=Edge,chrome=1
    < Content-Length: 89
    < Connection: keep-alive
    < 
    * Connection #0 to host api.dp.la left intact
    {"message":"API key created and sent via email. Be sure to check your Spam folder, too."}

It is good practice not to put your keys in your code. You should store them in a file and read them in from there. If you are pushing your code to GitHub, make sure you put your key files in .gitignore.

I created a file on my drive called "dpla_config_secret.json". The contents of the file look like this:

{
	"api_key" : "my api key here"
}

I can then write code to read the information in.

In [12]:
with open("./api/dpla_config_secret.json") as key_file:
    key = json.load(key_file)

In [13]:
key

{'api_key': '22443a3c90575f968c9be540e42d5e81'}

Then, when I create my API query, I can use a variable in place of my actual key.

The Requests library allows us to build urls with different parameters. You build the parameters as a dictionary that contains key/value pairs for everything after the '?' in your url.

In [14]:
import requests

In [15]:
# we are specifying our url and parameters here as variables
url = 'http://api.dp.la/v2/items/'
params = {'api_key' : key['api_key'], 'q' : 'goats+AND+cats'}

In [16]:
# we are creating a response object, r
r = requests.get(url, params=params)

In [17]:
type(r)

requests.models.Response

In [18]:
# we can look at the url that was created by requests with our specified variables
r.url

'http://api.dp.la/v2/items/?api_key=22443a3c90575f968c9be540e42d5e81&q=goats%2BAND%2Bcats'

In [19]:
# we can check the status code of our request
r.status_code

200

[HTTP Status Codes](http://www.restapitutorial.com/httpstatuscodes.html)

In [20]:
# we can look at the content of our request
print(r.content)

b'{"count":29,"start":0,"limit":10,"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://catalog.hathitrust.org/Record/009232885","dataProvider":["Cornell University"],"@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/hathitrust","name":"HathiTrust"},"object":"https://books.google.com/books/content?id=b2RUAAAAYAAJ\\u0026printsec=frontcover\\u0026img=1\\u0026zoom=5","ingestionSequence":27,"id":"ac774bfe366f0d3794a96bced3c9f664","ingestDate":"2016-09-09T16:22:07.603465Z","_rev":"12-278b6e2b378e477de56cd41b13c923d0","aggregatedCHO":"#sourceResource","_id":"hathitrust--009232885","sourceResource":{"subject":[{"name":"Domestic animals--Parasites"}],"rights":"Public domain. Learn more at http://www.hathitrust.org/access_use","format":["Electronic resource","Language material"],"date":{"displayDate":"1937]","end":"1937","begin":"1937"},"type":"text","publisher":["[Pittsburgh, Pa., Gulf Oil Corporation, Gulf Refining Company"],"specType":["Book"],"cre

By default, DPLA returns 10 items at a time. We can see from the count value, our query has 29 results. DPLA does give us a paramter we can set to change this to get up to 500 items at a time.



In [21]:
params = {'api_key' : key['api_key'], 'q' : 'goats+AND+cats', 'page_size': 500}
r = requests.get(url, params=params)
print(r.content)

b'{"count":29,"start":0,"limit":500,"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://catalog.hathitrust.org/Record/009232885","dataProvider":["Cornell University"],"@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/hathitrust","name":"HathiTrust"},"object":"https://books.google.com/books/content?id=b2RUAAAAYAAJ\\u0026printsec=frontcover\\u0026img=1\\u0026zoom=5","ingestionSequence":27,"id":"ac774bfe366f0d3794a96bced3c9f664","ingestDate":"2016-09-09T16:22:07.603465Z","_rev":"12-278b6e2b378e477de56cd41b13c923d0","aggregatedCHO":"#sourceResource","_id":"hathitrust--009232885","sourceResource":{"subject":[{"name":"Domestic animals--Parasites"}],"rights":"Public domain. Learn more at http://www.hathitrust.org/access_use","format":["Electronic resource","Language material"],"date":{"displayDate":"1937]","end":"1937","begin":"1937"},"type":"text","publisher":["[Pittsburgh, Pa., Gulf Oil Corporation, Gulf Refining Company"],"specType":["Book"],"cr

If we were working with an API that limited us to only 10 items at a time, we could write a loop to pull our data.

The file "seeclickfix_api.py" in the api folder of this repo is an example of how you can pull multiple pages of data from an API.

## Scraping

In the API section, we used urllib to call and API and save data. We can also use it to download webpages.

In [22]:
html = urllib.request.urlopen("http://xkcd.com/1481/")
print(html.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script>\n(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\nm=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n})(window,document,\'script\',\'//www.google-analytics.com/analytics.js\',\'ga\');\n\nga(\'create\', \'UA-25700708-7\', \'auto\');\nga(\'send\', \'pageview\');\n</script>\n<script type="text/javascript" src="//xkcd.com/1350/jquery.min.js"></script>\n<

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. 

In [23]:
urllib.request.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

('api.png', <http.client.HTTPMessage at 0x10b3c8b00>)

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command.

![](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

In [24]:
from bs4 import BeautifulSoup
url = 'https://litemind.com/best-famous-quotes'

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
for quote in soup.findAll('div',{'class':'wp_quotepage'}):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

b'1. You can do anything, but not everything.' b'\xe2\x80\x94David Allen'
b'2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.' b'\xe2\x80\x94Antoine de Saint-Exup\xc3\xa9ry'
b'3. The richest man is not he who has the most, but he who needs the least.' b'\xe2\x80\x94Unknown Author'
b'4. You miss 100 percent of the shots you never take.' b'\xe2\x80\x94Wayne Gretzky'
b'5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.' b'\xe2\x80\x94Ambrose Redmoon'
b'6. You must be the change you wish to see in the world.' b'\xe2\x80\x94Gandhi'
b'7. When hungry, eat your rice; when tired, close your eyes. Fools may laugh at me, but wise men will know what I mean.' b'\xe2\x80\x94Lin-Chi'
b'8. The third-rate mind is only happy when it is thinking with the majority. The second-rate mind is only happy when it is thinking with the minority. The first-rate mind is only happy when it is th

Scraping takes work. You need to be able to read the page source to understand how the information is structured and how you can access it. The examples here have been fairly straightforward. Sometimes the markup is messy and poorly formed.

The next commands will look at a [restaurant inspection report](http://dc.healthinspections.us/webadmin/dhd_431/lib/mod/inspection/paper/_paper_food_inspection_report.cfm?inspectionID=105185&wguid=1367&wgunm=sysact&wgdmn=431) from the DC Department of Health.

The markup is messy. It took me a while to parse the data I wanted from it. And once I did, I spent a while cleaning it up too.

In [25]:
# pull in the html and take a look at it
html_file = "http://dc.healthinspections.us/webadmin/dhd_431/lib/mod/inspection/paper/_paper_food_inspection_report.cfm?inspectionID=105185&wguid=1367&wgunm=sysact&wgdmn=431"
html_rpt = urllib.request.urlopen(html_file).read()
html_rpt

b'\r\n<!DOCTYPE html>\r\n<html>\r\n\r\n<head>\r\n<title>Food Establishment Inspection Report</title>\r\n<link href="http://dc.healthinspections.us:80/webadmin/dhd_431/lib/mod/inspection/paper/css/generic.css" rel="stylesheet" type="text/css" media="all, screen, print" />\r\n<style type="text/css">\r\ndiv.container {\r\n\tborder-bottom: 1px solid black;\r\n\tmargin-right:15px;\r\n  }\r\ndiv.container span {\r\n\tposition:relative;\r\n\tbottom:-2px;\r\n\tbackground:#FFFFFF;\r\n  }\r\n.checkboxRedN {\r\nfloat:left;\r\nborder:1px solid red;\r\nwidth:10px;\r\nheight:10px;\r\nfont-size:5px;\r\n}\r\n.checkboxN {\r\nborder:1px solid black;\r\nwidth:10px;\r\nheight:10px;\r\nfont-size:2px;\r\n}\r\n.boxwid2{\r\n\twidth:185px;\r\n}\r\n.spacer{\r\n\twidth:10px;\r\n}\r\n.line1{\r\n\twidth:400px;\r\n}\r\n.line2{\r\n\twidth:230px;\r\n}\r\n.line3{\r\n\twidth:245px;\r\n}\r\n.line5{\r\n\twidth:235px;\r\n}\r\n.hdrSpace{\r\n\twidth: 4px;\r\n}\r\n.hdrSpace2{\r\n\twidth: 4px;\r\n}\r\n.inspTypeSpace{\r\n\twid

In [26]:
# Using Beautiful Soup involved a lot of trial and error. Here are some examples of what I parsed.
soup = BeautifulSoup(html_rpt, 'html.parser')
inspection = soup.find_all('tr')

In [27]:
# phone number
inspection[5].get_text()

'\n\n\nTelephone\n\xa0(202) 667-0010\n\xa0E-mail address\r\n\t\t\t\t\t\t\t\t\xa0kelvin.ferrufino@gmail.com\r\n\t\t\t\t\t\t\t\n\n'

In [28]:
# inspection details
inspection[6].get_text()

'\n\n\nDate of Inspection\n\xa006\n/\n\xa026\n/\n\xa02013\n\xa0\xa0\xa0\xa0Time In\n\xa007\n:\n\xa000\nPM\n\xa0\xa0\xa0\xa0Time Out\n\xa008\n:\n\xa015\nAM\n\xa0\n\n\n'

In [29]:
# the inspector name and badge number
tables = soup.find_all('table')
inspector_info = tables[11]
inspector = inspector_info.find_all("td")

print(inspector[1])
print(inspector[2])

<td style="width:225px; vertical-align: bottom;"> A. Jackson</td>
<td style="width:90px; vertical-align: bottom;">54 </td>


There are a lot of resources out there for building scrapers. Take a look at the resources in the slides. And try out this tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/).

Thanks!