# Lecture 5 - Data Acquisition, Web Scraping and Web APIs *
---
* Some material on web scraping and usage of APIs adapted from Kevin Markham's data science courses at https://github.com/justmarkham

### Content

1. Data gathering via web scraping
2. HTML basics
3. Data gathering via web APIs
4. JSON file format


### Learning Outcomes

At the end of this lecture, you should be able to:

* list the different dynamic sources of data
* explain what HTML is and its basic structure
* make HTTP requests using python
* traverse the HTML document tree
* perform web scraping at an introductory level
* describe and process the JSON file format
* perform rudimentary data acquisition using Web APIs



---

# Data Acquisition

So far, we have looked at how we can acquire data from pre-prepared Excel and text files in the CSV format. We also saw how we can use pandas clipboard facility to paste and build data frames. 

We also experienced that much of the data does not come in tidy formats that are prepared and ready for data analysis. For this we learned a number of techniques that help us to wrangle and tidy our data into shape. 

Now we are going to look at two additional sources of data that are dynamic and will require the combination of all the techniques we learned previously, such as wrangling, merging, aggregation, as well as some new skills. 

It is becoming common these days that data is acquired from multiple sources and merged into a single dataset. The data sources that are increasingly becoming the backbone of many analytics and information systems are web based.

This section considers how data can be read (scraped) from web pages (HTML documents), and how data can be retrieved from web servers using their application program interfaces (APIs).

# 1. Web scraping

Often when we need to acquire data, web pages are a great resource to turn to. 

The term "web scraping" refers to an application or script that processes HTML pages. This is done in order to extract data embedded in HTML for manipulation. 

Web scraping applications in effect simulate a person viewing a website with a browser.

Our task then becomes writing scripts that can traverse the structure of HTML documents and locate the particular piece of data we need.

## HTML

Many websites make data available on their web pages for viewing in a browser, but do not make it conveniently 
downloadable as an easily machine-readable format like JSON, CSV, or XML. 

#### What is HTML?

HTML is a markup language (not a programming language) for describing web documents (web pages).

    HTML stands for Hyper Text Markup Language
    A markup language is a set of markup tags
    HTML documents are described by HTML tags
    Each HTML tag describes different document content

HTML pages consist of elements. Elements are marked up by tags and may have attributes inside them which describe how the content should be rendered by web browsers.

Please refer to http://www.w3schools.com/html/html_intro.asp for an introduction to HTML.

The examples below will show how we can perform web scraping on HTML pages using a Python package called `BeautifulSoup`. 

BeautifulSoup is an HTML/XML parser for Python that can turn markup text into a parse tree, that can then be traversed more easily.

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://www.crummy.com/software/BeautifulSoup/bs4/doc/ width=1100 height=500></iframe>")

BeautifulSoup provides a simplified, idiomatic ways of navigating, searching, and modifying parse tree generated by HTML and XML.

More info on BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

Good examples of how this is done can be found in : http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/ and http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python

## Intro to Web Scraping

We are going to begin with a toy example first using the simple html page created below:

In [None]:
# imports
import requests                 # How Python gets the webpages
from bs4 import BeautifulSoup   # Creates structured, searchable object
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# First, let's read the toy webpage as a string - this is what happens initially when you scrape any webpage
html_doc = """
<!doctype html>
<html lang="en">
<head>
  <title>Teo's Webpage</title>
</head>

<body>
  <h1>Teo's Webpage</h1>
  <p id="intro">My name is Teo.  I find web scraping interesting.</p>
  <p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>
  <p id="current">I currently work as a lecturer in Information Technology.</p>
  
  <h3>My Interests</h3>
  <ul>
      <li id="my favorite">Data Science and Machine Learning</li>
      <li class="hobby">Tennis</li>
      <li class="hobby">Reading</li>
      <li class="hobby">Travelling</li>
      <li class="hobby">Reading</li>
  </ul>
</body>
</html>
"""
type(html_doc)

In [None]:
# Beautiful soup allows us to create structure from the html elements, and to traverse it
page = BeautifulSoup(html_doc)
print type(page)
page

In [None]:
# The most useful methods in a Beautiful Soup object are "find" and "findAll".
# "find" takes several parameters, the most important are "name" and "attrs".
# name will help us find the type of an element
# Let's target "name".
page.find(name='body') # Finds the 'body' tag and everything inside of it.
body = page.find(name='body')
type(body) #element.Tag

The above result tells us that 'body' element was found in the HTML page, and it tells us what object type it is. We can see its content below:

In [None]:
body.contents

We can recursively search for other elements inside the returned result as well:

In [None]:
h1 = body.find(name='h1') # Find the 'h1' element inside of the 'body' tag
print h1
print h1.text

Notice how we can access the entire element or just the content. 

Now let's find the 'p' elements:

In [None]:
p = page.find(name='p')
# This only finds one.  This is where 'findAll' comes in.
print p

In [None]:
all_p = page.findAll(name='p')
print all_p
print type(all_p) # Result sets are a lot like Python lists

Access specific element with index:

In [None]:
print all_p[0]
print all_p[1]

In [None]:
# Iterable like  list
for one_p in all_p:
    print one_p.text # Print text

Access specific attribute of a tag:

In [None]:
print all_p[0] # Specific element
print all_p[0]['id'] # Specific attribute value of a specific element

Now let's look at 'attrs'. Beautiful soup also allows us to locate elements with specific attributes:

In [None]:
print page.find(name='p', attrs={"id":"intro"})

In [None]:
print page.find(name='p', attrs={"id":"background"})

In [None]:
result = page.find(name='p', attrs={"id":"current"})
result.text

We can also do a search of all instances of an element and name of a class:

In [None]:
print page.findAll("li", "hobby")

**Exercise:** Extract the 'h3' element from Teo's webpage.

In [None]:
page.find(name='h3')

**Exercise:** Extract Teo's hobbies from the html_doc.  Print out the text of the hobby. 

In [None]:
hobbies = page.findAll(name='ul')
for hobby in hobbies:
    print hobby.text

**Exercise:** Extract Teo's hobby that has the id "my favorite".

In [None]:
page.find(name='li', attrs={'id':'my favorite'}).text

In order to illustrate HTML web scraping on a real-world site, we will look at a website that lists the up-to-date gold price found on http://www.gold.org, and which is refreshed every minute. 

We will attempt to read the asking price of gold from the HTML document.

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://www.gold.org width=1100 height=500></iframe>")

The price we are interested in is found in the "ASK" row under the "Spot Price" section. 

In order to find where the price is situated in the HTML document, we must look at the document's source code. By right clicking on a page in a browser, an option should be displayed allowing you to view the source.

We must inspect the source so that we can find the element that houses this value. We can then use the python's BeautifulSoup package to read and iterate through the HTML elements in order to extract the data that we want.

There are three basic steps to scraping a single page:

    1. Get (request) the page
    2. Parse the page content (read and interpret the document structure)
    3. Search through the content of interest


Below is the example of a script that will access and display the latest gold price being traded:


In [None]:
#we first need to make some extra imports
import json
from time import sleep
from datetime import datetime

#you might need to set the proxies if you are doiung this from Massey's domain
#if the below does not work, then try this: "http://get-proxy.massey.ac.nz/"
massey_proxies = {
  "http": "http://alb-cache1.massey.ac.nz/",
 "https": "http://alb-cache1.massey.ac.nz/",
}

#massey_proxies = {
 # "http": "http://get-proxy.massey.ac.nz/",
#  "https": "http://get-proxy.massey.ac.nz/",
#}

#massey_proxies = ""

**STEP 1: GET** Access the page and read it into the beautiful soup object

In [None]:
url = "http://gold.org"
response = requests.get(url, proxies=massey_proxies)
response

In [None]:
page = response.content

In [None]:
page

**STEP 2: PARSE** Create a BeautifulSoup object that reads and parses the HTML page into a format that we can search and traverse.

In [None]:
scraping = BeautifulSoup(page) 

In [None]:
scraping

Now we can search for a given tag, id or class name.

**STEP 3: SEARCH** Search through the page for 'dd' type tags with the class name 'value':

In [None]:
element = scraping.find("dd", attrs={"class" : "value"})
element

Once we have found the tag we want, we extract the contents of it by calling .contents and optionally convert it into a float.

In [None]:
print float(str(element.contents[0]).replace(',', ''))

As it turns out, there are multiple tags in the document with this tag-name combination. 

If we re-run the search from before and ask for all results to be returned that match our criteria, this is what we get:

In [None]:
element = scraping.find_all("dd", attrs={"class" : "value"})
element

Our previous scrape worked because the value of interest was the first one, say we would like to scrape the mid price now (there could however be a shortcut).

In [None]:
element2 = scraping.find("div", attrs={"class" : "asset mid"})

In [None]:
print element2

In [None]:
element2.find("dd", "value").contents

**Exercise:** Scrape the bid price from the web page.

In [None]:
#step 1


In [None]:
page = response.content

In [None]:
#step 2
scraping 

In [None]:
element_bid_price = scraping.find_all(attrs={"class" : "asset-inner"})

In [None]:
element_bid_price

In [None]:
element_bid_price[2].find("dd", "value").contents

Below is an example of how we might write a script that continually extracts data from a page every 1-2 seconds:

In [None]:
def GetGoldPrice():
    url = "http://gold.org"
    response = requests.get(url, proxies=massey_proxies)
    page = response.content
    #create a BeautifulSoup object that reads in the HTML page
    scraping = BeautifulSoup(page)
    #search through the page for 'dd' type tags with the class name 'value'
    element = scraping.find("dd", "value")
    #access the contents inside the tags
    price = element.contents[0].string
    return price

for x in range(0,10):
    time_now = datetime.now().strftime("%I:%M:%S%p")
    print("{0}, Gold price is: {1} \n ".format(time_now, GetGoldPrice()))
    sleep(0.01)

**Exercise**: Extract the current FTSE 100 stock market index from the Google Finance page http://www.google.com/finance

In [None]:
#step 1



In [None]:
#step 2

#scraping

In [None]:
element_FTSE100 

In [None]:
element_FTSE100.contents

We can also read in entire HTML tables into dataframe objects:

In [None]:
scraping_html_table = BeautifulSoup(response.content)

In [None]:
scraping_html_table_FTSE100 = scraping.find_all("table", "quotes")
scraping_html_table_FTSE100

In [None]:
df = pd.io.html.read_html(str(scraping_html_table_FTSE100))
df[0]

## 2. Web APIs

Web servers serve out web pages in the HTML format as they are requested by users. Web servers are also capable of providing data that is not formatted in HTML. These web server provide public (and private) APIs through which users can interact, construct queries that the web servers understand, and receive data from them. Depending on who owns them, web servers will have different APIs. They usually provide developer help pages that demonstrate how they work and how queries can be constructed using HTTP which the servers understand.

Many websites have public APIs providing data feeds via JSON or some other format. We will consider only JSON as it is becoming a standard, and is conveniently, virtually identical to python's dictionaries in its syntax. 

Increasingly though, in order to access these APIs we must register for API Keys. They are credentials. Some of them are free and simply require that an account be created with a given website, while others must be purchased and have limits on the amount of data that can be pulled.

There are a number of ways to access these APIs. REST is becoming the most common mechanism. 

### REST

REST is a lightweight mechanism built on top of the HTTP protocol which enables applications to exchange data with severs. A combination of HTTP requests, together with valid REST queries can easily be constructed from Python. One easy-to-use method is through the `requests` package (http://docs.python-requests.org).

Previously, using Web Services and SOAP would result in queries like:

Using REST, such clumsy queries can be transformed into simple HTTP requests of a format (1) like:

Or alternatively, passing arguments using format (2) as follows:

There are slight differences in what you can expect from the two formats. Format 1 (path segment parameter) will return a 404 error when the parameter value does not correspond to an existing resource. 

Format 2 uses optional parameters. Instead of en error, this format will return an empty list when the parameter is not found in the query result. 

Example:

In [None]:
#echonest api
url = "http://developer.echonest.com/api/v4/artist/reviews?api_key=YB4F9B7ZLS2YMOGUG&id=ARH6W4X1187B99274F&format=json&results=1&start=0"
response = requests.get(url, proxies=massey_proxies)

#we want HTTP Response 200
response

In [None]:
response_json = response.content
response_json

### JSON

JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web servers and browsers and other applications. 

It is a much more flexible data format than a tabular text form like CSV. 

Here is an example:

In [None]:
#In Python triple-quoted strings allow us to include strings that have escape chars in it.
obj = """
{"name": "Massey University",
"campuses_NZ": ["Albany", "Palmerston North", "Wellington"],
"campuses_international": null,
"colleges": [{"name": "Sciences", "degrees": 10, "majors": 30},
{"name": "Business", "degrees": 8, "majors": 25}]
}
"""
obj


JSON is very nearly valid Python code with the exception of its null value `null` and
some other nuances (such as disallowing trailing commas at the end of lists). The basic
types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. 

**All of the keys in an object must be strings**. There are several Python libraries for reading and
writing JSON data. We will use `json` here as it is built into the Python standard library. 

To convert (deserialize) a JSON string from above to an equivalent Python object (`dict`), use `json.loads`:

In [None]:
result = json.loads(obj)
result

`json.dumps` on the other hand converts a Python object back to JSON:

In [None]:
as_json = json.dumps(result)
as_json

How you convert a JSON object or list of objects to a DataFrame or some other data
structure for analysis will be up to you. Conveniently, you can pass a list of JSON objects
to the DataFrame constructor and select a subset of the data fields:

In [None]:
massey_colleges = pd.DataFrame(result['colleges'], columns=['name', 'degrees'])
massey_colleges

We can convert a data frame back to a JSON object with the following:

In [None]:
massey_colleges.to_json()

### Data Acquisition from APIs


A popular API provider is https://apigee.com/providers

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=https://apigee.com/providers width=1100 height=500></iframe>")

We will look at getting data from Echo Nest. The Echo Nest offers an array of music data and services for developers to build apps and experiences.

Echo Nest API Console: https://apigee.com/console/echonest

Echo Nest Developer Center: http://developer.echonest.com/

We can use a free session API key from the service:

In [None]:
# request data from the Echo Nest API
url = 'http://developer.echonest.com/api/v4/artist/top_hottt?api_key=YB4F9B7ZLS2YMOGUG&format=json'
response = requests.get(url, proxies=massey_proxies)

#we want HTTP Response 200 - not 404
response

In [None]:
response.text

In [None]:
# decode JSON
print type(response.json())
result = response.json()
result

In [None]:
# pretty print for easier readability
import pprint
pprint.pprint(result)

In [None]:
# pull out the artist data
artists = result['response']['artists']    # list of 15 dictionaries
artists

In [None]:
# reformat data into a table structure
artists_data = [artist.values() for artist in artists]  # list of 15 lists
artists_data

In [None]:
artists_header = artists[0].keys()                      # list of 2 strings
artists_header

In [None]:
artists[0]['name']

**Exercise:** Have a look through the Echonest API and generate a query.

## API Repositories

A large number of other API repositories can be found under these links:

http://www.publicapis.com/

http://www.programmableweb.com/apis/directory

Mashape (http://www.publicapis.com/) is the Cloud API Marketplace where developers can easily consume Cloud APIs to integrate in their next project, and where existing APIs can be distributed to the community and monetized.

In order to access their APIs, it is usually required to at least create an account, while some web sites will charge fees for accessing their data. There are different ways of communicating with API servers. Mashape has created a python library that can simplify accessing their data. The library is called *unirest* and can easily be installed on your computer if you type in your command line the following line: 

In [None]:
import unirest

One of the free APIs listed under this market place is Bitcoin Exchange Rates which lists exchange rates between major companies and bitcoin as well as exchange rates between the major currencies.

https://www.mashape.com/montanaflynn/bitcoin-exchange-rates#

Below is an example of how to construct a query for the buying price of one bitcoin, wit hthe result returned in USD.

In [None]:
response = unirest.get("https://montanaflynn-bitcoin-exchange-rate.p.mashape.com/prices/buy?qty=1",
  headers={
    "X-Mashape-Key": "2BTWnoXPgrmshykB91haA2hod3UYp1FDVvyjsnjK3EfNKw5329",
    "Accept": "text/plain"
  }
)

response.body

In [None]:
type(response.body)

Notice that the type of the result response.body is a familiar dictionary from which we can easily extract our data

**Exercise:** Extract the total amount of the cost for 1 bitcoin.

In [None]:
btc['total']['amount']

**Exercise:** Execute a query for the cost of 15 for bitcoins and extract the total price from the dictionary.

Below is an example of a query for extracting the current exchange rates between the major currencies

In [None]:
response = unirest.get("https://montanaflynn-bitcoin-exchange-rate.p.mashape.com/currencies/exchange_rates",
  headers={
    "X-Mashape-Key": "QgrDeDPRdFmshQBsi3cDAvZvD6Ykp1AxBj4jsn1po92UN8XxKx",
    "Accept": "text/plain"
  }
)

response.body

**Exercise:** Search through the https://www.mashape.com/montanaflynn/bitcoin-exchange-rates# webpage and find out how to construct a query to extract from their API the sell price for a single bitcoin. Execute this and extract the price.

In [None]:
# These code snippets use an open-source library. http://unirest.io/python
response = unirest.get("https://montanaflynn-bitcoin-exchange-rate.p.mashape.com/prices/sell?qty=15",
  headers={
    "X-Mashape-Key": "2BTWnoXPgrmshykB91haA2hod3UYp1FDVvyjsnjK3EfNKw5329",
    "Accept": "text/plain"
  }
)

response.body['total']['amount']


Markit http://www.markit.com/Company/About-Markit is a provider of financial information services.

Below is an example of how the current stock proce of Apple can be queried though their API


In [None]:
url = "http://dev.markitondemand.com/Api/v2/Quote/json?symbol=AAPL"
response = requests.get(url, proxies=massey_proxies)

response

In [None]:
markit_dict = json.loads(response.content)
markit_dict

**Exercise:** Look through their API documentation at http://dev.markitondemand.com/#doc_lookup and construct a query.

Yahoo finance is one more source of fiancial data. Here are two different levels of detail. The first with all the detail and the second with some of it.

In [None]:
url = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20IN%20(%22YHOO%22,%22AAPL%22)&format=json&env=http://datatables.org/alltables.env"
response = requests.get(url, proxies=massey_proxies)

response

In [None]:
yahoo_fin1 = json.loads(response.content)
yahoo_fin1

In [None]:
url = "http://finance.yahoo.com/webservice/v1/symbols/YHOO,AAPL/quote?format=json&view=detail"
response = requests.get(url, proxies=massey_proxies)

response

In [None]:
yahoo_fin2 = json.loads(response.content)
yahoo_fin2

In [None]:
%%javascript
IPython.load_extensions('calico-spell-check')