# Data Collection & Data Formats

Term 1 2019 - Instructor: Teerapong Leelanupab

Teaching Assistant: Suttida Satjasunsern
***

## Downloading Data
The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [1]:
import urllib.request
url = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/kmitl.txt"
response = urllib.request.urlopen(url)
text = response.read().decode()
print(text)

	The establishment of the Nondhaburi Telecommunication Training Center on August 24, 1960 with academic cooperation from the Government of Japan marked the origination of KMITL. The training center became the Nodhaburi Institute of Telecommunications under the Columbo Plan, later in 1964.
	As specified by the 1971 King Mongkut's Institute of Technology Act, KMITL was originated by an amalgamation of three technical colleges: Nondhaburi Institute of Telecommunications, North Bangkok Technical College and Thonburi Technical College. In the same year, the Nondhaburi Institute of Telecommunications, or known as King Mongkut's Institute of Technology at Nondhaburi Campus, was relocated to the district of Ladkrabang in Bangkok. The new campus was called ''Chao Khun Taharn Ladkrabang Campus''. The Nondhaburi Institute of Telecommunications became the Faculty of Engineering in 1972. In the same year, the College of Design and Construction located at the Bangplad district was transformed into t

In practice, we may often want to wrap code to fetch URLs in a try block, to handle the case where we cannot access the URL.

In [2]:
url = "https://somemissinglink.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/kmitl.txt"
try:
    response = urllib.request.urlopen(url)
    text = response.read().decode()
except:
    print("Failed to retrieve %s" % url)

Failed to retrieve https://somemissinglink.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/kmitl.txt


## Working with CSV

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

We could download a CSV file using *urllib.request* and manually parse it...

In [None]:
# Download the CSV and store as a string
url = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/goal_scorers.csv"
response = urllib.request.urlopen(url)
raw_csv = response.read().decode()
# Parse each line
lines = raw_csv.split("\n")
for l in lines:
    l = l.strip()
    print(l)
    print(len(l))
#     if len(l) > 0:
#         # split based on a comma separator
#         parts = l.split(",")
#         print(parts)

But we can also use Pandas to directly download and parse CSV data for us, to create a Data Frame which is ready to analyse.

In [None]:
import pandas as pd
df = pd.read_csv("https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/goal_scorers.csv")
df

## Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [None]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [None]:
url = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/books.json"
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

In [None]:
print(raw_json)

We can now parse the JSON, converting it from a string into a useful Python data structure:

In [None]:
data = json.loads(raw_json)
print(data)

We can now iterate through the books in the list and extract the relevant information that we require.

In [None]:
for book in data:
    print( "%s = %d" % ( book["title"], book["year"] ) )

We then use json_normalize in Pandas to create a Data Frame of semi-structured JSON data to make it ready to analyse.

In [None]:
from pandas.io.json import json_normalize

df = json_normalize(data)
df.head(5)

### OR
Alternatively, we can also use Pandas to directly download and parse JSON data for us, to create a Data Frame which is ready to analyse.

In [None]:
import pandas as pd
link = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/books.json" 
df = pd.read_json( link, orient="records")
df.head(5)

## Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The *xml.etree.ElementTree* module can be used to extract data from a simple XML file based on its tree structure. 

In [None]:
# download the content
url = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/books.xml"
response = urllib.request.urlopen(url)
raw_xml = response.read().decode()
print(raw_xml)

We can use the *xml.etree.ElementTree.fromstring()* function to parse content from a string containing XML data.

In [None]:
import xml.etree.ElementTree as et
xroot = et.fromstring(raw_xml)

An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [None]:
for child in xroot:
    # get the name of the tag, along with any XML attributes which the tag has
    print( child.tag, child.attrib )

We can also query to find tags with specific names, such as '<book>' and then in turn find child nodes of that tag with a specific name.

In [None]:
for book in xroot.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title = book.find("title").text
    print(title)

We can parse xml to Pandas dataframes, which is ready to analyse.

In [None]:
df_cols = ["id", "title", "ISBN", "year", "rating", "language"]
df = pd.DataFrame(columns = df_cols)

for node in xroot: 
    s_id = node.attrib.get("id")
    s_title = node.find("title").text
    s_isbn = node.find("ISBN").text
    s_year = node.find("year").text
    s_rating = node.find("rating").text
    s_language = node.find("language").text
    
    #print("%s\t%s\t%s\t%s\t%s\t%s " % (s_id, s_title, s_isbn, s_year, s_rating, s_language))
    df = df.append(pd.Series([s_id, s_title, s_isbn, s_year, s_rating, s_language], 
                                index = df_cols), 
                                ignore_index=True)
    
df

## Working with HTML

[HyperText Markup Language (HTML)](https://en.wikipedia.org/wiki/HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

The built-in Python urllib.request module has functions which help in downloading content from HTTP URLs using minimal code:

In [None]:
import urllib.request
link = "https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/sample_web/sample.html" 
response = urllib.request.urlopen(link)
html = response.read().decode()

We can simple use the for-loop to read the html file line by line to see its structure.

In [None]:
lines = html.strip().split("\n")
for l in lines:
    print(l)

### The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python [requests](https://2.python-requests.org//en/master/) library. The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us. There are several different types of `requests` we can make using requests, of which `GET` is just one. If you want to learn more, check out this tutorial for using [API](https://www.dataquest.io/blog/python-api-tutorial/) requests in Python.

Let’s try downloading a simple sample website, [https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/sample_web/sample.html](https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/sample_web/sample.html). We’ll need to first download it using the requests.get method.

In [None]:
import requests
page = requests.get("https://www.it.kmitl.ac.th/~teerapong/resources/ds4biz/week4/sample_web/sample.html")
page

After running our request, we get a [Response](https://2.python-requests.org//en/master/user/quickstart/#response-content) object. This object has a `status_code` property, which indicates if the page was downloaded successfully:

In [None]:
page.status_code

A `status_code` of `200` means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a `2` generally indicates success, and a code starting with a `4` or a `5` indicates an error.

We can print out the HTML content of the page using the content property:

In [None]:
page.content

### Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library to parse this document, and extract the text from the `h3` tag. We first have to import the library, and create an instance of the `BeautifulSoup` class to parse our document:

In [None]:
import bs4 
soup = bs4.BeautifulSoup(page.content, 'html.parser')

for match in soup.find_all("h3"):
    text = match
    print(text)

## Working with APIs

### Example - Wikipedia

As a simple example of using an Online API, we will retrieve JSON data from the Wikipedia web API. The Wikipedia page for 'KMITL' is [here](https://en.wikipedia.org/wiki/King_Mongkut%27s_Institute_of_Technology_Ladkrabang). We can retrieve this data in a cleaner JSON format from the Wikipedia API endpoint (https://en.wikipedia.org/w/api.php).

In [None]:
title = "King_Mongkut%27s_Institute_of_Technology_Ladkrabang"
url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=true&titles=" + title
print(url)

In [None]:
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

Once we have downloaded the JSON data into a string, we parse it using the *loads()* function, which will convert it into an actual Python dictionary.

In [None]:
data = json.loads(raw_json)
data

The response still needs to be inspected. Note that the results we want are are in *data["query"]["pages"]*:

In [None]:
print(data["query"]["pages"])

In [None]:
result = data["query"]["pages"]["1232312"]
print(result["title"])
print(result["extract"])

### Example - Currency Exchange Rates

In the next example, we will use the *Fixer.io* API to get currency exchange rate information: http://fixer.io

For API documentation: https://fixer.io/documentation

To retrieve all rates in EUROs, we retrieve the following:

In [None]:
ACCESS_KEY = "0c9904dea3d2c46b78686bc16bbba722"

In [None]:
url = "http://data.fixer.io/api/latest?access_key=" + ACCESS_KEY
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")
print(raw_json)

Parse the JSON data

In [None]:
data = json.loads(raw_json)
# List all the rates
data

In [None]:
# Get a specific rate
data["rates"]["CHF"]

We can change the URL to get rates for a different currency, such as US Dollars (USD):

In [None]:
url = "http://data.fixer.io/api/latest?access_key=" + ACCESS_KEY + "&symbols=USD"
print(url)
# Retrieve the JSON
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")
# Parse the JSON
data = json.loads(raw_json)
# Display the rates data for US dollars
data["rates"]

In [None]:
data

In [None]:
df = json_normalize(data)
df