# Data Schemas

In our last session, Dr. Peck mentioned a few data schemas that you may not have seen before.  I thought, it would be useful to go through an example of some open source data formats, and tools that are available in python to work with these data schemas.

## JSON

This format consists of key-value pairs, was inspired by a subset of the JavaScript programming language, though it has now become language agnostic and exists as its own standard.  It has been widely adopted because it is easy for both humans and machines to read, create and understand the data it contains.  Here is an example of a JSON file; It should be fairly readable to you since you have been using Python / Python dictionariesi

```json
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}

```

observation: the two primary parts that make up JSON are keys and values, forming a key/value pair
- Key: a key is always a string, enclosed in quotation marks
- Value: A value can be a string, number, array, boolean expression, or object
- Key/Value Pair follows a specific syntax, with the key followed by a colon followed by the value.
- key value pairs are comma separated.
- arrays are enclosed in square brackets, objects are enclosed by curly brackets.

Python supports JSON natively

In [None]:
import json

with open("Data/ex.json") as data_file:
    data = json.load(data_file)

In [None]:
from pprint import pprint
pprint(data)

Hmm, the u- prefix means that we have a unicode string.  It doesn't really appear in the data.  

In [None]:
print data['address']


In [None]:
print data['address']['postalCode']

The JSON package can also dump python objects into JSON file.  This process is sometimes called serialization (transforming data into a series of byytes to be transmitted)  The conversion is fairly intuitive:

| Python | JSON |
| ------ |-----:|
|dict | object|
|list, tuple |array|
|str |string|
|int, long, float|number|
|True | true|
|False |false|
|None |null|

In [None]:
mydict = {
    "semester": "Fall 2018",
    "course number": "UN5550",
    "students": 18
}

In [None]:
print(mydict)

In [None]:
with open("mydict_ex1.json","w") as write_file:
    json.dump(mydict,write_file)

This file isn't very readable.  perhaps we can add some whitespace

In [None]:
with open("mydict_ex1.json","w") as write_file:
    json.dump(mydict,write_file,indent=5)

In [None]:
mydict = {
    "semester": "Fall 2018",
    "course number": "UN5550",
    "students": 18,
    "location": {"building": "fisher",
                "room": 330}
}

In [None]:
with open("mydict_ex2.json","w") as write_file:
    json.dump(mydict,write_file,indent=2)

# XML

XML stands for "Extensible Markup Language".  It is a markup language developed by the World Wide Web Consortium, and is widely used in document formats (e.g. XHTML, RSS) as well as the default for many common tools, such as microsoft office.   HL7 (Health Level 7) is also based on XML.  The basic construct is the idea of Tags and Elements.  

- A tag is a markup construct that begins with &lt; and ends with &gt;.  There are three type of tags:
    - start-tag, e.g. &lt;section&gt;
    - end-tag, e.g. &lt;/section&gt;
    - empty-element tag, such as &lt;line-break/&gt;

- An element is the component that begins with a start-tag and ends with a match end-tag, or consists of only the empty-element tag.  For example, section or line-break are the elements above.

- Element content, which may contain other elements, are the characters between the start tag and end tag.

An attribute is a markup construct that contains a name-value pair within a start-tag or empty-element tag.  If you're familiar with HTML, an example might be:

```html
<img src="digits.jpg" alt="Minst Digit data />
```

Here is an example XML file, taken from https://docs.python.org/2/library/xml.etree.elementtree.htm
```XML
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
```

A popular way to parse XML files is the beautifulsoup package.

In [None]:
from bs4 import BeautifulSoup

datafile = open("data/country.xml","r")
contents = datafile.read()
datafile.close()
data = BeautifulSoup(contents,'xml')

In [None]:
print(data.prettify())

In [None]:
data.country 

In [None]:
data.country['name']

In [None]:
cnames = data.find_all('country')
for cname in cnames:
    print(cname['name'])  # use this to specify which attribute
    print "  GDP per capita = $",cname.gdppc.string 

In [None]:
cname.gdppc

In [None]:
print(cname.gdppc.name)

In [None]:
print(cname.gdppc.string)

Question: how would you script out how to display all the neighbors of each country?  i.e., how would you retrieve:

Liechtenstein is next to Austria

Liechtenstein is next to Switzerland

Singapore is next to Malaysia

Panama is next to Costa Rica

Panama is next to Colombia

Lastly, if you want to export your xml to file, once can again use the prettify function.

In [None]:
outfile = open("test_out.xml","w")
outfile.write(data.prettify())
outfile.close()