# W02D3 - Other Data Types

- Notebook by [James Bain](https://github.com/jcbain/data_2022-03-07/tree/main/w02d03_other_data_types)


## Outline:
- "Tidy Data" concepts and principles
- Very quick tabular data recap
    - Reading in Excel files
- HTML/XML formats
- JSON format

# Tidy Data

- [*Tidy Data*](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham

## What is "Tidy Data"?

- A set of principles for how to structure tabular data in a data frame (and why they're important).
- Columns contain the same data type.
- Rows contain observations.
- We would like our data to be in this form.



![rows/columns/values](images/01_tidy_data.png)

## Examples of Untidy data

### Ex 1

- dataset: `puppies`
- info: measuring the weights (kg) of puppies after 6 months and 24 months
- **What is wrong? How to fix?**


| name  | 6 | 24 |
|----|----|----|
| pippa | 8 | 12 |
| prairie | 5 | 10 |

#### Problem:
* the column names aren't **variables**, but instead they are **values of a variable** (age).
* Called "pivoted"


Tidied example (unpivoted):

| name | age | weight |
|------|-----|--------|
| pippa | 6 | 8 |
| prairie | 6 | 5|
| pippa | 24 | 12 |
| prairie | 24 | 10 |

- This format will be easier to work with when we start ML.

### Ex 2

- dataset: `test_scores`
- info: documenting test scores of students
- **What is wrong? How to fix?**


| student_id | test_name | score |
| ---------- | --------- | ----- |
| 1          | bio       | 76/100|
| 2          | bio       | 90/100|
| 1          | anthro    | 200/250|

#### Problem:

* `score` column is a composite value of `points earned` over `points available` (string vs float)
* Don't perform calculation; instead split into two columns to avoid losing information.

Tidied example:

| student_id | test_name | points_earned | points_available |
| ---------- | --------- | ----- | -----|
| 1          | bio       | 76 | 100 |
| 2          | bio       | 90 | 100 |
| 1          | anthro    | 200 | 250 |

## Your Turn 
### Is this untidy? How can we fix?

data: TB cases

| country | cases (year 1999) | cases (year 2000) | population (year 1999) | population (year 2000) |
| ------- | ----------------- | ----------------- | ---------------------- | ---------------------- |
| Afghanistan | 745 | 2666 | 19987071 | 20595360 |
| Brazil | 37737 | 80488 | 172006362 | 174504897 |
| China | 212258 | 213766 | 1272915272 | 1280428583 | 



<details>
  <summary>Solution 1 (hidden)</summary>
    
| country | year | cases | population |
| ------- | ----------------- | ----------------- | ---------------------- |
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504897 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
</details>

<details>
  <summary>Solution 2 (hidden)</summary>
    
| country | year | attribute | value |
| ------- | ----------------- | ----------------- | ---------------------- |
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504897 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
</details>

# Tabular data recap
**... and a small introduction to reading in excel (`xlsx`) files**

In [None]:
import pandas as pd

In [None]:
# using pandas to construct a dataframe from an xlsx file
puppies = pd.read_excel("data/puppies.xlsx")
puppies.head()

In [None]:
# what if our file actually has multiple sheets?
puppies_weights = pd.read_excel("data/puppies.xlsx", sheet_name="Sheet1")
puppies_desc = pd.read_excel("data/puppies.xlsx", sheet_name="Sheet2")
puppies_weights

In [None]:
puppies_desc

## The power of documentation
When working with any python library and its methods, it is always helpful to be aware of its documentation. Here is `pandas.read_excel()`'s  [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html).

# Some other types of data that are useful to data scientists

| Data Type | Example |Uses |
| :--------- | :------- | :---- |
| JSON or XML | parsing APIs | gathering data, trend analysis, forecasting... |
| HTML | web scraping | gathering web based document data, social media contacts... |
| image/video | computer vision | self-driving cars, medical imaging diagnostics |
| text | tweets, scripts, books | sentiment analysis, text generation, other natural language processing |

Today we will be learning how to work with HTML, XML, and JSON formats.

- XML: `xml`
- HTML: `xml` (also `BeautifulSoup`, not covered but good to be aware of)
- JSON: `json`


# XML and HTML
- `html`: hyper text markup language (what our browsers can read)
- `xml`: extensible markup language (more general than HTML - we can define the "tags")
- hierarchical collections of elements (tree structure)
- generally consists of an opening tag, content, and closing tag

Let's look at some HTML: [Wikipedia page for "Dogs"](https://en.wikipedia.org/wiki/Dog)
- right-click->Inspect (or `F12`)

Let's look at some XML:

```xml
<dog>
    <name>Pippa</name>
    <age>10</age>
    <diet>
        <fooditem>kibbles</fooditem>
        <fooditem>pumpkin</fooditem>
    </diet>
</dog>
```

perhaps with attributes

```xml
<dog name="Pippa" age="10">
    <diet>
        <fooditem name="kibbles"></fooditem>
        <fooditem name="pumpkin"></fooditem>  
    </diet>
</dog>
```

Remember, we will be using this data, so someone else will have designed the XML formatting/style. Our focus will be to parse through a HTML/XML file to get the information we want.

## Demo

- data: `olympics.xml`
- path: 'data/olympics.xml'
- description: characteristics of several host countries of the Summer Olympic Games

```xml
<?xml version="1.0"?>
<data>
  <country name="greece">
    <order>1</order>
    <year>1896</year>
    <nexthost name="france"></nexthost>
  </country>
  <country name="united states of america">
    <order>3</order>
    <year>1904</year>
    <previoushost name="france"></previoushost>
    <nexthost name="england"></nexthost>
  </country>
  <country name="australia">
    <order>27</order>
    <year>2000</year>
    <previoushost name="united states of america"></previoushost>
    <nexthost name="greece"></nexthost>
  </country>
</data>
```

Useful site to visualize the tree structure of a XML file: https://codebeautify.org/xmlviewer

In [None]:
# read in our data
import xml.etree.ElementTree as et
tree = et.parse('data/olympics.xml')
tree

In [None]:
# grab the root element of tree
root = tree.getroot()
root

we have a handle on the root element. How do we begin exploring?
- what's the root element's **tag**?
- does the root element have any **attributes**? 
- does the root element contain any **children**?
- can we extend this knowledge?

How do we find out more? Check the [docs](https://docs.python.org/3/library/xml.etree.elementtree.html)

In [None]:
# explore some of it's features
print("the root element's tag name is:", root.tag)
print("the attributes of the root are:", root.attrib) #dictionary
print("the number of children that the root element has is: ", len(root))

In [None]:
#access the 0-th child
first_child = root[0]
print(first_child.tag)
print(first_child.attrib)
print(len(first_child))

In [None]:
second_child = root[1]
print(second_child.tag)
print(second_child.attrib)
print(len(second_child))

In [None]:
third_child = root[2]
print(third_child.tag)
print(third_child.attrib)
print(len(third_child))

We could have instead iterated over a node's children using the `.iter()` method.

In [None]:
for countries in root.iter('country'): #iterates over all children (of root) with the tag 'country'
    print(countries.tag)
    print(countries.attrib)
    print(len(countries))

For nodes with multiple attributes, we can just specify the key of the attribute we want (since it's a dictionary).

In [None]:
third_child.set('abbrev', 'AUS') #adds an attribute called 'abbrev' with a value of 'AUS' (note the XML file is not changed)
third_child.attrib

In [None]:
third_child.attrib['name']
#third_child.attrib['abbrev']

We now know how to traverse a XML tree and get tags and attributes. How do we get an actual **value**? (E.g. what year did Greece host the olympics?

In [None]:
# grab the 0th country element and check its attribute
first_child = root[0]
first_child.attrib

In [None]:
#find which child has the tag='year'
print(first_child[0].tag)
print(first_child[1].tag)

In [None]:
#use .text to get the value at that node.
first_child[1].text

Surely there must be a more efficient way... Maybe some way to search?

E.g. I want the value of the child with tag='year'

In [None]:
for sub_child in first_child.findall('year'):
    print(sub_child.text)

For more XML methods (and examples), visit this link: https://docs.python.org/3/library/xml.etree.elementtree.html

You'll also practice working with XML further in your compass exercises.

# JSON

**J**ava**S**cript **O**bject **N**otation: Most popular semi-structured data format.

A data-interchange format that is simple for humans and machines to read and write.

Data are stored in structures of key-value pairs (dictionary).


```json
{
  "pizza": ["veggie", "pepperoni", "pineapple", "mushroom"],
  "price": [12.99, 14.00, 11.99, 16.00],
  "averageRating": [5, 4, 4.5, 4.9]
}
```

or 

```json
{
  "pizza": {
    "0":"veggie",
    "1":"pepperoni",
    "2":"pineapple",
    "3":"mushroom"
  },
  "price": {
    "0":12.99,
    "1":14.0,
    "2":11.99,
    "3":16.0
  },
  "averageRating": {
    "0":5.0,
    "1":4.0,
    "2":4.5,
    "3":4.9
  }
}
```


When our attributes have the same number of items associated with them, parsing them out into a dataframe is rather simple.

In [None]:
data_url = "https://raw.githubusercontent.com/jcbain/data_2022-03-07/main/w02d03_other_data_types/src/data/pizza.json"
#data_path = "data/pizza.json"
pizza = pd.read_json(data_url)
pizza

## Nested JSON
however...

data: [`data/content.json`](data/content.json)

```json

{
  "articles": [
    {
      "name": "how to program in javascript",
      "author": "prairie",
      "wordCount": 1200
    },
    {
      "name": "should you rewrite your application in python?",
      "author": "pippa",
      "wordCount": 40000
    }
  ],
  "blogs": [
    {
      "title": "differences between js objects and python dictionaries",
      "postedBy": "jennifer"
    }
  ]
}
```

In [None]:
content = pd.read_json("data/content.json") #will give error

In which cases, we have to work a bit more to get our data in a format we find reasonable to work with

In [None]:
import json

with open("data/content.json") as file:
    content = json.load(file) #open JSON file as a python dictionary
    
content

Use the `pd.json_normalize` method to mold you data into a dataframe

... but looks a little strange

In [None]:
pd.json_normalize(content)

but conceptually "articles" and "blogs" are distinct

In [None]:
pd.json_normalize(content, record_path="articles")

In [None]:
pd.json_normalize(content, record_path="blogs")

In [None]:
#same as cell above
blogs = content["blogs"]
pd.json_normalize(blogs)

### Nested with multiple levels
How about complex but uniform structure...
[`data/states.json`](data/states.json)

In [None]:
import json

with open("data/states.json") as file:
    states = json.load(file) #open JSON file as a python dictionary
    
states

In [None]:
pd.json_normalize(states) #naively use pd.json_normalize

Notice 'counties' and 'info.ltGovernor' columns

In [None]:
#create a df from just the counties column
pd.json_normalize(states, record_path="counties")

Which state do each of these counties belong to? Can think of it as a "join"

In [None]:
#pulls data from up a level
pd.json_normalize(states, record_path="counties", meta=["state"])

In [None]:
#pulls data from up a level
pd.json_normalize(states, record_path="counties", meta=["state", "abbr"])

In [None]:
pd.json_normalize(states, record_path="counties", meta=["state", "abbr", ["info", "governor"]])

In [None]:
#rewriting above in a more visually pleasing way
pd.json_normalize(states,
                  record_path="counties",
                  meta=["state", "abbr", ["info", "governor"]])

In [None]:
#will give error because Ohio does not have a ltGovernor
pd.json_normalize(states,
                  record_path="counties",
                  meta=["state", "abbr", ["info", "governor"], ["info", "ltGovernor"]])

In [None]:
pd.json_normalize(states, 
                  record_path="counties", 
                  meta=["abbr", "state", ["info", "governor"], ["info", "ltGovernor"]],
                  errors="ignore"
                 )

Notice how we started with the most nested level (`counties`) and worked our way back up using `meta`.