# Types of data that are useful to data scientists

| Data Type | Example |Uses |
| :--------- | :------- | :---- |
| text | tweets, scripts, books | sentiment analysis, text generation, other natural language processing |
| JSON or XML | parsing APIs | gathering data, trend alalysis, forecasting... |
| HTML | web scraping | gathering web based document data, social media contacts... |
| images | computer vision | self-driving cars, medical imaging diagnostics |

## For today

- JSON: `json`
- XML: `xml`
- HTML: `xml` (`BeautifulSoup` not covered but good to be aware of)


### But first... a refresher on tabular data
**... and a small introduction to reading in `xlsx` files**

In [1]:
import pandas as pd

In [74]:
# using pandas to construct a dataframe from an xlsx file
puppies = pd.read_excel("../data/puppies.xlsx")
puppies.head()

Unnamed: 0,name,age,weight
0,pippa,6,8
1,prairie,6,5
2,pippa,24,12
3,prairie,24,10


In [75]:
# what if our file actually has multiple sheets?
puppies_weights = pd.read_excel("../data/puppies.xlsx", sheet_name="Sheet1")
puppies_desc = pd.read_excel("../data/puppies.xlsx", sheet_name="Sheet2")

puppies_weights

Unnamed: 0,name,age,weight
0,pippa,6,8
1,prairie,6,5
2,pippa,24,12
3,prairie,24,10


In [76]:
puppies_desc

Unnamed: 0,name,color,type
0,pippa,white,mix
1,prairie,tri-color,mix
2,chewey,white,jack russel terrier


# XML and HTML
- `html`: hyper text markup language
- `xml`: extensible markup language
- hierarchical collections of elements
- generally consists of an opening tag, content and closing tag

Let's look at some HTML: [Wikipedia page for "Dogs"](https://en.wikipedia.org/wiki/Dog)


Let's look at some XML:

```xml
<dog>
    <name>Pippa</name>
    <age>10</age>
    <diet>
        <fooditem>kibbles</fooditem>
        <fooditem>pumpkin</fooditem>
    </diet>
</dog>
```

perhaps with attributes

```xml
<dog name="Pippa" age="10">
    <diet>
        <fooditem name="kibbles"></fooditem>
        <fooditem name="pumpkin"></fooditem>  
    </diet>
</dog>
```

## Demo

- data: `olympics.xml`
- path: '/src/data/olympics.xml'
- description: characteristics of several host countries of the Summer Olympic Games

```xml
<?xml version="1.0"?>
<data>
  <country name="greece">
    <order>1</order>
    <year>1896</year>
    <nexthost name="france"></nexthost>
  </country>
  <country name="united states of america">
    <order>3</order>
    <year>1904</year>
    <previoushost name="france"></previoushost>
    <nexthost name="england"></nexthost>
  </country>
  <country name="australia">
    <order>27</order>
    <year>2000</year>
    <previoushost name="united states of america"></previoushost>
    <nexthost name="greece"></nexthost>
  </country>
</data>
```

In [70]:
# read in our data
import xml.etree.ElementTree as et
tree = et.parse('../data/olympics.xml')
tree

<xml.etree.ElementTree.ElementTree at 0x7f55089c7430>

In [71]:
# grab the root element of tree
root = tree.getroot()
root

<Element 'data' at 0x7f5508426c70>

we have a handle on the root element. How do we begin exploring?
- what's the root element's tag?
- does the root element have any attributes? 
- does the root element contain any children?
- can we extend this knowledge?

How do we find out more? Check the [docs](https://docs.python.org/3/library/xml.etree.elementtree.html)

In [72]:
# explore some of it's features
print("the root's tag is: ", root.tag)
print("the attributes of the root element: ", root.attrib)
print("the number of children of the root element: ", len(root))

the root's tag is:  data
the attributes of the root element:  {}
the number of children of the root element:  3


In [73]:
# root[0]
root[1]

<Element 'country' at 0x7f55085c39f0>

How might we go about displaying the attributes of each `country` tag?

In [55]:
for idx in range(len(root)):
    tag = root[idx].tag
    attribute = root[idx].attrib
    print("tag: {} || attributes: {}".format(tag, attribute))

tag: country || attributes: {'name': 'greece'}
tag: country || attributes: {'name': 'united states of america'}
tag: country || attributes: {'name': 'australia'}


In [77]:
# grab the 0th country element
first_country = root[0]

# grab the 1th element from the first country and display its content 
first_country[1].text

'1896'

'1896'