# Types of data that are useful to data scientists

| Data Type | Example |Uses |
| :--------- | :------- | :---- |
| text | tweets, scripts, books | sentiment analysis, text generation, other natural language processing |
| JSON or XML | parsing APIs | gathering data, trend alalysis, forecasting... |
| HTML | web scraping | gathering web based document data, social media contacts... |
| images | computer vision | self-driving cars, medical imaging diagnostics |

## For today

- JSON: `json`
- XML: `xml`
- HTML: `xml` (`BeautifulSoup` not covered but good to be aware of)


### But first... a refresher on tabular data
**... and a small introduction to reading in `xlsx` files**

In [1]:
import pandas as pd

In [2]:
# using pandas to construct a dataframe from an xlsx file


In [3]:
# what if our file actually has multiple sheets?


# XML and HTML
- `html`: hyper text markup language
- `xml`: extensible markup language
- hierarchical collections of elements
- generally consists of an opening tag, content and closing tag

Let's look at some HTML: [Wikipedia page for "Dogs"](https://en.wikipedia.org/wiki/Dog)


Let's look at some XML:

```xml
<dog>
    <name>Pippa</name>
    <age>10</age>
    <diet>
        <fooditem>kibbles</fooditem>
        <fooditem>pumpkin</fooditem>
    </diet>
</dog>
```

perhaps with attributes

```xml
<dog name="Pippa" age="10">
    <diet>
        <fooditem name="kibbles"></fooditem>
        <fooditem name="pumpkin"></fooditem>  
    </diet>
</dog>
```

## Demo

- data: `olympics.xml`
- path: '/src/data/olympics.xml'
- description: characteristics of several host countries of the Summer Olympic Games

```xml
<?xml version="1.0"?>
<data>
  <country name="greece">
    <order>1</order>
    <year>1896</year>
    <nexthost name="france"></nexthost>
  </country>
  <country name="united states of america">
    <order>3</order>
    <year>1904</year>
    <previoushost name="france"></previoushost>
    <nexthost name="england"></nexthost>
  </country>
  <country name="australia">
    <order>27</order>
    <year>2000</year>
    <previoushost name="united states of america"></previoushost>
    <nexthost name="greece"></nexthost>
  </country>
</data>
```

In [4]:
# read in our data


In [5]:
# grab the root element of tree


we have a handle on the root element. How do we begin exploring?
- what's the root element's tag?
- does the root element have any attributes? 
- does the root element contain any children?
- can we extend this knowledge?

How do we find out more? Check the [docs](https://docs.python.org/3/library/xml.etree.elementtree.html)

In [6]:
# explore some of it's features


How might we go about displaying the attributes of each `country` tag?

In [7]:
# grab the 0th country element


# grab the 1th element from the first country and display its content 


# JSON

**J**ava**S**cript **O**bject **N**otation

A data-interchange format that is simple for humans and machines to read and write.

Data are stored in structures of key-value pairs.


```json
{
  "pizza": ["veggie", "pepperoni", "pineapple", "mushroom"],
  "price": [12.99, 14.00, 11.99, 16.00],
  "averageRating": [5, 4, 4.5, 4.9]
}
```

or 

```json
{
  "pizza": {
    "0":"veggie",
    "1":"pepperoni",
    "2":"pineapple",
    "3":"mushroom"
  },
  "price": {
    "0":12.99,
    "1":14.0,
    "2":11.99,
    "3":16.0
  },
  "averageRating": {
    "0":5.0,
    "1":4.0,
    "2":4.5,
    "3":4.9
  }
}
```


When our attributes have the same number of items associated with them, parsing them out into a dataframe is rather simple.

## Nested JSON
however...

data: `content.json`

```json

{
  "articles": [
    {
      "name": "how to program in javascript",
      "author": "prairie",
      "wordCount": 1200
    },
    {
      "name": "should you rewrite your application in python?",
      "author": "pippa",
      "wordCount": 40000
    }
  ],
  "blogs": [
    {
      "title": "differences between js objects and python dictionaries",
      "postedBy": "jennifer"
    }
  ]
}
```

In which cases, we have to work a bit more to get our data in a format we find reasonable to work with

Use the `pd.json_normalize` method to mold you data into a dataframe

... but looks a little strange

but conceptually "articles" and "blogs" are distinct

How about complex but uniform structure...

In [8]:
states = [
    {
        "state": "Florida",
        "abbr": "FL",
        "info": { "governor": "Ron DeSantis", "ltGovernor": "Jeanette Nunez"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 400000},
            {"name": "Palm Beach", "population": 600000}
        ]
    },
        {
        "state": "Ohio",
        "abbr": "OH",
        "info": { "governor": "Mike DeWine"},
        "counties": [
            {"name": "Summit", "population": 50324},
            {"name": "Cuyahoga", "population": 200000}
        ]
    }
    
]