# XML, JSON, and Recursion
Agenda today:
- Introduction to XML (extensive markup language)
- Introduction to JSON (Javascript Object Notation)
    - Working with Json in Python
    - Working with a real-world dataset 
- Recursive Function



Students will be able to...
- Parse and iterate through an XML file using ElemenTree Library 
- Parse and iterate through a JSON file using the json library 
- Explain what a recursive vs. iterative function is

#### Why JSON/XML?
In the past, we have worked with nice, clean tabular table where the the rows are observations and columns are features, such as our housing data. However, when we are working with data collected through the web (through webscraping or an API call), the data are usually messy and in the format of a JSON or XML format. As data scientists, we are expected to know how to turn messy nested dictionary into tabular table for query, cleaning, and modeling. 

### Part I. XML - Extensive Markup Language

In [1]:
import xml.etree.ElementTree as ET

In [6]:
snippet = '''
<xs:element name="name">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name">
        <xs:simpleType>
          <xs:restriction base="xs:string"></xs:restriction>
        </xs:simpleType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>'''

In [8]:
tree = ET.fromstring(snippet)

ParseError: unbound prefix: line 2, column 0 (<string>)

### Part II. JSON - Javascript Object Notation

In [None]:
import json
import pandas as pd

In [2]:
with open('12traits_biathlon_data.json','r') as f:
    file = json.loads(f.read())

In [13]:
file['silhouette_targets'][0]

{'name': 'Persson',
 'shots': [{'x': 0.30481398618938044, 'y': 0.4769089569224076},
  {'x': 0.08254752672437815, 'y': 0.6231545854419579},
  {'x': -0.06570703868287404, 'y': 0.18276801307578938},
  {'x': -0.19950536224575632, 'y': 0.04277514867586857},
  {'x': -0.48209052446388273, 'y': -0.1486862587722918},
  {'x': -0.051966818786095756, 'y': -0.20272205113575173},
  {'x': 0.4761007040959099, 'y': 0.25226710985403933},
  {'x': 0.8388861454357874, 'y': 0.4382699065638803},
  {'x': -0.39174138324186564, 'y': 0.43270940550370157},
  {'x': -0.2090816799767541, 'y': 0.6042319463139916},
  {'x': -0.20956350666407814, 'y': 0.8756834553526813},
  {'x': -0.18699128521674976, 'y': 0.31834164130396214}]}

In [11]:
file

{'silhouette_targets': [{'name': 'Persson',
   'shots': [{'x': 0.30481398618938044, 'y': 0.4769089569224076},
    {'x': 0.08254752672437815, 'y': 0.6231545854419579},
    {'x': -0.06570703868287404, 'y': 0.18276801307578938},
    {'x': -0.19950536224575632, 'y': 0.04277514867586857},
    {'x': -0.48209052446388273, 'y': -0.1486862587722918},
    {'x': -0.051966818786095756, 'y': -0.20272205113575173},
    {'x': 0.4761007040959099, 'y': 0.25226710985403933},
    {'x': 0.8388861454357874, 'y': 0.4382699065638803},
    {'x': -0.39174138324186564, 'y': 0.43270940550370157},
    {'x': -0.2090816799767541, 'y': 0.6042319463139916},
    {'x': -0.20956350666407814, 'y': 0.8756834553526813},
    {'x': -0.18699128521674976, 'y': 0.31834164130396214}]},
  {'name': 'Dahlmeier',
   'shots': [{'x': 0.21946660037482163, 'y': 0.293897627400934},
    {'x': 0.2945606655599632, 'y': 0.2424010485271275},
    {'x': -0.04325340451740329, 'y': 0.0014464374879283107},
    {'x': 0.10292705900243555, 'y': -0.01

In [9]:
for key in file.keys():
    print(key)

silhouette_targets


In [10]:
for value in file.values():
    print(value)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [26]:
df = pd.DataFrame.from_dict(file['silhouette_targets'])
df.head()

Unnamed: 0,name,shots
0,Persson,"[{'x': 0.30481398618938044, 'y': 0.47690895692..."
1,Dahlmeier,"[{'x': 0.21946660037482163, 'y': 0.29389762740..."
2,Persson,"[{'x': 0.021290847316312672, 'y': -0.167043218..."
3,Dahlmeier,"[{'x': -0.1157085563035615, 'y': -0.0319162290..."
4,Berger,"[{'x': 0.09017508203998442, 'y': 0.08436344424..."


In [29]:
df.isnull().sum()

name     0
shots    0
dtype: int64

In [23]:
# now you want to unpack the shots and do some feature engineering on it
df.columns

Index(['name', 'shots'], dtype='object')

In [25]:
df.shots[0]

[{'x': 0.30481398618938044, 'y': 0.4769089569224076},
 {'x': 0.08254752672437815, 'y': 0.6231545854419579},
 {'x': -0.06570703868287404, 'y': 0.18276801307578938},
 {'x': -0.19950536224575632, 'y': 0.04277514867586857},
 {'x': -0.48209052446388273, 'y': -0.1486862587722918},
 {'x': -0.051966818786095756, 'y': -0.20272205113575173},
 {'x': 0.4761007040959099, 'y': 0.25226710985403933},
 {'x': 0.8388861454357874, 'y': 0.4382699065638803},
 {'x': -0.39174138324186564, 'y': 0.43270940550370157},
 {'x': -0.2090816799767541, 'y': 0.6042319463139916},
 {'x': -0.20956350666407814, 'y': 0.8756834553526813},
 {'x': -0.18699128521674976, 'y': 0.31834164130396214}]