# core

> Reading XML inputs and writing JSON outputs.

In [1]:
#| default_exp core

In [2]:
#| hide
%load_ext autoreload
%autoreload 2

In [3]:
#| hide
from fastcore.test import *
from nbdev.showdoc import *

## XML
XML is a markup language that is used to store and transport data. It is both human-readable and machine-readable.  It is flexible, allowing to define new tags and attributes as needed.  It is portable and platform-independent, and can be read as plain text or in a variety of editors and parsers.

However, it's verbose, complex, hard to debug, and not queryable.  The goal of this project is to convert XMLs to (semi)relational data formats that can support data warehousing and reporting solutions.

In this module, a special XML parser is created to create JSON outputs.


```{mermaid}
flowchart LR
title[XML Parsing Workflow]
  A[XML file] --> B(python\ndict)
  B --> C{remove all lists}
  C --> D[JSON run-level metadata]
  C --> E[JSON lists with run results]

```

## XML Example: Mass-Spectrometry

Let's have a look at a sample XML file.  Here, XML format is used to record mass-spectrometry data (`.mzML` files).

In [4]:
import json
import requests
import xmltodict

`xml_data` is a bytestring with the contents of the downloaded XML file:

In [5]:
xml_url = "https://raw.githubusercontent.com/ProteoWizard/pwiz/master/example_data/tiny.pwiz.1.1.1.mzML"
xml_data = requests.get(xml_url).content
xml_data[:2048]

b'<?xml version="1.0" encoding="ISO-8859-1"?>\n<indexedmzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.1_idx.xsd">\n  <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" id="urn:lsid:psidev.info:mzML.instanceDocuments.tiny.pwiz" version="1.1.0">\n    <cvList count="2">\n      <cv id="MS" fullName="Proteomics Standards Initiative Mass Spectrometry Ontology" version="2.33.1" URI="http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo"/>\n      <cv id="UO" fullName="Unit Ontology" version="11:02:2010" URI="http://obo.cvs.sourceforge.net/*checkout*/obo/obo/ontology/phenotype/unit.obo"/>\n    </cvList>\n    <fileDescription>\n      <fileContent>\n        <cvParam cvRe

In [6]:
#| hide
test_eq(xml_data[:64], b'<?xml version="1.0" encoding="ISO-8859-1"?>\n<indexedmzML xmlns="')

### Reading as a dictionary and converting to JSON
Reading with `xmltodict` parser is fast and yields a python dictionary, which can be readily converted to JSON.  
Among other things, it's easier to see the structure in a pretty-printed JSON:

In [7]:
xml_dict = xmltodict.parse(xml_data, attr_prefix="")
print(json.dumps(xml_dict, indent=4)[:2048])

{
    "indexedmzML": {
        "xmlns": "http://psi.hupo.org/ms/mzml",
        "xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
        "xsi:schemaLocation": "http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.1_idx.xsd",
        "mzML": {
            "xmlns": "http://psi.hupo.org/ms/mzml",
            "xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
            "xsi:schemaLocation": "http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd",
            "id": "urn:lsid:psidev.info:mzML.instanceDocuments.tiny.pwiz",
            "version": "1.1.0",
            "cvList": {
                "count": "2",
                "cv": [
                    {
                        "id": "MS",
                        "fullName": "Proteomics Standards Initiative Mass Spectrometry Ontology",
                        "version": "2.33.1",
                        "URI": "http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/contr

## Separating dicts and lists
JSON format is a collection of `key:value` pairs, where the `value` can be one of:  
- a single value (starts with "`"`" or a number, ends with comma);
- a list (starting with "`[`");
- another branch (starts with "`{`").

In most scenarios, dicts/single values will describe the run, and lists will be associated with repeated arrays of measurements.  It makes sense to pull the apart.

In [8]:
# to be continued...

In [9]:
#| export
def foo(): pass

In [10]:
#| hide
import nbdev; nbdev.nbdev_export()