# 201109 Extract additional taxonomy data

In [1]:
from pathlib import Path
import json
import xml.etree.ElementTree as ET
import re
from collections import Counter

from tqdm import tqdm

## File paths

In [2]:
outdir = Path('../../data/intermediate/201031-database-v1.1-software-version-migration/201109-extract-additional-taxonomy-data')
outdir.mkdir(exist_ok=True)

## Load XML data

In [3]:
taxon_xml = dict()

for file in tqdm(list(Path('tmp/taxa').glob('*.xml'))):
    taxid = int(re.fullmatch(r'(\d+)\.xml', file.name).group(1))
    with file.open() as f:
        txml = ET.parse(f).getroot()
        assert txml.tag == 'Taxon'
        taxon_xml[taxid] = txml

100%|██████████| 22498/22498 [00:04<00:00, 5147.22it/s]


## Inspect schema

Not sure what other information is in the original XML data that wasn't transferred to JSON. Recursively find all unique tag names in data.

Note: realized after there is a DTD file for the taxonomy database XML format [here](https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd), but that doesn't contain any more information.

In [4]:
def _record_schema(elem, schema):
    try:
        subschema = schema[elem.tag]
    except KeyError:
        subschema = schema[elem.tag] = {}
        
    for child in elem:
        _record_schema(child, subschema)


def deconstruct_schema(examples):
    schema = {}
    
    for ex in examples:
        _record_schema(ex, schema)
        
    return schema

In [5]:
schema = deconstruct_schema(taxon_xml.values())

In [6]:
schema

{'Taxon': {'TaxId': {},
  'ScientificName': {},
  'ParentTaxId': {},
  'Rank': {},
  'Division': {},
  'GeneticCode': {'GCId': {}, 'GCName': {}},
  'MitoGeneticCode': {'MGCId': {}, 'MGCName': {}},
  'Lineage': {},
  'LineageEx': {'Taxon': {'TaxId': {}, 'ScientificName': {}, 'Rank': {}}},
  'CreateDate': {},
  'UpdateDate': {},
  'PubDate': {},
  'OtherNames': {'Name': {'ClassCDE': {}, 'DispName': {}},
   'EquivalentName': {},
   'Synonym': {},
   'Includes': {},
   'GenbankSynonym': {},
   'GenbankCommonName': {},
   'BlastName': {},
   'Inpart': {},
   'CommonName': {},
   'Acronym': {},
   'GenbankAcronym': {}},
  'AkaTaxIds': {'TaxId': {}}}}

* `<TaxId>`, `<ParentTaxId`>, `<AkaTaxIds>`, `<ScientificName>`, `<Rank>`, `<CreateDate>`, `<UpdateDate>`, `<PubDate>` used in previous JSON export.
* `<GeneticCode>` and `<MitoGeneticCode>` not relevant.
* `<Lineage>` redundant with `<LineageEx>` and both can be reconstructed using parent ID relationships.
* Extracting data from `<OtherNames>` is what this notebook is primarily for. The schema under this is simple except for `<Name>` tags.
* `<Division>` - not sure what this is.

### Inspect `<Division>` values

In [7]:
Counter(txml.findtext('./Division') for txml in taxon_xml.values())

Counter({'Bacteria': 22496, 'Unassigned': 2})

This doesn't seem useful.

### Inspect `<Name>` child tag values

`<Name>` tags contain `<ClassCDE>` and `<DispName>` children as opposed to other child tags of `<OtherNames>` which just contain text. I suspect that the `<ClassCDE>` tag has only  a few possible values, making it easier to export the contents of all these tags to a unified JSON format.

In [8]:
Counter(el.text for txml in taxon_xml.values() for el in txml.findall('.OtherNames/Name/ClassCDE'))

Counter({'authority': 4802,
         'type material': 23178,
         'misspelling': 1610,
         'unpublished name': 9})

## Convert `<OtherName>` data to JSON

In [9]:
othernames_json = dict()

for taxid, txml in taxon_xml.items():
    entries = []
    
    for el in txml.findall('./OtherNames/*'):
        if el.tag == 'Name':
            _type = el.findtext('./ClassCDE')
            _name = el.findtext('./DispName')
            assert _type and _name
            entries.append(dict(type=_type, name=_name))
        else:
            assert el.text
            entries.append(dict(type=el.tag, name=el.text))
    
    if entries:
        othernames_json[taxid] = entries

In [10]:
with open(outdir / 'taxon-othernames.json', 'w') as f:
    json.dump(othernames_json, f)