# CONCORD-3 - Metadata
## Global surveillance of trends in cancer survival 2000–14
###### Analysis of individual records for 37513025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries.

- De Miguel González, Gerardo E.
- Ferreño Blanco, Diego
- Prado Rujas, Ignacio Iker
- Ruiz Martínez, Estela
- Zamudio López, Manuel

## Creating metadatata with `dcxml` 

Load required library [simpledc](https://dcxml.readthedocs.io/en/latest/):

In [1]:
from dcxml import simpledc

This are the available namespaces:

In [2]:
simpledc.ns

{'dc': 'http://purl.org/dc/elements/1.1/',
 'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
 'xml': 'xml',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

More can be added if neccessary:

In [3]:
simpledc.ns['Test'] = 'New namespace'
display(simpledc.ns)
# Delete it
_ = simpledc.ns.pop('Test')

{'dc': 'http://purl.org/dc/elements/1.1/',
 'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
 'xml': 'xml',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 'Test': 'New namespace'}

First, we create a dictionary containing the basic set of elements comprising **Dublin Core**:

In [4]:
metadata_dc = dict(
    # An entity responsible for making contributions to the resource:
    contributors = ['Gerardo E. De Miguel González',
                    'Diego Ferreño Blanco',
                    'Ignacio Iker Prado Rujas',
                    'Estela Ruiz Martínez',
                    'Manuel Zamudio López'],
    # The spatial or temporal topic of the resource, the spatial
    # applicability of the resource, or the jurisdiction under which 
    # the resource is relevant:
    coverage = ['Africa', 
                'America (Central and South)', 
                'America (North)',
                'Asia',
                'Europe',
                'Oceania'],
    # An entity primarily responsible for making the resource:
    creators = ['Cancer Survival Group',
                'London School of Hygiene & Tropical Medicine'],
    # A point or period of time associated with an event in the
    # lifecycle of the resource:
    dates = [str(i) for i in range(2000, 2015)],
    # An account of the resource:
    descriptions = ["""CONCORD data, divided into: Continent, Country, Registry, Year and Sex.
                    CONCORD is the program for global surveillance of cancer survival, 
                    led by the London School of Hygiene & Tropical Medicine. The CONCORD 
                    (Global Surveillance of Cancer Survival) program is endorsed by more 
                    than 40 international agencies, including the OECD (Organization for 
                    Economic Co-operation and Development) of the WHO (World Health Organization)
                    and the World Bank."""],
    # The file format, physical medium, or dimensions of the resource:
    formats = ['data/csv'],
    # An unambiguous reference to the resource within a given context:
    identifiers = ['10.5281/zenodo.xxxxxxx',
                   'CONCORD-3',
                   'CONCORD'],
    # A language of the resource:
    languages = ['en'],
    # An entity responsible for making the resource available:
    publishers = ['London School of Hygiene & Tropical Medicine'],
    # A related resource:
    relations = ['https://doi.org/10.1016/S0140-6736(17)33326-3',
                 'CONCORD',
                 'CONCORD-2'],
    # Information about rights held in and over the resource:
    rights = ['Attribution-NonCommercial-ShareAlike'],
    # A related resource from which the described resource is derived:
    sources = ['https://csg.lshtm.ac.uk/life-tables/'],
    # The topic of the resource:
    subjects = ['cancer',
               'cancer-surveillance',
               'concord',
               'concord-3'],
    # A name given to the resource:
    titles = ['CONCORD-3: Global surveillance of cancer survival'],
    # The nature or genre of the resource:
    types = ['data',
             'software tools']
)

Now, we convert the dictionary into a string version of the `xml`:

In [5]:
xml = simpledc.tostring(metadata_dc)
print(xml)

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:contributor>Gerardo E. De Miguel González</dc:contributor>
  <dc:contributor>Diego Ferreño Blanco</dc:contributor>
  <dc:contributor>Ignacio Iker Prado Rujas</dc:contributor>
  <dc:contributor>Estela Ruiz Martínez</dc:contributor>
  <dc:contributor>Manuel Zamudio López</dc:contributor>
  <dc:coverage>Africa</dc:coverage>
  <dc:coverage>America (Central and South)</dc:coverage>
  <dc:coverage>America (North)</dc:coverage>
  <dc:coverage>Asia</dc:coverage>
  <dc:coverage>Europe</dc:coverage>
  <dc:coverage>Oceania</dc:coverage>
  <dc:creator>Cancer Survival Group</dc:creator>
  <dc:creator>London School of Hygiene &amp; Tropical Medicine</dc:creator>
  <dc:date>2000</

We now convert the xml string into a `tree` from `xml.etree.ElementTree`:

In [6]:
import xml.etree.ElementTree as ET
#tree = ET.fromstring(xml)
# To obtain a root element we use the constructor directly
tree = ET.ElementTree(ET.fromstring(xml))
for table in tree.getiterator('{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'):
    for child in table:
        print(child.tag[34:], '->', child.text)

contributor -> Gerardo E. De Miguel González
contributor -> Diego Ferreño Blanco
contributor -> Ignacio Iker Prado Rujas
contributor -> Estela Ruiz Martínez
contributor -> Manuel Zamudio López
coverage -> Africa
coverage -> America (Central and South)
coverage -> America (North)
coverage -> Asia
coverage -> Europe
coverage -> Oceania
creator -> Cancer Survival Group
creator -> London School of Hygiene & Tropical Medicine
date -> 2000
date -> 2001
date -> 2002
date -> 2003
date -> 2004
date -> 2005
date -> 2006
date -> 2007
date -> 2008
date -> 2009
date -> 2010
date -> 2011
date -> 2012
date -> 2013
date -> 2014
description -> CONCORD data, divided into: Continent, Country, Registry, Year and Sex.
                    CONCORD is the program for global surveillance of cancer survival, 
                    led by the London School of Hygiene & Tropical Medicine. The CONCORD 
                    (Global Surveillance of Cancer Survival) program is endorsed by more 
                    than 

We can add more elements to obtain an extended version of **Dublin Core**:

In [7]:
root = tree.getroot()
a = ET.SubElement(root, '{http://purl.org/dc/elements/1.1/}test-tag')
a.text = 'This is a test'

namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'}
test = tree.find('dc:test-tag', namespaces=namespaces)
print(test.tag[34:], '->', test.text)

test-tag -> This is a test


Finally, we can save it into an `xml` file to be used in the desired repository:

In [8]:
tree.write(open('concord-dc.xml', 'w'), encoding='unicode')

## Automating the metadata generation

Lets remember there are different type of resources for the CONCORD-3 project:

- Raw data coming directly from the different registries from each country. 
- Aggregated and curated datasets already processed. 
- Charts and graphs produced after analysing the datasets. 
- Results and outputs (estimations and regressions) obtained from the data. 
- Publications, papers and reports explaining the results. 

According to this, we can distinguish 5 general metadata groups, and produce `xml` files accordingly.

Since we have seen a step-by-step procedure, we can encapsulate the work into a function with a simpler interface:

In [9]:
def update_metadata(metadata_dict, new_vals):
    """Update dictionary, even by modifying existing values."""
    #for k, v in new_vals.items():
    #    metadata_dict[k] = v
    # To avoid the for loop, use this built-in function:
    metadata_dict.update(new_vals)
    return metadata_dict
    
def dict_to_tree(metadata_dict):
    """Transform dictionary into a XML Element Tree."""
    return ET.ElementTree(ET.fromstring(simpledc.tostring(metadata_dict)))
    
def add_to_tree(tree, new_vals):
    """Add new values to existing XML Element Tree."""
    root = tree.getroot()
    for k, v in new_vals.items():
        for item in v:
            elem = ET.SubElement(root, '{http://purl.org/dc/elements/1.1/}' + k)
            elem.text = item 
            elem.tail = '\n  '
    return tree

def save_metadata(tree, fname):
    """Save tree to file"""
    tree.write(open(fname, 'w'), encoding='unicode')
    return

def process_metadata(orig_dict, dict_for_dict, dict_for_tree, fname=None):
    """Put everything together."""
    # 1st - Update existing dictionary
    md_dict = orig_dict.copy()
    md_dict = update_metadata(md_dict, dict_for_dict)

    # 2nd - Transform dictionary into tree
    tree = dict_to_tree(md_dict)

    # 3rd - Add new elements to extend Dublin Core
    tree = add_to_tree(tree, dict_for_tree)
    
    # 4th - Save to file
    if fname is not None:
        save_metadata(tree, fname)
    return tree

For example, consider a dataset containing probabilities of survival to cancer from all registries in Spain (so `coverage` is only Europe) from 2013, with a certain DOI and processed with a certain software. Then we can update the original `metadata_dc` dictionary and produce the metadata terms as follows:

In [10]:
raw_dict = {
    'coverage': ['Europe'],
    'dates': ['2013'],
    'types': ['data'],
    'identifiers': ['10.5281/zenodo.SPxxxxx', 'CONCORD-3', 'CONCORD']
}
raw_tree = {
    'Country': ['Spain'],
    'Registry': ['Albacete', 'Asturias', 'Basque Country',
                 'Canaries', 'Cuenca Spain', 'Girona', 
                 'Granada', 'Mallorca', 'Murcia', 'Navarra'
                 'Spain Childhood', 'Tarragona', 'Valencia Childhood'],
    'Analyzer': ['scikit-learn', '0.20.2']
}
fname = 'concord-Spain.xml'
# Remember we have basic metadata elements in this dict -> metadata_dc
t = process_metadata(metadata_dc, raw_dict, raw_tree, fname=fname)

# Check results:
loaded_tree = ET.parse(fname)
for table in loaded_tree.getroot().getiterator():
    for child in table:
        print(child.tag[34:], '->', child.text)

contributor -> Gerardo E. De Miguel González
contributor -> Diego Ferreño Blanco
contributor -> Ignacio Iker Prado Rujas
contributor -> Estela Ruiz Martínez
contributor -> Manuel Zamudio López
coverage -> Europe
creator -> Cancer Survival Group
creator -> London School of Hygiene & Tropical Medicine
date -> 2013
description -> CONCORD data, divided into: Continent, Country, Registry, Year and Sex.
                    CONCORD is the program for global surveillance of cancer survival, 
                    led by the London School of Hygiene & Tropical Medicine. The CONCORD 
                    (Global Surveillance of Cancer Survival) program is endorsed by more 
                    than 40 international agencies, including the OECD (Organization for 
                    Economic Co-operation and Development) of the WHO (World Health Organization)
                    and the World Bank.
format -> data/csv
identifier -> 10.5281/zenodo.SPxxxxx
identifier -> CONCORD-3
identifier -> CONCORD
l

Now it is straighforward to extend this for a chart, for instace:

In [11]:
raw_dict = {
    'coverage': ['Oceania'],
    'dates': ['2010'],
    'types': ['chart', 'kde'],
    'identifiers': ['10.4975/zenodo.AUxxxxx', 'CONCORD-3', 'CONCORD']
}
raw_tree = {
    'Country': ['Australia'],
    'Registry': ['Australian Capital Territory', 'New South Wales', 
                 'Northern Territory', 'Queensland', 
                 'South Australia', 'Tasmania', 
                 'Victoria', 'Western Australia'],
    'Analyzer': ['seaborn', '0.9.0'],
    'Chart-type': ['violin-plot']
}
fname = 'concord-Australia.xml'
# Remember we have basic metadata elements in this dict -> metadata_dc
t = process_metadata(metadata_dc, raw_dict, raw_tree, fname=fname)

# Check results:
loaded_tree = ET.parse(fname)
for table in loaded_tree.getroot().getiterator():
    for child in table:
        print(child.tag[34:], '->', child.text)

contributor -> Gerardo E. De Miguel González
contributor -> Diego Ferreño Blanco
contributor -> Ignacio Iker Prado Rujas
contributor -> Estela Ruiz Martínez
contributor -> Manuel Zamudio López
coverage -> Oceania
creator -> Cancer Survival Group
creator -> London School of Hygiene & Tropical Medicine
date -> 2010
description -> CONCORD data, divided into: Continent, Country, Registry, Year and Sex.
                    CONCORD is the program for global surveillance of cancer survival, 
                    led by the London School of Hygiene & Tropical Medicine. The CONCORD 
                    (Global Surveillance of Cancer Survival) program is endorsed by more 
                    than 40 international agencies, including the OECD (Organization for 
                    Economic Co-operation and Development) of the WHO (World Health Organization)
                    and the World Bank.
format -> data/csv
identifier -> 10.4975/zenodo.AUxxxxx
identifier -> CONCORD-3
identifier -> CONCORD


## Summary

To cut a long story short, in this notebook we have developed the necessary infraestructure to easily automatize the creation of metadata for all resources produced throughout the life cycle of the CONCORD-3 project.