# Gestión y uso de Metadatos

En este Notebook descubriremos cómo pueden explotarse metadatos publicados en formatos basados en etiquetas, como XML.

<img src="https://www.republica.com/wp-content/uploads/2017/04/grito.jpg " width="250">

Vamos a empezar por describir un par de objetos, empezando por un cuadro, "El grito", de  Edvard Munch.

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

Con Dublin Core también podemos describir datasets científicos. Vamos a probar con:

https://zenodo.org/record/583733#.WfL3V3V96-o

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Consejo**</p>

<p>En el propio repositorio puedes encontrar metadatos</p>
</div>

* Title: Snow dataset for Mount-Lebanon (2011-2016)
* Creator: Fayad, Abbas
* Subject:  Snow observations; Snow water equivalent
* Description: We present a comprehensive snow dataset for Mont-Lebanon. The dataset includes continuous meteorological observations from three high elevation automatic weather stations (AWS), snowpack field measurements collected at 30 different snow courses (elevation range 1300-2900 m a.s.l.), and post-processed MODIS snow products.  Meteorological and snow observations are presented for the snow seasons (November-June) between 2011 and 2016 for Mzar (MZA, 2294 m a.s.l.), 2014-2016 for the Cedars (CED, 2834 m a.s.l.), and 2015-2016 for Laqlouq (LAQ, 1830 m a.s.l.). Meteorological and snow data includes snow depth, temperature, relative humidity, incoming and reflected solar radiation, wind speed and direction, and atmospheric pressure measured at 30-min interval. Snow depth, snow density, and snow water equivalent were measured at the 30 different snow courses during snow season 2015 and 2016 with an average revisit time of 11.4 days. Post-processed daily MODIS snow cover area (SCA) and snow cover duration (SCD) products are presented for the three snow dominated basins (Abou Ali, Ibrahim, and El Kelb) and cover the time period from 01 November 2011 to 31 June 2016.
* Publisher: Zenodo
* Contributor:
* Date: 2017-05-28
* Type: publication-article
* Format: CSV
* Identifier: 10.5281/zenodo.583733
* Source: 
* Language:
* Relation: 
* Coverage:  
* Rights: https://creativecommons.org/licenses/by/4.0/

A partir de las descripciones, podemos crear documentos XML que sean interpretables por máquinas (entendiendo máquinas como scripts, software, etc). 

El grito:
  
  ```XML
 <dc:contributor> </dc:contributor>
  <dc:coverage> </dc:coverage>
  <dc:creator> </dc:creator>
  <dc:date> </dc:date>
  <dc:description> </dc:description>
  <dc:format>  </dc:format>
  <dc:identifier> </dc:identifier>
  <dc:language> </dc:language>
  <dc:publisher> </dc:publisher>
  <dc:relation> </dc:relation>
  <dc:rights> </dc:rights>
  <dc:source> </dc:source>
  <dc:title> </dc:title>
  <dc:type> </dc:type>
```

Dataset:
  ```XML
 <dc:contributor> </dc:contributor>
  <dc:coverage> </dc:coverage>
  <dc:creator> </dc:creator>
  <dc:date>2017-05-28 </dc:date>
  <dc:description> </dc:description>
  <dc:format>  </dc:format>
  <dc:identifier>10.5281/zenodo.583733</dc:identifier>
  <dc:language> </dc:language>
  <dc:publisher>Zenodo</dc:publisher>
  <dc:relation> </dc:relation>
  <dc:rights> </dc:rights>
  <dc:source> </dc:source>
  <dc:title>Snow dataset for Mount-Lebanon (2011-2016)</dc:title>
  <dc:type>publication-article</dc:type>
```

Ahora vamos a ver cómo podemos manejar estos datos en Python. Para ello, utilizaremos la librería xml.

PAra crear un documento XML bien formado, es necesario definir dónde está descrito el prefijo Dublin Core o "dc:". Para ello, añadimos antes de los datos la siguiente cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
```

Sin olvidar añadir al final:

```XML
</searchRetrieveResponse>
```

In [3]:
import xml.etree.ElementTree as ET
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">


    <dc:contributor>asdsadsad</dc:contributor>
    <dc:coverage>dfsd</dc:coverage>
    <dc:creator>sadsa</dc:creator>
    <dc:date>sadas</dc:date>
    <dc:description>sadsa</dc:description>
    <dc:format>sadasd</dc:format>
    <dc:identifier>sadsad</dc:identifier>
    <dc:language>asdasd</dc:language>
    <dc:publisher>wqewq</dc:publisher>
    <dc:relation >wqeqw</dc:relation>
    <dc:rights>ffefe</dc:rights>
    <dc:source>vfvf</dc:source>
    <dc:title>wqewqe</dc:title>
    <dc:type>ewfrb</dc:type>



</searchRetrieveResponse>'''

tree = ET.fromstring(dc_xml)
tree

<Element 'searchRetrieveResponse' at 0x7f54a2f26318>

Si queremos recorrer los elementos del XML que hemos formado, podemos utilizar un bucle, teniendo en cuenta que la información que nos interesa la tenemos en 'searchRetrieveResponse':

In [4]:
for table in tree.getiterator('searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}contributor asdsadsad
{http://purl.org/dc/elements/1.1/}coverage dfsd
{http://purl.org/dc/elements/1.1/}creator sadsa
{http://purl.org/dc/elements/1.1/}date sadas
{http://purl.org/dc/elements/1.1/}description sadsa
{http://purl.org/dc/elements/1.1/}format sadasd
{http://purl.org/dc/elements/1.1/}identifier sadsad
{http://purl.org/dc/elements/1.1/}language asdasd
{http://purl.org/dc/elements/1.1/}publisher wqewq
{http://purl.org/dc/elements/1.1/}relation wqeqw
{http://purl.org/dc/elements/1.1/}rights ffefe
{http://purl.org/dc/elements/1.1/}source vfvf
{http://purl.org/dc/elements/1.1/}title wqewqe
{http://purl.org/dc/elements/1.1/}type ewfrb


Observa que, al utilizar el prefijo 'dc:' e indicarle que está descrito en la URL 'http://purl.org/dc/elements/1.1/', la eqtiqueta o "tag" aparece como, por ejemplo {URL}contributor.

Prueba a mostrar los metadatos que has creado a partir del cuadro y del dataset:

In [5]:
for table in tree.getiterator('searchRetrieveResponse'):
    for child in table:
        print(child.text)

asdsadsad
dfsd
sadsa
sadas
sadsa
sadasd
sadsad
asdasd
wqewq
wqeqw
ffefe
vfvf
wqewqe
ewfrb


Utilizando findall() sobre el albor (tree), podemos encontrar todos los elementos con una etiqueta determinada.

In [6]:
relation = tree.findall('{http://purl.org/dc/elements/1.1/}relation')
print(relation)

[<Element '{http://purl.org/dc/elements/1.1/}relation' at 0x7f54a2c7de08>]


Ten en cuenta que lo que encontramos es, en realidad, una parte del XML, por lo que hay que iterarlo como antes:

In [7]:
for child in relation:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}relation wqeqw


XML utiliza prefijos para no necesitar referenciar a la URL de un tipo cada vez, lo podemos ver en la cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'''
```

Por ejemplo, cada vez que queremos utilizar un tipo de Dublin Core, utilizamos el prefijo dc: que equivale a llamar a la definición:

xmlns:dc="http://purl.org/dc/elements/1.1/"

Sin embargo, para utilizar ElementTree en Python, tenemos que utilizar la URL completa. Esto puede resultar un poco engorroso, así que podemos definir el namespace para utilizar también el prefijo:

In [16]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed

tree.find('dc:rights',namespaces).text

'ffefe'

Los documentos XML, aparte de las etiquetas y los valores, pueden contener atributos. Dado el siguiente ejemplo, vamos a ver cómo obtener la lista y los valores de los atributos

In [17]:
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:contributor>asdsadsad</dc:contributor>
<dc:coverage>dfsd</dc:coverage>
<dc:creator>sadsa</dc:creator>
<dc:date>sadas</dc:date>
<dc:description atributo1="valor1" atributo2="valor2">sadsa</dc:description>
<dc:format>sadasd</dc:format>
<dc:identifier>sadsad</dc:identifier>
<dc:language>asdasd</dc:language>
<dc:publisher>wqewq</dc:publisher>
<dc:relation >wqeqw</dc:relation>
<dc:rights>ffefe</dc:rights>
<dc:source>vfvf</dc:source>
<dc:title>wqewqe</dc:title>
<dc:type>ewfrb</dc:type>
</searchRetrieveResponse>'''

tree2 = ET.fromstring(dc_xml)

In [18]:
tree2.find('dc:description',namespaces).attrib

{'atributo1': 'valor1', 'atributo2': 'valor2'}

In [19]:
print(tree2.find('dc:description',namespaces).attrib['atributo1'])
print(tree2.find('dc:description',namespaces).attrib['atributo2'])


valor1
valor2


Vamos a analizar un documento XML más complejo, empezando por descargarlo:

In [20]:
import requests

response = requests.get('https://gist.githubusercontent.com/vivien/580729/raw/651d1b216357c0d7d9fc47075071fb482e11fb36/dublincore-example.xml')
if response.status_code == 200:
    with open("./dublincore-example.xml", 'wb') as f:
        f.write(response.content)

In [21]:
ls

dublincore-example.xml  MetadataIntro.ipynb


Y lo cargamos en python:

In [20]:
tree = ET.parse('dublincore-example.xml')
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}version 1.1
{http://www.loc.gov/zing/srw/}numberOfRecords 33587
{http://www.loc.gov/zing/srw/}records 

{http://www.loc.gov/zing/srw/}nextRecordPosition 11
{http://www.loc.gov/zing/srw/}resultSetIdleTime None
{http://www.loc.gov/zing/srw/}echoedSearchRetrieveRequest 



In [32]:
all_records = tree.findall('{http://www.loc.gov/zing/srw/}records')
print(all_records)

[<Element '{http://www.loc.gov/zing/srw/}records' at 0x7f54a14da368>]


In [24]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}record'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http:

In [33]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}recordData'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 



In [36]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}oclcdcs'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}creator Snelling, Lauraine.
{http://purl.org/dc/elements/1.1/}date c2003
{http://purl.org/dc/elements/1.1/}description "Ruby Torvald sets out on a daunting journey with her young sister, Opal, to hopefully see their long-lost father once more and claim the promised inheritance. But instead of the treasure they expected, the sisters discover something most shocking." -- Book Cover.
{http://purl.org/dc/elements/1.1/}format 320 p. ; 22 cm.
{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}language eng
{http://purl.org/dc/elements/1.1/}publisher Bethany House Publishers
{http://purl.org/dc/elements/1.1/}relation Dakotah treasures ; 1
{http://purl.org/dc/elements/1.1/}subject Inheritance and succession--Fiction.
{http://purl.org/dc/elements/1.1/}s

In [39]:
for table in tree.getiterator('{http://purl.org/dc/elements/1.1/}identifier'):
    for child in table:
        print(child.tag, child.text)

Como ves, hay que ir entendiendo la jerarquía del XML para poder obtener la información. 

¿Puedes obtener los títulos de los recursos descritos en el XML?

In [19]:
relation = tree.findall('//{http://purl.org/dc/elements/1.1/}title')
for elem in relation:
    print(elem.tag, elem.text)

  """Entry point for launching an IPython kernel.


In [21]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
relation = tree.findall('//dc:title',namespaces)
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


  


Ejemplo con EML

In [47]:
import requests

response = requests.get('https://zenodo.org/record/841691/files/amt_prototype.xml')
if response.status_code == 200:
    with open("./amt_prototype.xml", 'wb') as f:
        f.write(response.content)
        


In [48]:
ls

amt_prototype.xml  dublincore-example.xml  MetadataIntro.ipynb


En estándares más complejos, el xml de base puede tener una jerarquía anidada, como es el caso de EML. Entonces, cada elemento puede tener de 0 a N "hijos", formando nuevos árboles.

In [54]:
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if len(child)==0:
            print(child.tag, child.text)

alternateIdentifier 
10.5281/zenodo.841183

title water reservoir of Cuerda del Pozo
organizationName IFCA
electronicMailAddress marco@ifca.unican.es
salutation Mr
givenName Jesus Marco
surName De Lucas
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
organizationName IFCA
electronicMailAddress aguilarf@ifca.unican.es
role guardian
givenName Fernando
surName Aguilar
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
para The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keyword measure
keyword water reservoir
keyword sensor
keyword physical and chemical parameters
geographicDescription water reservoir
westBoundingCoordinate -3.75
eastBoundingCoordinate -2.375
nor

Explora un poco: Nombre del proyecto, autores, lista de atributos...

In [77]:
#elementos = tree.findall('/dataset[1]/creator/individualName/salutation')
elementos = tree.findall('//attributeList/attribute[@id="1465311292527"]/attributeName')
for e in elementos:
    print(e.tag + ":", e.text)

attributeName: date


  


In [None]:
dataset = ET.SubElement(root,'dataset')
for table in dataset.getiterator():
    print(child.tag, child.text)

# Ejercicio personal

A partir del ejemplo completo del esquema de metadatos de DataCite, muestra por pantalla los elementos que sean equivalentes a los propuestos por Dublin Core (cada uno en una línea). Es posible que tengas que combinar en uno varios campos del archivo de metadatos (Por ejemplo, en coverage las coordenadas + el nombre).

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

Recurso: https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml

In [1]:
import requests

response = requests.get('https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml')
if response.status_code == 200:
    with open("./datacite.xml", 'wb') as f:
        f.write(response.content)

In [48]:
import xml.etree.ElementTree as ET

tree = ET.parse('datacite.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if len(child)==0:
            print(child.tag, child.text)

{http://datacite.org/schema/kernel-3}identifier 10.5072/example-full
{http://datacite.org/schema/kernel-3}publisher DataCite
{http://datacite.org/schema/kernel-3}publicationYear 2014
{http://datacite.org/schema/kernel-3}language en-us
{http://datacite.org/schema/kernel-3}resourceType XML
{http://datacite.org/schema/kernel-3}version 3.1
{http://datacite.org/schema/kernel-3}creatorName Miller, Elizabeth
{http://datacite.org/schema/kernel-3}nameIdentifier 0000-0001-5000-0007
{http://datacite.org/schema/kernel-3}affiliation DataCite
{http://datacite.org/schema/kernel-3}title Full DataCite XML Example
{http://datacite.org/schema/kernel-3}title Demonstration of DataCite Properties.
{http://datacite.org/schema/kernel-3}subject 000 computer science
{http://datacite.org/schema/kernel-3}contributorName Starr, Joan
{http://datacite.org/schema/kernel-3}nameIdentifier 0000-0002-7285-027X
{http://datacite.org/schema/kernel-3}affiliation California Digital Library
{http://datacite.org/schema/kernel-3

In [55]:
namespaces = {'schemaLocation':'http://schema.datacite.org/meta/kernel-3/metadata.xsd'} # add more as needed
elementos = tree.findall('//creatorName',namespaces)
for e in elementos:
    print("Titulo:", e.text)


  


In [43]:
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

elementos = tree.findall('//project/title',namespaces)
for e in elementos:
    print("Título proyecto:", e.text)
    
elementos = tree.findall('//dataset/title',namespaces)
for e in elementos:
    print("Título Dataset:", e.text)
elementos = tree.findall('//geographicDescription',namespaces)
for e in elementos:
    print("Descripción:", e.text)

elementos = tree.findall('//attributeLabel',namespaces)
for e in elementos:
    print("Columna:", e.text)
    
elementos = tree.findall('//url',namespaces)
for e in elementos:
    print("Descargar:", e.text)
elementos = tree.findall('//size',namespaces)
for e in elementos:
    print("Tamaño:", e.text)
elementos = tree.findall('//authentication',namespaces)
for e in elementos:
    print("CheckSum:", e.text)
elementos = tree.findall('//fieldDelimiter',namespaces)
for e in elementos:
    print("Delimitador:", e.text)

Título Dataset: water reservoir of Cuerda del Pozo
Descripción: water reservoir
Columna: date
Columna: Temp
Columna: Press
Columna: Cond
Columna: Salinity
Columna: DO
Columna: rawO2
Columna: OxySat
Columna: ph
Columna: redox
Descargar: http://doriiie02.ifca.es/datasets/amt.csv
Tamaño: 4275737
CheckSum: f38ded28383d9f69af7cb9c98aed798b
Delimitador: ;


  after removing the cwd from sys.path.
  
  # This is added back by InteractiveShellApp.init_path()
  from ipykernel import kernelapp as app
