# Gestión y uso de Metadatos

## Librerías Necesarias

xml.etree.ElementTree

requests

In [None]:
import xml.etree.ElementTree as ET
import requests

## Metadata attachment

En este Notebook descubriremos cómo pueden explotarse metadatos publicados en formatos basados en etiquetas, como XML.

<img src="https://www.republica.com/wp-content/uploads/2017/04/grito.jpg " width="250">

Vamos a empezar por describir un par de objetos, empezando por un cuadro, "El grito", de  Edvard Munch.

* Title: El grito
* Creator: Edvard Munch
* Subject: arte moderno; pintura; obra; expresionismo; angustia existencial; emoción humana; 
* Description: Pintura expresionista que representa a una figura andrógina sosteniendo su rostro en una expresión de desesperación y angustia, con un fondo de cielos turbulentos y colores intensos.
* Publisher: -
* Contributor: -
* Date: 1893 
* Type: Obra de arte/Pintura
* Format: Óleo sobre cartón
* Identifier: -
* Source: Museo Nacional de Noruega
* Language:-
* Relation: -
* Coverage: Oslo, Noruega.
* Rights: -

Con Dublin Core también podemos describir datasets científicos. Vamos a probar con:

https://zenodo.org/record/3372754#.XcFkhE9Kg5k

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Consejo**</p>

<p>En el propio repositorio puedes encontrar metadatos</p>
</div>

* Title: Global River Ice Dataset - validation dataset
* Creator: Yang, Xiao; Pavelsky, Tamlin; Allen, George 
* Subject: -
* Description: Alaskan river ice records from National Weather Service (NWS), including nws_breakup_nogeo.csv containing [...]
* Publisher: Zenodo
* Contributor: -
* Date: August 20 2019
* Type: Dataset
* Format: -
* Identifier: 10.5281/zenodo.3372754
* Source: zenodo.org
* Language: -
* Relation: https://doi.org/10.5281/zenodo.3372753
* Coverage: - 
* Rights: Creative Commons Attribution 4.0 International

A partir de las descripciones, podemos crear documentos XML que sean interpretables por máquinas (entendiendo máquinas como scripts, software, etc). 

El grito:
  
  ```XML
  <dc:contributor></dc:contributor>
  <dc:coverage>Oslo, Noruega</dc:coverage>
  <dc:creator>Edvard Munch</dc:creator>
  <dc:date>1893</dc:date>
  <dc:description>Pintura expresionista que representa a una figura andrógina sosteniendo su rostro en una expresión de desesperación y angustia, con un fondo de cielos turbulentos y colores intensos.</dc:description>
  <dc:format>Óleo sobre cartón</dc:format>
  <dc:identifier></dc:identifier>
  <dc:language></dc:language>
  <dc:publisher></dc:publisher>
  <dc:relation></dc:relation>
  <dc:rights></dc:rights>
  <dc:source>Museo Nacional de Noruega</dc:source>
  <dc:title>El Grito</dc:title>
  <dc:type>Obra de arte/Pintura</dc:type>

```

Dataset:
  ```XML
  <dc:contributor></dc:contributor>
  <dc:coverage></dc:coverage>
  <dc:creator>Yang, Xiao; Pavelsky, Tamlin; Allen, George</dc:creator>
  <dc:date>2019-08-20</dc:date>
  <dc:description>Alaskan river ice records from National Weather Service (NWS), including nws_breakup_nogeo.csv containing [...]</dc:description>
  <dc:format></dc:format>
  <dc:identifier>10.5281/zenodo.3372754</dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Zenodo</dc:publisher>
  <dc:relation>https://doi.org/10.5281/zenodo.3372753</dc:relation>
  <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
  <dc:source>zenodo.org</dc:source>
  <dc:title>Global River Ice Dataset - validation dataset</dc:title>
  <dc:type>Dataset</dc:type>
```

Ahora vamos a ver cómo podemos manejar estos datos en Python. Para ello, utilizaremos la librería xml.

Para crear un documento XML bien formado, es necesario definir dónde está descrito el prefijo Dublin Core o "dc:". Para ello, añadimos antes de los datos la siguiente cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
```

Sin olvidar añadir al final:

```XML
</searchRetrieveResponse>
```

In [40]:
import xml.etree.ElementTree as ET
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">


     <dc:contributor>Edvard Munch </dc:contributor>
  <dc:coverage>Lugar indeterminado</dc:coverage>
  <dc:creator>Edvard Munch </dc:creator>
  <dc:date>1910</dc:date>
  <dc:description>Cuadro...</dc:description>
  <dc:format>Oleo sobre carton</dc:format>
  <dc:identifier>id_museo_grito</dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Galeria nacional de Oslo</dc:publisher>
  <dc:relation>cuadro1, cuadro2, cuadro3</dc:relation>
  <dc:rights>Acceso al museo</dc:rights>
  <dc:source></dc:source>
  <dc:title>El grito</dc:title>
  <dc:type>Cuadro</dc:type>



</searchRetrieveResponse>'''

tree = ET.fromstring(dc_xml)
tree

<Element 'searchRetrieveResponse' at 0x000001DCDA1C56C0>

Si queremos recorrer los elementos del XML que hemos formado, podemos utilizar un bucle, teniendo en cuenta que la información que nos interesa la tenemos en el elemento raíz 'searchRetrieveResponse':

In [41]:
for table in tree.iter('searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}contributor Edvard Munch 
{http://purl.org/dc/elements/1.1/}coverage Lugar indeterminado
{http://purl.org/dc/elements/1.1/}creator Edvard Munch 
{http://purl.org/dc/elements/1.1/}date 1910
{http://purl.org/dc/elements/1.1/}description Cuadro...
{http://purl.org/dc/elements/1.1/}format Oleo sobre carton
{http://purl.org/dc/elements/1.1/}identifier id_museo_grito
{http://purl.org/dc/elements/1.1/}language None
{http://purl.org/dc/elements/1.1/}publisher Galeria nacional de Oslo
{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3
{http://purl.org/dc/elements/1.1/}rights Acceso al museo
{http://purl.org/dc/elements/1.1/}source None
{http://purl.org/dc/elements/1.1/}title El grito
{http://purl.org/dc/elements/1.1/}type Cuadro


Observa que, al utilizar el prefijo 'dc:' e indicarle que está descrito en la URL 'http://purl.org/dc/elements/1.1/', la eqtiqueta o "tag" aparece como, por ejemplo {URL}contributor.

Prueba a mostrar los metadatos que has creado a partir del cuadro y del dataset:

In [42]:
for table in tree.iter('searchRetrieveResponse'):
    for child in table:
        print(child.text)

Edvard Munch 
Lugar indeterminado
Edvard Munch 
1910
Cuadro...
Oleo sobre carton
id_museo_grito
None
Galeria nacional de Oslo
cuadro1, cuadro2, cuadro3
Acceso al museo
None
El grito
Cuadro


Utilizando findall() sobre el arbol (tree), podemos encontrar todos los elementos con una etiqueta determinada.

In [43]:
relation = tree.findall('{http://purl.org/dc/elements/1.1/}relation')
print(relation)

[<Element '{http://purl.org/dc/elements/1.1/}relation' at 0x000001DCDA1E3FB0>]


Ten en cuenta que lo que encontramos es, en realidad, una parte del documento XML, por lo que hay que iterarlo como antes:

In [44]:
for child in relation:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3


XML utiliza prefijos para no necesitar referenciar a la URL de un tipo cada vez, lo podemos ver en la cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'''
```

Por ejemplo, cada vez que queremos utilizar un tipo de Dublin Core, utilizamos el prefijo dc: que equivale a llamar a la definición:

xmlns:dc="http://purl.org/dc/elements/1.1/"

Sin embargo, para utilizar ElementTree en Python, tenemos que utilizar la URL completa. Esto puede resultar un poco engorroso, así que podemos definir el namespace para utilizar también el prefijo:

In [45]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed

tree.find('dc:rights',namespaces).text

'Acceso al museo'

Los documentos XML, aparte de las etiquetas y los valores, pueden contener atributos. Dado el siguiente ejemplo, vamos a ver cómo obtener la lista y los valores de los atributos

In [46]:
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:contributor>asdsadsad</dc:contributor>
<dc:coverage>dfsd</dc:coverage>
<dc:creator>sadsa</dc:creator>
<dc:date>sadas</dc:date>
<dc:description atributo1="valor1" atributo2="valor2">sadsa</dc:description>
<dc:format>sadasd</dc:format>
<dc:identifier>sadsad</dc:identifier>
<dc:language>asdasd</dc:language>
<dc:publisher>wqewq</dc:publisher>
<dc:relation >wqeqw</dc:relation>
<dc:rights>ffefe</dc:rights>
<dc:source>vfvf</dc:source>
<dc:title>wqewqe</dc:title>
<dc:type>ewfrb</dc:type>
</searchRetrieveResponse>'''

tree2 = ET.fromstring(dc_xml)

In [47]:
tree2.find('dc:description',namespaces).attrib

{'atributo1': 'valor1', 'atributo2': 'valor2'}

Conociendo los nombres de estos atributos, puedes extraer su valor. Esto serviría para dar una información adicional al contenido de la etiqueta. Por ejemplo, se podría añadir el idioma como atributo en la descripción.

In [48]:
print(tree2.find('dc:description',namespaces).attrib['atributo1'])
print(tree2.find('dc:description',namespaces).attrib['atributo2'])


valor1
valor2


Vamos a analizar un documento XML más complejo, empezando por descargarlo:

In [49]:
import requests

response = requests.get('https://gist.githubusercontent.com/vivien/580729/raw/651d1b216357c0d7d9fc47075071fb482e11fb36/dublincore-example.xml')
if response.status_code == 200:
    with open("./dublincore-example.xml", 'wb') as f:
        f.write(response.content)

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Recuerda!**</p>

<p>Jupyter permite ejecutar ciertos comandos bash</p>
</div>

In [50]:
ls

 El volumen de la unidad D es DATA Torre
 El n�mero de serie del volumen es: 4851-22EB

 Directorio de d:\UNI\DataScience\DLC

05/12/2024  23:53    <DIR>          .
05/12/2024  23:53    <DIR>          ..
05/12/2024  23:53            13.595 amt_prototype.xml
05/12/2024  23:53            24.132 datacite.xml
05/12/2024  23:52           175.167 delaCal_mdg648_metadata.ipynb
27/11/2024  17:29           605.084 DOI20242025.ipynb
05/12/2024  23:53            19.563 dublincore-example.xml
27/11/2024  18:41         1.085.074 imagen_test.jpg
27/11/2024  18:27         1.649.013 OAI-PMH-APIs2425.ipynb
               7 archivos      3.571.628 bytes
               2 dirs     544.235.520 bytes libres


Y lo cargamos en python:

In [51]:
tree = ET.parse('dublincore-example.xml')
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
for table in tree.iter('{http://www.loc.gov/zing/srw/}searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}version 1.1
{http://www.loc.gov/zing/srw/}numberOfRecords 33587
{http://www.loc.gov/zing/srw/}records 

{http://www.loc.gov/zing/srw/}nextRecordPosition 11
{http://www.loc.gov/zing/srw/}resultSetIdleTime None
{http://www.loc.gov/zing/srw/}echoedSearchRetrieveRequest 



In [52]:
all_records = tree.findall('{http://www.loc.gov/zing/srw/}records')
print(all_records)

[<Element '{http://www.loc.gov/zing/srw/}records' at 0x000001DCDA1F1D00>]


In [53]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}record'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http:

In [54]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}recordData'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 



In [55]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}oclcdcs'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}creator Snelling, Lauraine.
{http://purl.org/dc/elements/1.1/}date c2003
{http://purl.org/dc/elements/1.1/}description "Ruby Torvald sets out on a daunting journey with her young sister, Opal, to hopefully see their long-lost father once more and claim the promised inheritance. But instead of the treasure they expected, the sisters discover something most shocking." -- Book Cover.
{http://purl.org/dc/elements/1.1/}format 320 p. ; 22 cm.
{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}language eng
{http://purl.org/dc/elements/1.1/}publisher Bethany House Publishers
{http://purl.org/dc/elements/1.1/}relation Dakotah treasures ; 1
{http://purl.org/dc/elements/1.1/}subject Inheritance and succession--Fiction.
{http://purl.org/dc/elements/1.1/}s

In [56]:
table = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')
for child in table:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

  table = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')


In [57]:
relation = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

  relation = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')


## XPATH

XPath es un lenguaje que permite construir expresiones que recorren y procesan un documento XML. La idea es parecida a las expresiones regulares para seleccionar partes de un texto sin atributos. XPath permite buscar y seleccionar teniendo en cuenta la estructura jerárquica del XML

<table border="1" class="docutils">
<colgroup>
<col width="30%">
<col width="70%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Syntax</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">tag</span></code></td>
<td>Selects all child elements with the given tag.
For example, <code class="docutils literal notranslate"><span class="pre">spam</span></code> selects all child elements
named <code class="docutils literal notranslate"><span class="pre">spam</span></code>, and <code class="docutils literal notranslate"><span class="pre">spam/egg</span></code> selects all
grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code> in all children named
<code class="docutils literal notranslate"><span class="pre">spam</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">*</span></code></td>
<td>Selects all child elements.  For example, <code class="docutils literal notranslate"><span class="pre">*/egg</span></code>
selects all grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code>.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">.</span></code></td>
<td>Selects the current node.  This is mostly useful
at the beginning of the path, to indicate that it’s
a relative path.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">//</span></code></td>
<td>Selects all subelements, on all levels beneath the
current  element.  For example, <code class="docutils literal notranslate"><span class="pre">.//egg</span></code> selects
all <code class="docutils literal notranslate"><span class="pre">egg</span></code> elements in the entire tree.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">..</span></code></td>
<td>Selects the parent element.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[@attrib]</span></code></td>
<td>Selects all elements that have the given attribute.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[@attrib='value']</span></code></td>
<td>Selects all elements for which the given attribute
has the given value.  The value cannot contain
quotes.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[tag]</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code>.  Only immediate children are supported.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[tag='text']</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code> whose complete text content, including
descendants, equals the given <code class="docutils literal notranslate"><span class="pre">text</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[position]</span></code></td>
<td>Selects all elements that are located at the given
position.  The position can be either an integer
(1 is the first position), the expression <code class="docutils literal notranslate"><span class="pre">last()</span></code>
(for the last position), or a position relative to
the last position (e.g. <code class="docutils literal notranslate"><span class="pre">last()-1</span></code>).</td>
</tr>
</tbody>
</table>

Como ves, hay que ir entendiendo la jerarquía del XML para poder obtener la información. 

¿Puedes obtener los títulos de los recursos descritos en el XML?

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Ayuda**</p>

<p>'//' para indicar que empiece a buscar desde el elemento actual desde el que parte el arbol + tipo+nombre del elemento a buscar ({http://purl.org/dc/elements/1.1/} title)</p>
</div>

In [58]:
relation = tree.findall('.//{http://purl.org/dc/elements/1.1/}title')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


Haz lo mismo utilizando namespace

In [59]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
relation = tree.findall('.//dc:title',namespaces)
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


## Ejemplo con EML

In [60]:
import requests

response = requests.get('https://zenodo.org/record/841691/files/amt_prototype.xml')
if response.status_code == 200:
    with open("./amt_prototype.xml", 'wb') as f:
        f.write(response.content)
        


In [61]:
ls

 El volumen de la unidad D es DATA Torre
 El n�mero de serie del volumen es: 4851-22EB

 Directorio de d:\UNI\DataScience\DLC

05/12/2024  23:53    <DIR>          .
05/12/2024  23:53    <DIR>          ..
05/12/2024  23:53            13.595 amt_prototype.xml
05/12/2024  23:53            24.132 datacite.xml
05/12/2024  23:52           175.167 delaCal_mdg648_metadata.ipynb
27/11/2024  17:29           605.084 DOI20242025.ipynb
05/12/2024  23:53            19.563 dublincore-example.xml
27/11/2024  18:41         1.085.074 imagen_test.jpg
27/11/2024  18:27         1.649.013 OAI-PMH-APIs2425.ipynb
               7 archivos      3.571.628 bytes
               2 dirs     544.235.520 bytes libres


En estándares más complejos, el xml de base puede tener una jerarquía anidada, como es el caso de EML. Entonces, cada elemento puede tener de 0 a N "hijos", formando nuevos árboles.

In [62]:
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

for table in root.iter():
    for child in table:
        if len(child)==0:
            print(child.tag, child.text)

alternateIdentifier 
10.5281/zenodo.841183

title water reservoir of Cuerda del Pozo
organizationName IFCA
electronicMailAddress marco@ifca.unican.es
salutation Mr
givenName Jesus Marco
surName De Lucas
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
organizationName IFCA
electronicMailAddress aguilarf@ifca.unican.es
role guardian
givenName Fernando
surName Aguilar
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
para The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keyword measure
keyword water reservoir
keyword sensor
keyword physical and chemical parameters
geographicDescription water reservoir
westBoundingCoordinate -3.75
eastBoundingCoordinate -2.375
nor

Explora un poco: Nombre del proyecto, autores, lista de atributos...

In [63]:
#elementos = tree.findall('/dataset[1]/creator/individualName/salutation')
elementos = tree.findall('.//attributeList/attribute[@id="1465311292527"]/attributeName')
for e in elementos:
    print(e.tag + ":", e.text)

attributeName: date


In [64]:
elementos = tree.findall('.//dataset')
for e in elementos:
    print(e.tag + ":", e.text)
    for i in e.iter():
        print(i.tag + ":", i.text)

dataset:  

dataset:  

title: water reservoir of Cuerda del Pozo
creator:  
individualName: None
salutation: Mr
givenName: Jesus Marco
surName: De Lucas
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: marco@ifca.unican.es
associatedParty: None
individualName: None
givenName: Fernando
surName: Aguilar
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: aguilarf@ifca.unican.es
role: guardian
abstract: None
para: The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keywordSet: None
keyword: measure
keyword: water reservoir
keyword: sensor
key

In [65]:
dataset = ET.SubElement(root,'dataset')
for table in dataset.iter():
    print(child.tag, child.text)

description practical salinity unit


# Ejercicio personal

Autor: Miguel de la Cal García

In [66]:
import xml.etree.ElementTree as ET

## Ejercicio 1

A partir del ejemplo completo del esquema de metadatos de DataCite, muestra por pantalla los elementos que sean equivalentes a los propuestos por Dublin Core (cada uno en una línea). Es posible que tengas que combinar en uno varios campos del archivo de metadatos (Por ejemplo, en coverage las coordenadas + el nombre).

* Title: Example Title
* Creator: ExampleFamilyName, ExampleGivenName; ExampleOrganization
* Subject: FOS: Computer and information sciences; Digital curation and preservation; Example Subject
* Description: Example Abstract. Example Methods. Example SeriesInformation. Example TableOfContents. Example TechnicalInfo. Example Other
* Publisher: Example Publisher
* Contributor: ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; DataCite; International DOI Foundation; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleContributor; ExampleFamilyName, ExampleGivenName; ExampleContributor; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFamilyName, ExampleGivenName
* Date: 2023-01-01
* Type: Example ResourceType
* Format: application/xml; text/plain
* Identifier: 10.82433/B09Z-4K37
* Source: 10.1016/j.epsl.2011.11.037
* Language: en
* Relation: ark:/13030/tqb3kh97gh8w; arXiv:0706.0001; 2018AGUFM.A24K..07S; 10.1016/j.epsl.2011.11.037; 9783468111242; 1562-6865; 10013/epic.10033; IECUR0097; 978-3-905673-82-1; 0077-5606; 0A9 2002 12B4A105 7; 1188-1534; urn:lsid:ubio.org:namebank:11815; 12082125; http://purl.oclc.org/foo/bar; 123456789999; http://www.heatflow.und.edu/index2.html; urn:nbn:de:101:1-201102033592; https://w3id.org/games/spec/coil#Coil_Bomb_Die_Of_Age; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037; 10.1016/j.epsl.2011.11.037
* Coverage: (49.2827, -123.1207) Vancouver, British Columbia, Canada
* Rights: Creative Commons Attribution 4.0 International

Recurso: https://schema.datacite.org/meta/kernel-4.5/example/datacite-example-full-v4.xml

In [67]:
import requests

response = requests.get('https://schema.datacite.org/meta/kernel-4.5/example/datacite-example-full-v4.xml')
if response.status_code == 200:
    with open("./datacite.xml", 'wb') as f:
        f.write(response.content)

En primer lugar exploraré a ojo el contenido del XML para saber qué información se puede clasificar con Dublin Core.

In [68]:
# Convertir el árbol XML a una sola cadena para
# ver el contenido con las identaciones correctas
xml_str = ET.tostring(tree.getroot(), encoding='unicode')

# Mostrar el XML
print(xml_str)


<ns0:eml xmlns:ns0="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="thesis11.1" system="knb" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> 
<resource>
<alternateIdentifier>
10.5281/zenodo.841183
</alternateIdentifier>
</resource>

<dataset> 
<title>water reservoir of Cuerda del Pozo</title>
 <creator id="1465301815644"> <individualName><salutation>Mr</salutation>
<givenName>Jesus Marco</givenName>
<surName>De Lucas</surName>
</individualName>
<organizationName>IFCA</organizationName>
<address><deliveryPoint>Avda Castros s/n</deliveryPoint>
<city>Santander</city>
<postalCode>39005</postalCode>
<country>Spain</country>
</address>
<electronicMailAddress>marco@ifca.unican.es</electronicMailAddress>
</creator>
 <associatedParty id="1465302104227"><individualName><givenName>Fernando</givenName>
<surName>Aguilar</surName>
</individualName>
<organizationName>IFCA</organizationName>
<address><deliveryPoint>Avda Castros s/n<

In [69]:
tree = ET.parse('datacite.xml')
root = tree.getroot()
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/', 'datacite': 'http://datacite.org/schema/kernel-4'}


In [70]:
# Title
# En el documenyto se lista título, subtítulo, 
# título traducido al francés y título alternativo.
# Elijo solo tomar el primero, el principal.
title = root.find('datacite:titles/datacite:title', namespaces)
print(f"Title: {title.text}")

# Creator
creators = root.findall('datacite:creators/datacite:creator/datacite:creatorName', namespaces)
for creator in creators:
    print(f"Creator: {creator.text}")

# Subject
subjects = root.findall('datacite:subjects/datacite:subject', namespaces)
for subject in subjects:
    print(f"Subject: {subject.text}")

# Description
descriptions = root.findall('datacite:descriptions/datacite:description', namespaces)
for description in descriptions:
    print(f"Description: {description.text}")

# Publisher
publisher = root.find('datacite:publisher', namespaces)
print(f"Publisher: {publisher.text}")

# Contributor
contributors = root.findall('datacite:contributors/datacite:contributor/datacite:contributorName', namespaces)
for contributor in contributors:
    print(f"Contributor: {contributor.text}")

# Date
dates = root.findall('datacite:dates/datacite:date', namespaces)
# Hay varias fechas, muchas iguales, en el documento como de publicación, actualización,
# etc. Seleccionamos la primera que es la de aceptación
print(f"Date: {dates[0].text}")

# Type
resource_type = root.find('datacite:resourceType', namespaces)
print(f"Type: {resource_type.text if resource_type is not None else 'N/A'}")

# Format
formats = root.findall('datacite:formats/datacite:format', namespaces)
for format in formats:
    print(f"Format: {format.text}")

# Identifier
identifier = root.find('datacite:identifier', namespaces)
print(f"Identifier: {identifier.text}")

# Source
# Hay elementos relacionados, pero no se lista ninguno claramente 
# como fuente, así que tomo lo más cercano a ello,
# que es un item relacionado del que deriva el recurso.
# Encontrar relatedIdentifier con relationType = "IsDerivedFrom"
source = root.find('datacite:relatedIdentifiers/datacite:relatedIdentifier[@relationType="IsDerivedFrom"]', namespaces)
print(f"Source: {source.text}")

# Language
# Solo hay listado un lenguaje: inglés
language = root.find('datacite:language', namespaces)
print(f"Language: {language.text}")

# Relation
# Hay varios identificadores repetidos de documentos relacionados,
# muestro todos.
# Encontrar todas las relaciones menos la de IsDerivedFrom
relations = root.findall('datacite:relatedIdentifiers/datacite:relatedIdentifier', namespaces)
for relation in relations:
    print(f"Relation: {relation.text}")

# Coverage
# Parece que solo hay una localización geográfica
# y tiene asociada varias coordenadas como un punto,
# un área, etc. Tomo para representarlo la coordenada del punto.
coverage = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPlace', namespaces)
lat = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPoint/datacite:pointLatitude', namespaces)
lon = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPoint/datacite:pointLongitude', namespaces)
print(f"Coverage: ({lat.text}, {lon.text}) {coverage.text}")

# Rights
rights = root.find('datacite:rightsList/datacite:rights', namespaces)
print(f"Rights: {rights.text}")

Title: Example Title
Creator: ExampleFamilyName, ExampleGivenName
Creator: ExampleOrganization
Subject: FOS: Computer and information sciences
Subject: Digital curation and preservation
Subject: Example Subject
Description: Example Abstract
Description: Example Methods
Description: Example SeriesInformation
Description: Example TableOfContents
Description: Example TechnicalInfo
Description: Example Other
Publisher: Example Publisher
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleOrganization
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleOrganization
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: ExampleFamilyName, ExampleGivenName
Contributor: DataCite
Contributor: International DOI Fou

De esta manera, por claridad, listo todos los elementos en varias líneas, si lo abrevio a una línea por cada tipo de contenido de Dublin Core:

In [71]:
# Title
title = root.find('datacite:titles/datacite:title', namespaces)
print(f"Title: {title.text}")

# Creator
creators = root.findall('datacite:creators/datacite:creator/datacite:creatorName', namespaces)
creators_str = '; '.join([creator.text for creator in creators])
print(f"Creator: {creators_str}")

# Subject
subjects = root.findall('datacite:subjects/datacite:subject', namespaces)
subjects_str = '; '.join([subject.text for subject in subjects])
print(f"Subject: {subjects_str}")

# Description
descriptions = root.findall('datacite:descriptions/datacite:description', namespaces)
descriptions_str = '. '.join([description.text for description in descriptions])
print(f"Description: {descriptions_str}")

# Publisher
publisher = root.find('datacite:publisher', namespaces)
print(f"Publisher: {publisher.text}")

# Contributor
contributors = root.findall('datacite:contributors/datacite:contributor/datacite:contributorName', namespaces)
contributors_str = '; '.join([contributor.text for contributor in contributors])
print(f"Contributor: {contributors_str}")

# Date
dates = root.findall('datacite:dates/datacite:date', namespaces)
print(f"Date: {dates[0].text}")

# Type
resource_type = root.find('datacite:resourceType', namespaces)
print(f"Type: {resource_type.text if resource_type is not None else 'N/A'}")

# Format
formats = root.findall('datacite:formats/datacite:format', namespaces)
formats_str = '; '.join([format.text for format in formats])
print(f"Format: {formats_str}")

# Identifier
identifier = root.find('datacite:identifier', namespaces)
print(f"Identifier: {identifier.text}")

# Source
source = root.find('datacite:relatedIdentifiers/datacite:relatedIdentifier[@relationType="IsDerivedFrom"]', namespaces)
print(f"Source: {source.text}")

# Language
# Solo hay listado un lenguaje: inglés
language = root.find('datacite:language', namespaces)
print(f"Language: {language.text}")

# Relation
relations = root.findall('datacite:relatedIdentifiers/datacite:relatedIdentifier', namespaces)
relations_str = '; '.join([relation.text for relation in relations])
print(f"Relation: {relations_str}")

# Coverage
coverage = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPlace', namespaces)
lat = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPoint/datacite:pointLatitude', namespaces)
lon = root.find('datacite:geoLocations/datacite:geoLocation/datacite:geoLocationPoint/datacite:pointLongitude', namespaces)
print(f"Coverage: ({lat.text}, {lon.text}) {coverage.text}")

# Rights
rights = root.find('datacite:rightsList/datacite:rights', namespaces)
print(f"Rights: {rights.text}")

Title: Example Title
Creator: ExampleFamilyName, ExampleGivenName; ExampleOrganization
Subject: FOS: Computer and information sciences; Digital curation and preservation; Example Subject
Description: Example Abstract. Example Methods. Example SeriesInformation. Example TableOfContents. Example TechnicalInfo. Example Other
Publisher: Example Publisher
Contributor: ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; DataCite; International DOI Foundation; ExampleFamilyName, ExampleGivenName; ExampleFamilyName, ExampleGivenName; ExampleContributor; ExampleFamilyName, ExampleGivenName; ExampleContributor; ExampleFamilyName, ExampleGivenName; ExampleOrganization; ExampleFam

## Ejercicio 2

Haz un listado de todas las etiquetas del documento XML con sus atributos (si lo tienen)

In [72]:
for elem in tree.iter():
    print(f"Tag: {elem.tag}, Attributos: {elem.attrib}")

Tag: {http://datacite.org/schema/kernel-4}resource, Attributos: {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd'}
Tag: {http://datacite.org/schema/kernel-4}identifier, Attributos: {'identifierType': 'DOI'}
Tag: {http://datacite.org/schema/kernel-4}creators, Attributos: {}
Tag: {http://datacite.org/schema/kernel-4}creator, Attributos: {}
Tag: {http://datacite.org/schema/kernel-4}creatorName, Attributos: {'nameType': 'Personal'}
Tag: {http://datacite.org/schema/kernel-4}givenName, Attributos: {}
Tag: {http://datacite.org/schema/kernel-4}familyName, Attributos: {}
Tag: {http://datacite.org/schema/kernel-4}nameIdentifier, Attributos: {'nameIdentifierScheme': 'ORCID', 'schemeURI': 'https://orcid.org'}
Tag: {http://datacite.org/schema/kernel-4}affiliation, Attributos: {'affiliationIdentifier': 'https://ror.org/04wxnsj81', 'affiliationIdentifierScheme': 'ROR', 'schemeURI': 'https://ror.org

## Ejercicio 3

Muestra los distintos identificadores que tiene ese documento de este modo: Identificador [tipo] = [identificador]

Ejemplo: Identificador DOI = 10.3122/121321

In [73]:
identifiers = tree.findall('.//datacite:identifier', namespaces)
for identifier in identifiers:
    id_type = identifier.attrib.get('identifierType', 'Unknown')
    print(f"Identificador {id_type} = {identifier.text}")
related_identifiers = tree.findall('.//datacite:relatedIdentifier', namespaces)
for identifier in related_identifiers:
    id_type = identifier.attrib.get('relatedIdentifierType', 'Unknown')
    print(f"Identificador de relación {id_type} = {identifier.text}")

Identificador DOI = 10.82433/B09Z-4K37
Identificador de relación ARK = ark:/13030/tqb3kh97gh8w
Identificador de relación arXiv = arXiv:0706.0001
Identificador de relación bibcode = 2018AGUFM.A24K..07S
Identificador de relación DOI = 10.1016/j.epsl.2011.11.037
Identificador de relación EAN13 = 9783468111242
Identificador de relación EISSN = 1562-6865
Identificador de relación Handle = 10013/epic.10033
Identificador de relación IGSN = IECUR0097
Identificador de relación ISBN = 978-3-905673-82-1
Identificador de relación ISSN = 0077-5606
Identificador de relación ISTC = 0A9 2002 12B4A105 7
Identificador de relación LISSN = 1188-1534
Identificador de relación LSID = urn:lsid:ubio.org:namebank:11815
Identificador de relación PMID = 12082125
Identificador de relación PURL = http://purl.oclc.org/foo/bar
Identificador de relación UPC = 123456789999
Identificador de relación URL = http://www.heatflow.und.edu/index2.html
Identificador de relación URN = urn:nbn:de:101:1-201102033592
Identificador

Hay varios identificadores de documentos relacionados repetidos, pero parece que debe ser así ya que aparece ese mismo listado varias veces en el propio XML.

## Ejercicio 4

Modifica el documento XML "amt_prototype.xml" para que todos los atributos incluyan su unidad (completa si es necesario).

Una vez editado, muestra la lista de atributos con su unidad. 

Ejemplo:

"Atribute: Salinity | Unit: psu"

In [74]:
import xml.etree.ElementTree as ET

# Cargar el archivo XML
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

# Hago una exploración a ojo para saber qué atributos hay
# para saber qué unidades asignar
xml_str = ET.tostring(tree.getroot(), encoding='unicode')
print(xml_str)


<ns0:eml xmlns:ns0="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="thesis11.1" system="knb" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"> 
<resource>
<alternateIdentifier>
10.5281/zenodo.841183
</alternateIdentifier>
</resource>

<dataset> 
<title>water reservoir of Cuerda del Pozo</title>
 <creator id="1465301815644"> <individualName><salutation>Mr</salutation>
<givenName>Jesus Marco</givenName>
<surName>De Lucas</surName>
</individualName>
<organizationName>IFCA</organizationName>
<address><deliveryPoint>Avda Castros s/n</deliveryPoint>
<city>Santander</city>
<postalCode>39005</postalCode>
<country>Spain</country>
</address>
<electronicMailAddress>marco@ifca.unican.es</electronicMailAddress>
</creator>
 <associatedParty id="1465302104227"><individualName><givenName>Fernando</givenName>
<surName>Aguilar</surName>
</individualName>
<organizationName>IFCA</organizationName>
<address><deliveryPoint>Avda Castros s/n<

Explorando el contenido veo que todos los atributos tienen ya una unidad asignada, con la excepcion de rawO2 (son mg/L), Oxigen Saturation (es u porcentaje) y Redox que supongo que es Redox potential en voltios.

In [75]:
units = { "rawO2" : "milligramsPerLiter", "Oxygen Saturation" : "percent", "Redox" : "Volts"}

# Cambiar dimensionless por esas unidades en esos atributos
# Cambiar en rawO2 por milligramsPerLiter
# Cambiar en Oxigen Saturation por percent
# Cambiar en Redox por Volts
for table in root.iter():
    for child in table:
        if child.tag == 'attributeList':
            for attribute in child:
                # Si el nombre de ese atributo está en el diccionario
                # de unidades, le asigno esa unidad
                if attribute.find('attributeName').text in units:
                    if attribute.find('measurementScale/interval/unit/standardUnit') is not None:
                        attribute.find('measurementScale/interval/unit/standardUnit').text = units[attribute.find('attributeName').text]
                    if attribute.find('measurementScale/ratio/unit/standardUnit') is not None:
                        attribute.find('measurementScale/ratio/unit/standardUnit').text = units[attribute.find('attributeName').text]

# Escribir nuevo contenido en el archivo
tree.write('amt_prototype_nuevo.xml')

In [76]:
# Mostrar la lista de atributos con su unidad
for attribute in root.findall('.//attribute'):
    attribute_name = attribute.find('attributeName').text
    # Buscar la unidad del atributo en las rutas posibles:
    # - attribute/measurementScale/ratio/unit/standardUnit
    # - attribute/measurementScale/interval/unit/customUnit
    # - attribute/measurementScale/interval/unit/standardUnit

    # Primero busco en attribute/measurementScale/ratio/unit/standardUnit
    unit = attribute.find('measurementScale/ratio/unit/standardUnit')
    if unit is not None:
        unit = unit.text
    else:
        # Si no se encontró, busco en attribute/measurementScale/interval/unit/customUnit
        unit = attribute.find('measurementScale/interval/unit/customUnit')
        if unit is not None:
            unit = unit.text
        else:
            # Si no se encontró, busco en attribute/measurementScale/interval/unit/standardUnit
            unit = attribute.find('measurementScale/interval/unit/standardUnit')
            if unit is not None:
                unit = unit.text
            else:
                # Si no se encontró, asigno N/A como es el caso de date, que no tiene unidad
                unit = 'N/A'
    print(f"Atribute: {attribute_name} | Unit: {unit}")

Atribute: date | Unit: N/A
Atribute: Temperature | Unit: celsius
Atribute: Press | Unit: dbar
Atribute: Conductivity | Unit: mS/cm
Atribute: Salinity | Unit: psu
Atribute: Dissolved oxygen | Unit: milligramsPerLiter
Atribute: rawO2 | Unit: milligramsPerLiter
Atribute: Oxygen Saturation | Unit: percent
Atribute: ph | Unit: dimensionless
Atribute: Redox | Unit: Volts
