<a href="https://colab.research.google.com/github/pocketfall/biodata/blob/main/testing_pydwca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testeando pydwca 0.3.0
* # **Resumen de resultados al final de Notebook**
* ## Usando archivos .zip dwca-plagioscion, dwca-pygmy_jellyfish
* ## Archivos eml usados de los mismos .zip
* ## Testeo hecho en pycharm de manera local inicialmente, se adaptó código para uso en google colab
* ## Documentación:
> https://pydwca.readthedocs.io/en/latest/index.html
* ## Referencias
> * CSIRO National Collections and Marine Infrastructure (NCMI) Information and Data Centre (IDC): A new pygmy species of box jellyfish (Cubozoa: Chirodropida) from sub-tropical Australia https://doi.org/10.15468/bdr9bd accessed via GBIF.org on 2024-08-12.
> * Lilian Casatti, Stierhof T (2005). Revision of the South American freshwater genus Plagioscion (Teleostei, Perciformes, Sciaenidae).. Plazi.org taxonomic treatments database. Checklist dataset https://doi.org/10.15468/wk03zz accessed via GBIF.org on 2024-08-12.

# Modulo dwca
## 1.
### Leyendo dwca-plagioscion.zip usando DarwinCoreArchive

### Ejemplo dado en documentación:
```
from dwca import DarwinCoreArchive

darwin_core = DarwinCoreArchive.from_archive("DwCArchive.zip")
```

`.from_archive()` no es un método válido, así que se usa `.from_file()`

In [None]:
# dependencias
try:
  from dwca import DarwinCoreArchive
except:
  %pip install pydwca
  from dwca import DarwinCoreArchive
%pip show pydwca

from pathlib import Path
from os import walk, mkdir, path
import requests

Collecting pydwca
  Downloading pydwca-0.3.0-py3-none-any.whl.metadata (4.5 kB)
Collecting datetime-interval>=0.2 (from pydwca)
  Downloading datetime-interval-0.2.tar.gz (4.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading pydwca-0.3.0-py3-none-any.whl (129 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: datetime-interval
  Building wheel for datetime-interval (setup.py) ... [?25l[?25hdone
  Created wheel for datetime-interval: filename=datetime_interval-0.2-py3-none-any.whl size=5157 sha256=74a475aaa77fa17acbd6187e743a960088524a2388a7ef4a48c76306fbe9e984
  Stored in directory: /root/.cache/pip/wheels/a1/19/33/73c880a0d97af4c8e0739141dd70914bf9b7339ccce84fe01e
Successfully built datetime-interval
Installing collected packages: datetime-interval, pydwca
Successfully installed datetime-interval-0.2 pydwca-0.3.0
Name: pydwca
Version: 0.3.0
Summa

In [None]:
# funcion para descargar data
def download_data(source: str,
                  is_list: bool= False) -> Path:
    """Downloads a dataset from source

    Args:
        source (str): A link to a file containing data.

    Returns:
        pathlib.Path to downloaded data.

    Example usage:
        download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                      destination="pizza_steak_sushi")
    """
    # Setup path to data folder
    data_path = Path("data/")

    # If the image folder doesn't exist, download it and prepare it...
    if data_path.is_dir() and is_list is False:
        print(f"[INFO] {data_path} directory exists, skipping download.")
    else:
        print(f"[INFO] Did not find {data_path / source.split('/')[-1]} directory, creating one...")
        data_path.mkdir(parents=True, exist_ok=True)

        # Download data
        target_file = Path(source).name
        with open(data_path / target_file, "wb") as f:
            request = requests.get(source)
            print(f"[INFO] Downloading {target_file} from {source}...")
            f.write(request.content)
    return data_path

In [None]:
# consiguiendo path de los archivos .zip y .xml
sources = ["https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/dwca-pygmy_jellyfish-v1.5.zip",
           "https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/eml_plag.xml",
           "https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/eml_pyg.xml",
           "https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/dwca-plagioscion.zip"]
source_paths = [download_data(source= i, is_list= True) / i.split("/")[-1] for i in sources]
source_paths

[INFO] Did not find data/dwca-pygmy_jellyfish-v1.5.zip directory, creating one...
[INFO] Downloading dwca-pygmy_jellyfish-v1.5.zip from https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/dwca-pygmy_jellyfish-v1.5.zip...
[INFO] Did not find data/eml_plag.xml directory, creating one...
[INFO] Downloading eml_plag.xml from https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/eml_plag.xml...
[INFO] Did not find data/eml_pyg.xml directory, creating one...
[INFO] Downloading eml_pyg.xml from https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/eml_pyg.xml...
[INFO] Did not find data/dwca-plagioscion.zip directory, creating one...
[INFO] Downloading dwca-plagioscion.zip from https://github.com/pocketfall/biodata/raw/main/data/dwca_colab/dwca-plagioscion.zip...


[PosixPath('data/dwca-pygmy_jellyfish-v1.5.zip'),
 PosixPath('data/eml_plag.xml'),
 PosixPath('data/eml_pyg.xml'),
 PosixPath('data/dwca-plagioscion.zip')]

In [None]:
# intentando leer con DarwinCoreArchive.from_file()
# como en la documentacion se usara datatype string
print(source_paths[3])
print(source_paths[0])
# data = DarwinCoreArchive.from_file(str(source_paths[0]))
data = DarwinCoreArchive.from_file(str(source_paths[3]))

# no se puede leer el archivo plagioscion, tampoco pygmy-jellyfish

data/dwca-plagioscion.zip
data/dwca-pygmy_jellyfish-v1.5.zip


ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

## 2. Escribiendo archivos DarwinCoreArchive

```
from dwca import DarwinCoreArchive
from eml.resources import EMLResource
from eml.types import ResponsibleParty, IndividualName

# Define the metadata file future location
darwin_core = DarwinCoreArchive(metadata="eml.xml")
```
```
darwin_core.metadata.initialize_resource(
    "Example for Darwin Core Archive",
    ResponsibleParty(
        individual_name=IndividualName(
            _id="1"
            last_name="Doe",
            first_name="John",
            salutation="Mr."
        )
    ),
    contact=[ResponsibleParty(_id="1", referencing=True)]
)

# Add core data
darwin_core.set_core("taxon.txt")
# Add an extension
darwin_core.add_extension("identifier.txt")

# Write the archive
with open("example.zip", "wb") as example_file:
    darwin_core.to_file(example_file)
```

In [None]:
# TBD

# Modulo EML
## 1. Leyendo archivos eml
## usando EML de plagioscion y pygmy-jellyfish

## Documentación
```
from eml import EML

eml_file = EML.from_xml("eml.xml")

# To see a summary of the content of the metadata file:
print(eml_file)
```

In [None]:
# dependencias
from eml import EML

In [None]:
# leyendo eml pygmy
print(source_paths[2])
data_eml_pygmy = EML.from_xml(str(source_paths[2]))
print(data_eml_pygmy)

data/eml_pyg.xml
EML:
	Resource Type: DATASET
	Title: A new pygmy species of box jellyfish (Cubozoa: Chirodropida) from sub-tropical Australia
	Creator: Gershwin, L. (Investigator at CSIRO Oceans and Atmosphere)
	MetadataProvider: Gershwin, L. (Investigator at CSIRO Oceans and Atmosphere)
	Publisher: OBIS Australia Node manager (OBIS Australia Data Manager at CSIRO National Collections and Marine Infrastructure)


In [None]:
# leyendo eml plagioscion
print(source_paths[1])
data_eml = EML.from_xml(str(source_paths[1]))

# no se logra leer el eml de plagioscion

data/eml_plag.xml


ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

## 2. Creando archivos EML
### Se probará el método dado por la documentación

```
import datetime as dt

from eml import EML
from eml.resources import EMLResource
from eml.types import ResponsibleParty, IndividualName, OrganizationName

eml_file = EML(
    package_id="Example package",
    system="http://my.system",
    resource_type=EMLResource.DATASET,
)
eml_file.add_title("Example for Darwin Core Archive")
eml_file.add_creator(ResponsibleParty(
    individual_name=IndividualName(
        last_name="Doe",
        first_name="John",
        salutation="Mr."
    )
))
eml_file.add_metadata_provider(ResponsibleParty(
    organization_name=OrganizationName("Metadata Provider Organization")
))
eml_file.set_publication_date(dt.date(2024, 2, 9))

# For other possible information to add check the full documentation of the module.

# To write the XML file
with open("eml.xml", "w", encoding="utf-8") as file:
    file.write(eml_file.to_xml())
```

In [None]:
import datetime as dt
from eml import EML
from eml.resources import EMLResource
from eml.types import ResponsibleParty, IndividualName, OrganizationName

In [None]:
eml_file = EML(
    package_id="11111",
    system="http://my.system",
    resource_type=EMLResource.DATASET,
)
eml_file.add_title("Example for Darwin Core Archive")
# eml_file.add_creator(ResponsibleParty(
#     individual_name=IndividualName(
#         last_name="Doe",
#         first_name="John",
#         salutation="Dr."
#     )
# ))
# eml_file.add_metadata_provider(ResponsibleParty(
#     organization_name=OrganizationName("Metadata Provider Organization")
# ))
# eml_file.set_publication_date(dt.date(2024, 2, 9))

# For other possible information to add check the full documentation of the module.

# To write the XML file
# with open("eml.xml", "w", encoding="utf-8") as f:
#     f.write(eml_file.to_xml())

RuntimeError: Resource not initialized yet

# Resumen
## Modulo dwca
> * ### Se usó para leer archivos unicamente
* ### No se pudo leer plagioscion ni otros archivos .zip con DarwinCoreArhive.from_file()
* ### No se ha probado el escribir un DarwinCoreArchive aún

## Modulo EML
> * ### Se usó para leer archivos eml.xml de plagioscion y pygmy-jellyfish
* ### No leyó el eml de plagioscion
* ### Si leyó el eml de pygmy-jellyfish
* ### eml de pygmy-jellyfish mostraba la data del archivo, sin embargo, este unicamente tiene ocurrencias
>>
	```
  EML:
  Resource Type: DATASET
	Title: A new pygmy species of box jellyfish (Cubozoa: Chirodropida) from sub-tropical Australia
	Creator: Gershwin, L. (Investigator at CSIRO Oceans and Atmosphere)
	MetadataProvider: Gershwin, L. (Investigator at CSIRO Oceans and Atmosphere)
	Publisher: OBIS Australia Node manager (OBIS Australia Data Manager at CSIRO National Collections and Marine Infrastructure)
  ```
* ### Al intentar escribir un eml se encontro un problema con EML.add_title()
>>
  ```
  ValueError: No titles providedDuring handling of the above exception, another exception occurred:
  RuntimeError                              
  ```
  ```
  Traceback (most recent call last)
  /usr/local/lib/python3.10/dist-packages/eml/base/eml.py in resource(self)
      150                 return self.__resource__
      151             except Exception as _:
  --> 152                 raise RuntimeError("Resource not initialized yet")
      153         return self.__resource__
  ```