vignettes/a03_Template_edits.Rmd

---
title: "Editing .txt Files"
description: |
  Resources and Guides for using EMLassemblyline to create EML for National Park Service data packages
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Editing .txt Files}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

---
output: github_document
---

Metadata inferred during the templating process should be validated by the user and missing info added. Use spreadsheet and text editors for this process. Template specific guides are listed below.

_NOTES:_

*  _Templates can be generated as .docx, .md, or .txt files. Here we focus on .txt. This is the default file format and also the simplest and least error-prone format._  
*  _Tabular templates: Leave empty cells blank, don't fill with NAs.
*  _Free-text templates: Keep template content simple. Complex formatting can lead to errors._

## abstract.txt

[__Example__](https://docs.google.com/document/d/1c0hkaA8iLKhMBBvnvdvSJq8Y1g4CCUlSJG2a2MFCK08/edit?usp=sharing)

Describes the salient features of a dataset in a concise summary much like an abstract does in a journal article. It should cover what the data are and why they were created. A good rule of thumb is that an abstract should be about 250 words or less. The abstract will become a publicly-facing piece of text featured on the DataStore reference page as well as sent on to DataCite, data.gov, and Google's Dataset Search. A thoughtful and well-planned abstract may be useable not only for the data package but also for the Data Release Report (DRR). 

If you write your abstract in a word processor (such as MS Word) and paste it in to the abstract.txt template file, please pass it through a text editor (such as Notepad) to make sure it is UTF-8 encoded and does not have special characters, including line breaks such as \<br>.

*Note:* editing abstract.txt is best done via a text editor.

## methods.txt

[__Example__](https://docs.google.com/document/d/1a7BIGrmLrU6eTlIsQvWALU4moByYR3eND2_a42bxIkg/edit?usp=sharing)

Describes the data creation methods. Includes enough detail for future users to correctly use the data. Lists instrument descriptions, protocols, etc. Methods sections can include citations. It may be appropriate to cite the Protocol, datasets that were ingested to generate the data package, software (e.g. R), packages (e.g. dplyr, ggplot2) or custom scripts.

*Note:* editing methods.txt is best done via a text editor.

## keywords.txt

[__Example__](https://docs.google.com/spreadsheets/d/1u9LzpfeyBMet4AMe9SLbHb8pRsHwT_7Qx87ZP-SlTKI/edit?usp=sharing)

 Describes the data in a small set of terms. Keywords facilitate search and discovery on scientific terms, as well as names of research groups, field stations, and other organizations. Using a controlled vocabulary or thesaurus vastly improves discovery. We recommend using the [LTER Controlled Vocabulary](http://vocab.lternet.edu/vocab/vocab/index.php) when possible.

*Note:* editing keywords.txt is best done via a spreadsheet application.

Columns:

*  **keyword** One keyword per line
*  **keywordThesaurus** URI of the vocabulary from which the keyword originates.

## personnel.txt

[__Example__](https://docs.google.com/spreadsheets/d/14vFIC1wyR6_tExz3QkI8O82jQepei8po87uxaeJvyak/edit?usp=sharing)

Describes the personnel and funding sources involved in the creation of the data. This facilitates attribution and reporting. 

Valid EML requires at least one person with a **creator** role. Creator is a synonym for Author in DataStore. 

DataStore also requires at least one person with the role if **contact**. If this is the same person, list that person twice (i.e. on two separate rows). 

Additional personnel (field technicians, consultants, collaborators, contributors, etc) may be added to give credit as necessary. Any roles other than "creator" or "contact" will be listed as associatedParties. 

For the purposes of NPS data packages, it is likely that you will not have Principle Investigators (PIs), or information about funding or funding agencies.

*Note:* editing personnel.txt is best done through a spreadsheet application.

Columns:

* **givenName** First name
* **middleInitial** Middle initial
* **surName** Last name
* **organizationName** Organization the person belongs to
* **electronicMailAddress** Email address
* **userId** Persons research identifier (e.g. [ORCID](https://orcid.org/)). Links a persons research profile to a data publication.
* **role** Role of the person with respect to the data. Persons serving more than one role are listed on separate lines (e.g. replicate the persons info on separate lines but change the role. Valid options:
    + **creator** Author(s) of the data. Will appear in the data citation.
    + **PI** Principal investigator the data were created under. Will appear with project level metadata. It is OK to leave this blank as there will be no PIs for many NPS data packages.
    + **contact** A point of contact for questions about the data. Can be an organization or position (e.g. data manager). To do this, enter the organization or position name under *givenName* and leave *middleInitial* and *surName* empty.
    + Other roles (e.g. Field Technician) will be listed as associated parties to the data. Their specific role (e.g. "Field Tech" will also be listed in metadata)
* Funding information is listed with PIs
    + **projectTitle** Title of project the data were created under. If ancillary projects were involved, then add as new lines below the primary project with the PIs info replicated. This can typically be left blank.
    + **fundingAgency** Agency the project was funded by. This can be left blank.
    + **fundingNumber** Grant or award number. Likely leave this blank.

## intellectual_rights.txt

There is no need to edit the intellectual rights file now. 

EMLassemblyline autopopulates an "intellectual_rights.txt" file and will use that file to add information to the <intellectualRights> element in your EML. Once you have finished generating your EML you need to update your intellectual rights to coincide with NPS guidance using a separate `EMLeditor::set_int_rights()` function.

## attributes_*.txt

[__Example 1__](https://docs.google.com/spreadsheets/d/1VV6SY_757j5R7anJNXGTo7U2PQ_GSgg-F6YZNBwzqM0/edit?usp=sharing), [__Example 2__](https://docs.google.com/spreadsheets/d/1e7eZAQHmQIPKUwaqGjC3htK-7YK9brBuVFVY03HED5E/edit?usp=sharing)

If you have multiple data files (.csvs), multiple text files will be generated, each starting with "attributes", followed by your csv file name, and having the extension ".txt". These files Describe columns of a data table (classes, units, datetime formats, missing value codes).

*Note:* editing attribute_<data file name>.txt is best done using a spreadsheet application.

Columns:

* **attributeName** Column name. Make sure that each column has an attributeName and tha they match (including case sensitivity)
* **attributeDefinition** Column definition
* **class** Column class. Valid options are:
    + **numeric** Numeric variable
    + **categorical** Categorical variable (i.e. nominal)
    + **character** Free text character strings (e.g. notes)
    + **Date** Date and time variable
* **unit** Column unit. Required for _numeric_ classes. Select from EML's standard unit dictionary, accessible with `view_unit_dictionary()`. Use values in the "id" column. If not found, then define as a custom unit (see custom_units.txt).
* **dateTimeFormatString** Format string. Required for Date classes. Valid format string components are:
    + **Y** Year
    + **M** Month
    + **D** Day
    + **h** Hour
    + **m** Minute
    + **s** Second
Common separators of format string components (e.g. - / \ :) are supported.
* **missingValueCode** Missing value code. Required for columns containing a missing value code).
* **missingValueCodeExplanation** Definition of missing value code.

## custom_units.txt

[__Example__](https://docs.google.com/spreadsheets/d/1XPoFiegWw7BIugkRmmMqBdC6hqKtIEfdN4g-QSImUxI/edit?usp=sharing)

Describes non-standard units used in a data table attribute template.

*Note:* custom-units.txt is best edited via a spreadsheet application.

Columns:

* **id** Unit name listed in the unit column of the table attributes template (e.g. feetPerSecond)
* **unitType** Unit type (e.g. velocity)
* **parentSI** SI equivalent (e.g. metersPerSecond)
* **multiplierToSI** Multiplier to SI equivalent (e.g. 0.3048)
* **description** Abbreviation (e.g. ft/s)

## catvars_*.txt

[__Example 1__](https://docs.google.com/spreadsheets/d/1cKgLv9ffLtTqHrGX0WbpTCeWQkHXw1lbMRIt6IccuaQ/edit?usp=sharing), [__Example 2__](https://docs.google.com/spreadsheets/d/13lRuvBElEr8RQWQrWUyBei6rlE9xoDRkEOxhysHZ2J4/edit?usp=sharing)

Describes categorical variables of a data table (if any columns are classified as categorical in table attributes template). If you have multiple data files (csvs), multiple catvars files will be created, one for each csv. 

*Note:* The catvars files are best edited with a spreadsheet application.

Columns:

* **attributeName** Column name
* **code** Categorical variable
* **definition** Definition of categorical variable

## geographic_coverage.txt

[__Example__](https://docs.google.com/spreadsheets/d/1lSvQsA6tG35egBp-ueXoCKFxVdd_6rzknBz1gFew3aA/edit?usp=sharing)

Describes where the data were collected.

If the only geographic coverage information you plan on using are park boundaries, you can skip this step. You can add park unit connections using EMLeditor, which will automatically generate properly formatted GPS coordinates for the park bounding boxes.

If you would like to add additional GPS coordinates (such as for specific site locations, transects, survey plots, or bounding boxes for locations within a park, etc) please do! 

*Note:* Hopefully you won't have to edit these, but if so they are best edited with a spreadsheet application.

Columns:

* **geographicDescription** Brief description of location.
* **northBoundingCoordinate** North coordinate
* **southBoundingCoordinate** South coordinate
* **eastBoundingCoordinate** East coordinate
* **westBoundingCoordinate** West coordinate

Coordinates must be in decimal degrees and include a minus sign (-) for latitudes south of the equator and longitudes west of the prime meridian. For points, repeat latitude and longitude coordinates in respective north/south and east/west columns. If you need to convert from UTMs, try using the `utm_to_ll()` function in the [R/QCkit](https://github.com/nationalparkservice/QCkit) package.

Currently EML handles points and rectangles well. At the least precise end of spectrum you could enter an entire park unit as geographic For a convenient way to get these coordinates, see the `get_park_polygon()` function in the [R/EMLeditor](https://github.com/nationalparkservice/EMLeditor) package.

We strongly encourage you to be as precise as possible with your geographicCoverage and provide sampling points (e.g. along a transect) whenever possible. This information will (eventually) be displayed on a map on the DataStore Reference page for the data package and these points will also be directly discoverable through DataStore searches.

If you have CUI concerns about the specific locations of your sites, consider fuzzing them rather than completely removing them. One good tool for fuzzing geographic coordinates is the `fuzz_location()` function in the  [R/QCkit](https://github.com/nationalparkservice/QCkit) package.

## taxonomic_coverage.txt

[__Example__](https://docs.google.com/spreadsheets/d/1jpOmcSq93KOpWnC26pHrxPtt4dFhB9SHXNHLs5Du7MQ/edit?usp=sharing)

Describes biological organisms occurring in the data and helps resolve them to authority systems. If matches can be made, then the full taxonomic hierarchy of scientific and common names are automatically rendered in the final EML metadata. This enables future users to search on any taxonomic level of interest across data packages in repositories.

*Note:* Hopefully you don't have to edit these.

Columns:

* **taxa_raw** Taxon name as it occurs in the data and as it will be listed in the metadata if no value is listed under the name_resolved column. Can be single word or species binomial.
* **name_type** Type of name. Can be "scientific" or "common".
* **name_resolved** Taxons name as found in an authority system.
* **authority_system** Authority system in which the taxa’s name was found. Can be: "[ITIS](https://www.itis.gov/)", "[WORMS](http://www.marinespecies.org/)", "or "[GBIF](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c)".
* **authority_id** Taxa’s identifier in the authority system (e.g. 168469).

## provenance.txt

[__Example__](https://docs.google.com/spreadsheets/d/1P7NwIgntemkAciZi3uZgZ_ponZ73cXSh-pEihydwrsM/edit?usp=sharing)

Describes source datasets. Explicitly listing the DOIs and/or URLs of input data help future users understand in greater detail how the derived data were created and may some day be able to assign attribution to the creators of referenced datasets.

Provenance metadata can be automatically generated for supported repositories simply by specifying an identifier (i.e. EDI) in the systemID column. For unsupported repositories (e.g. DataStore), the systemID column should be left blank.

For many monitoring protocols, there may not be any input datasets, instead the data package is based on newly collected & original data. In this case, leave provenance.txt blank.

Columns:

* **dataPackageID** Data package identifier. Supplying a valid packageID and systemID (of supported systems) is all that is needed to create a complete provenance record.
* **systemID** System (i.e. data repository) identifier. Currently supported systems are: EDI (Environmental Data Initiative). Leave this column blank unless specifying a supported system.
* **url** URL linking to an online source (i.e. data, paper, etc.). Required when a source can't be defined by a packageID and systemID.
* **onlineDescription** Description of the data source. Required when a source can't be defined by a packageID and systemID.
* **title** The source title. Required when a source can't be defined by a packageID and systemID.
* **givenName** A creator or contacts given name. Required when a source can't be defined by a packageID and systemID.
* **middleInitial** A creator or contacts middle initial. Required when a source can't be defined by a packageID and systemID.
* **surName** A creator or contacts middle initial. Required when a source can't be defined by a packageID and systemID.
* **role** "creator" and "contact" of the data source. Required when a source can't be defined by a packageID and systemID. Add both the creator and contact as separate rows within the template, where the information in each row is duplicated except for the givenName, middleInitial, surName (or organizationName), and role fields.
* **organizationName** Name of organization the creator or contact belongs to. Required when a source can't be defined by a packageID and systemID.
* **email** Email of the creator or contact. Required when a source can't be defined by a packageID and systemID.

## annotations.txt

[__Example__](https://docs.google.com/spreadsheets/d/1TOS1-yCKUJEvDZwenZs88ok8L3tQkIAUQp24m-6WuL8/edit?usp=sharing)

Adds semantic meaning to metadata (variables, locations, persons, etc.) through links to ontology terms. This enables greater human understanding and machine actionability (linked data) and greatly improves the discoverability and interoperability of data in general.

Columns:

* **id** A unique identifier for the element being annotated.
* **element** The element being annotated.
* **context** The context of the subject (i.e. element value) being annotated (e.g. If the same column name occurs in more than one data tables, you will need to know which table it came from.).
* **subject** The element value to be annotated.
* **predicate_label** The predicate label (a.k.a. property) describing the relation of the subject to the object. This label should be copied directly from an ontology.
* **predicate_uri** The predicate label URI copied directly from an ontology.
* **object_label** The object label (a.k.a. value) describing the subject. This label should be copied directly from an ontology.
* **object_uri** The object URI copied from an ontology.

## additional_info

[__Example__](https://docs.google.com/document/d/1bbZ8iR9MOtTNGbbcev7fdjNCT7va9I8v-Ryv6jFCJYU/edit?usp=sharing)

Ancillary info not captured by any of the other templates.