NeXML Manual

yeban edited this page Sep 20, 2014 · 15 revisions
Clone this wiki locally

The NeXML Manual (version 0.1), by Rutger A. Vos, James P. Balhoff, Hilmar Lapp, Peter E. Midford, Arlin Stoltzfus

Table of Contents

About this manual

This manual is intended to complement Vos, et al. (PubMed link), which presents the design of NeXML in the context of ongoing challenges in data interoperability in comparative analysis. By contrast, this manual provides technical information and practical guidance to users, including scientific end-users and scientific software developers.

The current version of this manual is available via the NeXML web site, as indicated in the resources section below.

We welcome comments, especially from readers who have tried to use NeXML or who have found bugs in the examples given here. Comments and suggestions should be directed to the nexml-discuss mailing list or to an appropriate tracker (see resources section below).

Guide to Resources

Various resources relating to NeXML are available online. Please note that URLs for some types of resources change. In cases where the location of a resource is likely to remain stable, we provide exact URLs. In other cases, the resource is listed as being available "via" some URL.

The NeXML Project

Resource Location
publication Vos, et al., 2012 (PubMed link)
information via http://www.nexml.org
bug trackers via http://www.nexml.org
libraries via http://www.nexml.org
mailing list https://lists.sourceforge.net/ lists/listinfo/nexml-discuss
manual via http://www.nexml.org
schema via http://www.nexml.org

Applications that support NeXML for input or output

Application Location/Download Related
DAMBE http://dambe.bio.uottawa.ca/dambe.asp
Mesquite http://mesquiteproject.org nexml support package: http://nexml.github.io/downloads/mesquite-nexml.zip mailing list: http://mesquiteproject.org/mailman/listinfo/mesquitelist
mx http://mx.phenomix.org
Phenex https://www.phenoscape.org/wiki/Phenex
Phylobox http://phylobox.appspot.com
jsPhyloSVG http://www.jsphylosvg.com
When NeXML support is added to NCL (see libraries, below), applications such as Brownie, GARLI, Phycas, and Treeview Xmay choose to support NeXML via that library. Mesquite (Maddison, W. and Maddison, D.R. 2001), an extensible graphical workbench for comparative analysis, provides support for reading and writing NeXML via an add-on package. The phylogenetic analysis package DAMBE (Xia, X. and Xie, Z. 2001) now reads and writes NeXML using its own implementation, in Visual Basic, of NeXML I/O; and the Phenex program (Balhoff, J., Dahdul, W., et al. 2010) reads and writes NeXML using Java code generated directly from the XML schema. Among tree viewers, PhyloBox and jsPhyloSVG read and visualize NeXML data.

Databases and Repositories that support NeXML

Database Location
TreeBASE http://Treebase.org

TreeBASE (Piel, W., Chan, L., et al. 2009) generates NeXML output as an option accessible via its web interface, or via its PhyloWS web-services interface.

Programming Libraries (APIs) for NeXML

Library Language Download Related
Bio::Phylo Perl http://search.cpan.org/dist/Bio-Phylo/
BioPerl Perl http://www.bioperl.org/ Nexml how-to: http://www.bioperl.org/wiki/HOWTO:Nexml
BioRuby Ruby https://github.com/nexml/bio-nexml documentation: http://nexml.github.com/bio-nexml/
DendroPy Python http://packages.python.org/DendroPy/
NeXML Java Library Java http://www.nexml.org/nexml/java/
NeXML support in the NEXUS Class Library (NCL, Lewis 2003) is currently under development. In additional to the applications listed above, NCL support for NeXML will facilitate its support in the phylobase http://phylobase.r-forge.r-project.org/ library for R. Programming interfaces In parallel with developing the NeXML syntax itself, members of the EvoInfo Working Group and other interested stakeholders have developed libraries for reading and writing NeXML in a variety of programming languages. By using these libraries in applications or simple scripts, data expressed in NeXML documents are accessible at a higher level of abstraction, obviating the need for studying the NeXML schema itself. This development process has also allowed deficiencies or awkward design decisions in the schema to be identified more easily and syntax proposals to be tested in a variety of programming environments. The libraries described below are freely available as open source software, and researcher-programmers are encouraged to evaluate these projects carefully and consider reusing their deliverables if they are planning to support NeXML I/O in their applications:
  • C++ — At time of writing, development is under way to add NeXML support to NCL (Lewis, P. 2003). This project is of interest to researcher-programmers working in a number of other programming languages to which C++ libraries can be linked.
  • Java — A compact, standalone library for Java version 5 is used in This library is used by the Mesquite and Phenex applications (see above) and by TreeBASE to provide NeXML support.
  • JavaScript — A commonly used method for exchanging structured data on the internet is by encoding it using JavaScript Object Notation (JSON; http://www.json.org). Several ways to map XML data onto JSON have been proposed. For the mapping used by the Google Data Protocol, RV has developed a JavaScript library that presents such data in an object-oriented application programming interface.
  • Perl — The Bio::Phylo libraries for phyloinformatics (Vos, R.A., Caravas, J., et al. 2011) comprise the most complete implementation of the NeXML specification in Perl. These libraries are compatible with BioPerl (Stajich, J., Block, D., et al. 2002), and are used by the document validator, the NEXUS-to-NeXML translator, the NeXML-to-JSON translator and several demo web service wrappers for TimeTree (Hedges, B., Dudley, J., et al. 2006) and the Tree of Life Web Project (Maddison, D., Schulz, K.-S., et al. 2007) on the NeXML website.
  • Python — The DendroPy library (Sukumaran, J. and Holder, M. 2009) provides classes and functions for simulating, processing, and manipulating phylogenetic trees and character state matrices, and supports reading and writing phylogenetic data in a range of formats, including NeXML. In addition, python-based modules for reading and writing NeXML are under development for the ETE (Huerta-Cepas, J., Dopazo, J., et al. 2010) and BioPython (Cock, P., Antao, T., et al. 2009) toolkits.
  • Ruby — At time of writing, some authors (AP, RV) are adding NeXML support, including a flexible semantic annotation system, to BioRuby (Mitsuteru, N., Goto, N., et al. 2003). NeXML support is also implemented in Ruby by the mx toolkit (http://mx.phenomix.org)

Getting started with NeXML

This section focuses on the needs of phylogenetics end-users who are new to NeXML. The manual assumes that the typical end-user

  • generates phylogenetic trees in the context of research on the biology or evolution of some set of characters
  • has a highly customized non-automated workflow that uses a variety of tools
  • understands the theory and practice of comparative evolutionary analysis
  • knows how to install software but typically is not an accomplished programmer

Below we provide guidance both to a) users who wish to use interactive tools without any programming; and b) users who are willing to write and execute simple scripts based on examples provided.

Setting yourself up to use NeXML without any programming

Users who are not able to use a programming library described in Vos, et al., 2012 may use existing tools to create and manipulate NeXML files. In the sections below, directions for using NeXML without programming are given first (followed by directions for scripting). To follow the directions in any given case, the user must have access to one or more of the following tools (full capacity to utilize NeXML without programming, install them all):
  • access to a local installation of Mesquite, an extensible software workbench for comparative evolutionary analysis that provides a graphical user interface and a variety of tools. If the mesquite-nexml package (below) is installed, Mesquite can read, write, and display NeXML. For more information, including downloading and installation instructions, see the Mesquite web site (resources section, above).
  • installation of the mesquite-nexml package, an extension for Mesquite. At the time of this writing, NeXML support is not part of the standard Mesquite installation. You have to install it yourself as follows
    1. download mesquite-nexml.zip from the "downloads" section of the NeXML web site
    2. unpack the zip archive to find two folders: "org" and "mesquite"
    3. move the contents of these folders into the appropriate places in your Mesquite installation:
      • move the "nexml" folder in "org" into Mesquite_Folder/org
      • move the "nexml" folder in "mesquite" into Mesquite_Folder/mesquite
    4. restart Mesquite
  • (optionally) access to an installation of mx (resources section), a data management system for systematics with a web-based front-end and an SQL back end. Installation requires a knowledge administrator, but mx may have certain advantages (see [NeXML-Manual#wiki-Getting-other-character-data-into-NeXML]).
  • web access to the NeXML project web site (resources section) provides a limited set of online tools, including format validation and conversion from Newick (or eNewick or NHX) to NeXML.
  • web access to a format translation server may be needed for some operations. Most are front ends to Don Gilbert's venerable "ReadSeq" program. Some translation servers active at the time of this writing are:

Setting yourself up to do simple programming with NeXML

Below we give examples in Perl and Python. However, NeXML APIs are available in Java, C and other programming languages, as described by Vos, et al., 2012. The examples given below generally are simple. For more complex problems, please consult the programming manuals for the various libraries.

Setting up Perl language tools

Bio::Phylo, a programming library in Perl, available from CPAN, may be installed using the following command:

 sudo perl -MCPAN -e 'install Bio::Phylo'

BioPerl, a programming library in Perl, available from CPAN, may be installed using the following command:

 sudo perl -MCPAN -e 'install Bundle::Bio'

For more information on these packages, see the resources section above. BioPerl installation can be a major hurdle. Developers who have not used it before are advised to have patience or to find help from someone more experienced.

Setting up Python language tools

If you have Python installed, DendroPy (see the resources section above) may be installed using pip with the command

 sudo pip install dendropy

or using setuptools with the command

 sudo easy_install -U dendropy

The DendroPy package has extensive, well written online documentation, with examples (see the resources section, above).

Getting phylogeny data into NeXML

Users may wish to generate a NeXML representation of phylogenetic data encoded in another format. How this is done will depend on the format.

Converting a Newick tree to NeXML

To convert a Newick tree to NeXML without programming, begin by pointing your web browser to the NeXML home page (see the resources section):

  1. on the NeXML home page, use the pull-down to select newick-->nexml
  2. click "browse" to locate the Newick file on your computer
  3. press "submit" to upload the file and translate its contents
  4. save the resulting NeXML file

The following Perl code, which uses the Bio::Phylo package, will convert a Newick tree to a NeXML tree:

 use Bio::Phylo::IO 'parse';
 my $file = shift @ARGV;
 print parse( '-format' => 'newick', '-file' => $file, '-as_project' => 1 )->to_xml;

The analogous Python code will read the Newick tree in "data.nwk" and write out a NeXML file named "data.xml":

 import dendropy
 data = dendropy.DataSet.get_from_ path("data.nwk", "newick")
 data.write_to_path("data.xml", "nexml")

Converting a NEXUS file to NeXML

To convert a NEXUS file to NeXML without programming, begin by pointing your web browser to the NeXML home page (see the resources section):

  1. on the NeXML home page, use the pull-down to select nexus-->nexml
  2. click "browse" to locate the NEXUS file on your computer
  3. press "submit" to upload the file and translate its contents
  4. save the resulting NeXML file

Alternatively, this same web service can be accessed using cURL from the command line, where filename is a file to upload and convert:

 curl -F file=@filename http://nexml.org/nexml/nex2xml

The following Perl code, which uses the Bio::Phylo package, will convert a NEXUS file to a NeXML string:

 use Bio::Phylo::IO 'parse';
 my $file = shift @ARGV;
 print parse( '-format' => 'nexus', '-file' => $file, '-as_project' => 1 )->to_xml;

The analogous operation in Python (using DendroPy) is carried out as follows

 import dendropy
 data = dendropy.DataSet.get_from_path("data.nex", "nexus")
 data.write_to_path("data.xml", "nexml")

which converts data in "data.nex" (NEXUS) to a NeXML file called "data.xml".

Converting ToLWeb trees to NeXML

Converting ToLWeb trees to NeXML without programming can be done in 2 steps, as follows:

  1. get the tree in Mesquite, and export it as NEXUS
    1. in Mesquite, select "File: Open Other: Open ToLWeb Tree"
    2. provide a group or clade name with which to search ToLWeb
    3. choose "File: Export" and select "NEXUS" from the format list
  2. follow the instructions above for https://github.com/nexml/nexml/wiki/NeXML-Manual#Converting a NEXUS file to NeXML

Trees for ToLWeb groups (or the entire tree of all groups) can be obtained using a web services interface, like this:

 http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=194 

where "194" is the group ID for the group "Bembidion" (to discover the ID for a group via string search, ToLWeb has a different service http://tolweb.org/tree/home.pages/downloadtree.html). Trees obtained from this web-service interface are returned in an XML format that is not the same as NeXML. The following Perl code, which uses the Bio::Phylo package, will convert a ToLWeb XML tree to a NeXML tree:

 use Bio::Phylo::IO 'parse';
 my $file = shift @ARGV;
 print parse( '-format' => ' tolweb', '-file' => $file, '-as_project' => 1 )->to_xml;

Note that, because ToLWeb has a stable web interface, the data can be passed to the parse function via the '-url' argument, whose value is a string URL that points to some output generated by ToLWeb. See the Bio::Phylo docs for more details.

Converting an NHX tree to NeXML

We are not aware of any way to do this without programming.

The following Perl code, which uses the BioPerl and Bio::Phylo packages, will extract the trees from an NHX file, though the NHX-tagged content will be lost:

 my $file = shift @ARGV;
 my $fac = Bio::Phylo::Factory->new;
 my $treein = Bio::TreeIO->new('-format'=>' nhx','-file'=>$file);
 my $proj = $fac->create_project;
 my $forest = $fac->create_forest;
 $proj->insert($forest);
 
 while ( my $nhxtree = $treein->next_tree ) {
    my $tree = Bio::Phylo::Forest::Tree->new_ from_bioperl($nhxtree);
    $forest->insert($ tree);
 }
 print $proj->to_xml;

Converting a PhyloXML tree to NeXML

We are not aware of any way to do this without programming.

Translation via XSLT is in development for this task, via Kate Rachwal of U Florida.

Using BioPerl and Bio::Phylo, it can be done as follows:

 my $reader = Bio::TreeIO->new( '-file' => $file, '-format' => 'phyloxml');
 my $factory = Bio::Phylo::Factory->new;
 my $project = $factory->create_project;
 my $forest = $factory->create_forest;
 
 while( my $tree = $reader->next_tree ) {
  $forest->insert(Bio::Phylo::Forest::Tree->new_from_bioperl($tree));
 }
 
 $project->insert($forest);
 $project->insert($forest->make_taxa);
 
 print $project->to_xml;

Getting aligned sequence data into NeXML

Aligned sequence data in various formats can be converted into NeXML format without any programming, via a format translation server (see the resources section). Typically these servers support all commonly used multi-sequence formats: FASTA, Phylip, ClustalW ("aln"), GenBank, NEXUS, GCG, MSF, NBRF (PIR), and others.

  1. convert the input file to NEXUS using a format translation server (see the resources section).
  2. convert the NEXUS file to NeXML (just as we did before for converting a tree):
    1. on the NeXML home page, use the pull-down to select nexus-->nexml
    2. click "browse" to locate the NEXUS file on your computer
    3. press "submit" to upload the file and translate its contents
    4. save the resulting NeXML file

For batch processing, alignments may be converted to NeXML using Python or Perl. In Python (using DendroPy), the following code:

 import dendropy
 data = dendropy.DataSet.get_from_path("data.nex", "nexus")
 data.write_to_path("data.xml", "nexml")

converts data in "data.nex" (NEXUS) to a NeXML file called "data.xml". The second argument to DataSet.get_from_path is the format, which can be "nexus", "phylip", "nexml" or "fasta" (or "newick" for trees). The DendroPy tutorial (see the resources section) provides explanation.

Using BioPerl and Bio::Phylo allows a greater range of input formats. BioPerl accepts the following multiple sequence formats as input:

  • fasta: FASTA format
  • pfam: pfam format
  • selex: selex (hmmer) format
  • stockholm: stockholm format
  • prodom: prodom (protein domain) format
  • clustalw: clustalw (.aln) format
  • msf: msf (GCG) format
  • mase: mase (seaview) format
  • bl2seq: Bl2seq Blast output
  • nexus: Swofford et al NEXUS format
  • pfam: Pfam sequence alignment format
  • phylip: Felsenstein's PHYLIP format

The input file in one of the above formats can be converted to NeXML using the code below as a script. If the script is called 'align2nexml.pl' then the command

 align2nexml.pl <filename> <format>

will convert the contents of <filename> from <format> (see list above) to nexml.

 use Bio::AlignIO; 
 use Bio::Phylo::Factory;
 use Bio::Phylo::Matrices::Matrix; 
 
 my $filename = shift; 
 my $format = shift; 
 my $inaln = Bio::AlignIO->new(-file=>$filename, -format=>$format);
 
 my $matrix = Bio::Phylo::Matrices::Matrix->new_from_bioperl($inaln->next_aln);
 my $fac  = Bio::Phylo::Factory->new;
 my $proj = $fac->create_project;
 my $taxa = $matrix->make_taxa; 
 $proj->insert($taxa);
 $proj->insert($matrix);
 print $proj->to_xml('compact' => 1);

Getting other character data into NeXML

Systematists and evolutionary biologists frequently use evolutionary models to analyze data other than molecular sequence characters. Such data typically are stored in NEXUS files or in other files supported by Mesquite. If the file is already in NEXUS format, then it can be converted to NeXML using the format converter accessible via the NeXML home page (the resources section).

If the file is in another format, note that Mesquite supports TNT, Hennig86, and Pagel formats, as well as tab-delimited data. If you have the NeXML compatibility package installed in Mesquite, then simply:

  1. load the file into Mesquite, then write out its contents as NeXML

If you do not have the NeXML compatibility package installed in Mesquite, then:

  1. convert the file to NEXUS using Mesquite
    1. choose "File: Open File" to select the file to open
    2. select the input format from the list provided
    3. save the data in NEXUS format
  2. convert the NEXUS file to NeXML (just as we did before for converting a tree):
    1. on the NeXML home page, use the pull-down to select nexus-->nexml
    2. click "browse" to locate the NEXUS file on your computer
    3. press "submit" to upload the file and translate its contents
    4. save the resulting NeXML file

Note that mx (see resources section) also will write out NeXML. mx may have an advantage over Mesquite to allow multi-user teams of systematists to "manage morphological matrices, specimen and sequence data, references, taxonomic descriptions, bifurcating and multi-entry keys, biological associations, images, collecting events, taxonomic hierarchies, ontologies and more."

Combining Phylogeny and Alignment (or other Character Data)

A common problem is to combine a tree from one file with a character matrix or alignment from another file. This is an important step because it links the tree with rows in the data matrix.

To do this without any programming, use Mesquite. See the Mesquite manual (resources section) for details.

To aggregate data from across multiple files using DendroPy, enforcing a single taxon domain:

 import dendropy
 dataset = dendropy.DataSet()
 dataset.attach_taxon_ set()
 dataset.read_from_path(" trees.tre", "nexus")
 datatset.read_from_path( "16s.fas", "fasta", data_type="dna")
 dataset.write_to_path(" data.xml", "nexml")

Bio::Phylo allows for a more generalized process of merging the set of OTUs in the tree with the set of OTUs in the matrix, as follows:

 # load trees, get taxa from first tree
 my $trees_proj = parse('-format' => 'newick', '-file' => $treefile, '-as_project' => 1 );
 my $trees_taxa = $trees_proj->get_taxa->[0];
 
 # load chars, get taxa from first char matrix
 my $chars_proj = parse( '-format' => 'nexus', '-file' => $charfile, '-as_project' => 1 );
 my $chars_taxa = $chars_proj->get_taxa->[0];
 
 # merge these two sets (collapses matching names)
 my $merged_taxa = $chars_taxa->merge_by_name($ trees_taxa);
 
 # link the trees from $treefile to this merged set of OTUs
 $trees_proj->get_ forests->[0]->set_taxa($ merged_taxa);
 
 # link the first chars block from $charsfile to merged set
 $chars_proj->get_ matrices->[0]->set_taxa($ merged_taxa);
 
 # collect new merged taxa block and the trees into one project
 $chars_proj->insert($ merged_taxa);
 $chars_proj->insert($ trees_proj->get_forests->[0]);
 
 # delete the old taxa block read from the nexus file
 $chars_proj->delete($ chars_taxa);
 
 # write to nexml
 print $chars_proj->to_xml;

Validating a NeXML file

To determine whether a file conforms to the NeXML schema, use the web server accessible via the NeXML home page (resources section Alternatively, to validate many files at once, you may wish to implement your own validator. This requires an XML processor library with validation methods, which are readily available (e.g., PHP DomDocument->schemaValidate or Java schema.validator.validate), and the NeXML schema, which is available for download from the NeXML web site (resources section).

Visualizing Data in NeXML files

If the mesquite-nexml package is installed, Mesquite will allow you to visualize and interactively manipulate trees and character data from NeXML input files. Directions for installing this package are given in the NeXML-Manual#wiki-Setting_yourself_up_to_use_NeXML_without_any_programming above.

Programmers may find it more helpful to get a direct view of NeXML using a dedicated XML tool. There are many such tools available, including a commercial tool called Oxygen, and the free version of Serna.

Getting Phylogeny Data out of NeXML

Lets suppose that, in order to be compatible with legacy software, I want a Newick or NEXUS version of a tree represented in NeXML. Here is the Python way:

 import dendropy
 data = dendropy.DataSet.get_from_ path("data.xml", "nexml")
 data.write_to_path(" data.nex", "nexus")

This is an example using BioPerl with Bio::Phylo, adapted from Chase Miller's http://www.bioperl.org/wiki/HOWTO:Nexml NeXML HowTo:

 # intialize stream
 my $in = Bio::NexmlIO->new(-file => "characters_and_trees.xml"); 
 # extract, convert, and write data types
 $in->extract_seqs(-file => ">seqs.fas", -format => "fasta");
 $in->extract_alns(-file => ">alns.nex", -format => "nexus");
 $in->extract_trees(-file => ">trees.nwk", -format => "newick");

Metadata annotations and NeXML

To facilitate integration and re-use of phylogenies, it is important to associate them with metadata such as publication information, taxonomic identifiers, accession numbers (and other kinds of specimen information), and with other types of data such as geographic coordinates.

One of the designed features of NeXML is the ability to encode metadata annotations that are linked to data elements. Rather than trying to provide vocabulary for all possible metadata types, NeXML provides for the use of externally defined vocabularies, such as ontologies. Currently this feature is used by the Phenoscape project to encode phenotype annotations in NeXML.

How Metadata are Represented using the <meta> Element

Fundamental data objects in NeXML can be annotated using RDFa (Adida, B., Birbeck, M., et al. 2008). This way, the annotations are directly available to any off-the-shelf RDFa parser but are also simple to integrate into non-RDF-aware processing libraries. The annotations are expressed using recursively nested meta elements, and are essentially triples whose subjects are identified by the about attribute. To specify that a NeXML element such as a tree is the subject, the about attribute and the id attribute of the element must match. This way, core NeXML elements can be converted to RDF (for example by using an XSL style sheet), RDFa annotations can be extracted from the NeXML (using another style sheet), and the subjects of their respective triples can be used to align the two resulting graphs.

  • If the triple's object value is a literal such as a string or a number, the exact data type is specified by the datatype attribute; these types are typically core XML schema types such as xsd:string for atomic types, or rdf:Literal for nested XML element structures that are to be parsed as opaque literals in an RDF graph. The object value is enclosed inside the meta element and the predicate is specified using the property attribute (whose value must be a CURIE). meta elements of this type are of the subclass nex:LiteralMeta.
  • If the triple's object value is a remote resource, its location is specified using the href attribute. The predicate for this class of triples is specified as a CURIE using the rel attribute. meta elements of this type are of the subclass nex:ResourceMeta.
  • If the triple's object value is a nested annotation, its predicate is specified as a CURIE using the rel attribute. Since this enclosing meta element is to be transformed into an anonymous RDF node, it needs to be identified as the subject of a reification by assigning it an about attribute (as per the RDFa rules for identifying triple subjects).

For example, a tree might be annotated like this:

 &lt;!-- nested inside /nexml/trees, preceded by an otus element --&gt;
 <tree  id="foo1" about="foo1" xmlns:eg="http:// example.org/terms#" xmlns:foo="http://foo.org" >
   <meta
      &lt;!-- object value is an atomic literal string "My Literal Value" --&gt;
      property="eg: hasLiteralTerm"
      xsi:type="nex: LiteralMeta"
      datatype="xsd:string">My Literal value</meta>
   <meta
      &lt;!-- object value is an XML literal foo:bar --&gt;
      property="eg: hasLiteralXml"
      xsi:type="nex: LiteralMeta"
      datatype="rdf:Literal">< foo:bar/></meta>
   <meta
      &lt;!-- object value is a remote resource at http://example.org --&gt;
      rel="eg:hasReference"
      href="http://example.org "
      xsi:type="nex: ResourceMeta"/>
   <meta
      &lt;!-- object value is a recursively nested annotation --&gt;
      rel="eg: hasNestedAnnotations"
      xsi:type="nex: ResourceMeta"
      id="meta1"
      about="meta1">
      <meta
         property="eg: hasLiteralTerm"
         xsi:type="nex: LiteralMeta"
         datatype="xsd:string">My Nested Literal Value
      </meta>
   </meta>
 &lt;!-- other contents, i.e. nodes and edges, go here --&gt;
 </tree>

Adding metadata to data stored in NeXML

While NeXML supports the representation of metadata, generalized tools and libraries for this task do not exist at present.

Developers interested in working with metadata are encouraged to join the NeXML mailing list (see resources section).

Extracting metadata from NeXML into RDF

The W3C RDFa Distiller (http://www.w3.org/2007/08/pyRdfa/) has been found to work with NeXML documents.

Programmer's guide to NeXML elements

The following subsections describe the element structures seen in NeXML instance documents. How NeXML elements relate to each other, either by nesting or by id references, is also visualized in simplified form in Figure 2 of Vos, et al., 2012. Note that neither of these descriptions is normative: NeXML is formally described by its schema.

The root element: nexml

The root element of the schema is called nexml. This root element has the following attributes:

  • A required version attribute whose value is a decimal number indicating the NeXML schema version. At present this value is 0.9.
  • An optional generator attribute, which is used to identify the program that generated the file. The attribute's value is a free form string.

In addition, the root element will commonly define a number of xml namespace prefixes. (Where it says "by convention" in the list below, the convention applies to the three-letter prefixes that are free to vary in most cases, not the namespaces themselves):

  • The xml namespace prefix that identifies xml schema syntax fragments that might be used in the file. By convention this is of the format xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" so that parts of schema language used inside NeXML (e.g. where a concrete subclass must be specified) are identified by the xsi prefix.
  • The NeXML namespace prefix, by convention of the format xmlns:nex="http://www.nexml.org/2009", so that locations where NeXML-specific types are referenced (e.g. data type subclasses) these are identified by their nex prefix.
  • The default namespace, xmlns="http://www.nexml.org/2009", which is necessary for namespace aware processors (such as the NeXML-to-CDAO xslt stylesheet produced by an EvoInfo subgroup at the Spring 2009 hackathon).
  • The xml namespace prefix, required to be of the format xmlns:xml="http://www.w3.org/XML/1998/namespace". This may be used, for example, to specify the base address of the document (using the xml:base attribute, which is relevant when translating to RDF).

Lastly, to associate the instance document with the NeXML schema, it requires an attribute to specify the schema location, and the namespace it applies to. This is of the format xsi:schemaLocation="http://www.nexml.org/2009 http:// www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd". Note that this attribute is a schema language snippet (identified by the xsi: prefix) that identifies a namespace (http://www.nexml.org/2009) and associates it with a physical schema location (http://www.nexml.org/2009/nexml.xsd). Together, this makes the root element look something like the following:

    &lt;!-- contents go here --&gt;
 
 </nex:nexml>

The root element can contain zero or more semantic annotations, one or more OTUs elements, zero or more characters elements, zero or more trees elements (in mixed order with characters elements).

The otus element

In some phylogenetic analyses (such as Bayesian analyses that yield a credible set of trees, or a set of equally parsimonious trees) many nodes in different trees refer to the same sequence. On the other hand, in some other analyses (such as those that involve simulation of a set of sequence alignments) many sequences - in different alignments - refer to the same node in the generating tree. Creating a third entity from which one-to-many links point both to nodes and sequences can normalize this relationship. In NEXUS files, these entities are defined in the "taxa" block using a set of labels that sequences and nodes later on in the file must refer to. In NeXML, this same functionality is provided by otus elements. The name change is a result of the ongoing integration of phylogenetics and taxonomy, which now causes concept confusion because of the overloading of the term "taxa" when it was introduced in the NEXUS standard.

In its simplest form, a otus element looks something like this:

 <otus id="tax1">
    <otu id="t1"/>
    <otu id="t2"/>
 </otus>

That is, the otus element (and its contained otu elements) require an id attribute that is unique at the file level. In addition, these elements can have an optional label attribute that defines a human readable name for the element, and can contain semantic annotations. otu can have an optional class attribute that designates what class (i.e. set) an OTU belongs to. Classes are defined with the class tag (see the section on Sets). The value of the class attribute of an OTU should be the id of the defined class.

Comparative data of various types: characters

The characters element is analogous to the NEXUS characters blocks: it stores comparative data such as molecular sequences or categorical or continuous morphological data. The element is different from the NEXUS characters block in that it allows for more detailed specification of the allowed states per character, strict validation of the observed states, annotation of characters (columns), states, rows and individual observations. In addition, the characters element is designed to allow for representation of non-homologized data: the element is more accurately described as a bucket of observations and the allowed parameter space for those observations. Only if the Boolean attribute aligned of the matrix element is set to "1" (true), can subsequent observations be assumed to be homologous across row elements.

The schema specifies the characters element to be of an abstract type, so that instance documents need to specify the concrete subclass (i.e. data type) using the xsi:type attribute. At present, six data types are supported: DNA, protein, restriction sites, standard categorical, continuous and RNA. For each of these data types there are two subclasses for individual cells or for sequences, whose names are constructed as follows (in Backus-Naur form):

 sub_type_name ::= data_type ("Cells" | "Seqs" )

For instance, for characters of the "dna" type, "dnaCells" is the more verbose representation, and "dnaSeqs" the more compact representation.

Categorical data: Standard

For categorical data (e.g. xsi:type="nex:StandardCells"), the matrix element must be preceded by a format element that specifies the allowed states per character. An example is shown below:

 &lt;!-- nested inside /nexml/characters element --&gt;
 <format>
   &lt;!-- The first elements inside a format element are state set definitions.
      In this example, there is a set of six states, each tagged with an id. 
      The symbol attribute is a shorthand token.
   --&gt;
    <states id="states1">
       <state id="s1" symbol="1"/>
       <state id="s2" symbol="2"/>
       <state id="s3" symbol="3"/>
       <state id="s4" symbol="4"/>
       <polymorphic_state_set id="s5" symbol="5">
          <mapping state="s1"/>
          <mapping state="s2"/>
       </polymorphic_state_set>
       <uncertain_state_set id="s6" symbol="6">
          <mapping state="s3"/>
          <mapping state="s4"/>
       </uncertain_state_set>
    </states>
    &lt;!-- The matrix in this example contains two columns, both referring 
    to the same stateset - and so cells in both columns can occupy one of 
    six states. --&gt;
    <char states="states1" id="c1"/>
    <char states="states1" id="c2"/>
 </format>
 &lt;!-- row elements follow --&gt;

In this case, then, the matrix holds two six-state characters. State s5 functions as an ambiguity code, a state that intersects two other states (s1 and s2). Because the element is called polymorphic_state_set, the mapping to the two other states indicates true polymorphism (i.e. both states are observed in a population). State s6 is uncertain. In practice, "polymorphism" can be read as "AND", and "uncertain" as "OR". The elements states, polymorphic_state_set, uncertain_state_set, state and char can all be semantically annotated to provide a facility that is loosely analogous to - but more powerful than - the NEXUS tokens CHARSTATELABELS and CHARLABELS, respectively.

Following the format element, the matrix element collects subsequent row elements that contain the mappings between the defined states and the actual observations. For example:

 &lt;!-- nested inside /nexml/characters, which is preceded by an otus element --&gt;
 &lt;!-- matrix is preceded by a format element at the same nesting level --&gt;
 <matrix>
    <row id="r1" otu="t1">
       &lt;!-- Each cell must contain a reference to the column it belongs to, 
           and to a state allowed within that column. --&gt;
       <cell char="c1" state="s2"/>
       <cell char="c2" state="s2"/>
    </row>
 </matrix>

This structure means that the entity defined by OTU t1 has state s2 (symbol 2) for both characters. Cell elements within different row elements are homologous if they have the same value for the char attribute. DNA, RNA, protein and restriction data can be described in a similar, verbose way, with the "symbol" attribute's value being an IUPAC nucleotide symbol, an IUPAC amino acid symbol or a Boolean (0/1), respectively. In a compact representation, the same STANDARD information is marked up like this:

 &lt;!-- nested inside /nexml/characters/matrix, preceded by an otus element --&gt;
 <row id="r2" otu="t2">
    <seq>2 2</seq>
 </row>

Notice how the symbols are space-separated (this is because STANDARD states aren't necessarily single-character symbols: integers greater than 9 are allowed also).

Categorical data with defined alphabets: Dna, Rna, Protein and Restriction

By analogy to the state set specification for standard categorical data, the other categorical data types currently implemented in NeXML also have a format element populated with states element within which are enumerated all possible states for the respective data types. For DNA, RNA and Protein these are the single character symbols for the fundamental states (nucleotides or amino acids) and the ambiguity codes that map onto them as uncertain_state_set elements. For restriction site data these are only the fundamental states for 1 or 0 (presence or absence of a restriction site, no uncertainty). These state sets are then linked up with the character columns they apply to just like in standard categorical data. In its most compact form, a DNA sequence alignment as expressed in a characters element would look something like this:

 &lt;!-- nested inside /nexml, preceded by an otus element --&gt;
 <characters otus="tax1" id="m1" xsi:type="nex:DnaSeqs">
    <format>
       &lt;!-- 
       state symbols as per the IUPAC codes, i.e. upper-case single-character letters; 
       ambiguity letters are "uncertain_state_set" referencing the unambiguous states; 
       gaps have symbol '-' and are an uncertain_state_set referencing none'; 
       missing states have symbol '?' and are an uncertain_state_set referencing all (including gaps)
       --&gt;
    </format>
 <matrix aligned="1">
    <row id="r1" otu="t1"><seq>AACATATCTC</seq> </row>
    <row id="r2" otu="t2"><seq>ATACCAGCAT</seq> </row>
    <row id="r3" otu="t3"><seq>GAGGGTATGG</seq> </row>
    <row id="r4" otu="t4"><seq>GGTCTTAGAG</seq> </row>
    <row id="r5" otu="t5"><seq>CGTCACAGTG</seq> </row>
 </matrix>
 </characters>

The example above shows how, for a compact representation of DNA, the characters are concatenated as a string because all the symbols are single characters. The RNA subclasses are virtually identical to the equivalent DNA subclasses (but U is used in place of T). The protein and restriction data types also both use single character symbols, whose allowed symbols are well-defined, namely as the IUPAC single character amino acid symbols and as 0 or 1, respectively.

Continuous data

For continuous data, the format element defines the characters (i.e. char elements) but not their states, and observations values (i.e. either the state attribute in verbose notation, or space-separated symbols in compact notation) are double precision numbers (i.e. of type xs:double).

Container for phylogenetic trees and networks: trees

Due to their nesting, tree descriptions as nested elements (as suggested in Felsenstein's book Inferring Phylogenies) can pose special problems for xml parsers: a parser can only hand off an element once all its children have been processed and stored in memory. Large trees described using nested elements can therefore develop huge memory requirements. Instead, NeXML describes trees as node and edge tables instead, following the syntax for GraphML. This also allows a similar syntax both for trees and networks, where nested elements cannot. Trees or networks in a NeXML file are nested within a trees tag. A NeXML file can contain zero or more trees elements containing one or more phylogenetic tree or network inside it. trees are linked to an otus with the compulsory otus attribute. trees must be id tagged and may have an optional label. See the example below:

 <trees otus="tax1" id="Trees" label="TreesBlockFromXML">
    &lt;!-- followed by definition of one or more trees 
    or networks (as in the later examples) --&gt;
 </trees> 

A phylogeny: tree

Phylogenetic trees are defined in NeXML with the tree tag, having a compulsory id and an optional label attribute. Nested within tree are definition of the nodes and edges. Nodes must be defined before edges (because the edges reference the nodes).

A node in a tree

Nodes are defined with the node tag, which must have an id. Nodes can optionally be linked to an OTU with the otu attribute. To define a root node, the optional root attribute can be set to true. The tree is considered rooted, multiply rooted or unrooted based on how many root nodes it has.

An edge in a tree

Edges are defined with the edge tag, which must have an id. An edge must have a direction, defined by the compulsory source and target attributes. The edge length is provided with the length attribute.

The concrete subclasses IntTree and FloatTree describe a tree shape following GraphML syntax. The classes differ in that the optional length attribute is either an integer or a IEEE 754-1985 compliant floating point number. Below is an example:

 &lt;!-- nested inside /nexml/trees, preceded by an otus element --&gt;
 &lt;!-- A tree with float edges. --&gt;
 <tree id="tree1" xsi:type="nex:FloatTree" label="tree1">
    <node id="n1" label="n1" root="true"/>  &lt;!-- this is the root node --&gt;
    <node id="n2" label="n2" otu="t1"/>  &lt;!-- this is probably a tip --&gt;
    <node id="n3" label="n3"/>
    <node id="n4" label="n4"/>
    <node id="n5" label="n5" otu="t3"/>
    <node id="n6" label="n6" otu="t2"/>
    <node id="n7" label="n7"/>
    <node id="n8" label="n8" otu="t5"/>
    <node id="n9" label="n9" otu="t4"/>
    &lt;!-- optional root edge, for coalescent trees.
       note: only has target  --&gt;
    <rootedge target="n1" id="re1" length="0.34765" />
    <edge source="n1" target="n3" id="e1" length="0.34534"/>
    <edge source="n1" target="n2" id="e2" length="0.4353"/>
    <edge source="n3" target="n4" id="e3" length="0.324"/>
    <edge source="n3" target="n7" id="e4" length="0.3247"/>
    <edge source="n4" target="n5" id="e5" length="0.234"/>
    <edge source="n4" target="n6" id="e6" length="0.3243"/>
    <edge source="n7" target="n8" id="e7" length="0.32443"/>
    <edge source="n7" target="n9" id="e8" length="0.2342"/>
 </tree>

This is an XML representation of the Newick string (((t4,t5)n7,(t2,t3)n4)n3,t1) n1;. In this Newick representation, the root is principally identified by having in-degree of zero or one, i.e. no edge element exists with a target attribute that references that node, but a rootedge element may exist to indicate a time span leading up to the root (principally for coalescent trees). An additional root attribute is used to indicate that this tree is in fact considered truly rooted. This attribute may be used on multiple nodes, to indicate multiple rootings. Tips are identified by there being no edge elements with source attributes that reference them. To add additional objects to nodes or edges, such as bootstrap values, a semantic annotation (see below) is used.

Cyclical graph: network

The NEXUS standard does not specify syntax for defining reticulate evolution (though "private" blocks may exist that add this facility). Extensions to describe networks in Newick syntax have been proposed (e.g. see Cardona, G., Rossello, F., et al. 2008, Than, C., Ruths, D., et al. 2008), but these are not in wide usage at time of writing, and no development to subsume these into NEXUS is under way. This means that there is currently no interoperable, rich syntax for networks. NeXML seeks to address this with the IntNetwork and FloatNetwork subclasses. These only differ from the tree subclasses in that the key constraints on the in-degree of nodes are lessened, so that a node can have multiple parents. Analogously to the special tokens proposed for the Newick extensions that distinguish between reticulations resulting from hybridization, lateral gene transfer and genomic introgression (which are tokens attached to edges), semantic annotations to make this distinction can be attached to the extra edges that form the reticulation. In the example below, node n6 has an additional parent node n7, creating a reticulation:

 <network id="tree3" xsi:type="nex:IntNetwork" label="tree2">
    <node id="n1" label="n1"/>
    <node id="n2" label="n2" otu="t1"/>
    <node id="n3" label="n3"/>
    <node id="n4" label="n4"/>
    <node id="n5" label="n5" otu="t3"/>
    <node id="n6" label="n6" otu="t2"/>
    <node id="n7" label="n7"/>
    <node id="n8" label="n8" otu="t5"/>
    <node id="n9" label="n9" otu="t4"/>
    <edge source="n1" target="n3" id="e1" length="1"/>
    <edge source="n1" target="n2" id="e2" length="2"/>
    <edge source="n3" target="n4" id="e3" length="3"/>
    <edge source="n3" target="n7" id="e4" length="1"/>
    <edge source="n4" target="n5" id="e5" length="2"/>
    <edge source="n4" target="n6" id="e6" length="1"/>
    <edge source="n7" target="n6" id="e7" length="1"/> 
    <edge source="n7" target="n8" id="e7" length="1"/>
    <edge source="n7" target="n9" id="e8" length="1"/>
 </network>

Sets of things: set

NEXUS files can define sets of various things. In NeXML this is implemented using set elements with attributes (named after the set member types) that reference the id attributes of the set members. This is most easily demonstrated with an example:

 <otus id="tax1">
    <otu id="t1" label="Pan paniscus"/>
    <otu id="t2" label="Gorilla gorilla"/>
    <otu id="t3" label="Eulemur mongoz"/>
    <otu id="t4" label="Lemur catta"/>
    <otu id="t5" label="Rhinopithecus roxellana"/>
    
    <set id="set1" label="Apes" otu="t1 t2"/>
    <set id="set1" label="Lemurs" otu="t3 t4"/>
    <set id="set3" label="Primates" otu="t1 t2 t3 t4 t5"/>
 </otus>

General recommendations for developers

Developers of software projects seeking to support the NeXML format are encouraged they first explore the processing libraries discussed in the previous section. If none of those libraries meet the requirements of the project, several recommendations can be made.

Firstly, the developers of the Phenex project (Balhoff, J., Dahdul, W., et al. 2010) have found the "XML beans" approach useful. In this approach, off-the-shelf tools are used to generate code that processes NeXML documents based on the schema specification. Such tools are very advanced for the Java programming language. Code generators for C exist also, but experimentation with one of them (http://www.codesynthesis.com/ products/xsd/) proved to be somewhat problematic: certain XML schema constructs (e.g. cycles in inclusions of multiple schema files, which the standard allows) are not supported very well.

Secondly, if the decision is made to develop a NeXML processing library "from scratch", it is strongly recommended that this processing library is built on top of an off-the-shelf XML parsing library, which will deal with low-level issues such as character encodings, line break formats and XML comments automatically. Free, open source XML parsing libraries exist for all commonly used programming languages. These libraries generally implement at least one of the two popular, standardized application programming interfaces: stream-based (SAX) or based on the document object model (DOM). The SAX model presents the contents of an XML file as a stream of elements and attributes through which a "cursor" moves. This model is memory-efficient, as the library never stores the entire document in memory. A downside is that backtracking in the stream is impossible (although NeXML is designed such that the need for this is minimized), and so this requires careful design of the processing library to keep track of the current context of the cursor. The DOM model presents the contents of an XML file as a tree of parent and child nodes that can be traversed in all directions. This is easier to work with, but requires more memory. It is possible that some NeXML documents are too large to be processed this way, although preliminary experiments with a NeXML version of the full Tree of Life (Maddison, D., Schulz, K.-S., et al. 2007) web project's XML output showed that this could be processed with a DOM parser in the Perl programming language (which is not an especially memory-efficient language).

A third recommendation is that class hierarchies for NeXML libraries are best designed such that they mirror the complex type hierarchy in the NeXML schema fairly closely, at least initially. Using the idioms for abstraction in the chosen implementation language, the resulting class hierarchy can subsequently be made more concise while maintaining fidelity with the standard. For example, the development of an abstract "factory" that generates phylogenetic data objects based on the encountered elements in an instance document can greatly reduce code size in some implementation languages. Another pattern that proved useful in simplifying the logic in a number of processing libraries is the development of a lookup table such that phylogenetic data objects can be found quickly based on their identifiers in instance documents so that references to these objects can be resolved without backtracking in the document structure.

A final recommendation for developers is to make use of the resources available through the NeXML project. The website features an online translator from NEXUS to NeXML files, which shows developers familiar with NEXUS how its constructs map onto NeXML, and a validator to test any produced output. Support for developers is also available through an active mailing list that is advertised on the NeXML website.

References

Adida B, Birbeck M, McCarron S, Pemberton S. 2008. RDFa in XHTML: Syntax and Processing. W3C Recommendation.

Balhoff J, Dahdul W, Kothari C, Lapp H, Lundberg J, Mabee P, Midford P, Westerfield M, Vision T. 2010. Phenex: Ontological Annotation of Phenotypic Diversity. PLoS ONE, 5:e10500.

Cardona G, Rossello F, Valiente G. 2008. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics, 9:532.

Maddison D, Schulz K-S, Maddison W. 2007. The Tree of Life Web Project. Zootaxa:19-40.

Than C, Ruths D, Nakhleh L. 2008. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics, 9:322.

Vos, RA, Balhoff, JP, Caravas, JA, Holder, MT, Lapp, H, Midford, PE, Priyam, A, Sukumaran, J, Xia, X, and Stoltzfus, A, 2012. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol. 2012 Jul;61(4):675-89. Epub 2012 Feb 22.