Skip to content

Commit

Permalink
Merge pull request #79 from jjmccollum/77-add-support-for-nexus-style…
Browse files Browse the repository at this point in the history
…-csv-excel-output

⚡ Updated tabular output options; ⚡ added support for creatin…
  • Loading branch information
jjmccollum committed Dec 23, 2023
2 parents 19b46b8 + d1730d3 commit a977b25
Show file tree
Hide file tree
Showing 7 changed files with 358 additions and 61 deletions.
17 changes: 16 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ If you use this software, please cite the paper: Joey McCollum and Robert Turnbu

.. code-block:: bibtex
@article{McCollum2022,
@article{MT2022,
author = {Joey McCollum and Robert Turnbull},
title = {{teiphy: A Python Package for Converting TEI XML Collations to NEXUS and Other Formats}},
journal = {Journal of Open Source Software},
Expand All @@ -143,5 +143,20 @@ If you use this software, please cite the paper: Joey McCollum and Robert Turnbu
url = {https://doi.org/10.21105/joss.04879}
}
Further details on the capabilities of ``teiphy``, particularly in terms of the text-critically valuable features it can map from TEI XML collations to BEAST 2 inputs, are discussed in Joey McCollum and Robert Turnbull, "Using Bayesian Phylogenetics to Infer Manuscript Transmission History," *DSH* TBD (2024), DOI: 10.1093/llc/fqad089.

.. code-block:: bibtex
@article{MT2024,
author = {Joey McCollum and Robert Turnbull},
title = {{Using Bayesian Phylogenetics to Infer Manuscript Transmission History}},
journal = {Digital Scholarship in the Humanities},
year = {2024},
volume = {TBD},
number = {TBD},
pages = {TBD},
doi = {10.1093/llc/fqad089},
url = {https://doi.org/10.1093/llc/fqad089}
}
.. end-credits
13 changes: 8 additions & 5 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -588,17 +588,20 @@ Note that for the ``nexus``, ``hennig86``, ``phylip``, and ``fasta`` output form
This is a requirement for Hennig86 format, and some phylogenetic programs that use these formats (such as IQTREE and RAxML) do not support symbols outside of the basic 36 alphanumeric characters or a 32-character alphabet at this time.

Collations can also be converted to tabular formats.
Within Python, the ``collation`` class's ``to_numpy`` method can be invoked to convert a collation to a NumPy matrix with rows for variant readings, columns for witnesses, and frequency values in the cells.
Within Python, the ``collation`` class's ``to_numpy`` method can be invoked to convert a collation to a NumPy ``array`` with rows for variant readings, columns for witnesses, and frequency values in the cells.
Where a witness has missing data at a variation, its frequencies for different readings at this unit can be split evenly over 1 using the ``split_missing`` argument; otherwise, the witness will have frequencies of 0 for all readings at that unit.
The same class's ``to_distance_matrix`` method produces a NumPy matrix with rows and columns for witnesses, where each cell contains the number of units where the row witness and column witness both have unambiguous readings and these readings disagree.
The same class's ``to_distance_matrix`` method produces a NumPy ``array`` with rows and columns for witnesses, where each cell contains the number of units where the row witness and column witness both have unambiguous readings and these readings disagree.
The cells can instead be populated with the proportion of disagreements to units where the row and column witnesses have readings with the ``proportion`` argument.
The same class's ``to_long_table`` method produces a NumPy matrix with columns for witness ID, variation unit ID, reading index, and reading text and rows for all combinations of these values found in the collation.
The ``to_dataframe`` method invokes either ``to_numpy`` or ``to_long_table`` (depending on whether its ``long_table`` argument is true) and returns a Pandas ``DataFrame`` augmented with row and column labels (or, in the case of a long table, just column labels).
The same class's ``to_nexus_table`` method produces a NumPy ``array`` with rows for witnesses, columns for variation unit IDs, and attested reading IDs in the cells, resembling a NEXUS sequence.
By default, cells corresponding to ambiguous readings are written as space-separated sequences of readings between braces, but they can be written as missing states with the ``ambiguous_as_missing`` argument.
The same class's ``to_long_table`` method produces a NumPy ``array`` with columns for witness ID, variation unit ID, reading index, and reading text and rows for all combinations of these values found in the collation.
The ``to_dataframe`` method invokes ``to_numpy`` by default, but if the ``table_type`` argument is ``distance``, ``nexus`` or ``long``, then it will invoke ``to_distance_matrix``, ``to_nexus_table`` or ``to_long_table``, respectively.
It returns a Pandas ``DataFrame`` augmented with row and column labels (or, in the case of a long table, just column labels).

From the command line, the standard reading-witness matrix or long table can be written to a specified CSV, TSV, or Excel (.xlsx) file.
If you specify the output filename with its extension, ``teiphy`` will infer which format to use.
If you are writing a reading-witness matrix to output, you can set the method's ``split_missing`` argument using the ``--split-missing`` command-line flag.
If you want to write a long table to output instead of a reading-witness matrix, then you can do so by including the ``--long-table`` command-line flag.
If you want to write a distance matrix, a NEXUS-style table, or a long table to output instead of a reading-witness matrix, then you can do so by specifying the ``--table distance``, ``--table nexus``, or ``--table long`` command-line argument, respectively.

Other Options
-------------
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "teiphy"
version = "0.1.7"
version = "0.1.8"
description = "Converts TEI XML collations to NEXUS and other formats"
authors = ["Joey McCollum and Robert Turnbull"]
license = "MIT"
Expand Down
Loading

0 comments on commit a977b25

Please sign in to comment.