## Exploring data in NEXUS source file

I'm using this Jupyter notebook to ask questions about the contents of the NEXUS file checked into github here: `https://raw.githubusercontent.com/neelsmith/nexus/master/jvm/src/test/resources/CaveTrechineCOI.nex`



### Configure notebook for custom libraries

I have not yet published the `nexus` library I'll use to JCenter, so will configure this notebook to use the personal Bintray repository where I have already published it.

In [None]:
val personalBintray = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalBintray)


In [None]:
import $ivy.`edu.holycross.shot::nexus:1.4.1`

### Load NEXUS data from a URL using `nexus` library


In [None]:
import edu.holycross.shot.nexus._
val dataUrl = "https://raw.githubusercontent.com/neelsmith/nexus/master/jvm/src/test/resources/CaveTrechineCOI.nex"
val nexus = NexusSource.fromUrl(dataUrl)


### Explore data set

In [None]:
// These are the names of the blocks in the data set:
nexus.blockNames


I'm pretty sure I don't understand how the CODONS block is organized:

In [None]:
val codons = nexus.block("codons").get
println("Block: " + codons)

### Data matrix

Most immediately, I want to be able to work with the matrix of character data.

Use the library to create a structured `DataMatrix` object:

In [None]:
val m = nexus.matrix

In [None]:
// Basic structure is good: same number of labels and data rows!
m.labels.size == m.data.size

In [None]:
// The rows method gives us a Vector of structured `NexusCharacters` object.
// Count 'em and peek at the first one:

In [None]:
m.rows.size
m.rows.head

For the `NexusCharacters` class, the `size` method counts the number of characters ("columns").

I expect all of these to be the same size, but no!

In [None]:
m.rows.map(r => r.size).distinct

So let's look at the rows that have only 48 characters.

In [None]:
val shortRows = m.rows.filter(r => r.size == 48)
println(shortRows.size + " rows with 48 characters.")

I'm curious about those rows with no known characters.  How many of the 48-character rows do they account for?


In [None]:
val noData = shortRows.filter(r => r.characters == "????????????????????????????????????????????????")
println(noData.size + " rows with no data.")




Interesting!  So what's the *one* other short row?

In [None]:
shortRows.filterNot(r => r.characters == "????????????????????????????????????????????????")


I guess I should look at the labels for all the no-data rows:  maybe that would tell us something?

In [None]:
val noDataLabels = noData.map(row => row.label)
println(noDataLabels.mkString(", "))

## Questions: what's up with the short character lists?

- is there something special about Trechuscoloradensis, with characters but only 48 of them?
- do we know a reason why we have no data for the other group of 175 records?
- should our library have a simple (higher-order) method for identifying records with no data? (e.g., a method named something like `noData` on the `NexusCharacters` class, so that we could easily filter/distinguish rows in a matrix with no data?
