# Answer to Joe's question --- Part II

Character data---most often DNA or other sequence data---is typically stored in a textual format. For instance, NCBI database uses the FASTA format (https://en.wikipedia.org/wiki/FASTA_format). This means the data can be stored as `char` type and read by a C program. For now, we'll work with some hard-coded data in this format

In [1]:
int num_taxa = 5;
int num_sites = 5; // The number of DNA sites in the data matrix;
char rawalignment[] = "TTTAA" 
                      "TTT-C"
                      "T?TTC"
                      "TT--A"
                      "TGTTG";
// Note: in C++ the above is a single string. I've broken the lines to show the "shape" of
// the data matrix: 5 characters (sites) by 5 taxa (tips).

As we saw in the lectures, we can create a function that sets a definition for our inputs. In this case, if we want to use DNA characters, we can define the following function:

In [2]:
#include <limits.h>
#include <iostream>

char char2DNAbase(const char c)
{
    if (c == 'A' || c == 'a') {
        return (char)1;
    }
    else if (c == 'C' || c == 'c') {
        return (char)1 << 1;
    }
    else if (c == 'G' || c == 'g') {
        return (char)1 << 2;
    }
    else if (c == 'T' || c == 't') {
        return (char)1 << 3;
    }
    else if (c == '?' || c == '-') {
        return CHAR_MAX; // In other words: 11111111
    }
    
    return -1; // A nonsense value for an error check
}

We can now read this input and convert it to a bitwise representation.

In [3]:
// First we need storage for the matrix:
char* matrix = NULL;
matrix = (char*)calloc(num_sites * num_taxa, sizeof(char));

if (matrix == NULL) {
    exit(EXIT_FAILURE); // Kill program if the calloc failed.
}
// I'm storing the data in a 'char' type because there are only 4 bases of DNA but 8 bits in a char.

// Loop over the whole matrix and convert it to out bitwise convention defined in char2DNAbase()
int i = 0;
for (i = 0; i < (num_taxa * num_sites); ++i) {
    matrix[i] = char2DNAbase(rawalignment[i]);
}

It's worth making a digression here to think about matrices in C. It is possible to define a matrix in C if the dimensions are compile-time constants:

`int intMatrix[5][10];`

You can get and set any position in this array by subscripting as you would an array:

`intMatrix[1][4] // refers to the data in the second row and the fifth column`

However, the storage of a matrix in C/C++ is just a linear array. The matrix notation is just 'syntactic sugar'. For more flexible matrices, it helps to use functions to simplify getting and setting values being read as a matrix. For instance:

In [4]:
char get_from_data_matrix(char* mat, int i, int j, int width)
{
    return mat[j * width + i];
}

In [5]:
void set_in_data_matrix(char indata, char* mat, int i, int j, int width)
{
    mat[j * width + i] = indata;
}

In [6]:
// This code snippet simply serves to verify that these getter/setter functions should work
int j = 0;
for (j = 0; j < num_taxa; ++j) {
    for (i = 0; i < num_sites; ++i) {
        // Cast it as an int as the binary might not correspond to an ASCII character
        int x = (int)get_from_data_matrix(matrix, i, j, num_sites);
        std::cout << x << "\t";
    }
    std::cout << std::endl;
}

8	8	8	1	1	
8	8	8	127	2	
8	127	8	8	2	
8	8	127	127	1	
8	4	8	8	4	


If you compare the output above to the matrix, you should be able to check if this is accurate (an easy way to check is figure out what '127' is in binary and compare that to our `char2DNAbase()` function)...

The following syntax:

`int amatrix[3][5];`

is not an array of three-by-five. Instead, it should be read as an array of size three of arrays of size 5. In other words:

`amatrix == { {_,_,_,_,_}, {_,_,_,_,_}, {_,_,_,_,_} };`    

This is something like a pointer to pointers in terms of C/C++ syntax (eek!). This lets us employ a dirty trick.

In [7]:
char **nodeData = NULL; // This is our big matrix for all nodal sets in our tree

// Notice below we are using 2 * num_taxa -1 so that we can also store results for calculations on the internal nodes
nodeData = (char**)calloc(2*num_taxa - 1, sizeof(char*)); // Allocate the pointers to the rows of data

// We can now think of nodeData as something like an array of arrays (really, it's a block of memory for pointers
// to blocks of pointers to char). That means we can loop over each element in nodeData and point it at a new block
// of memory corresponding to the width of our matrix:
for (i = 0; i < 2 * num_taxa - 1; ++i) {
    nodeData[i] = (char*)calloc(num_sites, sizeof(char));
}

Because the C/C++ compiler allows a lot of interchange between pointer and array syntax, we can cheat a little and index into nodeData as though it were a 2-dimensional array (i.e. a matrix):

In [8]:
// Set a value in node data: 
nodeData[3][5] = 99;

// Read out the same value:
std::cout << "The value at 3 and 5: " << (int)nodeData[3][5] << std::endl;

The value at 3 and 5: 99


In [9]:
// This is technically the same as:
std::cout << "The value at 3 and 5: " << (int) *(*(nodeData + 3) + 5) << std::endl;

The value at 3 and 5: 99


Now we have enough storage for all the tips and all the nodal sets. What we need to do now is write the data 'into' the cells for our tips. We do this just as we did earlier on a smaller example matrix:

In [10]:
for (i = 0; i < num_taxa; ++i) {
    int j = 0;
    for (j = 0; j < num_sites; ++j) {
        nodeData[i][j] = char2DNAbase(rawalignment[i * num_taxa + j]);
    }
}

Once again, verify the output just for the sake of it:

In [11]:
for (i = 0; i < num_taxa; ++i) {
    int j = 0;
    for (j = 0; j < num_sites; ++j) {
        std::cout << (int)nodeData[i][j] << "\t";
    }
    std::cout << std::endl;
}
std::cout << std::endl;

8	8	8	1	1	
8	8	8	127	2	
8	127	8	8	2	
8	8	127	127	1	
8	4	8	8	4	



Now that we have a matrix and storage for all of our data, we can write a function that can perform ancestral states reconstructions. We just need to pass this function an index for each of the nodes in question:

In [12]:
long unord_parsimony_downpass(long left, long right, long parent, char** data, long nchars)
{
    int i = 0;
    int res = 0;
    
    for (i = 0; i < nchars; ++i) {
        if (data[left][i] & data[right][i]) { // If any states in common between descendant sets...
            data[parent][i] = data[left][i] & data[right][i]; // parent set is the intersection of sets (bitwise AND)
        }
        else {
            data[parent][i] = data[left][i] | data[right][i]; // parent is the union of descendant sets (bitwise OR)
            ++res; // Add a step to the tree
        }
    }
    
    return res; // Return the number of evolutionary steps implied at this node
}

Now, you could write this into your traversal function from Part I, but this isn't very flexible. Traversals are used a lot in tree-related algorithms and you might prefer to have only a single, general-purpose traversal function. The traversal function could write the postorder sequence of nodes into an array. Then, starting from the first internal node in postorder sequence, you could loop over the postorder sequence and apply the function above. We'll save that for Part III.